【CE314】Computer Science NLP

Deadline: Please follow deadline on FASER

Build a text classifier on the IMDB sentiment classification dataset, you can use any classification method, but you must training your model on the first 40000 instances and testing your model on the last 10000 instances. The IMDB dataset will be uploaded on the moodle page for you to download.

Your code should include:

1: Read the file, incorporate the instances into the training set and testing set.

2: Pre-processing the text, you can choose whether you need stemming, removing stop words, removing non-alphabetical words. (Not all classification models need this step, it is OK if you think your model can perform better without this step, and you can give some justification in the report.)

3: Analysing the feature of the training set, report the linguistic features of the training dataset.

4: Build a text classification model, train your model on the training set and test your model on the test set.

5: Summarize the performance of your model (You can gain additional marks if you have some graph visualization).

6: (Optional) You can speculate how you can improve your works based on your proposed model.

After you build such a model and test on the test set, you should write a report (no longer than three pages in A4, with Arial 11 fonts) to summarize your work.

(You can use the existing algorithms on github or kaggle, but you must not directly copy and paste their code!

However, you are not allowed to use the Naïve Bayes algorithm and VADER classifier, which practiced in Lab 4)

Suggestion: some bonus points:

Have necessary comments on your code

Have proper reference on your report

Have graph visualization on your report

Investigate more evaluation methods, like not only show the P R F score, but also run multiple times and show the standard derivation on P R F (I am sure you can find more evaluation methods.)

Write your report like a mini-conference paper (you can learn from this paper:

  • Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical Attention Networks for Document Classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 1480--1489, San Diego, California. Association for Computational Linguistics.
相关推荐
Sirius Wu12 小时前
MoE与Fengyu-Dense_架构对比及训练方案
人工智能·深度学习·算法·机器学习·语言模型·架构
却道天凉_好个秋12 小时前
HEVC(一):环路滤波
人工智能·算法·计算机视觉·环路滤波
MartinYeung512 小时前
[论文学习]大型语言模型中个人可识别资讯(PII)的机器遗忘技术:UnlearnPII 基准与 PERMU 方法的分析
人工智能·学习·语言模型
Sirius Wu12 小时前
Agent模型冷启动问题
开发语言·javascript·人工智能·机器学习·ecmascript·aigc
RSTJ_162512 小时前
PYTHON+AI LLM DAY SIXTY-NINE
人工智能
喵叔哟12 小时前
14【.NET10 实战--孢子记账--产品智能化】--智能生成预算
大数据·人工智能·.net
ylscode12 小时前
ChatGPT新增锁定模式,以缓解提示注入和数据泄露攻击
人工智能·安全·ai·安全威胁分析
DS随心转插件12 小时前
DeepSeek 代码手机端导出与 AI 辅助方案实测
android·人工智能·chatgpt·智能手机·deepseek·ai导出鸭
He少年12 小时前
【实践AI 编程规则引擎深度解读:从零理解工程化设计】
人工智能
hero8274827412 小时前
公司武装AI
人工智能