【CE314】Computer Science NLP

Deadline: Please follow deadline on FASER

Build a text classifier on the IMDB sentiment classification dataset, you can use any classification method, but you must training your model on the first 40000 instances and testing your model on the last 10000 instances. The IMDB dataset will be uploaded on the moodle page for you to download.

Your code should include:

1: Read the file, incorporate the instances into the training set and testing set.

2: Pre-processing the text, you can choose whether you need stemming, removing stop words, removing non-alphabetical words. (Not all classification models need this step, it is OK if you think your model can perform better without this step, and you can give some justification in the report.)

3: Analysing the feature of the training set, report the linguistic features of the training dataset.

4: Build a text classification model, train your model on the training set and test your model on the test set.

5: Summarize the performance of your model (You can gain additional marks if you have some graph visualization).

6: (Optional) You can speculate how you can improve your works based on your proposed model.

After you build such a model and test on the test set, you should write a report (no longer than three pages in A4, with Arial 11 fonts) to summarize your work.

(You can use the existing algorithms on github or kaggle, but you must not directly copy and paste their code!

However, you are not allowed to use the Naïve Bayes algorithm and VADER classifier, which practiced in Lab 4)

Suggestion: some bonus points:

Have necessary comments on your code

Have proper reference on your report

Have graph visualization on your report

Investigate more evaluation methods, like not only show the P R F score, but also run multiple times and show the standard derivation on P R F (I am sure you can find more evaluation methods.)

Write your report like a mini-conference paper (you can learn from this paper:

  • Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical Attention Networks for Document Classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 1480--1489, San Diego, California. Association for Computational Linguistics.
相关推荐
互联网推荐官几秒前
上海大模型应用开发怎么样?从技术底座到落地路径的完整拆解
人工智能·软件工程
冷小鱼2 分钟前
大模型训练全景:从预训练到对齐的技术炼金术
人工智能·训练·大模型训练
百度Geek说2 分钟前
柚漫剧 AI全流程提效拆解---从单点提效到工程融合
人工智能
fuquxiaoguang8 分钟前
Agentic AI 爆发元年:2026,智能体正在学会“自己动手”
人工智能·agentic ai
隔壁大炮12 分钟前
第二章 脑电、诱发电位和事件相关电位
人工智能·深度学习·erp·eeg·脑电信号
人工智能AI技术15 分钟前
跳出CURD牢笼 拥抱智能体开发开启职业第二曲线
人工智能
无忧智库21 分钟前
具身智能的数据底座之战:一个大规模三维空间语义语料库的完整工程实践(WORD)
大数据·人工智能
我认不到你29 分钟前
拒绝token焦虑 cpa(CLI Proxy API)反代 chatgpt(Codex) 保姆级全图文教程
人工智能·ai·chatgpt
我滴老baby34 分钟前
工具调用全景解析从Function Calling到MCP协议的完整实践
开发语言·人工智能·python·架构·fastapi