【CE314】Computer Science NLP

Deadline: Please follow deadline on FASER

Build a text classifier on the IMDB sentiment classification dataset, you can use any classification method, but you must training your model on the first 40000 instances and testing your model on the last 10000 instances. The IMDB dataset will be uploaded on the moodle page for you to download.

Your code should include:

1: Read the file, incorporate the instances into the training set and testing set.

2: Pre-processing the text, you can choose whether you need stemming, removing stop words, removing non-alphabetical words. (Not all classification models need this step, it is OK if you think your model can perform better without this step, and you can give some justification in the report.)

3: Analysing the feature of the training set, report the linguistic features of the training dataset.

4: Build a text classification model, train your model on the training set and test your model on the test set.

5: Summarize the performance of your model (You can gain additional marks if you have some graph visualization).

6: (Optional) You can speculate how you can improve your works based on your proposed model.

After you build such a model and test on the test set, you should write a report (no longer than three pages in A4, with Arial 11 fonts) to summarize your work.

(You can use the existing algorithms on github or kaggle, but you must not directly copy and paste their code!

However, you are not allowed to use the Naïve Bayes algorithm and VADER classifier, which practiced in Lab 4)

Suggestion: some bonus points:

Have necessary comments on your code

Have proper reference on your report

Have graph visualization on your report

Investigate more evaluation methods, like not only show the P R F score, but also run multiple times and show the standard derivation on P R F (I am sure you can find more evaluation methods.)

Write your report like a mini-conference paper (you can learn from this paper:

  • Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical Attention Networks for Document Classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 1480--1489, San Diego, California. Association for Computational Linguistics.
相关推荐
罗西的思考39 分钟前
AI Agent框架探秘:拆解 OpenHands(10)--- Runtime
人工智能·算法·机器学习
冬奇Lab1 小时前
OpenClaw 源码精读(2):Channel & Routing——一条消息如何找到它的 Agent?
人工智能·开源·源码阅读
冬奇Lab1 小时前
一天一个开源项目(第38篇):Claude Code Telegram - 用 Telegram 远程用 Claude Code,随时随地聊项目
人工智能·开源·资讯
格砸3 小时前
从入门到辞职|从ChatGPT到OpenClaw,跟上智能时代的进化
前端·人工智能·后端
可观测性用观测云3 小时前
可观测性 4.0:教系统如何思考
人工智能
sunny8653 小时前
Claude Code 跨会话上下文恢复:从 8 次纠正到 0 次的工程实践
人工智能·开源·github
小笼包包仔3 小时前
OpenClaw 多Agent软件开发最佳实践指南
人工智能
smallyoung4 小时前
AgenticRAG:智能体驱动的检索增强生成
人工智能
_skyming_4 小时前
OpenCode 如何做到结果不做自动质量评估,为什么结果还不错?
人工智能
HXhlx4 小时前
CART决策树基本原理
算法·机器学习