【CE314】Computer Science NLP

Deadline: Please follow deadline on FASER

Build a text classifier on the IMDB sentiment classification dataset, you can use any classification method, but you must training your model on the first 40000 instances and testing your model on the last 10000 instances. The IMDB dataset will be uploaded on the moodle page for you to download.

Your code should include:

1: Read the file, incorporate the instances into the training set and testing set.

2: Pre-processing the text, you can choose whether you need stemming, removing stop words, removing non-alphabetical words. (Not all classification models need this step, it is OK if you think your model can perform better without this step, and you can give some justification in the report.)

3: Analysing the feature of the training set, report the linguistic features of the training dataset.

4: Build a text classification model, train your model on the training set and test your model on the test set.

5: Summarize the performance of your model (You can gain additional marks if you have some graph visualization).

6: (Optional) You can speculate how you can improve your works based on your proposed model.

After you build such a model and test on the test set, you should write a report (no longer than three pages in A4, with Arial 11 fonts) to summarize your work.

(You can use the existing algorithms on github or kaggle, but you must not directly copy and paste their code!

However, you are not allowed to use the Naïve Bayes algorithm and VADER classifier, which practiced in Lab 4)

Suggestion: some bonus points:

Have necessary comments on your code

Have proper reference on your report

Have graph visualization on your report

Investigate more evaluation methods, like not only show the P R F score, but also run multiple times and show the standard derivation on P R F (I am sure you can find more evaluation methods.)

Write your report like a mini-conference paper (you can learn from this paper:

  • Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical Attention Networks for Document Classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 1480--1489, San Diego, California. Association for Computational Linguistics.
相关推荐
算.子12 分钟前
【Spring AI 实战】八、完整 RAG 问答实战:检索 + 重排序 + 生成全链路
java·人工智能·spring
Sendingab13 分钟前
2026年AI口播IP新风口:多模态大模型实操,让口播兼具质感与流量
人工智能·#数字人·ip口播
Rubin智造社17 分钟前
04月22日AI每日参考:OpenAI发布AI经济政策,Agent进入金融市场
人工智能·深度学习·openai·agent·开源模型·anthropic
老王谈企服18 分钟前
[信创选型] 2026国产化替代进入应用层:有没有通过国产化认证、能在麒麟系统上跑的合规Agent?
数据库·人工智能·ai
愚公搬代码19 分钟前
【愚公系列】《OpenClaw实战指南》012-分析与展示:一句话生成可发给老板的报表与 PPT(Excel/WPS 表格自动化处理)
人工智能·自动化·powerpoint·excel·飞书·wps·openclaw
wx_xkq128820 分钟前
优秘智能数字分身:行业首创的AI赋能新质生产力的技术落地实践,从企业到个人的全域孪生革新
人工智能
RoboWizard22 分钟前
移动固态硬盘摔了一下后无法识别,数据还能恢复吗?
大数据·人工智能·数码相机·智能手机·性能优化·无人机
ofoxcoding24 分钟前
GPT image-2 怎么调用?2026 完整接入教程 + 踩坑实录
人工智能·gpt·ai
传说故事29 分钟前
【论文阅读】StarVLA-α: Reducing Complexity in Vision-Language-Action Systems
论文阅读·人工智能·具身智能·vla
水如烟30 分钟前
孤能子视角:GPT Image 2 的发布,硅界“关系编织密度”突破人界“观察符阈值”的临界事件
人工智能