【CE314】Computer Science NLP

Deadline: Please follow deadline on FASER

Build a text classifier on the IMDB sentiment classification dataset, you can use any classification method, but you must training your model on the first 40000 instances and testing your model on the last 10000 instances. The IMDB dataset will be uploaded on the moodle page for you to download.

Your code should include:

1: Read the file, incorporate the instances into the training set and testing set.

2: Pre-processing the text, you can choose whether you need stemming, removing stop words, removing non-alphabetical words. (Not all classification models need this step, it is OK if you think your model can perform better without this step, and you can give some justification in the report.)

3: Analysing the feature of the training set, report the linguistic features of the training dataset.

4: Build a text classification model, train your model on the training set and test your model on the test set.

5: Summarize the performance of your model (You can gain additional marks if you have some graph visualization).

6: (Optional) You can speculate how you can improve your works based on your proposed model.

After you build such a model and test on the test set, you should write a report (no longer than three pages in A4, with Arial 11 fonts) to summarize your work.

(You can use the existing algorithms on github or kaggle, but you must not directly copy and paste their code!

However, you are not allowed to use the Naïve Bayes algorithm and VADER classifier, which practiced in Lab 4)

Suggestion: some bonus points:

Have necessary comments on your code

Have proper reference on your report

Have graph visualization on your report

Investigate more evaluation methods, like not only show the P R F score, but also run multiple times and show the standard derivation on P R F (I am sure you can find more evaluation methods.)

Write your report like a mini-conference paper (you can learn from this paper:

  • Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical Attention Networks for Document Classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 1480--1489, San Diego, California. Association for Computational Linguistics.
相关推荐
Cxiaomu几秒前
AI 聊天流式交互基础:SSE、EventSource 与 ReadableStream
人工智能·交互
啦啦啦!2 分钟前
项目环境的搭建,项目的初步使用和deepseek的初步认识
开发语言·c++·人工智能·算法
Westward-sun.4 分钟前
OpenCV实战:摄像头实时文档扫描与透视矫正
人工智能·opencv·计算机视觉
V搜xhliang02464 分钟前
生成式人工智能、大语言模型在医学教育教学中的前沿探讨
人工智能
枫叶林FYL4 分钟前
【自然语言处理 NLP】7.1 机制可解释性(Mechanistic Interpretability)
人工智能·自然语言处理
任小栗4 分钟前
【实战干货】Vue3 + WebRTC + SIP + AI 实现全自动语音接警系统(远程流获取+实时ASR+TTS回播)
人工智能·webrtc
qq_348231858 分钟前
OpenClaw 完整安装教程
人工智能
杨浦老苏10 分钟前
轻量级RSS源处理中间件FeedCraft
人工智能·docker·ai·群晖·rss
平安的平安13 分钟前
Python 实现 AI 图像生成:调用 Stable Diffusion API 完整教程
人工智能·python·stable diffusion
IT观测16 分钟前
# 聚焦AI驱动数据分析:2026年智能BI工具市场的深度调研与趋势展望报告
人工智能·数据挖掘·数据分析