【CE314】Computer Science NLP

Deadline: Please follow deadline on FASER

Build a text classifier on the IMDB sentiment classification dataset, you can use any classification method, but you must training your model on the first 40000 instances and testing your model on the last 10000 instances. The IMDB dataset will be uploaded on the moodle page for you to download.

Your code should include:

1: Read the file, incorporate the instances into the training set and testing set.

2: Pre-processing the text, you can choose whether you need stemming, removing stop words, removing non-alphabetical words. (Not all classification models need this step, it is OK if you think your model can perform better without this step, and you can give some justification in the report.)

3: Analysing the feature of the training set, report the linguistic features of the training dataset.

4: Build a text classification model, train your model on the training set and test your model on the test set.

5: Summarize the performance of your model (You can gain additional marks if you have some graph visualization).

6: (Optional) You can speculate how you can improve your works based on your proposed model.

After you build such a model and test on the test set, you should write a report (no longer than three pages in A4, with Arial 11 fonts) to summarize your work.

(You can use the existing algorithms on github or kaggle, but you must not directly copy and paste their code!

However, you are not allowed to use the Naïve Bayes algorithm and VADER classifier, which practiced in Lab 4)

Suggestion: some bonus points:

Have necessary comments on your code

Have proper reference on your report

Have graph visualization on your report

Investigate more evaluation methods, like not only show the P R F score, but also run multiple times and show the standard derivation on P R F (I am sure you can find more evaluation methods.)

Write your report like a mini-conference paper (you can learn from this paper:

  • Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical Attention Networks for Document Classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 1480--1489, San Diego, California. Association for Computational Linguistics.
相关推荐
Jackson@ML7 分钟前
一分钟了解大语言模型(LLMs)
人工智能·语言模型·自然语言处理
让学习成为一种生活方式9 分钟前
大麦(Hordeum vulgare)中 BAHD 超家族酰基转移酶-文献精读129
人工智能
思茂信息12 分钟前
CST软件对OPERA&CST软件联合仿真汽车无线充电站对人体的影响
c语言·开发语言·人工智能·matlab·汽车·软件构建
墨绿色的摆渡人16 分钟前
pytorch小记(二十):深入解析 PyTorch 的 `torch.randn_like`:原理、参数与实战示例
人工智能·pytorch·python
lqjun082718 分钟前
Pytorch实现常用代码笔记
人工智能·pytorch·笔记
qyhua19 分钟前
用 PyTorch 从零实现简易GPT(Transformer 模型)
人工智能·pytorch·transformer
白熊1881 小时前
【计算机视觉】OpenCV项目实战:基于face_recognition库的实时人脸识别系统深度解析
人工智能·opencv·计算机视觉
桃花键神1 小时前
华为云Flexus+DeepSeek征文|基于Dify平台tiktok音乐领域热门短视频分析Ai agent
人工智能·华为云
几道之旅1 小时前
mAP、AP50、AR50:目标检测中的核心评价指标解析
人工智能·目标检测·目标跟踪
搏博1 小时前
抗量子计算攻击的数据安全体系构建:从理论突破到工程实践
人工智能·人机交互·量子计算