【CE314】Computer Science NLP

Deadline: Please follow deadline on FASER

Build a text classifier on the IMDB sentiment classification dataset, you can use any classification method, but you must training your model on the first 40000 instances and testing your model on the last 10000 instances. The IMDB dataset will be uploaded on the moodle page for you to download.

Your code should include:

1: Read the file, incorporate the instances into the training set and testing set.

2: Pre-processing the text, you can choose whether you need stemming, removing stop words, removing non-alphabetical words. (Not all classification models need this step, it is OK if you think your model can perform better without this step, and you can give some justification in the report.)

3: Analysing the feature of the training set, report the linguistic features of the training dataset.

4: Build a text classification model, train your model on the training set and test your model on the test set.

5: Summarize the performance of your model (You can gain additional marks if you have some graph visualization).

6: (Optional) You can speculate how you can improve your works based on your proposed model.

After you build such a model and test on the test set, you should write a report (no longer than three pages in A4, with Arial 11 fonts) to summarize your work.

(You can use the existing algorithms on github or kaggle, but you must not directly copy and paste their code!

However, you are not allowed to use the Naïve Bayes algorithm and VADER classifier, which practiced in Lab 4)

Suggestion: some bonus points:

Have necessary comments on your code

Have proper reference on your report

Have graph visualization on your report

Investigate more evaluation methods, like not only show the P R F score, but also run multiple times and show the standard derivation on P R F (I am sure you can find more evaluation methods.)

Write your report like a mini-conference paper (you can learn from this paper:

  • Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical Attention Networks for Document Classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 1480--1489, San Diego, California. Association for Computational Linguistics.
相关推荐
喜欢吃豆32 分钟前
快速手搓一个MCP服务指南(九): FastMCP 服务器组合技术:构建模块化AI应用的终极方案
服务器·人工智能·python·深度学习·大模型·github·fastmcp
星融元asterfusion38 分钟前
基于路径质量的AI负载均衡异常路径检测与恢复策略
人工智能·负载均衡·异常路径
zskj_zhyl43 分钟前
智慧养老丨从依赖式养老到自主式养老:如何重构晚年生活新范式
大数据·人工智能·物联网
创小匠1 小时前
创客匠人视角下创始人 IP 打造与知识变现的底层逻辑重构
人工智能·tcp/ip·重构
xiangduanjava1 小时前
关于安装Ollama大语言模型本地部署工具
人工智能·语言模型·自然语言处理
zzywxc7871 小时前
AI 正在深度重构软件开发的底层逻辑和全生命周期,从技术演进、流程重构和未来趋势三个维度进行系统性分析
java·大数据·开发语言·人工智能·spring
超龄超能程序猿1 小时前
(1)机器学习小白入门 YOLOv:从概念到实践
人工智能·机器学习
大熊背2 小时前
图像处理专业书籍以及网络资源总结
人工智能·算法·microsoft
江理不变情2 小时前
图像质量对比感悟
c++·人工智能