【CE314】Computer Science NLP

Deadline: Please follow deadline on FASER

Build a text classifier on the IMDB sentiment classification dataset, you can use any classification method, but you must training your model on the first 40000 instances and testing your model on the last 10000 instances. The IMDB dataset will be uploaded on the moodle page for you to download.

Your code should include:

1: Read the file, incorporate the instances into the training set and testing set.

2: Pre-processing the text, you can choose whether you need stemming, removing stop words, removing non-alphabetical words. (Not all classification models need this step, it is OK if you think your model can perform better without this step, and you can give some justification in the report.)

3: Analysing the feature of the training set, report the linguistic features of the training dataset.

4: Build a text classification model, train your model on the training set and test your model on the test set.

5: Summarize the performance of your model (You can gain additional marks if you have some graph visualization).

6: (Optional) You can speculate how you can improve your works based on your proposed model.

After you build such a model and test on the test set, you should write a report (no longer than three pages in A4, with Arial 11 fonts) to summarize your work.

(You can use the existing algorithms on github or kaggle, but you must not directly copy and paste their code!

However, you are not allowed to use the Naïve Bayes algorithm and VADER classifier, which practiced in Lab 4)

Suggestion: some bonus points:

Have necessary comments on your code

Have proper reference on your report

Have graph visualization on your report

Investigate more evaluation methods, like not only show the P R F score, but also run multiple times and show the standard derivation on P R F (I am sure you can find more evaluation methods.)

Write your report like a mini-conference paper (you can learn from this paper:

  • Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical Attention Networks for Document Classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 1480--1489, San Diego, California. Association for Computational Linguistics.
相关推荐
z小猫不吃鱼7 分钟前
13 Scaling Law 入门:模型规模、数据规模和计算量是什么关系?
人工智能·深度学习·机器学习
七牛开发者17 分钟前
如何从零开发一个工业级的 SKILL
人工智能·程序员·agent
瘦瘦瘦大人19 分钟前
豆包与抖音联动创作新手实战指南
人工智能
三无推导22 分钟前
ComfyUI 安装部署教程:Windows 下快速搭建可视化 AI 绘图工作流,零基础也能跑通
人工智能·pytorch·windows·stable diffusion·aigc·ai绘画·持续部署
春日见23 分钟前
5分钟入门强化学习之动态规划算法与实现
大数据·人工智能·python·算法·机器学习·计算机视觉
老虾头26 分钟前
AI工具在传统行业服务升级中的应用案例分享
人工智能
SNKXD_131 分钟前
2026品牌运营团队AI营销培训:TOP5轻量化课程适配常态化技能升级学习
大数据·人工智能·学习
Nan-h132 分钟前
AI 浏览器怎么选:侧边栏助手、浏览器 Agent 和可复用工作流的差别
人工智能·ai浏览器
TMT星球35 分钟前
AI时代的风控攻防战:Soul如何用AI治理AI
大数据·人工智能
Agent手记38 分钟前
电信运营商如何用AI实现携号转网自动处理?基于实在Agent的业务自动化落地与TARS大模型解析方案
运维·人工智能·ai·自动化