{"id":21076136,"url":"https://github.com/windrises/ml_contest","last_synced_at":"2025-07-02T13:35:01.988Z","repository":{"id":100240760,"uuid":"162253763","full_name":"windrises/ml_contest","owner":"windrises","description":"ml课程作业","archived":false,"fork":false,"pushed_at":"2018-12-27T10:44:30.000Z","size":227,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-01-20T22:52:49.853Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/windrises.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-12-18T08:15:46.000Z","updated_at":"2018-12-27T10:44:31.000Z","dependencies_parsed_at":"2023-05-13T03:00:21.093Z","dependency_job_id":null,"html_url":"https://github.com/windrises/ml_contest","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/windrises%2Fml_contest","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/windrises%2Fml_contest/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/windrises%2Fml_contest/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/windrises%2Fml_contest/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/windrises","download_url":"https://codeload.github.com/windrises/ml_contest/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243521247,"owners_count":20304184,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-19T19:26:54.685Z","updated_at":"2025-03-14T03:43:15.382Z","avatar_url":"https://github.com/windrises.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"整理一下最近的工作，附上代码和准备的ppt\n\n情感分析比赛源数据是Stanford整理的那个imdb数据，原数据是25k train + 25k test + 50k unsup，区别在于课程数据里去掉了25k的test，然后从train里抽出了5k的数据当做test\n\n这里简单做了几个实验，准确率都在90%附近，没做进一步的调参，不过估计难有比较大的提升(代码没有经过认真整理，所以可能会有点乱)：\n\n- TF-IDF是最简单也是运行速度最快的模型，分类器可以用SVM、LogReg、NaiveBayes、MLP等等，不过效果应该差不多\n\n- Deep_learning，这里用Keras写了一个CNN的模型。在做词嵌入时只是简单用了texts_to_sequences，并且为了降低数据维度，只取了部分词频高的单词，而且还对句子做了截断，这都是可能的优化方向。还可以考虑换成词嵌入加深度模型，比如Word2vec或Glove。除了CNN，Keras还提供了LSTM，CNN+LSTM，Bi_LSTM等模型，我都试过了，由于数据量比较小，效果差别不大 当然还可以尝试加入Attention\n\n- Sentence_embedding 是直接对句子做了嵌入，优点在于不用考虑上面提到的维度过大的问题，比单纯地对单句所有单词做向量平均要好，还可以利用上unsup数据。然后可以直接拿SVM，LR或者MLP等作分类器\n\n- N-gram 主要是参考了fast-text的做法，实验后发现n设为2比较好\n\n## 一些调研：\n\nstate of the art的算法错误率在5%左右，但是要么没有源代码，要么难以复现。而且只有比较少的论文利用上了unsup数据。比如：\n\n- [Universal Language Model Fine-tuning for Text Classification](https://arxiv.org/abs/1801.06146)，18年，错误率只有4.6%(https://github.com/fastai/fastai/tree/master/courses/dl2/imdb_scripts 代码维护得很差。没有GPU的话，单pre-train的一个epoch就需要十多个小时，放弃)\n\n- [Adversarial Training Methods for Semi-Supervised Text Classification](https://arxiv.org/abs/1605.07725)，17年，提出了Adversarial模型，半监督，用上了unsup数据，提供了用TF写的源代码以及参数(https://github.com/tensorflow/models/tree/master/research/adversarial_text 试了一下发现pre-train需要100k个steps，放弃)\n\n- [Supervised and semi-supervisedtext categorization using LSTM for regionembeddings](https://arxiv.org/abs/1602.02373)，16年，作者发了一系列论文，提出了ConText模型，不过代码是用C++和CUDA写的(https://github.com/riejohnson/ConText)\n\n- [Learned in Translation: Contextualized Word Vectors](https://arxiv.org/abs/1708.00107)，17年，提出了CoVe模型(https://github.com/salesforce/cove)\n\n- [Distributed Representations of Sentences and Documents](https://arxiv.org/abs/1405.4053)，14年，作者是Mikolov，也是Word2vec的作者，这里提出了Doc2vec做句子嵌入，提供了C++代码，论文里说错误率在7.4%。还有论文说用两层MLP做分类器可以把准确率提高到94.5%。不过别人用python的gensim包复现时发现不管怎么优化，准确率最多90%左右(https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb)\n\n还有一些专在imdb数据集上做sentiment analysis的论文，比如：\n\n- Sentiment Analysis with Deeply Learned Distributed Representations of Variable Length Texts，Stanford的，论文里说用句子嵌入+两层MLP准确率能达到94.5%\n\n- Deep learning for sentiment analysis of movie reviews，也是Stanford的，没细看\n\n还有一些综述性的博客：\n\nhttps://blog.paralleldots.com/data-science/breakthrough-research-papers-and-models-for-sentiment-analysis/\n\nhttp://nlpprogress.com/english/sentiment_analysis.html\n\n等等\n\n---\n\n课上看了一下别的组的方法，其实大家的思路和方法都差不多。不少组很重视预处理和集成学习，这是值得学习的。","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwindrises%2Fml_contest","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwindrises%2Fml_contest","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwindrises%2Fml_contest/lists"}