{"id":17263720,"url":"https://github.com/jrzaurin/nlp-stuff","last_synced_at":"2025-08-22T07:17:49.925Z","repository":{"id":79954791,"uuid":"93398531","full_name":"jrzaurin/nlp-stuff","owner":"jrzaurin","description":"A bit of everything about text and nlp [IN PROGRESS]","archived":false,"fork":false,"pushed_at":"2021-11-05T10:19:21.000Z","size":3164,"stargazers_count":28,"open_issues_count":0,"forks_count":10,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-14T08:48:15.732Z","etag":null,"topics":["deep-learning","python3","text-classification","text-mining","text-processing"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jrzaurin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2017-06-05T11:48:42.000Z","updated_at":"2023-12-28T19:15:41.000Z","dependencies_parsed_at":null,"dependency_job_id":"db4385b5-0e1c-48b0-9290-8d6886285a4d","html_url":"https://github.com/jrzaurin/nlp-stuff","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/jrzaurin/nlp-stuff","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jrzaurin%2Fnlp-stuff","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jrzaurin%2Fnlp-stuff/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jrzaurin%2Fnlp-stuff/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jrzaurin%2Fnlp-stuff/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jrzaurin","download_url":"https://codeload.github.com/jrzaurin/nlp-stuff/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jrzaurin%2Fnlp-stuff/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":271601988,"owners_count":24788165,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-22T02:00:08.480Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","python3","text-classification","text-mining","text-processing"],"created_at":"2024-10-15T07:57:20.449Z","updated_at":"2025-08-22T07:17:49.892Z","avatar_url":"https://github.com/jrzaurin.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg width=\"450\" src=\"docs/figures/nlp_stuff_logo.png\"\u003e\n\u003c/p\u003e\n\n# NLP stuff\n\nI add here stuff related to NLP.\n\nWithin each directory there should be a README file to help guiding you through the code. So far, this is what I have included:\n\n\n1. `text_classification_DL_battle`\n\n\tAmazon Reviews classification (score prediction) using Hierarchical Attention Networks ([Zichao Yang, et al., 2016](https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf)), BERT models at the [Hugginface's transformer library](https://github.com/huggingface/transformers) and the [Fastai Text API](https://fastai1.fast.ai/text.html).\n\n2. `text_classification_HAN`\n\n\tAmazon Reviews classification (score prediction) using Hierarchical Attention Networks ([Zichao Yang, et al., 2016](https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf)). I have also used a number of Dropout mechanisms from the work *Regularizing and Optimizing LSTM Language Models* ([Stephen Merity, Nitish Shirish Keskar and Richard Socher, 2017](https://arxiv.org/pdf/1708.02182.pdf)). The companion Medium post can be found [here](https://towardsdatascience.com/predicting-amazon-reviews-scores-using-hierarchical-attention-networks-with-pytorch-and-apache-5214edb3df20).\n\n\n3. `text_classification_without_DL`\n\n\tPredicting the review score for Amazon reviews (Shoes, Clothes and jewelery).\n\tusing tf-idf, LDA and [EnsembleTopics](https://github.com/lmcinnes/enstop)\n\talong with `lightGBM` and `hyperopt` for the final classification and\n\thyper-parameter optimization. I placed special emphasis in the text\n\tpreprocessing.\n\n4. `text_classification_EDA`\n\n\tAmazon Reviews classification using tf-idf and *EDA: Easy Data Augmentation\n\tTechniques for Boosting Performance on Text Classification Tasks* ([Jason Wei\n\tand Kai Zou 2019](https://github.com/jasonwei20/eda_nlp)) along with\n\t`lightGBM` and `hyperopt` for the final classification and hyper-parameter\n\toptimization. Following the philosophy of the previous exercise, I placed\n\tsome emphasis in the text preprocessing, in particular in the use of certain\n\ttokenizers.\n\n\n5. `rnn_character_tagging`\n\n\tTagging at character level using RNNs with the aim of differentiating for example, different coding languages or writing styles. The code here is based in a [post](http://nadbordrozd.github.io/blog/2017/06/03/python-or-scala/) by [Nadbor](https://www.linkedin.com/in/nadbor-drozd-12316063/).\n\n\n6. `textrank`\n\n\tThe simplest text summarization approach using the `Pagerank` algorithm via\n\tthe\t[networkx](https://networkx.github.io/documentation/networkx-1.10/index.html)\n\tpackage and comparing the results with the\n\tproper`Textrank` implementation *Variations of the Similarity Function of TextRank for Automated Summarization* ([Federico Barrios et al., 2016](https://github.com/summanlp/textrank)).\n\n\n7. `text_classification_CNN_with_tf`\n\n\tThis is a dir with very old Tensorflow code using the 20_newsgroup dataset.\n\tMy aim back then was is simply to illustrate 3 different ways of building\n\ta Convolutional neural network for text classification using Tensorflow.\n\tLast time I checked (October 2019) The code still run, but if you run it\n\tyou will get every possible warning to upgrade. This dir is mostly for me\n\tto keep track of the things I do more than any other thing.\n\n\nAny comments or suggestions please: jrzaurin@gmail.com or even better open an issue.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjrzaurin%2Fnlp-stuff","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjrzaurin%2Fnlp-stuff","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjrzaurin%2Fnlp-stuff/lists"}