Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/kavgan/nlp-in-practice
Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.
https://github.com/kavgan/nlp-in-practice
gensim machine-learning natural-language-processing nlp text-classification text-mining tf-idf word2vec
Last synced: 3 days ago
JSON representation
Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.
- Host: GitHub
- URL: https://github.com/kavgan/nlp-in-practice
- Owner: kavgan
- Created: 2018-01-28T07:18:36.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2020-12-02T18:46:45.000Z (about 4 years ago)
- Last Synced: 2025-01-12T01:04:44.797Z (10 days ago)
- Topics: gensim, machine-learning, natural-language-processing, nlp, text-classification, text-mining, tf-idf, word2vec
- Language: Jupyter Notebook
- Homepage: http://kavita-ganesan.com/kavitas-tutorials/#.WvIizNMvyog
- Size: 91.8 MB
- Stars: 1,157
- Watchers: 51
- Forks: 794
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# NLP-IN-PRACTICE
Use these NLP, Text Mining and Machine Learning code samples and tools to solve real world text data problems.## Notebooks / Source
Links in the first column take you to the subfolder/repository with the source code.
| Task | Related Article | Source Type | Description
|---|---| ---| --- |
| [Large Scale Phrase Extraction](https://github.com/kavgan/phrase-at-scale) | [phrase2vec article](http://kavita-ganesan.com/how-to-generate-phrase-embeddings-using-word2vec-in-3-easy-steps/) | python script | Extract phrases for large amounts of data using PySpark. Annotate text using these phrases or use the phrases for other downstream tasks.
| [Word Cloud for Jupyter Notebook and Python Web Apps ](https://github.com/kavgan/word_cloud) | [word_cloud article](http://kavita-ganesan.com/word-cloud-for-data-scientists/#.W867cBNKj65) | python script + notebook | Visualize top keywords using word counts or tfidf
| [Gensim Word2Vec (with dataset)](word2vec/) | [word2vec article](http://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/) | notebook | How to work correctly with Word2Vec to get desired results
| [Reading files and word count with Spark](spark_wordcount/) | [spark article](http://kavita-ganesan.com/reading-csv-and-json-files-in-spark/) | python script | How to read files of different formats using PySpark with a word count example
| [Extracting Keywords with TF-IDF and SKLearn (with dataset)](tf-idf) | [tfidf article](http://kavita-ganesan.com/extracting-keywords-from-text-with-tf-idf-and-pythons-scikit-learn/#.W2TlD9hKhhE) | notebook | How to extract interesting keywords from text using TF-IDF and Python's SKLEARN
| [Text Preprocessing](text-pre-processing) | [text preprocessing article](http://kavita-ganesan.com/getting-started-with-text-preprocessing/#.XHa4-ZNKhuU) | notebook | A few code snippets on how to perform text preprocessing. Includes stemming, noise removal, lemmatization and stop word removal.
| [TFIDFTransformer vs. TFIDFVectorizer](tfidftransformer/) | [tfidftransformer and tfidfvectorizer usage article](http://kavita-ganesan.com/how-to-use-tfidftransformer-tfidfvectorizer-and-whats-the-difference/)| notebook | How to use TFIDFTransformer and TFIDFVectorizer correctly and the difference between the two and what to use when.
| [Accessing Pre-trained Word Embeddings with Gensim](pre-trained-embeddings/) |[Pre-trained word embeddings article](http://kavita-ganesan.com/easily-access-pre-trained-word-embeddings-with-gensim/#.XQCYP9NKhhE)| notebook | How to access pre-trained GloVe and Word2Vec Embeddings using Gensim and an example of how these embeddings can be leveraged for text similarity
| [Text Classification in Python (with news dataset)](text-classification/) |[Text classification with Logistic Regression article](https://kavita-ganesan.com/news-classifier-with-logistic-regression-in-python/#.XT95_5NKhgc)| notebook | Get started with text classification. Learn how to build and evaluate a text classifier for news classification using Logistic Regression.
| [CountVectorizer Usage Examples](CountVectorizer/) |[How to Correctly Use CountVectorizer? An In-Depth Look article](https://kavita-ganesan.com/how-to-use-countvectorizer/#.XeqMhpNKhhE)| notebook | Learn how to maximize the use of CountVectorizer such that you are not just computing counts of words, but also preprocessing your text data appropriately as well as extracting additional features from your text dataset.
| [HashingVectorizer Examples](hashingvectorizer/) |[HashingVectorizer Vs. CountVectorizer article](https://kavita-ganesan.com/hashingvectorizer-vs-countvectorizer/#.XeqMhpNKhhP)| notebook | Learn the differences between HashingVectorizer and CountVectorizer and when to use which.
| [CBOW vs. SkipGram](cbow_skipgram_subword/) |[Word2Vec: A Comparison Between CBOW, SkipGram & SkipGramSI article](https://kavita-ganesan.com/comparison-between-cbow-skipgram-subword/#.X8fgvxNKiso)| notebook | A quick comparison of the three embeddings architecture.# Notes
- For more articles, please [see this list](http://kavita-ganesan.com/kavitas-tutorials/#.WvIizNMvyog).
- If you would like to receive articles via email [subscribe to my mailing list](https://kavita-ganesan.com/subscribe/#.XTThjZNKhgc).# Contact
This repository is maintained by [Kavita Ganesan](https://kavita-ganesan.com/about-me/#.XTTh6ZNKhgc). Connect with me on [LinkedIn](https://www.linkedin.com/in/kavita-ganesan/) or [Twitter](https://twitter.com/kav_gan).