Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/piskvorky/topic_modeling_tutorial
Instructions & code for the EuroPython 2014 training session "Topic Modeling for Fun and Profit"
https://github.com/piskvorky/topic_modeling_tutorial
Last synced: 3 months ago
JSON representation
Instructions & code for the EuroPython 2014 training session "Topic Modeling for Fun and Profit"
- Host: GitHub
- URL: https://github.com/piskvorky/topic_modeling_tutorial
- Owner: piskvorky
- Created: 2014-07-14T08:33:14.000Z (almost 10 years ago)
- Default Branch: master
- Last Pushed: 2014-08-08T07:59:56.000Z (almost 10 years ago)
- Last Synced: 2024-01-21T02:44:50.043Z (5 months ago)
- Language: Python
- Homepage: https://ep2014.europython.eu/en/schedule/sessions/90/
- Size: 429 KB
- Stars: 107
- Watchers: 13
- Forks: 52
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Lists
- Awesome-Indonesia-NLP - Topic Modeling
- my-awesome-stars - piskvorky/topic_modeling_tutorial - Instructions & code for the EuroPython 2014 training session "Topic Modeling for Fun and Profit" (Python)
README
This repository contains code and instructions for my EuroPython 2014 tutorial, **"Topic Modeling for Fun and Profit"**.
https://github.com/piskvorky/topic_modeling_tutorial
----------------------------------------------------Tutorial setup
==============Install the following packages **before the training starts**:
```bash
$ pip install six cython numpy scipy ipython[notebook]
$ pip install nltk gensim pattern requests textblob
$ python -m textblob.download_corpora lite
```If you run into problems, try to follow the specific packages' installation instructions (e.g. [scipy instructions](http://www.scipy.org/install.html)), ask on their mailing list (don't forget to report your operating system and the actual error) or [contact me](mailto:[email protected]), in advance. **There won't be much time for troubleshooting dependencies during the training itself!**
For Windows users, it may be easier to use `conda` to manage the dependencies. Download miniconda from [here](http://conda.pydata.org/miniconda.html), install it, then run:
```bash
$ conda create -n topic_modeling six cython numpy scipy ipython-notebook nltk requests pip
$ source activate topic_modeling
$ pip install nltk pattern gensim textblob
$ python -m textblob.download_corpora lite
```Then **download corpora we'll be using for topic modeling and indexing**:
```bash
$ python download_data.py ./data
```(or, alternatively, download these two files [[14MB](http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz), [95MB](http://dumps.wikimedia.org/simplewiki/20140623/simplewiki-20140623-pages-articles.xml.bz2)] yourself. No need to unzip them or anything, just copy them under the `./data/` directory of this repository.)
You will need about **700MB of free disk space** to run all the tutorial examples fully.
Check that everything works correctly by opening and running the first tutorial notebook:
```bash
$ ipython notebook '0 - Intro & Setup.ipynb'
```Congratulations!
Objectives
==========The tutorial shows how to
* **process very large corpora efficiently**, using practical NLP techniques,
* **automatically extract themes (topics)** from them, using unsupervised topic modeling,
* **index documents** for retrieval and
* **run semantic similarity queries** (*"Give me ten documents that are thematically the most similar to this one."*).The focus is on building practical applications and engineering, rather than on the theory behind topic modeling and the math itself.
Target audience
===============This training expects you are a reasonably advanced developer, who knows at least Python basics (dicts, lists, tuples, comprehensions). Knowing NumPy arrays and Python generators/iterators is a plus, but we'll go over what we need.
Same with relevant NLP (natural language processing) and IR (information retrieval) concepts like lemmatization, collocations and unsupervised machine learning (clustering): I'll cover what we need during the training.
How it works
============Get this repository either via standard `git clone https://github.com/piskvorky/topic_modeling_tutorial.git`, or by downloading and unzipping [this ZIP file](https://github.com/piskvorky/topic_modeling_tutorial/archive/master.zip).
The training materials are a set of IPython notebooks.
To run the notebooks interactively, type in shell:
```bash
$ ipython notebook
```while in the folder of this repository.
This will open a new browser window, listing all the notebooks. Start from the first one, *"0 - Intro & Setup"*, executing each cell in turn by holding down SHIFT+ENTER.
You can also view the notebooks non-interactively (read-only mode), as HTML in your browser (no Python needed):
[0 - Intro & Setup](http://radimrehurek.com/topic_modeling_tutorial/0%20-%20Intro%20%26%20Setup.html)
[1 - Streamed Corpora](http://radimrehurek.com/topic_modeling_tutorial/1%20-%20Streamed%20Corpora.html)
[2 - Topic Modeling](http://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html)
[3 - Indexing and Retrieval](http://radimrehurek.com/topic_modeling_tutorial/3%20-%20Indexing%20and%20Retrieval.html)These static HTML notebooks also contain rendered cell output, so you can compare your results to mine.
------
(c) 2014 Radim Řehůřek