https://github.com/urschrei/lovecraft
A basic NLTK demo, using the collected works of H. P. Lovecraft as a corpus
https://github.com/urschrei/lovecraft
classification corpus frequency-count lovecraft matplotlib nltk
Last synced: 2 months ago
JSON representation
A basic NLTK demo, using the collected works of H. P. Lovecraft as a corpus
- Host: GitHub
- URL: https://github.com/urschrei/lovecraft
- Owner: urschrei
- Created: 2013-07-22T15:00:12.000Z (almost 12 years ago)
- Default Branch: master
- Last Pushed: 2017-10-13T18:35:01.000Z (over 7 years ago)
- Last Synced: 2025-03-27T12:23:46.780Z (3 months ago)
- Topics: classification, corpus, frequency-count, lovecraft, matplotlib, nltk
- Language: Jupyter Notebook
- Size: 6.27 MB
- Stars: 14
- Watchers: 3
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Classifying and ranking text using NLTK and The Nameless Horror
This is a small demo showing basic NLTK functionality (tokenizing, classifying, frequency counting), using [The Collected Works of H.P. Lovecraft](http://gutenberg.net.au/ebooks06/0600031h.html) as a corpus.
The code ought to be fairly self-explanatory, however:- The script will write a file, `results.pickle`, to your current working directory upon its first run, because classification is quite slow. This allows you to tune the [tag set](lhttp://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) to be used for frequency counting without having to wait for re-classification each time.
- There's a Jupyter notebook for interactive exploration## Requirements
- Requests
- BeautifulSoup4
- NLTK
- Matplotlib >= 1.5.xAnd for the Notebook:
- Pandas
- Jupyter## License
MIT, copyright Stephan Hügel 2013
