Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/howardyclo/digestant

Modules for effectively digesting data from Twitter and Reddit using ML, NLP and statistics.
https://github.com/howardyclo/digestant

machine-learning named-entity-recognition natural-language-processing reddit summarization topic-modeling

Last synced: 3 months ago
JSON representation

Modules for effectively digesting data from Twitter and Reddit using ML, NLP and statistics.

Host: GitHub
URL: https://github.com/howardyclo/digestant
Owner: howardyclo
Created: 2017-07-18T09:15:06.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2022-12-08T00:44:05.000Z (about 2 years ago)
Last Synced: 2023-03-04T03:54:27.256Z (almost 2 years ago)
Topics: machine-learning, named-entity-recognition, natural-language-processing, reddit, summarization, topic-modeling
Language: Jupyter Notebook
Homepage:
Size: 165 MB
Stars: 8
Watchers: 5
Forks: 5
Open Issues: 16
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Digestant
See more on [introduction slides](https://docs.google.com/presentation/d/18flIvwADXwQum-8xY6I3nSSQmWzqJCRrtQa9y23zVcM/edit?usp=sharing), [project survey](https://hackmd.io/s/rkh_rJY4-) and [demo](https://github.com/YuChunLOL/Digestant/blob/master/demo/demo_howard.ipynb).

## Dev Environment
- Python 3.x

## Setup
- Recommended to create a new virtual environment to manage your python project.
- Download python packages from `requirements.txt`: `$ pip install -r requirements.txt`.
- Download NLTK data: `$ python -m nltk.downloader all`.
- Download SpaCy **`en_core_web_md`** model: `$ python -m spacy download en_core_web_md`.
- Download `stanford-ner-xxxx-xx-xx` zip file Stanford NER model
1. Download from [the official website](https://nlp.stanford.edu/software/stanford-ner-2017-06-09.zip).
2. Unzip and place the `stanford-ner-xxxx-xx-xx` folder the project root path. The name of folder should also be **`stanford-ner/`**.

## Usage
1. Create a twitter and reddit account, follow the accounts that you are interested in.
2. Copy `config-sample.json` and rename it to `config.json` in the same directory. Remember to fill the keys in `config.json`. (Go to your twitter/reddit developer console, create application and get keys.)
3. We need to crawl twitter data, so run the script `crawlers/twitter_crawler.py`. It will automatically crawl data and save them to `dataset/twitter/` by default.
4. You can customize data entities by modifying `domains.json` and `types.json`. (See [demo](https://github.com/YuChunLOL/Digestant/blob/master/demo/demo_howard.ipynb))
5. Currently, you can execute [`demo/demo_howard.ipynb`](https://github.com/YuChunLOL/Digestant/blob/master/demo/demo_howard.ipynb) or other notebooks to see daily digest.