Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/darkestfloyd/nlp_pipline
Demonstration of reproducible sentiment analysis pipeline using Airflow
https://github.com/darkestfloyd/nlp_pipline
Last synced: about 12 hours ago
JSON representation
Demonstration of reproducible sentiment analysis pipeline using Airflow
- Host: GitHub
- URL: https://github.com/darkestfloyd/nlp_pipline
- Owner: darkestfloyd
- Created: 2019-10-27T20:33:51.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2019-10-27T20:57:30.000Z (about 5 years ago)
- Last Synced: 2024-11-23T16:29:04.463Z (2 months ago)
- Language: Python
- Size: 17.6 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Reproducible NLP pipeline using Airflow
Demonstration of reproducible sentiment analysis pipeline using Airflow### Set up airflow and code
To run, make sure airflow is installed. You will also need the NLTK package with vader_lexicon.To download vader_lexicon, run:
`python -c "import nltk; nltk.download('vader_lexicon')"`
```
# in a terminal
git clone [email protected]:nischalchand/nlp_pipline.git
mv nlp_pipeline ~/airflow# init airflow database
airflow initdb# in new terminal
airflow webserver -p 8080# in another new terminal
airflow scheduler
```### Set up input
An `input.tsv` file is included in the repo, you can use [pyhton-edgar](https://pypi.org/project/python-edgar/) to get more data.To filter out only 10k filings from edgar download files, in terminal
```
cd
grep -h 10-K * > 10k.tsv
cp 10k.tsv ~/airflow/input.tsv
```Go to `localhost:8080` in web browser, turn-on and trigger "nlp_pipeline". The number of branches in the DAG will be dependent on the number of rows in `input.tsv`.
### Output information
An `input_complete.tsv` file is created in the same directory, which is a copy of `input.tsv` with an additional
column `sentiment_file` for the path of the sentiment file. Each sentiment file is stored in `out_files`, with the file format
`_.txt`.