https://github.com/quving/newsminer
A service that analysis news articles stored in https://newsbox.quving.com.
https://github.com/quving/newsminer
Last synced: 3 months ago
JSON representation
A service that analysis news articles stored in https://newsbox.quving.com.
- Host: GitHub
- URL: https://github.com/quving/newsminer
- Owner: Quving
- Created: 2020-03-09T17:11:00.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2022-12-08T03:48:37.000Z (over 2 years ago)
- Last Synced: 2025-01-30T23:48:56.573Z (5 months ago)
- Language: Python
- Size: 64.5 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Newsminer
## Description
As the name suggests, this repository contains scripts to analyze news articles.
For the first, only German texts will be analyzed. These texts will be provided by the [Newsbox-Api](https://newsbox.quving.com), which will aggregate new news at regular intervals.
Goal is to obtain a LDA-Model that is capable of clustering the news.## Result and Demo
The visualization of the LDA can be found on https://newsmap.quving.com.## Setup
### Installation
Install the required dependencies by the following steps.
1. ```virtualenv -p $(which python3.6) venv```
2. ```source venv/bin/acticate```
3. ```pip install -r requirements.txt```### Authentication
In order to retrieve articles stored in the Newsbox-API, credentials are required.
You can obtain them by mailing me ([email protected]).```
curl -X POST \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-d "{ "username\": "string", "password\": "string"}" \
https://newsbox.quving.com/auth/token/
```Example Response:
```bash
{
"refresh": "eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI0NiJ10.eyJ0b2tlbl90eXBlIjoicmVmcmVzaCIsI2J...",
"access": "eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI2NiJ10.eyJ0b2tlbl90eXBlIjoiYWNjZXNzIiwiZX..."
}
```Now set environment-variable to make it usable in the script.
```
export NEWSMINER_AUTH_TOKEN='eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI2NiJ10.eyJ0b2tlbl90eXBlIjoiYWNjZXNzIiwiZX...'
```
## Documentation
### Train german POS-Tagger
The German POS Tagger is required for the lemmatization and stemming. That is a necessary step for the data preprocessing
in order to train a LDA-Model.- To do that, you need to download the trainingset first. I recommend to use
'[tiger_release_aug07.corrected.16012013.conll09](https://www.ims.uni-stuttgart.de/documents/ressourcen/korpora/tiger-corpus/download/start.html)'
that is provided by the University of Stuttgart.
- Once downloaded, create a directory 'data' and move the downloaded file to that directory.
- Finally execute ``` python train_tagger_de.py ``` It will store the tagger in pickle-format to
```[repo]/artifacts/tagger/```
### Stemming/Lemmatization
Example snippet.
```
from lemmatizer import Lemmatizerif __name__ == '__main__':
lemmatizer = Lemmatizer()
text = "Heute war ein wirklich langer Tag gewesen. Der Hund isst sein Leckerli."
text = lemmatizer.lemmatize_text(text=text)
print(text)
```Output:
```heut sein ein wirklich lang tag sein der hund essen sein leckerli```
### Train LDA
- In order to train a LDA, the POS-Tagger is required (stored locally in ```.../artififacts/tagger/...```)
- Also, you need to have access to the Newsbox-API (see 'Authentication' section above.)
- Execute ```python train_lda.py```. It can take some minutes depending on if you're using cached files or not.
- If it's done, the browser should open up automatically.## Approach
- If the tagger tag a word as noun, the original word will be taken instead of its stem.
- Only nouns will be respected for the lda-model.### References
- https://datascience.blog.wzb.eu/2016/07/13/accurate-part-of-speech-tagging-of-german-texts-with-nltk/
- https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24## Troubleshooting
### Mysql - Problems (Mac)
```
...copying MySQLdb/constants/FIELD_TYPE.py -> build/lib.macosx-10.9-x86_64-3.6/MySQLdb/constants
copying MySQLdb/constants/FLAG.py -> build/lib.macosx-10.9-x86_64-3.6/MySQLdb/constants
running build_ext
building 'MySQLdb._mysql' extension
creating build/temp.macosx-10.9-x86_64-3.6
creating build/temp.macosx-10.9-x86_64-3.6/MySQLdb
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -arch x86_64 -g -Dversion_info=(1,4,6,'final',0) -D__version__=1.4.6 -I/usr/local/Cellar/mysql/8.0.19/include/mysql -I/Library/Frameworks/Python.framework/Versions/3.6/include/python3.6m -c MySQLdb/_mysql.c -o build/temp.macosx-10.9-x86_64-3.6/MySQLdb/_mysql.o
gcc -bundle -undefined dynamic_lookup -arch x86_64 -g build/temp.macosx-10.9-x86_64-3.6/MySQLdb/_mysql.o -L/usr/local/Cellar/mysql/8.0.19/lib -lmysqlclient -lssl -lcrypto -o build/lib.macosx-10.9-x86_64-3.6/MySQLdb/_mysql.cpython-36m-darwin.so
ld: library not found for -lssl
clang: error: linker command failed with exit code 1 (use -v to see invocation)
error: command 'gcc' failed with exit status 1
----------------------------------------
ERROR: Failed building wheel for mysqlclient
```
Some pip dependencies require the mysql_config. Thus, following fix can be applied:```
brew install mysql-client
brew install openssl
export LIBRARY_PATH=$LIBRARY_PATH:/usr/local/opt/openssl/lib/
```