Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/eea/eea.corpus
Machine Learning and Natural Language Processing of the EEA Corpus via spaCy, Textacy and pyLDAvis and other useful NLP algorithms.
https://github.com/eea/eea.corpus
data-visualization eea-corpus environment latent-dirichlet-allocation lda-visualisation machine-learning natural-language-processing nlp spacy text-mining topic-modeling
Last synced: about 1 month ago
JSON representation
Machine Learning and Natural Language Processing of the EEA Corpus via spaCy, Textacy and pyLDAvis and other useful NLP algorithms.
- Host: GitHub
- URL: https://github.com/eea/eea.corpus
- Owner: eea
- License: gpl-3.0
- Created: 2017-04-22T14:24:58.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2022-12-08T00:51:03.000Z (about 2 years ago)
- Last Synced: 2024-04-11T13:00:19.676Z (9 months ago)
- Topics: data-visualization, eea-corpus, environment, latent-dirichlet-allocation, lda-visualisation, machine-learning, natural-language-processing, nlp, spacy, text-mining, topic-modeling
- Language: Python
- Homepage:
- Size: 9.42 MB
- Stars: 14
- Watchers: 44
- Forks: 2
- Open Issues: 24
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# EEA Corpus (alpha stage)
This docker image is based on spaCy, Textacy, pyLDAvis & others to analyse the
EEA Corpus (the collection of all published EEA documents) or any other CSV
file with a column of text.It provides a number of Machine Learning and Natural Language Processing
algorthims that can be run on top of the EEA Corpus or a subset of it.~The idea is to provide these methods over a REST API when possible.~
## Current features
### Compose a text transformation pipeline to prepare a corpus
First upload a CSV file, then use the "Create a corpus" button to enter
the pipeline composition page.### Create and visualise topic models via pyLDAvis.
The topics are found via a text-mining technique called
[Topic Modeling](https://en.wikipedia.org/wiki/Topic_model).In machine learning and natural language processing, a topic model is a
type of statistical model for discovering the abstract "topics" that occur in a
collection of documents.[Video demonstration](https://www.youtube.com/watch?v=IksL96ls4o0&t=255s)
![LDA visualisation example](ldavis.png?raw=true "LDA visualisation example")
## How to run:
```
docker-compose build
docker-compose up -d
```This will (after some time) start the EEA Corpus application server on
[localhost:8181](http://0.0.0.0:8181)## EEA Corpus Data
The latest EEA Corpus dataset can be produced by visiting [global
catalogue](http://search.apps.eea.europa.eu/) > See all results > download
csv.Once the csv file is downloaded, you can pass it to this application to be
analysed. Make sure your first column is the "document text" to be analysed.
The other columns are considered metadata.You may download an already generated large EEA corpus data for testing like
this:```
curl -L -o data.csv https://www.dropbox.com/s/sihmoc4wwpl0kr2/data_all.csv?dl=1
```