https://github.com/louisguitton/ue22-nlp-news
https://github.com/louisguitton/ue22-nlp-news
Last synced: 7 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/louisguitton/ue22-nlp-news
- Owner: louisguitton
- Created: 2021-06-11T20:23:54.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2023-01-04T20:23:08.000Z (almost 3 years ago)
- Last Synced: 2025-02-03T22:40:46.672Z (9 months ago)
- Language: Python
- Size: 233 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: Readme.md
- Changelog: newsapi-crawl/.gitignore
Awesome Lists containing this project
README
# station
> Industry grade text analytics combining spaCy and ElasticSearch
## Thanks
- https://github.com/EMBEDDIA/texta-rest
- https://github.com/d-one/NLPeasy
- https://github.com/nestauk/clio-lite
- https://github.com/deepset-ai/haystack## Flow
- get some JSONL text data
- specify which language it is (in our case ES_analyser = french)
- specify which fields are text and copied to 'all_text'
- specify which field to use as '\_id'
- ingest into ElasticSearch
- add 'Dataset' class that takes an ES query and returns documents
- add 'Serialiser' class that takes Dataset and returns docs for sklearn and spaCy
- parse documents with Language models (spaCy or stanza?) to get NER## News Dataset
- https://commoncrawl.org/2016/10/news-dataset-available/
- https://github.com/commoncrawl/news-crawl/