Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/d-one/nlpeasy
Easy Peasy Language Squeezy
https://github.com/d-one/nlpeasy
datascience elasticsearch kibana nlp spacy
Last synced: 2 months ago
JSON representation
Easy Peasy Language Squeezy
- Host: GitHub
- URL: https://github.com/d-one/nlpeasy
- Owner: d-one
- License: other
- Created: 2019-10-10T08:06:28.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2023-05-03T11:37:49.000Z (over 1 year ago)
- Last Synced: 2024-10-01T05:41:27.033Z (3 months ago)
- Topics: datascience, elasticsearch, kibana, nlp, spacy
- Language: Jupyter Notebook
- Homepage:
- Size: 520 KB
- Stars: 10
- Watchers: 4
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: HISTORY.md
- Contributing: CONTRIBUTING.rst
- License: LICENSE
Awesome Lists containing this project
README
[![Travis Build Status]()](https://travis-ci.org/d-one/nlpeasy)
[![pypi Version](https://img.shields.io/pypi/v/nlpeasy.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/nlpeasy/)NLPeasy
=======Build NLP pipelines the easy way
> **Disclaimer:** This is in Alpha stage, lot of things can go wrong.
> It could possibly change your Elasticsearch Data the API is not fixed yet
> and even the name NLPeasy might change.* Free software: Apache Software License 2.0
Usage
-----For this example to completely work you need to have Python at least in Version 3.6 installed.
Also you need to have install and start either- **Docker** , direct download links for
[Mac (DMG)](https://download.docker.com/mac/stable/Docker.dmg) and
[Windows (exe)](https://download.docker.com/win/stable/Docker%20for%20Windows%20Installer.exe).
- **Elasticsearch** and **Kibana**:
or
(pure Apache licensed version)Then on the terminal issue:
```bash
python -m venv venv
source venv/bin/activate
pip install nlpeasy scikit-learn
python -m spacy download en_core_web_md
```
The package `scikit-learn` is just used in this example to get the newsgroups data and preprocess it.
The last command downloads a spacy model for the english language -
for the following you need to have at least it's `md` (=medium) version which has wordvectors.```python
import pandas as pd
import nlpeasy as ne
from sklearn.datasets import fetch_20newsgroups# connect to running elastic or else start an Open Source stack on your docker
elk = ne.connect_elastic(docker_prefix='nlp', elk_version='7.10.2', mount_volume_prefix=None)
# If it is started on docker it will on the first time pull the images (1.3GB)!
# Setting mountVolumePrefix="./elastic-data/" would keep the data of elastic in your
# filesystems and then the data survives container restarts# read data as Pandas data frame
news_raw = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))
news_groups = [news_raw['target_names'][i] for i in news_raw['target']]
news = pd.DataFrame({'newsgroup': news_groups, 'message': news_raw['data']})# setup NLPeasy pipeline with name for the elastic index and set the text column
pipeline = ne.Pipeline(index='news', text_cols=['message'], tag_cols=['newsgroup'], elk=elk)pipeline += ne.VaderSentiment('message', 'sentiment')
pipeline += ne.SpacyEnrichment(nlp='en_core_web_md', cols=['message'], vec=True)# do the pipeline - just for first 100, the whole thing would take 10 minutes
news_enriched = pipeline.process(news.head(10000000), write_elastic=True)# Create Kibana Dashboard of all the columns
pipeline.create_kibana_dashboard()# open Kibana in webbrowser
elk.show_kibana()
```Let's have some fun outside of Elastic/Kibana - but this needs `pip install matplotlib`
```python
import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
grouped = news_enriched.loc[~news_enriched.message_vec.isna()].groupby('newsgroup')
group_vec = grouped.apply(lambda z: np.stack(z.message_vec.values).mean(axis=0))
clust = linkage(np.stack(group_vec), 'ward')
# calculate full dendrogram
plt.figure(figsize=(10, 10))
plt.title('Hierarchical Clustering Dendrogram Newsgroups')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
clust,
leaf_rotation=0., # rotates the x axis labels
leaf_font_size=8., # font size for the x axis labels
labels=group_vec.index,
orientation='left'
)
plt.show()
```Installation
------------Prerequisites:
- Python 3 (we use Python 3.7)
- Elastic: Several possibilities
- Have Docker installed - needs to have the docker package installed (see below).
- Install and start Elasticsearch and Kibana:
or
(pure Apache licensed version)
- Use any running Elasticsearch and Kibana (on premise or cloud)...
- Pretrained Models: See below for Spacy Language Models and WordVectorsIt is recommended to use a virtual environment:
```bash
cd $PROJECT_DIR
python -m venv venv
source venv/bin/activate
```
The source statement has to be repeated whenever you open a new terminal.Then install
```bash
pip install nlpeasy
```
Or the development version from GitHub:
```bash
pip install --upgrade git+https://github.com/d-one/nlpeasy
```If you want to use spaCy language models download them (90-200 MB), e.g.
```bash
python -m spacy download en_core_web_md
# and/or
python -m spacy download de_core_news_md
```
If you want to use pretrained [FastText-Wordvectors](https://fasttext.cc/docs/en/pretrained-vectors.html) (each ~7GB):
```bash
curl -O https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.zip
curl -O https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.de.zip
```If you want to use Jupyter, install it to the virtual environment:
```bash
pip install jupyterlab
```### Development
To install this module in Dev-mode, i.e. change files and reload module:
```bash
git clone https://github.com/d-one/nlpeasy
cd nlpeasy
```It is recommended to use a virtual environment:
```bash
python -m venv venv
source venv/bin/activate
```Install the version in edit mode:
```bash
pip install -e .
```In Jupyter you can have reloaded code when you change the files as in:
```python
%load_ext autoreload
%autoreload 2
```Features
--------* Pandas based pipeline
* Support for any extensions - now includes some for Regex, spaCy, VaderSentiment
* Write results to ElasticSearch
* Automatic Kibana dashboard generation
* Have Elastic started in Docker if it is not installed locally or remotely
* Apache License 2.0Credits
-------This package was created with [Cookiecutter]() and the [`audreyr/cookiecutter-pypackage`] project template.