Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/datasciencecampus/optimus

A text processing pipeline for turning unstructured text data into hierarchical datasets
https://github.com/datasciencecampus/optimus

dsc-projects fasttext-model pipeline python

Last synced: 5 days ago
JSON representation

A text processing pipeline for turning unstructured text data into hierarchical datasets

Lists

README

        

# o p t i m u s

A text processing pipeline for turning unstructured text data into hierarchical datasets.

## What does Optimus do?
The Data Science Campus has been exploring how to process unlabelled list data
that is collected manually in an uncontrolled fashion with no supplementary
information to allow aggregation of data. Please note that this project is
intended to work on short descriptions, of no more than around 10 words. For longer
text descriptions you may need to fork the repository and optimise some of the metrics.

For further information on the methodology please read our [blog](https://datasciencecampus.ons.gov.uk/o-p-t-i-m-u-s-turning-free-text-lists-into-hierarchical-datasets).

## Getting Started

These instructions will get you a copy of the project up and running on your
local machine for development and testing purposes.

Documentation on the methods utilised and how Optimus functions is pending. This
README will be updated to include links to this material once it is made available.

### Prerequisites

You will need the following tools in order to be able to set up and use optimus:

- A modern MacOS or linux installation, Windows is not supported and
you are on your own trying it there
- [curl](https://curl.haxx.se/)
- [zsh](https://github.com/robbyrussell/oh-my-zsh/wiki/Installing-ZSH)
- [python 3.6](https://www.python.org) or later
- [git](https://git-scm.com)

Firstly the user should clone this git repository
```
git clone https://github.com/datasciencecampus/optimus.git

```

Within the repo is a file named `setup.zsh`. This is a command line tool to
install all of the other things you need. For help using this, invoke the script
as

``` sh
. setup.zsh -h
```

This script allows you to download the [FastText wikipedia word
embeddings](https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md)
model and places it in the optimus directory. If your project is elsewhere and
you are not working in optimus directly then it is recommended to use this script to
download the model and then you can move it to be local to your working directory.

### Quick Start example

There is a quick start example script that demonstrates how to use the pipeline called `example.py` in the root directory. The final dataset is written to `optimus_results.csv` also in the root directory.

## A graphical UI for running Optimus

In order to make the tool more accessible a web app based UI was developed. This user interface will help process data without the need of any python coding.

If this is something that interests you please read [this README.md](apps/pipeline_launcher/readme.md) file for more info.

## How to use the python module

#### Importing Optimus
Import Optimus into python either through the whole module

* `import optimus`

or by importing the Optimus classes

* `from optimus import Optimus`

#### Customise settings for Optimus

Configuration of the pipeline is controlled with a configuration file
`config.json` file in the following format:

```json
{
"data":"location/to/data.csv",
"model":"location/to/wiki.en.bin"
...
}
```

After creating a `config.json` file, the location can be passed when creating an
instance of Optimus:

```python
o = Optimus(config_path='path/to/config.json', ...)
```

Further settings can be added on an ad hoc basis and will overwrite any previous
settings. To do so, pass in valid arguments into the Optimus class upon
construction like so:

```python
o = Optimus(
config_path='path/to/config.json',
data="path/to/new_data.csv",
cutoff=6,
...
)
```

Optimus has a default settings file to fall back on in case none of this is
provided however using just default settings might cause issues. This is mainly
due to the path specifications to the data and models in the default settings
not being accurate.

The file `etc/config.json` stores the default arguments used by Optimus. Please
do not edit this file.

Shortened reference:

1. `obj = Optimus()` -> Uses default settings
2. `obj = Optimus(config_path='path/to/user/config.json')`
-> Uses custom config file
3. `obj = Optimus(distance=10, stepsize=2, cutoff=16 ...)`
-> replace specific parameter values instead of those defined in the config
file.

#### Running the code & getting outputs

Optimus takes in `pandas.core.series.Series` objects. In order to run a
configured Optimus object on a series, simply call the object and enclose the
desired series in the brackets. For example, for a pandas series called `text`:

``` python
from optimus import Optimus

O = Optimus()
results = O(text)
```

**NOTE**: If no data is passed into the the Optimus object the data defined in
the config file will be used.

##### Additional arguments to Optimus:

* **save_csv**
One can pass `save_csv` as an optional keyword argument. If the value is set to
`save_csv=True` this will force Optimus to save the output DataFrame which
includes all the labels from each iteration in the working directory as
labelled.csv.

* **full**
Similarly if one just needs a dataframe to be returned and not saved, use the
full=True setting to receive back the dataframe containing the mapped labels.

* **verbose**
A boolean value which will dictate how much will be printed to the console as
the code runs. Some outputs are still maintained in the console even if
`verbose=False` as this allows some idea of progress of the processing.

## Managing Memory

The fastText model is large and requires a sizeable amount of RAM. Each instance
of optimus will load its own fast text model on the first processing call. It
does this by checking if the model was loaded before and if not will perform a
`ft.load_model()` operation. Once its loaded, all subsequent runs (based on the
same instance of Optimus) should not reload a model.

#### Replacing models and freeing memory

The Optimus object has a `replace_model` method. This method aims to provide a
way to control the memory usage of the Optimus object. This method allows a user
to reload and replace a new model or just to remove the loaded model from the
Optimus object.

The method takes a string or a fastText loaded model and assigns it to the
Optimus object. If no model parameter is passed, the method will simply delete
and garbage collect the existing loaded model.

```python
o = Optimus(args, kwargs)
output = o(some_data)

# Load from a path
o.replace_model('string/path/to/model')

# Provide an already loaded model
o.replace_model(fastText.load_model('string/path/to/model'))

# Delete the existing model in the Optimus object
o.replace_model()
```
## Embedding plot functions

This pipeline comes with a helpful embedding visualiser module.
This set of functions will allow users to pass in a pandas series full
of text entries and a fastText model and use the model to embed these
strings into first a n dimensional space which will then be reduced to 2 dimensional space using t-SNE.

This will then be plotted and exported into a 'embedding_plot.html'
which is fully interactive.

```python
import pandas as pd
from lib.emplot import plot

series = pd.Series(['string1', ..., 'string2'])
plot(series=series,
model='path/to/model.bin',
output_path='output_vectors.csv')

```

## Working with large datasets

Ward linkage is computationally expensive. The process needs to calculate a
pairwise distance matrix for all of the embedded vectors and this is of order
$n^2$ for $n$ data points, in memory consumption. When you factor in that the
models for the fastText embedding are already gigabytes in size this can become
a problem.

Where data starts to push the boundaries of what is available to the process we
currently recommend performing a sampling of your data points, using optimus to
categorise the labelled points and then using (for example) a knn to 'smear' the
generated labels across the points nearby.

Example code to do this is provided in the `sampling/` directory. The program
performs a simple random sample of the content of your list and then embeds
these words before using the approach outlined above to generate labels for the
out of sample words. This approach is naive, but can provide a starting point
for more complex sampling mechanisms such as the use of
[apricot](https://github.com/jmschrei/apricot).

## Authors / Contributors

#### Data Science Campus - Office for National Statistics
* Steven Hopkins
* Gareth Clews
* Arturas Eidukas
* Lucy Gwilliam

#### Department for the Environment, Food and Rural Affairs
* Tom Hopkinson

## License

This project is licensed under the MIT License - see the
[LICENSE.md](LICENSE.md) file for details

## References

### Bag of Tricks for Efficient Text Classification

[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [*Bag of Tricks for Efficient Text Classification*](https://arxiv.org/abs/1607.01759)

```
@InProceedings{joulin2017bag,
title={Bag of Tricks for Efficient Text Classification},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},
month={April},
year={2017},
publisher={Association for Computational Linguistics},
pages={427--431},
}
```