Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/codait/text-extensions-for-pandas
Natural language processing support for Pandas dataframes.
https://github.com/codait/text-extensions-for-pandas
Last synced: 6 days ago
JSON representation
Natural language processing support for Pandas dataframes.
- Host: GitHub
- URL: https://github.com/codait/text-extensions-for-pandas
- Owner: CODAIT
- License: apache-2.0
- Created: 2020-02-26T20:12:44.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2023-08-30T00:16:03.000Z (about 1 year ago)
- Last Synced: 2024-05-14T15:32:57.157Z (6 months ago)
- Language: Jupyter Notebook
- Homepage:
- Size: 79.9 MB
- Stars: 213
- Watchers: 13
- Forks: 34
- Open Issues: 37
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# Text Extensions for Pandas
[![Documentation Status](https://readthedocs.org/projects/text-extensions-for-pandas/badge/?version=latest)](https://text-extensions-for-pandas.readthedocs.io/en/latest/?badge=latest)
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/frreiss/tep-fred/branch-binder?urlpath=lab/tree/notebooks)Natural language processing support for Pandas dataframes.
Text Extensions for Pandas turns Pandas DataFrames into a universal data
structure for representing intermediate data in all phases of your NLP
application development workflow.**Web site:** https://ibm.biz/text-extensions-for-pandas
**API docs:** https://text-extensions-for-pandas.readthedocs.io/
## Features
### SpanArray: A Pandas extension type for *spans* of text
* Connect features with regions of a document
* Visualize the internal data of your NLP application
* Analyze the accuracy of your models
* Combine the results of multiple models### TensorArray: A Pandas extension type for tensors
* Represent BERT embeddings in a Pandas series
* Store logits and other feature vectors in a Pandas series
* Store an entire time series in each cell of a Pandas series### Pandas front-ends for popular NLP toolkits
* [SpaCy](https://spacy.io/)
* [Transformers](https://github.com/huggingface/transformers)
* [IBM Watson Natural Language Understanding](https://www.ibm.com/cloud/watson-natural-language-understanding)
* [IBM Watson Discovry Table Understanding](https://cloud.ibm.com/docs/discovery-data?topic=discovery-data-understanding_tables)## CoNLL-2020 Paper
Looking for the model training code from our CoNLL-2020 paper, ["Identifying Incorrect Labels in the CoNLL-2003 Corpus"](https://www.aclweb.org/anthology/2020.conll-1.16/)?
See the notebooks in [this directory](https://github.com/CODAIT/text-extensions-for-pandas/tree/master/tutorials/corpus).The associated data set is [here](https://github.com/CODAIT/Identifying-Incorrect-Labels-In-CoNLL-2003).
## Installation
This library requires Python 3.7+, Pandas, and Numpy.
To install the latest release, just run:
```
pip install text-extensions-for-pandas
```Depending on your use case, you may also need the following additional
packages:
* `spacy` (for SpaCy support)
* `transformers` (for transformer-based embeddings and BERT tokenization)
* `ibm_watson` (for IBM Watson support)Alternatively, packages are available to be installed from conda-forge for use in a conda environment with:
```
conda install --channel=conda-forge text_extensions_for_pandas
```## Installation from Source
If you'd like to try out the very latest version of our code,
you can install directly from the head of the master branch:
```
pip install git+https://github.com/CODAIT/text-extensions-for-pandas
```You can also directly import our package from your local copy of the
`text_extensions_for_pandas` source tree. Just add the root of your local copy
of this repository to the front of `sys.path`.## Documentation
For examples of how to use the library, take a look at the **example notebooks** in
[this directory](https://github.com/CODAIT/text-extensions-for-pandas/tree/master/notebooks). You can try out these notebooks on [Binder](https://mybinder.org/) by navigating to [https://mybinder.org/v2/gh/frreiss/tep-fred/branch-binder?urlpath=lab/tree/notebooks](https://mybinder.org/v2/gh/frreiss/tep-fred/branch-binder?urlpath=lab/tree/notebooks)To run the notebooks on your local machine, follow the following steps:
1. Install [Anaconda](https://docs.anaconda.com/anaconda/install/) or [Miniconda](https://docs.conda.io/en/latest/miniconda.html).
1. Check out a copy of this repository.
1. Use the script `env.sh` to set up an Anaconda environment for running the code in this repository.
1. Type `jupyter lab` from the root of your local source tree to start a [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/) environment.
1. Navigate to the `notebooks` directory and choose any of the notebooks thereAPI documentation can be found at [https://text-extensions-for-pandas.readthedocs.io/en/latest/](https://text-extensions-for-pandas.readthedocs.io/en/latest/)
## Contents of this repository
* **`text_extensions_for_pandas`**: Source code for the `text_extensions_for_pandas` module.
* **env.sh**: Script to create a conda environment `pd` capable of running the notebooks and test cases in this project
* **generate_docs.sh**: Script to build the [API documentation](https://readthedocs.org/projects/text-extensions-for-pandas/)
* **api_docs**: Configuration files for `generate_docs.sh`
* **binder**: Configuration files for [running notebooks on Binder](https://mybinder.org/v2/gh/frreiss/tep-fred/branch-binder?urlpath=lab/tree/notebooks)
* **config**: Configuration files for `env.sh`.
* **docs**: Project web site
* **notebooks**: example notebooks
* **resources**: various input files used by our example notebooks
* **test_data**: data files for regression tests. The tests themselves are
located adjacent to the library code files.
* **tutorials**: Detailed tutorials on using Text Extensions for Pandas to
cover complex end-to-end NLP use cases (work in progress).## Contributing
This project is an IBM open source project. We are developing the code in the open under the [Apache License](https://github.com/CODAIT/text-extensions-for-pandas/blob/master/LICENSE), and we welcome contributions from both inside and outside IBM.
To contribute, just open a Github issue or submit a pull request. Be sure to include a copy of the [Developer's Certificate of Origin 1.1](https://elinux.org/Developer_Certificate_Of_Origin) along with your pull request.
## Building and Running Tests
Before building the code in this repository, we recommend that you use the
provided script `env.sh` to set up a consistent build environment:
```
$ ./env.sh --env_name myenv
$ conda activate myenv
```
(replace `myenv` with your choice of environment name).To run tests, navigate to the root of your local copy and run:
```
pytest text_extensions_for_pandas
```To build pip and source code packages:
```
python setup.py sdist bdist_wheel
```(outputs go into `./dist`).
To build API documentation, run:
```
./generate_docs.sh
```