https://github.com/yoctol/strpipe
text preprocessing pipeline
https://github.com/yoctol/strpipe
cython natural-language-processing
Last synced: about 1 year ago
JSON representation
text preprocessing pipeline
- Host: GitHub
- URL: https://github.com/yoctol/strpipe
- Owner: Yoctol
- License: other
- Created: 2018-08-21T10:37:42.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2018-11-30T11:23:02.000Z (over 7 years ago)
- Last Synced: 2025-04-25T04:50:03.178Z (about 1 year ago)
- Topics: cython, natural-language-processing
- Language: Python
- Homepage:
- Size: 694 KB
- Stars: 5
- Watchers: 7
- Forks: 0
- Open Issues: 9
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# strpipe
[![travis-ci][travis-image]][travis-url] [![pypi-version][pypi-image]][pypi-url] [![codecov][codecov-image]][codecov-url]
[travis-image]: https://travis-ci.org/Yoctol/strpipe.svg?branch=master
[travis-url]: https://travis-ci.org/Yoctol/strpipe
[pypi-image]: https://badge.fury.io/py/strpipe.svg
[pypi-url]: https://badge.fury.io/py/strpipe
[codecov-image]: https://codecov.io/gh/Yoctol/strpipe/branch/master/graph/badge.svg
[codecov-url]: https://codecov.io/gh/Yoctol/strpipe
Reversible string processing pipe. Featuring reproducibility, serializability and performance.
## Installation
```
pip install strpipe
```
## Usage
```python
import strpipe as sp
p = sp.Pipe()
p.add_step_by_op_name('ZhCharTokenizer')
p.add_step_by_op_name('AddSosEos')
p.add_checkpoint()
p.add_step_by_op_name('Pad')
p.add_step_by_op_name('TokenToIndex')
data = [
'你好啊',
'早安',
'你早上好',
]
p.fit(data)
result, tx_info, intermediates = p.transform(data) # convention: tx => tranform
back_data = p.inverse_transform(result, tx_info)
```
### Serialization
```python
# Save it
p.save_json('/path/of/pipe')
# Load it
p = sp.Pipe.restore_from_json('/path/of/pipe')
result, meta = p.transform(['你好'])
```
## Test
```
$ make test
```
## Docs
```
$ make docs
Docs will be built in the `docs/build/html` folder. (Note: this also reinstalls the package because we
need Cython code to be rebuilt.)
```
## Extend Ops
1. Extend the new ops with `BaseOp`
2. Define `input_type`, `output_type`
3. Implement op creation
4. Implement fit, transform, inverse_transform. If the op is stateless, the `fit` method should return None.
> Note: It is expected that an ops's functionality will often be able to be decomposed into several functions. These functions should be written into (or imported from) the toolkit package for easy reuse.
Ops in the ops package will, for the most part, be wrappers for functions in toolkit.
5. Write tests
6. Register to `op_factory`