https://github.com/eonm-pro/b-cleaner
Preprocess bibliographical data for alignement tasks
https://github.com/eonm-pro/b-cleaner
alignment data-normalization data-preprocessing
Last synced: 3 months ago
JSON representation
Preprocess bibliographical data for alignement tasks
- Host: GitHub
- URL: https://github.com/eonm-pro/b-cleaner
- Owner: eonm-pro
- License: mit
- Created: 2020-07-21T14:39:24.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2021-07-05T04:46:33.000Z (over 4 years ago)
- Last Synced: 2024-07-08T12:56:43.343Z (over 1 year ago)
- Topics: alignment, data-normalization, data-preprocessing
- Language: Rust
- Homepage:
- Size: 19.5 KB
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# B-cleaner (Bibliographical data cleaner)
[](https://www.repostatus.org/#active)
[](https://opensource.org/licenses/MIT)
[]()
[](https://app.fossa.com/projects/git%2Bgithub.com%2Feonm-abes%2Fb-cleaner?ref=badge_shield)
[]()
B-cleaner is a Rust library dedicated to bibliographical data preprocessing (simplification, normalization). This library is used for preprocessing data in alignement tasks. B-cleaner is designed to have a small memory footprint and high performances.
B-cleaner offers **binding with Python 3**.
To compile b-cleaner as a Python library make sure you are building this library with the python features enabled: `cargo build --release --lib --features=python`. You can also use [maturin](https://github.com/PyO3/maturin).
## Usage
B-cleaner works with tokenized data. Tokenized data should contain punctuation.
B-cleaner is able to clean:
* titles
* authors
* any text
### Rust usage
```rust
use b_cleaner::{TitleCleaner, Clean};
fn main() {
let raw_data: Vec<&str> = "Lorem ipsum dolor: sit amet".split_whitespace().collect();
let mut title = TitleCleaner::new(&raw_data);
title.clean();
assert_eq!(title.tokens(), &vec!["lorem", "ipsum", "dolor"]);
}
```
### Python usage
```python
>>> import b_cleaner as bc
>>> bc.clean_title(["Lorem", "ipsum", "dolor", "sit", "amet"])
#['lorem', 'ipsum', 'dolor', 'amet']
>>> bc.clean_author(["John", "W.", "Doe", "(1950-2018)"])
#['john', 'w', 'doe']
```
## Build B-cleaner for python
B-cleaner can be build and installed with [maturin](https://github.com/PyO3/maturin), a tool dedicated to build python native modules written in rust with [pyo3](https://github.com/PyO3/pyo3).
Make sure maturin is installed on your system:
```sh
pip install maturin
```
Maturin and pyo3 might require some developement dependencies to build the native module:
```sh
sudo apt install python3-dev python-dev
```
Then build and install b_cleaner with:
```sh
pip install .
```