https://github.com/pprzetacznik/patent-parsing-tools
USPTO patents dataset generator
https://github.com/pprzetacznik/patent-parsing-tools
dataset dataset-generation python uspto
Last synced: 3 months ago
JSON representation
USPTO patents dataset generator
- Host: GitHub
- URL: https://github.com/pprzetacznik/patent-parsing-tools
- Owner: pprzetacznik
- License: mit
- Created: 2015-09-26T23:26:19.000Z (about 10 years ago)
- Default Branch: master
- Last Pushed: 2024-12-29T21:15:20.000Z (9 months ago)
- Last Synced: 2025-05-25T08:50:06.563Z (5 months ago)
- Topics: dataset, dataset-generation, python, uspto
- Language: Python
- Homepage:
- Size: 1.69 MB
- Stars: 5
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
patent-parsing-tools
====================
USPTO patents dataset generator.[](https://patent-parsing-tools.readthedocs.io/en/latest/?badge=latest)
[](https://github.com/pprzetacznik/patent-parsing-tools/actions?query=workflow%3A"patent-parsing-tools+CI")
[](https://pypi.org/project/patent-parsing-tools/)
[](https://pypi.org/project/patent-parsing-tools/)## Documentation
[Read the docs](https://patent-parsing-tools.readthedocs.io/en/latest/)
## System requirements
```Bash
sudo yum install python-devel libxslt-devel libxml2-devel
```## Installation:
```
pip install patent-parsing-tools
```## Examples:
Downloading dataset:
```Bash
python -m patent_parsing_tools.downloader \
--directory dataset \
--year-from 2010 \
--year-to 2010
```Collecting and serializing data:
```Bash
python -m patent_parsing_tools.supervisor \
--working-directory patents/working_directory \
--train-destination patents/train_destination \
--test-destination patents/test_destination \
--year-from 2014 \
--year-to 2015
```Generating dictionary with train set:
```Bash
python -m patent_parsing_tools.bow.dictionary_maker \
--train-directory patents/train_destination \
--max-patents 1000000000 \
--dictionary dictionary.txt \
--dict-max-size 4096
```Generate bag of words with train set and test set:
```Bash
python -m patent_parsing_tools.bow.bag_of_words \
--serialized-patents patents/train_destination \
--destination-directory patents/final_dataset_train \
--dictionary dictionary.txt \
--batch-size 1048576
python -m patent_parsing_tools.bow.bag_of_words \
--serialized-patents patents/test_destination \
--destination-directory patents/final_dataset_test \
--dictionary dictionary.txt \
--batch-size 1048576
```## Testing
```Bash
pytest
```## Contributing and develpment
```Bash
$ mkvirtualenv ppt
$ workon ppt
(ppt) $ pip install -r requirements.txt
```## Publish new release
```Bash
$ git tag v1.0
$ git push origin v1.0
```## Building documentation
```Bash
(ppt) $ sphinx-build -M html docs docs_build
```## References
Usage:
* Elton, *Using natural language processing techniques to extract information on the properties and functionalities of energetic materials from large text corpora*, 2019, online: [https://arxiv.org/abs/1903.00415](https://arxiv.org/abs/1903.00415).
* Lee, *Natural Language Processing Techniques for Advancing Materials Discovery: A Short Review*, 2023, online: [https://doi.org/10.1007/s40684-023-00523-6](https://doi.org/10.1007/s40684-023-00523-6).## License
The MIT License (MIT). Copyright (c) 2014 Michał Dul, Piotr Przetacznik, Krzysztof Strojny. Check [LICENSE](LICENSE) files for more information.