https://github.com/laion-ai/riverbed
Tools for content datamining and NLP at scale
https://github.com/laion-ai/riverbed
Last synced: about 1 month ago
JSON representation
Tools for content datamining and NLP at scale
- Host: GitHub
- URL: https://github.com/laion-ai/riverbed
- Owner: LAION-AI
- License: apache-2.0
- Created: 2022-09-06T16:49:11.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2024-06-20T02:17:22.000Z (12 months ago)
- Last Synced: 2025-05-07T18:13:34.945Z (about 1 month ago)
- Language: Python
- Homepage:
- Size: 5.91 MB
- Stars: 43
- Watchers: 7
- Forks: 6
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# riverbed
Tools for content datamining and NLP at scale.## motiviation
Given a set of text content in human language, or code, we would like to:
- Filter for quality, NSFW and potential illegal text
- label and create classifiers for the content.
- search, store and share the content to user and to other AI models## installation
```
git clone https://github.com/ontocord/riverbed/
chmod ugo+x /content/riverbed/bin/lmplz
pip install https://github.com/kpu/kenlm/archive/master.zip
pip install dataset datasets fasttext indexed_gzip whoosh transformers sentencepiece spacy nltk fast-pytorch-kmeans mmh3 tqdm
git clone --recursive https://github.com/seomoz/simhash-py
rm simhash-py/simash/*.cpp
python simhash-py/setup.py install build_ext --inplace
pip install tsnecuda==3.0.1+cu112 -f https://tsnecuda.isx.ai/tsnecuda_stable.html
python -m spacy download en_core_web_md
python -m nltk.downloader stopwords
```## history
Originally written by Ontocord, LLC. Donated to LAION for the open source community.