Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/oscar-project/ungoliant
:spider: The pipeline for the OSCAR corpus
https://github.com/oscar-project/ungoliant
common-crawl commoncrawl corpus-linguistics crawler fasttext language-classification nlp oscar
Last synced: 6 days ago
JSON representation
:spider: The pipeline for the OSCAR corpus
- Host: GitHub
- URL: https://github.com/oscar-project/ungoliant
- Owner: oscar-project
- License: apache-2.0
- Created: 2021-02-15T03:19:32.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2023-12-18T16:31:48.000Z (11 months ago)
- Last Synced: 2024-07-09T02:02:50.111Z (4 months ago)
- Topics: common-crawl, commoncrawl, corpus-linguistics, crawler, fasttext, language-classification, nlp, oscar
- Language: Rust
- Homepage: https://oscar-corpus.com
- Size: 4.72 MB
- Stars: 154
- Watchers: 2
- Forks: 14
- Open Issues: 31
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
# Ungoliant
![](https://img.shields.io/crates/d/ungoliant?style=flat-square) ![](https://img.shields.io/crates/l/ungoliant?style=flat-square)
[![codecov](https://codecov.io/gh/oscar-corpus/ungoliant/branch/master/graph/badge.svg?token=Q3M8F86E2G)](https://codecov.io/gh/oscar-corpus/ungoliant)🕷️ **Ungoliant is a high-performance pipeline that provides tools to build corpus generation pipelines from CommonCrawl.** 🕷️
It currently is the generation pipeline for [OSCAR corpus](https://oscar-corpus.com), from [CommonCrawl](https://commoncrawl.org).
Ungoliant is a replacement of [goclassy](https://github.com/oscar-corpus/goclassy).![](https://img.shields.io/github/workflow/status/oscar-corpus/ungoliant/Rust/master?label=main&style=flat-square) ![](https://img.shields.io/github/workflow/status/oscar-corpus/ungoliant/Rust/dev?label=dev&style=flat-square)
## Installation
### Installing/Compiling the binary
* Via `cargo`: `cargo install ungoliant`
* Via `git`: `cargo install --git https://github.com/oscar-corpus/ungoliant`Ungoliant needs numerous dependencies that should be compiled when installing. However `cmake / gcc` can be needed as the project uses [fasttext-rs](https://github.com/messense/fasttext-rs).
### KenLM feature
The KenLM feature is optional because it relies on unsafe code that can break if the supplied model files are not correct.
To enable it, install KenLM requirements:
```bash
apt install -y libboost-all-dev libeigen3-dev
```and use `cargo install ungoliant --features kenlm` or `cargo b --features kenlm` if you're building from source.
### Getting a language identification file (for fastText):
By default, `ungoliant` expects the `lid.176.bin` model by meta.
Use `curl https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -o lid.176.bin` to get it.However, you can use the model you want: just point to its path using `ungoliant download --lid-path `.
Other options include:
- NLLB model (https://huggingface.co/facebook/fasttext-language-identification)
- OpenLID model (https://github.com/laurieburchell/open-lid-dataset)## Usage
The usual way of generating corpora is:
1. Fetch the `wet.paths.gz` file from the last [CommonCrawl dump](https://commoncrawl.org/connect/blog/) and decompress it.
2. Download the files using the `download` command.
3. Generate the corpus using the `pipeline` command (it may take some time).
4. Head on to [oscar-tools](https://github.com/oscar-project/oscar-tools) for the packaging stepsYou can find more information on each command's `--help`.
```text
ungoliant 2
corpus generation tool.USAGE:
ungoliantFLAGS:
-h, --help Prints help information
-V, --version Prints version informationSUBCOMMANDS:
download Download a CommonCrawl release
help Prints this message or the help of the given subcommand(s)
pipeline Run pipeline
rebuild Rebuild the corpus for a given language.
```## Documentation
Ungoliant is not yet on docs.rs: use `cargo doc --bins --open` to open the documentation.
Head on to [OSCAR Documentation](https://oscar-project.github.io/documentation/) for more info about the project.