{"id":13564938,"url":"https://github.com/oscar-project/ungoliant","last_synced_at":"2025-04-03T22:30:23.451Z","repository":{"id":38330055,"uuid":"338956175","full_name":"oscar-project/ungoliant","owner":"oscar-project","description":":spider: The pipeline for the OSCAR corpus","archived":false,"fork":false,"pushed_at":"2023-12-18T16:31:48.000Z","size":4946,"stargazers_count":162,"open_issues_count":31,"forks_count":14,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-11-04T18:45:38.572Z","etag":null,"topics":["common-crawl","commoncrawl","corpus-linguistics","crawler","fasttext","language-classification","nlp","oscar"],"latest_commit_sha":null,"homepage":"https://oscar-corpus.com","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oscar-project.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2021-02-15T03:19:32.000Z","updated_at":"2024-10-02T20:59:50.000Z","dependencies_parsed_at":"2024-01-16T19:19:10.448Z","dependency_job_id":null,"html_url":"https://github.com/oscar-project/ungoliant","commit_stats":{"total_commits":358,"total_committers":8,"mean_commits":44.75,"dds":0.05865921787709494,"last_synced_commit":"c2be3db7167d5d1c3dd3b5e89ee2dc8944ebd0b4"},"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oscar-project%2Fungoliant","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oscar-project%2Fungoliant/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oscar-project%2Fungoliant/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oscar-project%2Fungoliant/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oscar-project","download_url":"https://codeload.github.com/oscar-project/ungoliant/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247089601,"owners_count":20881802,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["common-crawl","commoncrawl","corpus-linguistics","crawler","fasttext","language-classification","nlp","oscar"],"created_at":"2024-08-01T13:01:38.271Z","updated_at":"2025-04-03T22:30:22.740Z","avatar_url":"https://github.com/oscar-project.png","language":"Rust","funding_links":[],"categories":["Rust","Data Processing"],"sub_categories":[],"readme":"# Ungoliant\n\n\u003cimg align=\"left\" src=\"img/logo.png\" width=\"200\" height=\"200\" /\u003e \n\n![](https://img.shields.io/crates/d/ungoliant?style=flat-square) ![](https://img.shields.io/crates/l/ungoliant?style=flat-square) \n[![codecov](https://codecov.io/gh/oscar-corpus/ungoliant/branch/master/graph/badge.svg?token=Q3M8F86E2G)](https://codecov.io/gh/oscar-corpus/ungoliant)\n\n🕷️ **Ungoliant is a high-performance pipeline that provides tools to build corpus generation pipelines from CommonCrawl.** 🕷️\n\nIt currently is the generation pipeline for [OSCAR corpus](https://oscar-corpus.com), from [CommonCrawl](https://commoncrawl.org).\nUngoliant is a replacement of [goclassy](https://github.com/oscar-corpus/goclassy).\n\n\n![](https://img.shields.io/github/workflow/status/oscar-corpus/ungoliant/Rust/master?label=main\u0026style=flat-square)                           ![](https://img.shields.io/github/workflow/status/oscar-corpus/ungoliant/Rust/dev?label=dev\u0026style=flat-square)\n\n## Installation\n\n### Installing/Compiling the binary\n* Via `cargo`: `cargo install ungoliant`\n* Via `git`: `cargo install --git https://github.com/oscar-corpus/ungoliant`\n\nUngoliant needs numerous dependencies that should be compiled when installing. However `cmake / gcc` can be needed as the project uses [fasttext-rs](https://github.com/messense/fasttext-rs).\n\n### KenLM feature\n\nThe KenLM feature is optional because it relies on unsafe code that can break if the supplied model files are not correct.\n\nTo enable it, install KenLM requirements:\n\n```bash\napt install -y libboost-all-dev libeigen3-dev\n```\n\nand use `cargo install ungoliant --features kenlm` or `cargo b --features kenlm` if you're building from source.\n\n### Getting a language identification file (for fastText):\n\nBy default, `ungoliant` expects the `lid.176.bin` model by meta. \nUse `curl https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -o lid.176.bin` to get it.\n\nHowever, you can use the model you want: just point to its path using `ungoliant download --lid-path \u003cpath to lid\u003e`.\n\nOther options include:\n\n- NLLB model (https://huggingface.co/facebook/fasttext-language-identification)\n- OpenLID model (https://github.com/laurieburchell/open-lid-dataset)\n\n\n## Usage \n\nThe usual way of generating corpora is:\n\n1. Fetch the `wet.paths.gz` file from the last [CommonCrawl dump](https://commoncrawl.org/connect/blog/) and decompress it.\n2. Download the files using the `download` command.\n3. Generate the corpus using the `pipeline` command (it may take some time).\n4. Head on to [oscar-tools](https://github.com/oscar-project/oscar-tools) for the packaging steps\n\nYou can find more information on each command's `--help`.\n\n```text\nungoliant 2\ncorpus generation tool.\n\nUSAGE:\n    ungoliant \u003cSUBCOMMAND\u003e\n\nFLAGS:\n    -h, --help       Prints help information\n    -V, --version    Prints version information\n\nSUBCOMMANDS:\n    download    Download a CommonCrawl release\n    help        Prints this message or the help of the given subcommand(s)\n    pipeline    Run pipeline\n    rebuild     Rebuild the corpus for a given language.\n```\n\n## Documentation\n\nUngoliant is not yet on docs.rs: use `cargo doc --bins --open` to open the documentation.\n\nHead on to [OSCAR Documentation](https://oscar-project.github.io/documentation/) for more info about the project.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foscar-project%2Fungoliant","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foscar-project%2Fungoliant","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foscar-project%2Fungoliant/lists"}