{"id":31447974,"url":"https://github.com/ashvardanian/stringwars","last_synced_at":"2026-01-18T18:01:58.071Z","repository":{"id":224163012,"uuid":"762493674","full_name":"ashvardanian/StringWars","owner":"ashvardanian","description":"Comparing performance-oriented string-processing libraries for substring search, multi-pattern matching, hashing, edit-distances, sketching, and sorting across CPUs and GPUs in Rust 🦀 and Python 🐍","archived":false,"fork":false,"pushed_at":"2025-09-27T14:54:04.000Z","size":572,"stargazers_count":90,"open_issues_count":1,"forks_count":4,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-09-27T16:02:02.344Z","etag":null,"topics":["benchmark","bioinformatics","database","dataframe","levenshtein-distance","libc","memchr","polars","rapids","string","string-search","strstr","substring-search"],"latest_commit_sha":null,"homepage":"https://ashvardanian.com/posts/stringwars-on-gpus/","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ashvardanian.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-02-23T22:31:57.000Z","updated_at":"2025-09-27T14:53:42.000Z","dependencies_parsed_at":"2024-02-24T07:29:00.343Z","dependency_job_id":"5f9f26c1-91a7-40f4-a33f-5373c1aa43c5","html_url":"https://github.com/ashvardanian/StringWars","commit_stats":null,"previous_names":["ashvardanian/memchr_vs_stringzilla","ashvardanian/stringzilla-benchmarks-rs","ashvardanian/stringwa.rs","ashvardanian/stringwars"],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/ashvardanian/StringWars","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2FStringWars","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2FStringWars/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2FStringWars/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2FStringWars/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ashvardanian","download_url":"https://codeload.github.com/ashvardanian/StringWars/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2FStringWars/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":277782799,"owners_count":25876209,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-01T02:00:09.286Z","response_time":88,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","bioinformatics","database","dataframe","levenshtein-distance","libc","memchr","polars","rapids","string","string-search","strstr","substring-search"],"created_at":"2025-10-01T02:19:08.179Z","updated_at":"2026-01-18T18:01:58.064Z","avatar_url":"https://github.com/ashvardanian.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# StringWars\n\n## Text Processing on CPUs \u0026 GPUs, in Python \u0026 Rust\n\n![StringWars Thumbnail](https://github.com/ashvardanian/ashvardanian/blob/master/repositories/StringWa.rs.jpg?raw=true)\n\nThere are many __great__ libraries for string processing!\nMostly, of course, written in Assembly, C, and C++, but some in Rust as well.\n\nWhere Rust decimates C and C++, is the __simplicity__ of dependency management, making it great for benchmarking \"Systems Software\" and lining up apples-to-apples across native crates and their Python bindings.\nSo, to accelerate the development of the [`StringZilla`](https://github.com/ashvardanian/StringZilla) C, C++, and CUDA libraries (with Rust and Python bindings), I've created this repository to compare it against some of my \u0026 communities most beloved Rust projects, like:\n\n- [`memchr`](https://github.com/BurntSushi/memchr) for substring search.\n- [`rapidfuzz`](https://github.com/rapidfuzz/rapidfuzz-rs) and [`bio`](https://github.com/rust-bio/rust-bio) for edit distances and alignments.\n- [`aHash`](https://github.com/tkaitchuck/aHash), [`xxhash-rust`](https://github.com/DoumanAsh/xxhash-rust), [`foldhash`](https://github.com/orlp/foldhash), and [`blake3`](https://github.com/BLAKE3-team/BLAKE3) for hashing.\n- [`aho_corasick`](https://github.com/BurntSushi/aho-corasick) and [`regex`](https://github.com/rust-lang/regex) for multi-pattern search.\n- [`arrow`](https://github.com/apache/arrow-rs) and [`polars`](https://github.com/pola-rs/polars) for collections and sorting.\n- [`icu`](https://github.com/unicode-org/icu4x) for Unicode processing.\n- [`ring`](https://github.com/briansmith/ring) and [`sodiumoxide`](https://github.com/sodiumoxide/sodiumoxide) for encryption.\n\nOf course, the functionality of the projects is different, as are the APIs and the usage patterns.\nSo, I focus on the workloads for which StringZilla was designed and compare the throughput of the core operations.\nNotably, I also favor modern hardware with support for a wider range SIMD instructions, like mask-equipped AVX-512 on x86 starting from the 2015 Intel Skylake-X CPUs or more recent predicated variable-length SVE and SVE2 on Arm, that aren't often supported by existing libraries and tooling.\n\n\u003e [!IMPORTANT]  \n\u003e The numbers in the tables below are provided for reference only and may vary depending on the CPU, compiler, dataset, and tokenization method.\n\u003e Most of them were obtained on Intel Sapphire Rapids __(SPR)__ and Granite Rapids __(GNR)__ CPUs and Nvidia Hopper-based __H100__ and Blackwell-based __RTX 6000__ Pro GPUs, using Rust with `-C target-cpu=native` optimization flag.\n\u003e To replicate the results, please refer to the [Replicating the Results](#replicating-the-results) section below.\n\n## Benchmarks at a Glance\n\n### Hash\n\nMany hashing libraries exist, but they often lack reproducible outputs, streaming support, or cross-language availability.\nThroughput on short words and long lines:\n\n```\n                    Short Words                  Long Lines\nRust:\nstringzilla::hash   ████████████████████ 1.84    ████████████████████ 11.38 GB/s\naHash::hash_one     █████████████▍       1.23    ███████████████▏      8.61 GB/s\nxxh3::xxh3_64       ███████████▊         1.08    ████████████████▋     9.48 GB/s\nstd::hash           ████▋                0.43    ██████▌               3.74 GB/s\n\nPython:\nstringzilla.hash    ████████████████████ 0.14    ████████████████████  9.19 GB/s\nhash                ██████████████████▌  0.13    █████████▎            4.27 GB/s\nxxhash.xxh3_64      █████▋               0.04    █████████████▉        6.38 GB/s\n```\n\nSee [hash/README.md](hash/README.md) for details\n\n### Case-Insensitive UTF-8 Search\n\nUnicode-aware case-insensitive search with full case folding (ß↔SS, σ↔ς).\nThroughput searching across ~100MB multilingual corpora:\n\n```\nRust:\n                      English                      German\nstringzilla           ████████████████████ 12.79   ████████████████████ 10.67 GB/s\nicu                   ▏                     0.08   ▏                     0.08 GB/s\n\n                      Russian                      Korean\nstringzilla           ████████████████████  7.12   ████████████████████ 35.10 GB/s\nicu                   ▏                     0.14   ▏                     0.23 GB/s\n\nPython:\n                      English                      German\nstringzilla           ████████████████████  5.61   ████████████████████  6.08 GB/s\nregex                 ██▋                   0.77   ███                   0.90 GB/s\n\n                      Russian                      Korean\nstringzilla           ████████████████████  5.70   ████████████████████ 20.05 GB/s\nregex                 ████████              2.30   ████▋                 4.59 GB/s\n```\n\nSee [unicode/README.md](unicode/README.md) for details\n\n### Exact Substring Search\n\nSubstring search is offloaded to C's `memmem` or `strstr` in most languages, but SIMD-optimized implementations can do better.\nThroughput on long lines:\n\n```\n                    Left to right                Reverse order\nRust:\nmemmem::Finder      ████████████████████ 10.99\nstringzilla         ███████████████████▋ 10.82   ████████████████████ 10.66 GB/s\nstd::str            ███████████████████▊ 10.88   ███████████▏          5.94 GB/s\n\nPython:\nstringzilla         ████████████████████ 11.79   ████████████████████ 11.56 GB/s\nstr                 ██                    1.23   ██████▋               3.84 GB/s\n```\n\nSee [find/README.md](find/README.md) for details\n\n### Byte-Set Search\n\nSearching for character sets (tabs, HTML markup, digits) commonly uses regex or Aho-Corasick automata.\nThroughput counting all matches on long lines:\n\n```\nRust:\nstringzilla         ████████████████████   8.17 GB/s\nregex::find_iter    ████████████▊          5.22 GB/s\naho_corasick        █▏                     0.50 GB/s\n\nPython:\nstringzilla         ████████████████████   8.79 GB/s\nre.finditer         ▍                      0.19 GB/s\n```\n\nSee [find/README.md](find/README.md) for details\n\n### UTF-8 Processing\n\nDifferent scripts stress UTF-8 differently: Korean has 3-byte Hangul with single-byte whitespace (representative for tokenization), Arabic uses 2-byte characters, English is mostly 1-byte ASCII.\nThroughput on AMD Zen5 Turin:\n\n```\nNewline splitting:\n                      English                     Arabic\nstringzilla           ████████████████ 15.45      ████████████████████ 18.34 GB/s\nstdlib                ██                1.90      ██                    1.82 GB/s\n\nWhitespace splitting:\n                      English                     Korean\nstringzilla           ████████████████████ 0.82   ████████████████████ 1.88 GB/s\nstdlib                ██████████████████▊  0.77   ██████████▍          0.98 GB/s\nicu::WhiteSpace       ██▋                  0.11   █▌                   0.15 GB/s\n```\n\nCase folding on bicameral scripts (Latin, Cyrillic, Greek, Armenian) plus Chinese for reference:\n\n```\nCase folding:\n                      English 16x                 German 6x\nstringzilla           ████████████████████ 7.53   ████████████████████ 2.59 GB/s\nstdlib                ██▌                  0.48   ███▎                 0.43 GB/s\n\n                      Russian 10x                 French 5x\nstringzilla           ████████████████████ 2.20   ████████████████████ 1.84 GB/s\nstdlib                ██                   0.22   ███▊                 0.35 GB/s\n\n                      Greek 5x                    Armenian 4x\nstringzilla           ████████████████████ 1.00   ████████████████████  908 MB/s\nstdlib                ████▍                0.22   ████▉                 223 MB/s\n\n                      Vietnamese 1.3x             Chinese 4x\nstringzilla           ████████████████████  352   ████████████████████ 1.21 GB/s\nstdlib                █████████████▏        265   █████▍                325 MB/s\n```\n\nSee [unicode/README.md](unicode/README.md) for details\n\n### Sequence Operations\n\nDataframe libraries and search engines rely heavily on string sorting.\nSIMD-accelerated comparisons and specialized radix sorts can outperform generic algorithms.\nThroughput on short words:\n\n```\nRust:\nstringzilla         ████████████████████  213.73 M cmp/s\npolars::sort        ██████████████████▊   200.34 M cmp/s\narrow::lexsort      ███████████▍          122.20 M cmp/s\nstd::sort           █████                  54.35 M cmp/s\n\nPython:\npolars.sort         ████████████████████  223.38 M cmp/s\nstringzilla.sorted  ███████████████▎      171.13 M cmp/s\npyarrow.sort        █████▌                 62.17 M cmp/s\nlist.sort           ████▏                  47.06 M cmp/s\n```\n\nGPU: `cudf` on H100 reaches __9,463 M cmp/s__ on short words.\n\nSee [sequence/README.md](sequence/README.md) for details\n\n### Random Generation\n\nRandom byte generation and lookup tables are common in image processing and bioinformatics.\nThroughput on long lines:\n\n```\nRust:\nstringzilla         ████████████████████  10.57 GB/s\nzeroize             ████████▉              4.73 GB/s\nrand_xoshiro        ███████▎               3.85 GB/s\n\nPython:\nstringzilla         ████████████████████  20.37 GB/s\npycryptodome        ████████████▉         13.16 GB/s\nnumpy.Philox        █▌                     1.59 GB/s\n```\n\nSee [memory/README.md](memory/README.md) for details\n\n### Similarity Scoring\n\nEdit distance is essential for search engines, data cleaning, NLP, and bioinformatics.\nIt's computationally expensive with O(n\\*m) complexity, but GPUs and multi-core parallelism help.\nLevenshtein distance on ~1,000 byte lines (MCUPS = Million Cell Updates Per Second):\n\n```\nRust:\n                        1 Core                       1 Socket\nbio::levenshtein        █▏                      823\nrapidfuzz               ████████████████████ 14,316\nstringzilla (384x GNR)  ██████████████████▎  13,084  ████████████████████ 3,084,270 MCUPS\nstringzilla (B200)                                   ██████▍                998,620 MCUPS\nstringzilla (H100)                                   ██████                 925,890 MCUPS\n```\n\nSee [similarities/README.md](similarities/README.md) for details\n\n### Fingerprinting\n\nConverting variable-length strings into fixed-length sketches (like Min-Hashing) enables fast approximate matching in large-scale retrieval.\nThroughput on ~1,000 byte lines:\n\n```\nRust:\n                        1 Core                       1 Socket\npc::MinHash             ████████████████████   3.16\nstringzilla (384x GNR)  ███▏                   0.51  ███████████████▍      302.30 MB/s\nstringzilla (H100)                                   ████████████████████  392.37 MB/s\n```\n\nSee [fingerprints/README.md](fingerprints/README.md) for details\n\n### Encryption\n\nChaCha20 and AES256 encryption throughput comparison on long lines:\n\n```\nRust:\nring::aes256        ████████████████████   2.89 GB/s\nring::chacha20      ████████▏              1.19 GB/s\nlibsodium::chacha20 █████                  0.71 GB/s\n```\n\nSee [encryption/README.md](encryption/README.md) for details\n\n## Replicating the Results\n\n### Replicating the Results in Rust\n\nBefore running benchmarks, you can test your Rust environment running:\n\n```bash\ncargo install cargo-criterion --locked\n```\n\nTo pull and compile all the dependencies, you can call:\n\n```bash\nRUSTFLAGS=\"-C target-cpu=native\" cargo build --benches --all-features                  # to compile everything\nRUSTFLAGS=\"-C target-cpu=native\" cargo check --benches --all-features --all-targets    # to fail on warnings\n```\n\nBy default StringWars links `stringzilla` in CPU mode.\nIf the machine has an NVIDIA GPU with CUDA installed, enable the CUDA kernels explicitly when running benches, for example:\n\n```bash\nRUSTFLAGS=\"-C target-cpu=native\" \\\n    STRINGWARS_DATASET=README.md \\\n    STRINGWARS_TOKENS=lines \\\n    STRINGWARS_FILTER=GPU \\\n    cargo criterion --features \"cuda bench_similarities\" bench_similarities --jobs 1\n```\n\nWars always take long, and so do these benchmarks.\nEvery one of them includes a few seconds of a warm-up phase to ensure that the CPU caches are filled and the results are not affected by cold start or SIMD-related frequency scaling.\nEach of them accepts a few environment variables to control the dataset, the tokenization, and the error bounds.\nYou can log those by printing file-level documentation using `awk` on Linux:\n\n```bash\nawk '/^\\/\\/!/ { print } !/^\\/\\/!/ { exit }' find/bench.rs\n```\n\nCommonly used environment variables are:\n\n- `STRINGWARS_DATASET` - the path to the textual dataset file.\n- `STRINGWARS_TOKENS` - the tokenization mode: `file`, `lines`, or `words`.\n- `STRINGWARS_ERROR_BOUND` - the maximum allowed error in the Levenshtein distance.\n\nHere is an example of a common benchmark run on a Unix-like system:\n\n```bash\nRUSTFLAGS=\"-C target-cpu=native\" \\\n    STRINGWARS_DATASET=README.md \\\n    STRINGWARS_TOKENS=lines \\\n    cargo criterion --features bench_hash bench_hash --jobs $(nproc)\n```\n\nOn Windows using PowerShell you'd need to set the environment variable differently:\n\n```powershell\n$env:STRINGWARS_DATASET=\"README.md\"\ncargo criterion --jobs $(nproc)\n```\n\n### Replicating the Results in Python\n\nIt's recommended to use `uv` for Python dependency management and running the benchmarks.\nTo install all dependencies for all benchmarks:\n\n```sh\nuv venv --python 3.12\nuv pip install -r requirements.txt -r requirements-cuda.txt\nuv pip install --only-binary=:all: -r requirements.txt -r requirements-cuda.txt\n```\n\nTo install dependencies for individual benchmarks:\n\n```sh\nPIP_EXTRA_INDEX_URL=https://pypi.nvidia.com \\\nuv pip install '.[find,hash,sequence,fingerprints,similarities]'\n```\n\nTo run individual benchmarks, you can call:\n\n```sh\nuv run --no-project python hash/bench.py --help\nuv run --no-project python find/bench.py --help\nuv run --no-project python memory/bench.py --help\nuv run --no-project python sequence/bench.py --help\nuv run --no-project python similarities/bench.py --help\nuv run --no-project python fingerprints/bench.py --help\n```\n\n## Datasets\n\n### UTF8 Corpus\n\nFor mixed UTF data, I've used the XL Sum dataset for multilingual extractive summarization.\nIt's 4.7 GB in size (1.7 GB compressed), 1'004'598 lines long, and contains 268'435'456 tokens of mean length 8.\nTo download, unpack, and run the benchmarks, execute the following bash script in your terminal:\n\n```bash\ncurl -fL -o xlsum.csv.gz https://github.com/ashvardanian/xl-sum/releases/download/v1.0.0/xlsum.csv.gz\ngzip -d xlsum.csv.gz\nSTRINGWARS_DATASET=xlsum.csv cargo criterion --jobs $(nproc)\n```\n\n### Multilingual Wikipedia Corpus\n\nThe Cohere Wikipedia dataset provides pre-processed JSONL files for different languages.\nThis may be the optimal dataset for relative comparison of UTF-8 decoding and matching enginges in each individual environment.\nNot all Wikipedia languages are available, but the following have been selected specifically:\n\n- __Chinese (zh)__: 3-byte CJK characters, rare 1-byte punctuation\n- __Korean (ko)__: 3-byte Hangul syllables, frequent 1-byte punctuation\n- __Arabic (ar)__: 2-byte Arabic script, with regular 1-byte punctuation\n- __French (fr)__: Mixed 1-2 byte Latin with high diacritic density\n- __English (en)__: Mostly 1-byte ASCII baseline\n\nTo download and decompress one file from each language:\n\n```bash\ncurl -fL -o wiki_en.jsonl.gz https://huggingface.co/datasets/Cohere/wikipedia-22-12/resolve/main/en/000.jsonl.gz \u0026\u0026 gunzip wiki_en.jsonl.gz\ncurl -fL -o wiki_zh.jsonl.gz https://huggingface.co/datasets/Cohere/wikipedia-22-12/resolve/main/zh/000.jsonl.gz \u0026\u0026 gunzip wiki_zh.jsonl.gz\ncurl -fL -o wiki_ko.jsonl.gz https://huggingface.co/datasets/Cohere/wikipedia-22-12/resolve/main/ko/000.jsonl.gz \u0026\u0026 gunzip wiki_ko.jsonl.gz\ncurl -fL -o wiki_ar.jsonl.gz https://huggingface.co/datasets/Cohere/wikipedia-22-12/resolve/main/ar/000.jsonl.gz \u0026\u0026 gunzip wiki_ar.jsonl.gz\ncurl -fL -o wiki_fr.jsonl.gz https://huggingface.co/datasets/Cohere/wikipedia-22-12/resolve/main/fr/000.jsonl.gz \u0026\u0026 gunzip wiki_fr.jsonl.gz\ncurl -fL -o wiki_de.jsonl.gz https://huggingface.co/datasets/Cohere/wikipedia-22-12/resolve/main/de/000.jsonl.gz \u0026\u0026 gunzip wiki_de.jsonl.gz\ncurl -fL -o wiki_es.jsonl.gz https://huggingface.co/datasets/Cohere/wikipedia-22-12/resolve/main/es/000.jsonl.gz \u0026\u0026 gunzip wiki_es.jsonl.gz\ncurl -fL -o wiki_it.jsonl.gz https://huggingface.co/datasets/Cohere/wikipedia-22-12/resolve/main/it/000.jsonl.gz \u0026\u0026 gunzip wiki_it.jsonl.gz\n```\n\nEach JSONL file contains one JSON object per line with fields: `id`, `title`, `text` (paragraph content), `url`, `wiki_id`, and `paragraph_id`.\n\n### CC-100 Corpus\n\nThe [CC-100](https://data.statmt.org/cc-100/) corpus provides large monolingual text files (1-80 GB) for 100+ languages, extracted from Common Crawl.\nFiles are XZ-compressed plain text with documents separated by double-newlines.\n\n| Workload                    | Relevant Scripts                  | Best Test Languages                                  |\n| --------------------------- | --------------------------------- | ---------------------------------------------------- |\n| __Case Folding__            | Latin, Cyrillic, Greek, Armenian  | Turkish (I/i), German (ss-\u003eSS), Greek, Russian       |\n| __Normalization__           | Indic, Arabic, Vietnamese, Korean | Vietnamese, Hindi, Korean, Arabic                    |\n| __Whitespace Tokenization__ | Most scripts except CJK/Thai      | English, Russian, Arabic vs. Chinese, Japanese, Thai |\n| __Grapheme Clusters__       | Indic, Thai, Khmer, Myanmar       | Thai, Tamil, Myanmar, Khmer                          |\n| __RTL Handling__            | Arabic, Hebrew                    | Arabic, Hebrew, Persian                              |\n\n__Bicameral scripts__ with various case folding rules:\n\n```bash\ncurl -fL https://data.statmt.org/cc-100/en.txt.xz | xz -d \u003e cc100_en.txt      # 82 GB - English\ncurl -fL https://data.statmt.org/cc-100/de.txt.xz | xz -d \u003e cc100_de.txt      # 18 GB - German\ncurl -fL https://data.statmt.org/cc-100/tr.txt.xz | xz -d \u003e cc100_tr.txt      # 5.4 GB - Turkish\ncurl -fL https://data.statmt.org/cc-100/ru.txt.xz | xz -d \u003e cc100_ru.txt      # 46 GB - Russian\ncurl -fL https://data.statmt.org/cc-100/uk.txt.xz | xz -d \u003e cc100_uk.txt      # 14 GB - Ukrainian\ncurl -fL https://data.statmt.org/cc-100/el.txt.xz | xz -d \u003e cc100_el.txt      # 7.4 GB - Greek\ncurl -fL https://data.statmt.org/cc-100/hy.txt.xz | xz -d \u003e cc100_hy.txt      # 776 MB - Armenian\ncurl -fL https://data.statmt.org/cc-100/ka.txt.xz | xz -d \u003e cc100_ka.txt      # 1.1 GB - Georgian\ncurl -fL https://data.statmt.org/cc-100/pl.txt.xz | xz -d \u003e cc100_pl.txt      # 12 GB - Polish\ncurl -fL https://data.statmt.org/cc-100/cs.txt.xz | xz -d \u003e cc100_cs.txt      # 4.4 GB - Czech\ncurl -fL https://data.statmt.org/cc-100/nl.txt.xz | xz -d \u003e cc100_nl.txt      # 7.9 GB - Dutch\ncurl -fL https://data.statmt.org/cc-100/fr.txt.xz | xz -d \u003e cc100_fr.txt      # 14 GB - French\ncurl -fL https://data.statmt.org/cc-100/es.txt.xz | xz -d \u003e cc100_es.txt      # 14 GB - Spanish\ncurl -fL https://data.statmt.org/cc-100/pt.txt.xz | xz -d \u003e cc100_pt.txt      # 13 GB - Portuguese\ncurl -fL https://data.statmt.org/cc-100/it.txt.xz | xz -d \u003e cc100_it.txt      # 7.8 GB - Italian\n```\n\n__Unicameral scripts__ without case folding, but with other normalization/segmentation challenges:\n\n```bash\ncurl -fL https://data.statmt.org/cc-100/ar.txt.xz | xz -d \u003e cc100_ar.txt      # 5.4 GB - Arabic (RTL)\ncurl -fL https://data.statmt.org/cc-100/he.txt.xz | xz -d \u003e cc100_he.txt      # 6.1 GB - Hebrew (RTL)\ncurl -fL https://data.statmt.org/cc-100/fa.txt.xz | xz -d \u003e cc100_fa.txt      # 20 GB - Persian (RTL)\ncurl -fL https://data.statmt.org/cc-100/hi.txt.xz | xz -d \u003e cc100_hi.txt      # 2.5 GB - Hindi (Devanagari)\ncurl -fL https://data.statmt.org/cc-100/bn.txt.xz | xz -d \u003e cc100_bn.txt      # 860 MB - Bengali\ncurl -fL https://data.statmt.org/cc-100/ta.txt.xz | xz -d \u003e cc100_ta.txt      # 1.3 GB - Tamil\ncurl -fL https://data.statmt.org/cc-100/te.txt.xz | xz -d \u003e cc100_te.txt      # 536 MB - Telugu\ncurl -fL https://data.statmt.org/cc-100/th.txt.xz | xz -d \u003e cc100_th.txt      # 8.7 GB - Thai (no spaces)\ncurl -fL https://data.statmt.org/cc-100/vi.txt.xz | xz -d \u003e cc100_vi.txt      # 28 GB - Vietnamese\ncurl -fL https://data.statmt.org/cc-100/zh-Hans.txt.xz | xz -d \u003e cc100_zh.txt # 14 GB - Chinese\ncurl -fL https://data.statmt.org/cc-100/ja.txt.xz | xz -d \u003e cc100_ja.txt      # 15 GB - Japanese\ncurl -fL https://data.statmt.org/cc-100/ko.txt.xz | xz -d \u003e cc100_ko.txt      # 14 GB - Korean (Jamo)\ncurl -fL https://data.statmt.org/cc-100/my.txt.xz | xz -d \u003e cc100_my.txt      # 46 MB - Myanmar\ncurl -fL https://data.statmt.org/cc-100/km.txt.xz | xz -d \u003e cc100_km.txt      # 153 MB - Khmer\ncurl -fL https://data.statmt.org/cc-100/am.txt.xz | xz -d \u003e cc100_am.txt      # 133 MB - Amharic (Ethiopic)\ncurl -fL https://data.statmt.org/cc-100/si.txt.xz | xz -d \u003e cc100_si.txt      # 452 MB - Sinhala\n```\n\n### Leipzig Corpora Collection\n\nThe [Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download/) provides pre-segmented sentences in 200+ languages.\nEach tar.gz contains `*-sentences.txt` (tab-separated `id\\tsentence`), `*-words.txt` (frequencies), and co-occurrence files.\nStandard sizes: 10K, 30K, 100K, 300K, 1M sentences. Check for newer years at the download page.\n\n__Bicameral scripts__ with various case folding rules:\n\n```bash\ncurl -fL https://downloads.wortschatz-leipzig.de/corpora/eng_wikipedia_2016_1M.tar.gz | tar -xzf - -O 'eng_wikipedia_2016_1M/eng_wikipedia_2016_1M-sentences.txt' | cut -f2 \u003e leipzig1M_en.txt\ncurl -fL https://downloads.wortschatz-leipzig.de/corpora/deu_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'deu_wikipedia_2021_1M/deu_wikipedia_2021_1M-sentences.txt' | cut -f2 \u003e leipzig1M_de.txt\ncurl -fL https://downloads.wortschatz-leipzig.de/corpora/tur_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'tur_wikipedia_2021_1M/tur_wikipedia_2021_1M-sentences.txt' | cut -f2 \u003e leipzig1M_tr.txt\ncurl -fL https://downloads.wortschatz-leipzig.de/corpora/rus_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'rus_wikipedia_2021_1M/rus_wikipedia_2021_1M-sentences.txt' | cut -f2 \u003e leipzig1M_ru.txt\ncurl -fL https://downloads.wortschatz-leipzig.de/corpora/ukr_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'ukr_wikipedia_2021_1M/ukr_wikipedia_2021_1M-sentences.txt' | cut -f2 \u003e leipzig1M_uk.txt\ncurl -fL https://downloads.wortschatz-leipzig.de/corpora/ell_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'ell_wikipedia_2021_1M/ell_wikipedia_2021_1M-sentences.txt' | cut -f2 \u003e leipzig1M_el.txt\ncurl -fL https://downloads.wortschatz-leipzig.de/corpora/hye_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'hye_wikipedia_2021_1M/hye_wikipedia_2021_1M-sentences.txt' | cut -f2 \u003e leipzig1M_hy.txt\ncurl -fL https://downloads.wortschatz-leipzig.de/corpora/kat_wikipedia_2021_300K.tar.gz | tar -xzf - -O 'kat_wikipedia_2021_300K/kat_wikipedia_2021_300K-sentences.txt' | cut -f2 \u003e leipzig300K_ka.txt\ncurl -fL https://downloads.wortschatz-leipzig.de/corpora/pol_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'pol_wikipedia_2021_1M/pol_wikipedia_2021_1M-sentences.txt' | cut -f2 \u003e leipzig1M_pl.txt\ncurl -fL https://downloads.wortschatz-leipzig.de/corpora/ces_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'ces_wikipedia_2021_1M/ces_wikipedia_2021_1M-sentences.txt' | cut -f2 \u003e leipzig1M_cs.txt\ncurl -fL https://downloads.wortschatz-leipzig.de/corpora/nld_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'nld_wikipedia_2021_1M/nld_wikipedia_2021_1M-sentences.txt' | cut -f2 \u003e leipzig1M_nl.txt\ncurl -fL https://downloads.wortschatz-leipzig.de/corpora/fra_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'fra_wikipedia_2021_1M/fra_wikipedia_2021_1M-sentences.txt' | cut -f2 \u003e leipzig1M_fr.txt\ncurl -fL https://downloads.wortschatz-leipzig.de/corpora/spa_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'spa_wikipedia_2021_1M/spa_wikipedia_2021_1M-sentences.txt' | cut -f2 \u003e leipzig1M_es.txt\ncurl -fL https://downloads.wortschatz-leipzig.de/corpora/por_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'por_wikipedia_2021_1M/por_wikipedia_2021_1M-sentences.txt' | cut -f2 \u003e leipzig1M_pt.txt\ncurl -fL https://downloads.wortschatz-leipzig.de/corpora/ita_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'ita_wikipedia_2021_1M/ita_wikipedia_2021_1M-sentences.txt' | cut -f2 \u003e leipzig1M_it.txt\ncurl -fL https://downloads.wortschatz-leipzig.de/corpora/lit_wikipedia_2021_300K.tar.gz | tar -xzf - -O 'lit_wikipedia_2021_300K/lit_wikipedia_2021_300K-sentences.txt' | cut -f2 \u003e leipzig300K_lt.txt\n```\n\n__Unicameral scripts__ without case folding, but with other normalization/segmentation challenges:\n\n```bash\ncurl -fL https://downloads.wortschatz-leipzig.de/corpora/ara_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'ara_wikipedia_2021_1M/ara_wikipedia_2021_1M-sentences.txt' | cut -f2 \u003e leipzig1M_ar.txt\ncurl -fL https://downloads.wortschatz-leipzig.de/corpora/heb_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'heb_wikipedia_2021_1M/heb_wikipedia_2021_1M-sentences.txt' | cut -f2 \u003e leipzig1M_he.txt\ncurl -fL https://downloads.wortschatz-leipzig.de/corpora/fas_wikipedia_2014_1M.tar.gz | tar -xzf - -O 'fas_wikipedia_2014_1M/fas_wikipedia_2014_1M-sentences.txt' | cut -f2 \u003e leipzig1M_fa.txt\ncurl -fL https://downloads.wortschatz-leipzig.de/corpora/hin_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'hin_wikipedia_2021_1M/hin_wikipedia_2021_1M-sentences.txt' | cut -f2 \u003e leipzig1M_hi.txt\ncurl -fL https://downloads.wortschatz-leipzig.de/corpora/ben_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'ben_wikipedia_2021_1M/ben_wikipedia_2021_1M-sentences.txt' | cut -f2 \u003e leipzig1M_bn.txt\ncurl -fL https://downloads.wortschatz-leipzig.de/corpora/tam_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'tam_wikipedia_2021_1M/tam_wikipedia_2021_1M-sentences.txt' | cut -f2 \u003e leipzig1M_ta.txt\ncurl -fL https://downloads.wortschatz-leipzig.de/corpora/tel_wikipedia_2021_300K.tar.gz | tar -xzf - -O 'tel_wikipedia_2021_300K/tel_wikipedia_2021_300K-sentences.txt' | cut -f2 \u003e leipzig300K_te.txt\ncurl -fL https://downloads.wortschatz-leipzig.de/corpora/tha_wikipedia_2021_10K.tar.gz | tar -xzf - -O 'tha_wikipedia_2021_10K/tha_wikipedia_2021_10K-sentences.txt' | cut -f2 \u003e leipzig10K_th.txt\ncurl -fL https://downloads.wortschatz-leipzig.de/corpora/vie_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'vie_wikipedia_2021_1M/vie_wikipedia_2021_1M-sentences.txt' | cut -f2 \u003e leipzig1M_vi.txt\ncurl -fL https://downloads.wortschatz-leipzig.de/corpora/zho_wikipedia_2018_1M.tar.gz | tar -xzf - -O 'zho_wikipedia_2018_1M/zho_wikipedia_2018_1M-sentences.txt' | cut -f2 \u003e leipzig1M_zh.txt\ncurl -fL https://downloads.wortschatz-leipzig.de/corpora/jpn_wikipedia_2018_1M.tar.gz | tar -xzf - -O 'jpn_wikipedia_2018_1M/jpn_wikipedia_2018_1M-sentences.txt' | cut -f2 \u003e leipzig1M_ja.txt\ncurl -fL https://downloads.wortschatz-leipzig.de/corpora/kor_wikipedia_2021_1M.tar.gz | tar -xzf - -O 'kor_wikipedia_2021_1M/kor_wikipedia_2021_1M-sentences.txt' | cut -f2 \u003e leipzig1M_ko.txt\ncurl -fL https://downloads.wortschatz-leipzig.de/corpora/amh_wikipedia_2021_30K.tar.gz | tar -xzf - -O 'amh_wikipedia_2021_30K/amh_wikipedia_2021_30K-sentences.txt' | cut -f2 \u003e leipzig30K_am.txt\n```\n\nTo produce a mixed dataset with rows in all languages:\n\n```bash\ncat leipzig*.txt | shuf | head -c 1G \u003e leipzig1GB.txt\n```\n\n### DNA Corpus\n\nFor bioinformatics workloads, I use the following datasets with increasing string lengths:\n\n```bash\ncurl -fL -o acgt_100.txt 'https://huggingface.co/datasets/ashvardanian/StringWars/resolve/main/acgt_100.txt?download=true'\ncurl -fL -o acgt_1k.txt 'https://huggingface.co/datasets/ashvardanian/StringWars/resolve/main/acgt_1k.txt?download=true'\ncurl -fL -o acgt_10k.txt 'https://huggingface.co/datasets/ashvardanian/StringWars/resolve/main/acgt_10k.txt?download=true'\ncurl -fL -o acgt_100k.txt 'https://huggingface.co/datasets/ashvardanian/StringWars/resolve/main/acgt_100k.txt?download=true'\ncurl -fL -o acgt_1m.txt 'https://huggingface.co/datasets/ashvardanian/StringWars/resolve/main/acgt_1m.txt?download=true'\ncurl -fL -o acgt_10m.txt 'https://huggingface.co/datasets/ashvardanian/StringWars/resolve/main/acgt_10m.txt?download=true'\n```\n\n## Deep Profiling\n\nIn case you are profiling the some of the internal kernels of mentioned libraries, here are a few example commands to get around.\nSuch as using `ncu` for NVIDIA GPUs to evaluate the register usage and occupancy of the CUDA kernels used in StringZilla's Levenshtein distance calculation:\n\n```bash\n/usr/local/cuda/bin/ncu \\\n  --metrics launch__registers_per_thread,launch__occupancy_per_block_size,sm__warps_active.avg.pct_of_peak_sustained_active,sm__throughput.avg.pct_of_peak_sustained_elapsed,dram__throughput.avg.pct_of_peak_sustained_elapsed,dram__bytes.sum \\\n  --target-processes all \\\n  --kernel-name \"levenshtein_on_each_cuda_thread\" \\\n  --launch-skip 5 \\\n  --launch-count 1 \\\n  bash -c 'STRINGWARS_DATASET=acgt_100.txt STRINGWARS_BATCH=65536 STRINGWARS_TOKENS=lines STRINGWARS_FILTER=\"uniform/stringzillas::LevenshteinDistances\\(1xGPU\\)\" cargo criterion --features \"cuda bench_similarities\" bench_similarities --jobs 1'\n```\n\nUsing `perf` on Linux to analyze the CPU-side performance of SIMD-accelerated substring search:\n\n```bash\nperf record -e cpu-clock -g graph,0x400000 -o perf.data -- cargo criterion --features \"bench_similarities\" bench_similarities --jobs 1\nperf report -i perf.data\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashvardanian%2Fstringwars","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fashvardanian%2Fstringwars","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashvardanian%2Fstringwars/lists"}