{"id":18925251,"url":"https://github.com/newca12/dictionary-builder","last_synced_at":"2025-10-15T04:54:25.385Z","repository":{"id":1273897,"uuid":"1213137","full_name":"newca12/dictionary-builder","owner":"newca12","description":"Real world example to demonstrate advanced techniques to unmarshall very large xml document with very low memory footprint.","archived":false,"fork":false,"pushed_at":"2025-03-22T17:08:43.000Z","size":206,"stargazers_count":60,"open_issues_count":1,"forks_count":13,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-03-22T18:22:17.006Z","etag":null,"topics":["akka-stream","dictionary","jaxb","rust"],"latest_commit_sha":null,"homepage":"https://edla.org","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/newca12.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"COPYING.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2011-01-01T22:49:45.000Z","updated_at":"2025-03-22T17:08:47.000Z","dependencies_parsed_at":"2025-03-22T18:20:27.212Z","dependency_job_id":"8c18ea12-23f5-4eba-86c4-fed7fef008f1","html_url":"https://github.com/newca12/dictionary-builder","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/newca12%2Fdictionary-builder","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/newca12%2Fdictionary-builder/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/newca12%2Fdictionary-builder/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/newca12%2Fdictionary-builder/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/newca12","download_url":"https://codeload.github.com/newca12/dictionary-builder/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249080220,"owners_count":21209490,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["akka-stream","dictionary","jaxb","rust"],"created_at":"2024-11-08T11:10:21.751Z","updated_at":"2025-10-15T04:54:20.343Z","avatar_url":"https://github.com/newca12.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Dictionary builder [![OpenHub](http://www.openhub.net/p/dictionary-builder/widgets/project_thin_badge.gif)](https://www.openhub.net/p/dictionary-builder)\n## About ##\nThis project allow you to build dictionaries based on [Wiktionary](http://www.wiktionary.org/) entries.   \n\nDictionary builder used to be a demonstration of advanced JAXB techniques to unmarshall very large xml document with very low memory footprint.   \nThe Java/JAXB implementation has been archived in [java-jaxb branch](https://github.com/newca12/dictionary-builder/tree/java-jaxb)\n\nThen it was re-written with Scala and Akka Streams.  \nThe Scala/akka-stream implementation has been archived in [scala-akka-streams branch](https://github.com/newca12/dictionary-builder/tree/scala-akka-streams)\n\nAnd now re-written with Rust.\n\nThe resulting dictionnary is exactly the same with the three implementations.\nNone of these implementations was designed to be use as a benchmark but nethertheless Rust results are breathtaking. See below.\n\ndictionary-builder is an EDLA project.\n\nThe purpose of [edla.org](https://edla.org) is to promote the state of the art in various domains.\n\n## Warning ##\n\nDon't expect too much from this dictionary builder.  \nAfter running this program you will find in the ```root``` folder configured in the Settings.toml :  \n* a file named with ```words_file``` configuration that contains all the words found (and expressions if ```expression = true``` is configured)\n* a file named with ```excluded_words_file``` configuration that contains all pages in the dump that were filtered out  \n* and if ```with_definition = true``` is configured a bunch of folders (two level deep) with gzip compressed file. Each file contains the definition in the **rought wikimedia format** wich is probably not what you are expected.\n\n\n## How to use it ##\n\n0. Rust [need to be installed](https://doc.rust-lang.org/book/ch01-01-installation.html) to generate an executable\n\n1. Get a fresh wiktionary backup   \nChoose your favorite language and download the dump containing the current versions of article content [here](http://download.wikimedia.org/backup-index.html)  \nExample for the english dump:\nhttp://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles-multistream.xml.bz2\n\n2. Uncompress the fresh downloaded dump somewhere (Take care you need up to 10 Gigas of free disk space)\n\n3. Edit Settings.toml to indicate the language you choose, where the dump is located and last but not least where the dictionary should be generated.  \n(With Windows systems PATHs need to be escaped for example `C:\\\\dico\\\\words` and take care you need at least 4G of free disk space to store your dictionary if you set `with_definition=true`)  \n**The root folder must already exist, you must create is yourself if not.** \n\n4. Build the executable : cargo build --release  \n\n5. Launch the program : ./target/release/dictionary-builder\n\n6. Some results :  \nFrom the English dictionary 1001051 entries are generated in less than 2 minutes and 3.5 Gigas disk space are required for the dictionary.  \nNota : on some systems antivirus can slow down a lot the generation if ```with_definition = true``` is configured.\n\nThat's it.\n\n## Performance comparaison ##\n\nTest were done on a modest i7-4600U CPU @ 2.10GHz with SSD.  \nThe results sound like a joke :\n\n|| Rust  | Scala/akka streams | Java/JAXB |\n| :---:| :---: | :---: | :---: |\n| without definition| 37s  | 4min 47s  | 7min 36s |\n| with definitions | 1min 53s  | 5min 46s  | 9min 1s |\n\nRust implementation outperform by far the others implementations and the icing on the cake : Rust use ten time less memory. :rocket:\n\n## Developer Notes ##\n\nSome words like for example `con` are [reserved](https://superuser.com/questions/86999/why-cant-i-name-a-folder-or-file-con-in-windows) in Windows system. but :\n``` rust\nFile::create(\"con\").expect(\"Unable to create file\"); \n````\nwill not trig any error. (This is not specific to Rust, Java will not trig an exception either)\n\n### License ###\n© 2009-2023 Olivier ROLAND. Distributed under the GPLv3 License.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnewca12%2Fdictionary-builder","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnewca12%2Fdictionary-builder","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnewca12%2Fdictionary-builder/lists"}