{"id":20519457,"url":"https://github.com/bzaar/dawgsharp","last_synced_at":"2025-04-05T12:07:45.138Z","repository":{"id":16449474,"uuid":"19201293","full_name":"bzaar/DawgSharp","owner":"bzaar","description":"DAWG String Dictionary in C#","archived":false,"fork":false,"pushed_at":"2024-08-18T19:29:44.000Z","size":6122,"stargazers_count":120,"open_issues_count":6,"forks_count":18,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-04-05T12:07:39.105Z","etag":null,"topics":["c-sharp","dawg","dictionaries","graph","search","trie","trie-tree-autocomplete"],"latest_commit_sha":null,"homepage":"http://www.nuget.org/packages/DawgSharp/","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bzaar.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2014-04-27T10:17:21.000Z","updated_at":"2025-03-23T02:01:16.000Z","dependencies_parsed_at":"2024-11-22T23:02:00.639Z","dependency_job_id":"d3858cbf-cb45-487d-9666-25abfd3d724f","html_url":"https://github.com/bzaar/DawgSharp","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bzaar%2FDawgSharp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bzaar%2FDawgSharp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bzaar%2FDawgSharp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bzaar%2FDawgSharp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bzaar","download_url":"https://codeload.github.com/bzaar/DawgSharp/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247332609,"owners_count":20921853,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["c-sharp","dawg","dictionaries","graph","search","trie","trie-tree-autocomplete"],"created_at":"2024-11-15T22:14:00.535Z","updated_at":"2025-04-05T12:07:45.120Z","avatar_url":"https://github.com/bzaar.png","language":"C#","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Build status](https://ci.appveyor.com/api/projects/status/4htqh2lt5l5vfgxd?svg=true)](https://ci.appveyor.com/project/morpher/dawgsharp)\n  [[NuGet Package](https://www.nuget.org/packages/DawgSharp/)]   [[Get Commercial License](http://morpher.co.uk)]   \n\nDawgSharp, a clever string dictionary in C#\n===========================================\n\nDAWG (Directed Acyclic Word Graph) is a data structure for storing and searching large word lists and dictionaries.  It can be 40x more efficient than the .NET ```Dictionary``` class for certain types of data.\n\nAs an example, [my website](http://russiangram.com) hosts a 2 million word dictionary which used to take up 56 meg on disk and took 7 seconds to load (when using Dictionary and BinarySerializer).  After switching to DAWG, it now takes 1.4 meg on disk and 0.3 seconds to load.\n\nHow is this possible?  Why is the standard Dictionary not as clever as DAWG?  The thing is, DAWG works well with natural language strings and may not work as well for generated strings such as license keys (OIN1r4Be2su+UXSeOj0TaQ).  Human language words tend to have lots of common letter sequences eg _-ility_ in _ability_, _possibility_, _agility_ etc and the algorithm takes advantage of that by finding those sequences and storing them only once for multiple words.  DAWG has also proved useful in representing DNA data (sequences of genes).  The history of DAWG dates back as far as 1985.  For more backgroud, google DAWG or DAFSA (Deterministic Acyclic Finite State Automaton).\n\nDawgSharp is an implementation of DAWG, one of many.  What makes it special?\n\n * It is written in pure C#, compiles to MSIL (AnyCPU) and runs on .NET 3.5 and above.\n * It has no dependencies.\n * It introduces no limitations on characters in keys.  Some competing implementations allow only 26 English letters.  This implementation handles any Unicode characters.\n * The compaction algorithm visits every node only once which makes it really fast (5 seconds for my 2 million word list).\n * It offers out-of-the-box persistence: call ```Load/Save``` to write the data to disk and read it back.\n * It has unit tests (using the Visual Studio testing framework).\n * It has received several man-hours of performance profiling sessions so it's pretty much as fast as it can be. The next step of making it faster would be rewriting the relevant code in IL.\n * It's GC-friendly: the Garbage Collector only sees 3 large arrays of ints where a Dictionary would store millions of strings.\n\nUsage\n-----\nIn this example we will simulate a usage scenario involving two programs, one to generate the dictionary and write it to disk and the other to load that file and use the read-only dictionary for lookups.\n\nFirst get the code by cloning this repository or installing the [NuGet package](https://www.nuget.org/packages/DawgSharp/).\n\nCreate and populate a ```DawgBuilder``` object:\n\n```csharp\nvar words = new [] { \"Aaron\", \"abacus\", \"abashed\" };\n\nvar dawgBuilder = new DawgBuilder \u003cbool\u003e (); // \u003cbool\u003e is the value type.\n                                             // Key type is always string.\nforeach (string key in words)\n{\n    dawgBuilder.Insert (key, true);\n}\n```\n\n(Alternatively, do ```var dawgBuilder = words.ToDawgBuilder(key =\u003e key, _ =\u003e true);```)\n\nCall ```BuildDawg``` on it to get the compressed version and save it to disk:\n\n```csharp\nDawg\u003cbool\u003e dawg = dawgBuilder.BuildDawg (); \n// Computer is working.  Please wait ...\n\nusing (var file = File.Create (\"DAWG.bin\")) \n    dawg.SaveTo (file);\n```\n\nNow read the file back in and check if a particular word is in the dictionary:\n\n```csharp\nvar dawg = Dawg \u003cbool\u003e.Load (File.Open (\"DAWG.bin\"));\n\nif (dawg [\"chihuahua\"])\n{\n    Console.WriteLine (\"Word is found.\");\n}\n```\n\nThe Value Type, \u0026lt;TPayload\u0026gt;\n----------\n\nThe ```Dawg``` and ```DawgBuilder``` classes take a template parameter called ```\u003cTPayload\u003e```.  It can be any type you want.  Just to be able to test if a word is in the dictionary, a bool is enough.  You can also make it an ```int``` or a ```string``` or a custom class.  But beware of one important limitation.  DAWG works well only when the set of values that TPayload can take is relatively small.  The smaller the better.  Eg if you add a definition for each word, it will make each entry unique and your graph will become a tree (which may not be too bad!).\n\nMatchPrefix()\n-------------\nOne other attractive side of DAWG is its ability to efficiently retrieve all words starting with a particular substring:\n\n```csharp\ndawg.MatchPrefix(\"awe\")\n```\n\nThe above query will return an ```IEnumerable\u003cKeyValuePair\u003e``` which might contain keys such as **awe, aweful** and **awesome**. The call ```dawg.MatchPrefix(\"\")``` will return all items in the dictionary.\n\nIf you need to look up by suffix instead, there is no MatchSuffix method. But the desired effect can be achieved\nby adding the reversed keys and then using MatchPrefix() on the reversed keys:\n\n```csharp\ndawgBuilder.Insert(\"ability\".Reverse(), true);\n...\ndawg.MatchPrefix(\"ility\".Reverse())\n```\n\nGetPrefixes()\n-------------\n\nGetPrefixes() returns all dictionary items whose keys are substrings of a given string. For example:\n\n```csharp\ndawg.GetPrefixes(\"awesomenesses\")\n```\nMight return keys such as **awe, awesome, awesomeness** and finally **awesomenesses**.\n\nGetLongestCommonPrefixLength()\n------------------------------\n\nOne other neat feature is the method ```int GetLongestCommonPrefixLength(IEnumerable\u003cchar\u003e word)```. If ```word``` is found in the dictionary, it will return its length; if not, it will return the length of the longest word that *is* found in the dictionary and that is also the beginning of the given word. For example, if **prepare** is in the dictionary but **preempt** is not, then ```dawg.GetLongestCommonPrefixLength(\"preempt\")``` will return 3 which is the length of \"pre\".\n\nThread Safety\n-------------\n\nThe ```DawgBuilder``` class is *not* thread-safe and must be accessed by only one thread at any particular time.\n\nThe ```Dawg``` class is immutable and thus thread-safe.\n\n\nMultiDawg\n---------\n\nThe MultiDawg class can store multiple values agaist a single string key in a very memory-efficient manner.\n\n\nFuture plans\n------------\n### More usage scenarios\n\nThe API was designed to fit a particular usage scenario (see above) and can be extended to support other scenarios eg being able to add new words to the dictionary after it's been compacted.  I just didn't need this so it's not implemented.  You won't get any exceptions.  There is just no ```Insert``` method on the ```Dawg``` class.\n\n### Better API\n\nImplement the IDictionary interface on both DawgBuilder and Dawg ([#5](https://github.com/bzaar/DawgSharp/issues/5)).\n\nLiterature\n----------\n * [Comparisons of Efficient Implementations for DAWG](http://www.ijcte.org/vol8/1018-C024.pdf)\n * [DotNetPerls](http://www.dotnetperls.com/directed-acyclic-word-graph)\n * [Radix trie - Wikipedia](https://en.wikipedia.org/wiki/Radix_tree)\n * [http://wutka.com/dawg.html](http://wutka.com/dawg.html)\n\nCompeting Implementations\n-------------------------\n * [DAWG (C#)](https://www.nuget.org/packages/DAWG)\n * [dawgdic (C++)](https://code.google.com/p/dawgdic/)\n * [MARISA (C++)](https://code.google.com/p/marisa-trie/)\n * [libdatrie (C)](http://linux.thai.net/~thep/datrie/datrie.html)\n\nLicense\n-------\nDawgSharp is licensed under GPLv3 which means it can be used free of charge in open-sources projects. [Read the full license](LICENSE.txt)\n\nIf you would like to use DawgSharp in a proprietary project, please purchase a commercial license at [http://morpher.co.uk](http://morpher.co.uk).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbzaar%2Fdawgsharp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbzaar%2Fdawgsharp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbzaar%2Fdawgsharp/lists"}