{"id":21007064,"url":"https://github.com/jjfiv/csc212spellchecking","last_synced_at":"2026-03-03T03:36:17.948Z","repository":{"id":142094494,"uuid":"156628650","full_name":"jjfiv/CSC212SpellChecking","owner":"jjfiv","description":"Data Structure Analysis for Spell Checking","archived":false,"fork":false,"pushed_at":"2020-10-13T10:45:53.000Z","size":937,"stargazers_count":0,"open_issues_count":2,"forks_count":104,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-07-18T22:24:29.972Z","etag":null,"topics":["data-analysis","smith-csc212"],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jjfiv.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2018-11-08T00:52:39.000Z","updated_at":"2019-11-08T18:20:29.000Z","dependencies_parsed_at":null,"dependency_job_id":"bb7d5630-b765-48fd-94aa-c485b4aff68b","html_url":"https://github.com/jjfiv/CSC212SpellChecking","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/jjfiv/CSC212SpellChecking","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jjfiv%2FCSC212SpellChecking","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jjfiv%2FCSC212SpellChecking/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jjfiv%2FCSC212SpellChecking/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jjfiv%2FCSC212SpellChecking/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jjfiv","download_url":"https://codeload.github.com/jjfiv/CSC212SpellChecking/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jjfiv%2FCSC212SpellChecking/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30031188,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-03T03:27:35.548Z","status":"ssl_error","status_checked_at":"2026-03-03T03:27:09.213Z","response_time":61,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","smith-csc212"],"created_at":"2024-11-19T08:54:44.824Z","updated_at":"2026-03-03T03:36:17.932Z","avatar_url":"https://github.com/jjfiv.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CSC212: Spell Checking\n\nIn this assignment, we will compare and contrast different data structures for creating a SpellChecker.\n\n## Rubric (100)\n\nThis assignment is out of 100 points. There are three pure coding sections, adding up to (20) points. The rest involves a little more data analysis. (You will have to code, though). All data analysis can be done in Google Sheets or Excel or R (etc.) -- you need not make graphs with Java.\n\n### (20) Report \u0026 Reflection.\n\nThis assignment is more like a scientific investigation. Prepare a report with answers to these questions, and submit it as a PDF.\n\n### (10) Conclusion: Which Data Structure do you recommend?\n\nIn your report, have a conclusion section where you recommend one of these data structures for \"Spell-Checking\". Justify your selection.\n\n### (15) Measure Creation (Insertion) Speed.\n\nConsider the code in ``CheckSpelling`` first.\n\nWe create a number of data structures: ``HashSet``, ``TreeSet``, ``SortedStringListSet``, ``CharTrie``, and ``LLHash``. Figure out how many nanoseconds per insert are required. This will involve studying my ``System.nanoTime()`` timing code. Since nanoseconds are metric, they are 10^-9 seconds, or 1e-9 in Java notation (hence my division by 1e9 to convert to seconds).\n\n- How long does it take to fill each data structure? \n- Plot insertion time per element for each of these data structures.\n- Is there a timing difference between constructing ``HashSet`` and ``TreeSet`` with their input data or calling ``add`` in a for loop? If yes, use the words \"balancing\" and \"resizing\" to explain what's going on.\n\n### (20) Plot Query Speed\n\nNow consider the code in ``FakeDatasetExperiment``.\n\nI have devised a method ``timeLookup`` that calculates per-item query time for all the words in a structure. It also prints out the \"fraction found\" of the dataset. \n\n- Construct a dataset that has Strings that are both in and not in the dictionary.\n- For full credit, devise a method to inject some percentage of hits and misses. Create a line plot as the percentage of hits goes from (0.0 to 1.0) in steps of 0.1, where each line is a different data structure.\n\n### (10) Spell-check a Project Gutenberg book\n\nNow consider the code in ``BookExperiment``. \n\n- What is the ratio that's \"mis-spelled\"?\n- Are the query speeds the same over real-world data?\n- What are some of the words that are \"mis-spelled\"?\n- I gave you ``WordSplitter`` again.\n\n## Now implement some code!\n\nThis is the more traditional \"data-structures\" portion of the assignment.\n\n### (10) CharTrie.countNodes()\n\nStudy the ``CharTrie`` implementation and complete the countNodes method on the ``CharTrie.Node`` class. A recursive solution will be simplest.\n\nYou might want to create a unit test so that you count the nodes of a CharTrie that you can draw by hand. (So you know if you get it correct).\n\nFor clarity, ``countNodes`` should return the number of characters stored in the tree. This should be more than the number of words in the vocabulary, but less than the number of characters in the vocabulary (since a Trie shares prefixes).\n\n### (10) SortedStringListSet.binarySearch\n\nRight now, this data structure merely calls Java's built-in ``Collections.binarySearch``. Replace it with your own implementation.\n\nFor bonus points, find out why [most binary search implementations are incorrect](https://ai.googleblog.com/2006/06/extra-extra-read-all-about-it-nearly.html). Try to fix it in your implementation.\n\nDouble-check that your query speeds have not changed with your implementation of binary search. If they have, why might that be?\n\n### (5) LLHash.countCollisions() and LLHash.countUsedBuckets()\n\nLLHash maintains an ArrayList of Buckets inside of itself. Use this list to compute how many collisions occurred and how many buckets are used. CheckSpelling.main uses them in print statements to compute the load-factor.\n\nPlay with the size of LLHash. Does this change your perception of its speed?\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjjfiv%2Fcsc212spellchecking","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjjfiv%2Fcsc212spellchecking","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjjfiv%2Fcsc212spellchecking/lists"}