{"id":22895901,"url":"https://github.com/daniel-lima-lopez/n-gram-example","last_synced_at":"2026-04-27T18:03:19.345Z","repository":{"id":254949256,"uuid":"848036033","full_name":"daniel-lima-lopez/N-Gram-Example","owner":"daniel-lima-lopez","description":"Implementation of a BiGram-based language system in Python","archived":false,"fork":false,"pushed_at":"2024-11-28T22:32:38.000Z","size":5623,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-24T23:37:07.493Z","etag":null,"topics":["ngram","ngram-language-model","ngrams","nlp","nlp-machine-learning","python"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/daniel-lima-lopez.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-27T02:30:15.000Z","updated_at":"2024-11-28T22:32:42.000Z","dependencies_parsed_at":null,"dependency_job_id":"4b5fe829-074d-4ae2-a149-89c2baa01d19","html_url":"https://github.com/daniel-lima-lopez/N-Gram-Example","commit_stats":null,"previous_names":["daniel-lima-lopez/n-gram-example"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/daniel-lima-lopez/N-Gram-Example","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daniel-lima-lopez%2FN-Gram-Example","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daniel-lima-lopez%2FN-Gram-Example/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daniel-lima-lopez%2FN-Gram-Example/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daniel-lima-lopez%2FN-Gram-Example/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/daniel-lima-lopez","download_url":"https://codeload.github.com/daniel-lima-lopez/N-Gram-Example/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daniel-lima-lopez%2FN-Gram-Example/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32348058,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-27T17:12:42.749Z","status":"ssl_error","status_checked_at":"2026-04-27T17:12:41.658Z","response_time":128,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ngram","ngram-language-model","ngrams","nlp","nlp-machine-learning","python"],"created_at":"2024-12-13T23:32:39.262Z","updated_at":"2026-04-27T18:03:19.302Z","avatar_url":"https://github.com/daniel-lima-lopez.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# N-Gram-Example\n[This repository](https://github.com/daniel-lima-lopez/N-Gram-Example) shows the implementation of a BiGram, which considers the vocabulary of different dialogues between characters in Shakespear's works, whose information is found in the [Shakespeare plays](https://www.kaggle.com/datasets/kingburrito666/shakespeare-plays) dataset.\n\n## Installation\nClone this repository: \n```bash\ngit clone git@github.com:daniel-lima-lopez/N-Gram-Example.git\n```\nmove to installation directory:\n```bash\ncd N-Gram-Example\n```\n\n## Method description\nThe presented bigram is implemented by counting all occurrences of word pairs present in the text corpus. In this way, the system is able to identify the most frequent word pairs, and therefore, analyze the English idiom as a system of word pairs.\n\nThe [counts.ipynb](counts.ipynb) notebook shows the procedure necessary to analyze the text corpus and create the [word_id.csv](word_id.csv) and [CMatrix.csv](CMatrix.csv) files. The first file list the unique words found in the corpus, while the second file contains the count of all possible word pairs found in the text. Note that, given the large amount of data, saving this information in a matrix would contain mostly zeros (sparse matrix), so it was decided to write only those existing occurrences.\n\nThe Bigram is implemented in the [BiGram.py](BiGram.py) code, which takes as input the files [word_id.csv](word_id.csv) and [CMatrix.csv](CMatrix.csv). When instantiating it, the value of the parameters `k` and `add` can be chosen. Where `k` is a factor that multiplies all the elements in the counting matrix and `add` is a constant that is added to the result of the multiplication. This is done in order to move some of the counting mass to word pairs that are not in the corpus, in order to expand the vocabulary of the system.\n\n## Example\nThe following example can be executed in the notebook [example.ipynb](example.ipynb).\n\nWe can instantiate the BiGram class as follows:\n```python\nfrom BiGram import BiGram\ntest = BiGram(k=5, add=1)\n```\nWe can then use the `next_word()` method to predict the next most likely word, given a previous word. Below is an example of 10 sentences of 5 words generated by the Bigram. Note that in each case the starting indicator of the sentence is used, and the i+i-th word is generated considering the i-th word:\n```python\nfor i in range(10):\n    ws = ['s1']\n    for i in range(5):\n        nw = test.next_word(ws[-1])\n        if nw == 'e1':\n            break\n        else:\n            ws.append(nw)\n    print(*ws[1:])\n```\nwhich results in:\n```\n- the grace insurrection module countenances\n- and for they prescriptions deiphobus\n- attend lettersdamnd eyases censureo smarting\n- humbling witch lade scions dearbeloved\n- incarnal cricket tellus exchequers overview\n- visit stubble each heros nursery\n- boarish lucentio luna godfather dire\n- you begin offend glorious sundaycitizens\n- come infixing dareful cuckooflowers minded\n- and shrilltongued everpardon blue uttering\n```\n\nIt is important to mention that the `next_word()` method chooses the next most probable word considering the counting matrix, and making a random selection among all possible occurrences of words, considering with greater probability those combinations that are most frequent in the corpus.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdaniel-lima-lopez%2Fn-gram-example","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdaniel-lima-lopez%2Fn-gram-example","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdaniel-lima-lopez%2Fn-gram-example/lists"}