{"id":17166373,"url":"https://github.com/andreaferretti/lda","last_synced_at":"2025-10-12T19:35:41.938Z","repository":{"id":66316534,"uuid":"133983446","full_name":"andreaferretti/lda","owner":"andreaferretti","description":"Latent Dirichlet Allocation","archived":false,"fork":false,"pushed_at":"2018-05-18T17:17:20.000Z","size":515,"stargazers_count":6,"open_issues_count":0,"forks_count":1,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-10-12T19:35:37.322Z","etag":null,"topics":["lda","nim","topic-modeling"],"latest_commit_sha":null,"homepage":null,"language":"Nim","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/andreaferretti.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-05-18T17:15:45.000Z","updated_at":"2022-07-06T14:19:58.000Z","dependencies_parsed_at":"2023-02-24T17:00:17.402Z","dependency_job_id":null,"html_url":"https://github.com/andreaferretti/lda","commit_stats":{"total_commits":18,"total_committers":1,"mean_commits":18.0,"dds":0.0,"last_synced_commit":"eee2501656532caad0a12290dc70022fadd98bca"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/andreaferretti/lda","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andreaferretti%2Flda","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andreaferretti%2Flda/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andreaferretti%2Flda/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andreaferretti%2Flda/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/andreaferretti","download_url":"https://codeload.github.com/andreaferretti/lda/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andreaferretti%2Flda/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279012670,"owners_count":26085159,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-12T02:00:06.719Z","response_time":53,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["lda","nim","topic-modeling"],"created_at":"2024-10-14T23:05:25.269Z","updated_at":"2025-10-12T19:35:41.876Z","avatar_url":"https://github.com/andreaferretti.png","language":"Nim","funding_links":[],"categories":[],"sub_categories":[],"readme":"LDA\n===\n\nThis library implements a form of text clustering and topic modeling called\n[Latent Dirichlet Allocation](http://ethen8181.github.io/machine-learning/clustering_old/topic_model/LDA.html).\n\nIn order to use it, you have to have a seq of documents, each one being itself\na seq of strings. These documents can then be indexed through the use of a\nvocabulary, as follows:\n\n```nim\nimport sequtils, strutils\nimport lda\n\nlet\n  rawDocs = @[\n      \"eat turkey on turkey day holiday\",\n      \"i like to eat cake on holiday\",\n      \"turkey trot race on thanksgiving holiday\",\n      \"snail race the turtle\",\n      \"time travel space race\",\n      \"movie on thanksgiving\",\n      \"movie at air and space museum is cool movie\",\n      \"aspiring movie star\"\n    ]\n  docWords = rawDocs.mapIt(it.split(' '))\n  vocab = makeVocab(docWords)\n  docs = makeDocs(docWords, vocab)\n```\n\nOnce you have the vocabulary `vocab` , which is just the seq of all word appearing\nthrough all documents, and the preoprocessed documents, which are a nested\nsequence of integer indices, you can traing the model through Collapsed Gibbs\nSampling using\n\n```nim\nlet ldaResult = lda(docs, vocabLen = vocab.len, K = 3, iterations = 1000)\n```\n\nHere `K` denotes the number of desired topics and `iterations` the number of\nrounds in the training phase. The result contains a document/topic matrix\nand a word/topic matrix. These can be used to find the most descriptive\nwords for a topic:\n\n```nim\nfor t in 0 ..\u003c 3:\n  echo \"TOPIC \", t\n  echo bestWords(ldaResult, vocab, t)\n```\n\nor to find the most relevant topics for a document:\n\n```nim\nfor d in 0 ..\u003c docs.len:\n  echo \"\u003e \", rawDocs[d]\n  echo \"topic: \", ldaResult.bestTopic(d)\n```\n\nor even to generate text with the same topic distribution as a given document:\n\n```nim\necho sample(ldaResult, vocab, doc = 6)\n```\n\n## TODO\n\n* parallel training\n* variational Bayes sampling\n* modified model to account for stop words","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandreaferretti%2Flda","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fandreaferretti%2Flda","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandreaferretti%2Flda/lists"}