{"id":13564226,"url":"https://github.com/adjidieng/ETM","last_synced_at":"2025-04-03T21:30:33.348Z","repository":{"id":35673532,"uuid":"196301093","full_name":"adjidieng/ETM","owner":"adjidieng","description":"Topic Modeling in Embedding Spaces","archived":false,"fork":false,"pushed_at":"2023-10-03T22:32:12.000Z","size":201140,"stargazers_count":541,"open_issues_count":32,"forks_count":127,"subscribers_count":14,"default_branch":"master","last_synced_at":"2024-11-04T17:47:16.598Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/adjidieng.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2019-07-11T01:52:09.000Z","updated_at":"2024-11-02T13:18:06.000Z","dependencies_parsed_at":"2023-01-16T02:43:57.115Z","dependency_job_id":"dd2cb4f6-2135-472a-9b91-c7f170e2f123","html_url":"https://github.com/adjidieng/ETM","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adjidieng%2FETM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adjidieng%2FETM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adjidieng%2FETM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adjidieng%2FETM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/adjidieng","download_url":"https://codeload.github.com/adjidieng/ETM/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247082851,"owners_count":20880730,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T13:01:28.319Z","updated_at":"2025-04-03T21:30:32.953Z","avatar_url":"https://github.com/adjidieng.png","language":"Python","funding_links":[],"categories":["Python","Models"],"sub_categories":["Embedding based Topic Models"],"readme":"# ETM\n\nThis is code that accompanies the paper titled \"Topic Modeling in Embedding Spaces\" by Adji B. Dieng, Francisco J. R. Ruiz, and David M. Blei. (Arxiv link: https://arxiv.org/abs/1907.04907)\n\nETM defines words and topics in the same embedding space. The likelihood of a word under ETM is a Categorical whose natural parameter is given by the dot product between the word embedding and its assigned topic's embedding. ETM is a document model that learns interpretable topics and word embeddings and is robust to large vocabularies that include rare words and stop words.\n\n## Dependencies\nThe major project dependency are :\n\n+ python 3.6.7\n+ pytorch 1.1.0\n\nWith or without a virtual environment install you can install the other project requirements with: \n\n`pip install -r requirement.txt`\n## Datasets\n\nAll the datasets are pre-processed and can be found below:\n\n+ https://bitbucket.org/franrruiz/data_nyt_largev_4/src/master/\n+ https://bitbucket.org/franrruiz/data_nyt_largev_5/src/master/\n+ https://bitbucket.org/franrruiz/data_nyt_largev_6/src/master/\n+ https://bitbucket.org/franrruiz/data_nyt_largev_7/src/master/\n+ https://bitbucket.org/franrruiz/data_stopwords_largev_2/src/master/ (this one contains stop words and was used to showcase robustness of ETM to stop words.)\n+ https://bitbucket.org/franrruiz/data_20ng_largev/src/master/\n\nAll the scripts to pre-process a given dataset for ETM can be found in the folder 'scripts'. The script for 20NewsGroup is self-contained as it uses scikit-learn. If you want to run ETM on your own dataset, follow the script for New York Times (given as example) called data_nyt.py  \n\n## To Run\n\nTo learn interpretable embeddings and topics using ETM on the 20NewsGroup dataset, run\n```\npython main.py --mode train --dataset 20ng --data_path data/20ng --num_topics 50 --train_embeddings 1 --epochs 1000\n```\n\nTo evaluate perplexity on document completion, topic coherence, topic diversity, and visualize the topics/embeddings run\n```\npython main.py --mode eval --dataset 20ng --data_path data/20ng --num_topics 50 --train_embeddings 1 --tc 1 --td 1 --load_from CKPT_PATH\n```\n\nTo learn interpretable topics using ETM with pre-fitted word embeddings (called Labelled-ETM in the paper) on the 20NewsGroup dataset:\n\n+ first fit the word embeddings. For example to use simple skipgram you can run\n```\npython skipgram.py --data_file PATH_TO_DATA --emb_file PATH_TO_EMBEDDINGS --dim_rho 300 --iters 50 --window_size 4 \n```\n\n+ then run the following \n```\npython main.py --mode train --dataset 20ng --data_path data/20ng --emb_path PATH_TO_EMBEDDINGS --num_topics 50 --train_embeddings 0 --epochs 1000\n```\n\n## Citation\n\n```\n@article{dieng2019topic,\n  title={Topic modeling in embedding spaces},\n  author={Dieng, Adji B and Ruiz, Francisco J R and Blei, David M},\n  journal={arXiv preprint arXiv:1907.04907},\n  year={2019}\n}\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadjidieng%2FETM","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fadjidieng%2FETM","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadjidieng%2FETM/lists"}