{"id":38655843,"url":"https://github.com/haozhg/lmd","last_synced_at":"2026-01-17T09:27:16.546Z","repository":{"id":61686369,"uuid":"547907350","full_name":"haozhg/lmd","owner":"haozhg","description":"Language Model Decomposition: Quantifying the Dependency and Correlation of Language Models","archived":false,"fork":false,"pushed_at":"2022-12-22T15:00:01.000Z","size":1910,"stargazers_count":10,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-26T02:48:23.551Z","etag":null,"topics":["bert","deep-learning","language-models","multilingual-bert","natural-language-processing","nlp","pretrained-models","python","pytorch","roberta","transformers","xlm-roberta"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/haozhg.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-10-08T14:38:11.000Z","updated_at":"2025-03-23T18:26:02.000Z","dependencies_parsed_at":"2023-01-30T12:01:29.307Z","dependency_job_id":null,"html_url":"https://github.com/haozhg/lmd","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/haozhg/lmd","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/haozhg%2Flmd","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/haozhg%2Flmd/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/haozhg%2Flmd/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/haozhg%2Flmd/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/haozhg","download_url":"https://codeload.github.com/haozhg/lmd/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/haozhg%2Flmd/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28505553,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-17T06:57:29.758Z","status":"ssl_error","status_checked_at":"2026-01-17T06:56:03.931Z","response_time":85,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","deep-learning","language-models","multilingual-bert","natural-language-processing","nlp","pretrained-models","python","pytorch","roberta","transformers","xlm-roberta"],"created_at":"2026-01-17T09:27:16.410Z","updated_at":"2026-01-17T09:27:16.507Z","avatar_url":"https://github.com/haozhg.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# lmd\nCode for paper titled \"Language Model Decomposition: Quantifying the Dependency and Correlation of Language Models\" (accepted to EMNLP 2022). The arxiv version is here: https://arxiv.org/abs/2210.10289\n\n## Install\nCreate virtual env if needed\n```\npython3 -m venv .venv\nsource .venv/bin/activate\n```\n\nInstall from pip (https://pypi.org/project/nlp.lmd/)\n```\npip install nlp.lmd\n```\n\nInstall from source\n```\ngit clone git@github.com:haozhg/lmd.git\ncd lmd\npip install -e .\n````\n\nTo use lmd cli, run `lmd --help` or `python -m lmd.cli --help`\n\n```\n$ lmd --help\nusage: Language Model Decomposition [-h] [--target TARGET] [--basis BASIS]\n                                    [--tokenizer-name TOKENIZER_NAME]\n                                    [--max-seq-length MAX_SEQ_LENGTH]\n                                    [--batch-size BATCH_SIZE]\n                                    [--dataset-name DATASET_NAME]\n                                    [--dataset-config-name DATASET_CONFIG_NAME]\n                                    [--val-split-percentage VAL_SPLIT_PERCENTAGE]\n                                    [--test-split-percentage TEST_SPLIT_PERCENTAGE]\n                                    [--max-train-samples MAX_TRAIN_SAMPLES]\n                                    [--max-val-samples MAX_VAL_SAMPLES]\n                                    [--max-test-samples MAX_TEST_SAMPLES]\n                                    [--preprocessing-num-workers PREPROCESSING_NUM_WORKERS]\n                                    [--overwrite_cache OVERWRITE_CACHE]\n                                    [--preprocess-dir PREPROCESS_DIR]\n                                    [--embedding-dir EMBEDDING_DIR]\n                                    [--results-dir RESULTS_DIR]\n                                    [--models-dir MODELS_DIR] [--alpha ALPHA]\n                                    [--log-level LOG_LEVEL]\n                                    [--try-models TRY_MODELS]\n                                    [--pre-select-multiplier PRE_SELECT_MULTIPLIER]\n                                    [--seed SEED]\n```\n\n## Results\nTo reproduce the results in Appendix B of the paper, run `bash scripts/run.sh`. The results are also stored in [`results/128k`](./results/128k/)\n\n## Citation\nIf you find this paper/code useful, please cite us:\n```\n@misc{https://doi.org/10.48550/arxiv.2210.10289,\n  doi = {10.48550/ARXIV.2210.10289},\n  url = {https://arxiv.org/abs/2210.10289},\n  author = {Zhang, Hao},\n  keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences, I.2.7, 68T50 (Primary) 68T30, 68T07 (Secondary)},\n  title = {Language Model Decomposition: Quantifying the Dependency and Correlation of Language Models},\n  publisher = {arXiv},\n  year = {2022},\n  copyright = {arXiv.org perpetual, non-exclusive license}\n}\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhaozhg%2Flmd","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhaozhg%2Flmd","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhaozhg%2Flmd/lists"}