{"id":28095169,"url":"https://github.com/minishlab/tokenlearn","last_synced_at":"2025-05-13T15:20:46.273Z","repository":{"id":260293560,"uuid":"871835352","full_name":"MinishLab/tokenlearn","owner":"MinishLab","description":"Pre-train Static Word Embeddings","archived":false,"fork":false,"pushed_at":"2025-04-12T13:05:16.000Z","size":50,"stargazers_count":60,"open_issues_count":5,"forks_count":5,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-05-11T22:40:39.359Z","etag":null,"topics":["ai","embeddings","machine-learning","model2vec","nlp","python","torch"],"latest_commit_sha":null,"homepage":"https://minishlab.github.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MinishLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-13T04:28:12.000Z","updated_at":"2025-05-11T15:03:16.000Z","dependencies_parsed_at":"2024-10-30T15:47:28.103Z","dependency_job_id":"bf11ae22-1146-4aa1-80d1-f340c8311d7f","html_url":"https://github.com/MinishLab/tokenlearn","commit_stats":null,"previous_names":["minishlab/tokenlearn"],"tags_count":2,"template":false,"template_full_name":"MinishLab/watertemplate","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MinishLab%2Ftokenlearn","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MinishLab%2Ftokenlearn/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MinishLab%2Ftokenlearn/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MinishLab%2Ftokenlearn/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MinishLab","download_url":"https://codeload.github.com/MinishLab/tokenlearn/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253969297,"owners_count":21992265,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","embeddings","machine-learning","model2vec","nlp","python","torch"],"created_at":"2025-05-13T15:18:44.262Z","updated_at":"2025-05-13T15:20:46.259Z","avatar_url":"https://github.com/MinishLab.png","language":"Python","readme":"# Tokenlearn\nTokenlearn is a method to pre-train [Model2Vec](https://github.com/MinishLab/model2vec).\n\nThe method is described in detail in our [Tokenlearn blogpost](https://minishlab.github.io/tokenlearn_blogpost/).\n\n## Quickstart\n\nInstall the package with:\n\n```bash\npip install tokenlearn\n```\n\nThe basic usage of Tokenlearn consists of two CLI scripts: `featurize` and `train`.\n\nTokenlearn is trained using means from a sentence transformer. To create means, the `tokenlearn-featurize` CLI can be used:\n\n```bash\npython3 -m tokenlearn.featurize --model-name \"baai/bge-base-en-v1.5\" --output-dir \"data/c4_features\"\n```\n\nNOTE: the default model is trained on the C4 dataset. If you want to use a different dataset, the following code can be used:\n\n```bash\npython3 -m tokenlearn.featurize \\\n    --model-name \"baai/bge-base-en-v1.5\" \\\n    --output-dir \"data/c4_features\" \\\n    --dataset-path \"allenai/c4\" \\\n    --dataset-name \"en\" \\\n    --dataset-split \"train\"\n```\n\nTo train a model on the featurized data, the `tokenlearn-train` CLI can be used:\n```bash\npython3 -m tokenlearn.train --model-name \"baai/bge-base-en-v1.5\" --data-path \"data/c4_features\" --save-path \"\u003cpath-to-save-model\u003e\"\n```\n\nTraining will create two models:\n- The base trained model.\n- The base model with weighting applied. This is the model that should be used for downstream tasks.\n\nNOTE: the code assumes that the padding token ID in your tokenizer is 0. If this is not the case, you will need to modify the code.\n\n### Evaluation\n\nTo evaluate a model, you can use the following command after installing the optional evaluation dependencies:\n\n```bash\npip install evaluation@git+https://github.com/MinishLab/evaluation@main\n\n```\n\n```python\nfrom model2vec import StaticModel\n\nfrom evaluation import CustomMTEB, get_tasks, parse_mteb_results, make_leaderboard, summarize_results\nfrom mteb import ModelMeta\n\n# Get all available tasks\ntasks = get_tasks()\n# Define the CustomMTEB object with the specified tasks\nevaluation = CustomMTEB(tasks=tasks)\n\n# Load a trained model\nmodel_name = \"tokenlearn_model\"\nmodel = StaticModel.from_pretrained(model_name)\n\n# Optionally, add model metadata in MTEB format\nmodel.mteb_model_meta = ModelMeta(\n            name=model_name, revision=\"no_revision_available\", release_date=None, languages=None\n        )\n\n# Run the evaluation\nresults = evaluation.run(model, eval_splits=[\"test\"], output_folder=f\"results\")\n\n# Parse the results and summarize them\nparsed_results = parse_mteb_results(mteb_results=results, model_name=model_name)\ntask_scores = summarize_results(parsed_results)\n\n# Print the results in a leaderboard format\nprint(make_leaderboard(task_scores))\n```\n\n## License\n\nMIT\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fminishlab%2Ftokenlearn","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fminishlab%2Ftokenlearn","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fminishlab%2Ftokenlearn/lists"}