{"id":28459237,"url":"https://github.com/bicycleman15/skim","last_synced_at":"2025-10-10T00:02:50.604Z","repository":{"id":253854991,"uuid":"843319805","full_name":"bicycleman15/skim","owner":"bicycleman15","description":"[KDD 2025] Code for the paper \"On the Necessity of World Knowledge for Mitigating Missing Labels in Extreme Classification\"","archived":false,"fork":false,"pushed_at":"2024-10-08T20:27:20.000Z","size":45,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-08-18T01:37:05.275Z","etag":null,"topics":["extreme-classification","information-retrieval","missing-labels","small-language-models"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2408.09585","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bicycleman15.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-08-16T08:56:30.000Z","updated_at":"2025-08-04T23:57:08.000Z","dependencies_parsed_at":"2025-08-18T01:36:34.356Z","dependency_job_id":"26c48cfd-de7e-4d1f-a463-58177b11abf2","html_url":"https://github.com/bicycleman15/skim","commit_stats":null,"previous_names":["bicycleman15/skim"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/bicycleman15/skim","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bicycleman15%2Fskim","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bicycleman15%2Fskim/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bicycleman15%2Fskim/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bicycleman15%2Fskim/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bicycleman15","download_url":"https://codeload.github.com/bicycleman15/skim/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bicycleman15%2Fskim/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279002346,"owners_count":26083351,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-09T02:00:07.460Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["extreme-classification","information-retrieval","missing-labels","small-language-models"],"created_at":"2025-06-07T00:42:19.278Z","updated_at":"2025-10-10T00:02:50.577Z","avatar_url":"https://github.com/bicycleman15.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SKIM: Scalable Knowledge Infusion for Missing Labels\n\nThe repo contains the official code for the paper [On the Necessity of World Knowledge for Mitigating Missing Labels in Extreme Classification](https://arxiv.org/abs/2408.09585).\n\n## Artifacts\nWe provide the following artifacts for future research and reproducibility:\n\n1. SKIM augmented datasets that were used to train the XC models in the paper. One may directly use the provided `.npz` files to train on their favourite XC models/architectures. Available at: `artifacts/skim_augmented_datasets/\u003cdataset\u003e/trn_X_Y_skim.npz`.\n\n2. Additionally, we provide all the required prompts, GPT4 responses, `litgpt` converted training format dataset for fintuning SLMs, fintuned SLMs, and the large-scale generated synthetic queries that can be used to obtain/reproduce the above `.npz` files. We provide detailed instructions and code on how one can obtain SKIM augmented dataset for their own XC datasets. These all can be found in `artifacts/` directory.\n\n## Code\n\nIn order to obtain SKIM augmented datasets for your XC datasets, you can follow the steps outlined. \n\nOn a high level, these would be (i) obtaining a SLM that can generate synthetic queries (this would require distilling this specific task using a much larger LLM e.g. GPT4 a few finetuning examples, (ii) generating large-scale synthetic queries using this finetuned SLM, (iii) mapping these synthetic queries to the train set queries using a pretrained XC encoder (e.g. NGAME in our case), and obtaining the final augmented dataset. Refer to below for more details:\n\nStep 0: To perform task-specific distillation, refer to the diretory `skim/task-specific-distillation`. (Note that we call this step 0 since this would be the first thing one do when using SKIM. However, the paper does not talk about this step 0 explicitly.)\n\nStep 1: To perform large-scale synthetic query generation, refer to the directory `skim/step-1`.\n\nStep 2: Mapping synthetic queries to train set queries, refer to the directory `skim/step-2`.\n\nYou should now have a the SKIM augmented dataset in the form of `trn_X_Y_skim.npz`. Now train your favourite XC model/architecture on this SKIM augmented dataset.\n\n## Requirements\n\nUse the file `requirements.txt` to install the dependencies.\n\n## Acknowledgements\n\nWe heavily rely on the following to train the XC models in our paper:\n1. DEXML: https://github.com/nilesh2797/DEXML\n2. Renee: https://github.com/microsoft/renee\n\n## Issues\n\nIf you have any questions, feel free to open an issue on GitHub or contact the authors (Jatin Prakash (jatin.prakash@nyu.edu) or Anirudh Buvanesh (anirudh.buvanesh@mila.quebec)).\n\n## Reference\n\nIf you find this repo useful, please consider citing:\n\n```bibtex\n@article{prakash2024necessity,\n  title={On the Necessity of World Knowledge for Mitigating Missing Labels in Extreme Classification},\n  author={Prakash, Jatin and Buvanesh, Anirudh and Santra, Bishal and Saini, Deepak and Yadav, Sachin and Jiao, Jian and Prabhu, Yashoteja and Sharma, Amit and Varma, Manik},\n  journal={arXiv preprint arXiv:2402.05266},\n  year={2024}\n}\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbicycleman15%2Fskim","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbicycleman15%2Fskim","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbicycleman15%2Fskim/lists"}