{"id":13578171,"url":"https://github.com/microsoft/ASTRA","last_synced_at":"2025-04-05T16:31:58.555Z","repository":{"id":39723186,"uuid":"357023796","full_name":"microsoft/ASTRA","owner":"microsoft","description":"Self-training with Weak Supervision (NAACL 2021)","archived":false,"fork":false,"pushed_at":"2023-07-24T22:35:54.000Z","size":267,"stargazers_count":158,"open_issues_count":6,"forks_count":22,"subscribers_count":10,"default_branch":"main","last_synced_at":"2025-03-30T21:37:37.618Z","etag":null,"topics":["machine-learning","nlp","weak-supervision","weakly-supervised-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/microsoft.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null}},"created_at":"2021-04-12T01:39:14.000Z","updated_at":"2025-03-23T11:14:10.000Z","dependencies_parsed_at":"2022-09-20T09:12:30.353Z","dependency_job_id":"2ecfd205-5544-4982-8fe8-d6e6d759828c","html_url":"https://github.com/microsoft/ASTRA","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FASTRA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FASTRA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FASTRA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FASTRA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/microsoft","download_url":"https://codeload.github.com/microsoft/ASTRA/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247366376,"owners_count":20927499,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-learning","nlp","weak-supervision","weakly-supervised-learning"],"created_at":"2024-08-01T15:01:28.110Z","updated_at":"2025-04-05T16:31:58.091Z","avatar_url":"https://github.com/microsoft.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Self-Training with Weak Supervision\n\nThis repo holds the code for our weak supervision framework, ASTRA, described in our NAACL 2021 paper: \"[Self-Training with Weak Supervision](https://www.microsoft.com/en-us/research/publication/leaving-no-valuable-knowledge-behind-weak-supervision-with-self-training-and-domain-specific-rules/)\" \n\n\n## Overview of ASTRA\n\nASTRA is a weak supervision framework for training deep neural networks by automatically generating weakly-labeled data. Our framework can be used for tasks where it is expensive to manually collect large-scale labeled training data. \n\nASTRA leverages domain-specific **rules**, a large amount of **unlabeled data**, and a small amount of **labeled data**  through a **teacher-student** architecture:\n\n![alt text](https://github.com/microsoft/ASTRA/blob/main/astra.jpg?raw=true)\n\nMain components:\n* **Weak Rules**: domain-specific rules, expressed as Python labeling functions. Weak supervision usually considers multiple rules that rely on heuristics (e.g., regular expressions) for annotating text instances with weak labels.\n*  **Student**: a base model (e.g., a BERT-based classifier) that provides pseudo-labels as in standard self-training. In contrast to heuristic rules that cover a subset of the instances, the student can predict pseudo-labels for all instances.\n* **RAN Teacher**: our Rule Attention Teacher Network that aggregates the predictions of multiple weak sources (rules and student) with instance-specific weights to compute a single pseudo-label for each instance. \n\nThe following table reports classification results over 6 benchmark datasets averaged over multiple runs.\n\nMethod | TREC | SMS | YouTube | CENSUS | MIT-R | Spouse \n--- | --- | --- | --- |--- |--- |--- \nMajority Voting | 60.9 | 48.4 | 82.2 | 80.1 | 40.9 | 44.2\nSnorkel | 65.3 | 94.7 | 93.5 | 79.1 | 75.6 | 49.2\nClassic Self-training | 71.1 | 95.1 | 92.5 | 78.6 | 72.3 | 51.4\n**ASTRA** | **80.3** | **95.3** | **95.3** | **83.1** | **76.1** | **62.3**\n\nOur [NAACL'21 paper](https://www.microsoft.com/en-us/research/publication/leaving-no-valuable-knowledge-behind-weak-supervision-with-self-training-and-domain-specific-rules/) describes our ASTRA framework and more experimental results in detail. \n\n## Installation\n\nFirst, create a conda environment running Python 3.6: \n```\nconda create --name astra python=3.6\nconda activate astra\n```\n\nThen, install the required dependencies:\n```\npip install -r requirements.txt\n```\n\n## Download Data\nFor reproducibility, you can directly download our pre-processed data files (split into multiple unlabeled/train/dev sets): \n\n```\ncd data\nbash prepare_data.sh\n```\n\nThe original datasets are available [here](https://github.com/awasthiabhijeet/Learning-From-Rules).\n\n\n## Running ASTRA \n\n\nTo replicate our NAACL '21 experiments, you can directly run our bash script:\n```\ncd scripts\nbash run_experiments.sh\n```\nThe above script will run ASTRA and report results under a new \"experiments\" folder. \n\nYou can alternatively run ASTRA with custom arguments as: \n```\ncd astra\npython main.py --dataset \u003cDATASET\u003e --student_name \u003cSTUDENT\u003e --teacher_name \u003cTEACHER\u003e\n```\n\nSupported STUDENT models: \n1. **logreg**: Bag-of-words Logistic Regression classifier\n2. **elmo**: ELMO-based classifier\n3. **bert**: BERT-based classifier\n\nSupported TEACHER models: \n1. **ran**: our Rule Attention Network (RAN)\n\nWe will soon add instructions for supporting custom datasets as well as student and teacher components. \n\n\n\n\n## Citation \n\n```\n@InProceedings{karamanolakis2021self-training,\nauthor = {Karamanolakis, Giannis and Mukherjee, Subhabrata (Subho) and Zheng, Guoqing and Awadallah, Ahmed H.},\ntitle = {Self-training with Weak Supervision},\nbooktitle = {NAACL 2021},\nyear = {2021},\nmonth = {May},\npublisher = {NAACL 2021},\nurl = {https://www.microsoft.com/en-us/research/publication/self-training-weak-supervision-astra/},\n}\n```\n\n## Contributing\n\nThis project welcomes contributions and suggestions.  Most contributions require you to agree to a\nContributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us\nthe rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.\n\nWhen you submit a pull request, a CLA bot will automatically determine whether you need to provide\na CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions\nprovided by the bot. You will only need to do this once across all repos using our CLA.\n\nThis project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).\nFor more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or\ncontact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.\n\n## Trademarks\n\nThis project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft \ntrademarks or logos is subject to and must follow \n[Microsoft's Trademark \u0026 Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).\nUse of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.\nAny use of third-party trademarks or logos are subject to those third-party's policies.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2FASTRA","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicrosoft%2FASTRA","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2FASTRA/lists"}