{"id":15014107,"url":"https://github.com/norskregnesentral/skweak","last_synced_at":"2025-05-15T11:05:39.065Z","repository":{"id":37659715,"uuid":"348175166","full_name":"NorskRegnesentral/skweak","owner":"NorskRegnesentral","description":"skweak: A software toolkit for weak supervision applied to NLP tasks","archived":false,"fork":false,"pushed_at":"2024-09-02T12:48:05.000Z","size":29365,"stargazers_count":918,"open_issues_count":8,"forks_count":73,"subscribers_count":25,"default_branch":"main","last_synced_at":"2024-10-29T15:24:02.050Z","etag":null,"topics":["data-science","distant-supervision","natural-language-processing","nlp-library","nlp-machine-learning","python","spacy","training-data","weak-supervision"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NorskRegnesentral.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-03-16T01:28:46.000Z","updated_at":"2024-10-11T16:53:35.000Z","dependencies_parsed_at":"2022-07-14T09:22:15.721Z","dependency_job_id":"d65f82c8-7576-4c74-8942-5efd34cbaee6","html_url":"https://github.com/NorskRegnesentral/skweak","commit_stats":{"total_commits":160,"total_committers":12,"mean_commits":"13.333333333333334","dds":0.2875,"last_synced_commit":"2b6db15e8429dbda062b2cc9cc74e69f51a0a8b6"},"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NorskRegnesentral%2Fskweak","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NorskRegnesentral%2Fskweak/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NorskRegnesentral%2Fskweak/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NorskRegnesentral%2Fskweak/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NorskRegnesentral","download_url":"https://codeload.github.com/NorskRegnesentral/skweak/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248441938,"owners_count":21104105,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","distant-supervision","natural-language-processing","nlp-library","nlp-machine-learning","python","spacy","training-data","weak-supervision"],"created_at":"2024-09-24T19:45:12.021Z","updated_at":"2025-04-11T16:39:56.010Z","avatar_url":"https://github.com/NorskRegnesentral.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# skweak: Weak supervision for NLP\n\n[![GitHub license](https://img.shields.io/github/license/NorskRegnesentral/skweak)](https://github.com/NorskRegnesentral/skweak/blob/main/LICENSE.txt)\n[![GitHub stars](https://img.shields.io/github/stars/NorskRegnesentral/skweak)](https://github.com/NorskRegnesentral/skweak/stargazers)\n![PyPI](https://img.shields.io/pypi/v/skweak)\n![Testing](https://github.com/NorskRegnesentral/skweak/actions/workflows/testing.yml/badge.svg)\n\n\u003cbr\u003e\n\u003cp align=\"center\"\u003e\n   \u003cimg alt=\"skweak logo\" src=\"https://raw.githubusercontent.com/NorskRegnesentral/skweak/main/data/skweak_logo.jpg\"/\u003e\n\u003c/p\u003e\u003cbr\u003e\n\n**Skweak is no longer actively maintained** (if you are interested to take over the project, give us a shout). \n\nLabelled data remains a scarce resource in many practical NLP scenarios. This is especially the case when working with resource-poor languages (or text domains), or when using task-specific labels without pre-existing datasets. The only available option is often to collect and annotate texts by hand, which is expensive and time-consuming. \n\n`skweak` (pronounced `/skwi:k/`) is a Python-based software toolkit that provides a concrete solution to this problem using weak supervision. `skweak` is built around a very simple idea: Instead of annotating texts by hand, we define a set of _labelling functions_ to automatically label our documents, and then _aggregate_ their results to obtain a labelled version of our corpus. \n\nThe labelling functions may take various forms, such as domain-specific heuristics (like pattern-matching rules), gazetteers (based on large dictionaries), machine learning models, or even annotations from crowd-workers. The aggregation is done using a statistical model that automatically estimates the relative accuracy (and confusions) of each labelling function by comparing their predictions with one another.\n\n`skweak` can be applied to both sequence labelling and text classification, and comes with a complete API that makes it possible to create, apply and aggregate labelling functions with just a few lines of code. The toolkit is also tightly integrated with [SpaCy](http://www.spacy.io), which makes it easy to incorporate into existing NLP pipelines. Give it a try!\n\n\u003cbr\u003e\n\n**Full Paper**:\u003cbr\u003e\nPierre Lison, Jeremy Barnes and Aliaksandr Hubin (2021), \"[skweak: Weak Supervision Made Easy for NLP](https://aclanthology.org/2021.acl-demo.40/)\", *ACL 2021 (System demonstrations)*.\n\n**Documentation \u0026 API**: See the [Wiki](https://github.com/NorskRegnesentral/skweak/wiki) for details on how to use `skweak`. \n\n\u003cbr\u003e\n\n\nhttps://user-images.githubusercontent.com/11574012/114999146-e0995300-9ea1-11eb-8288-2bb54dc043e7.mp4\n\n\u003cbr\u003e\n\n\n\n## Dependencies\n\n- `spacy` \u003e= 3.0.0\n- `hmmlearn` \u003e= 0.3.0\n- `pandas` \u003e= 0.23\n- `numpy` \u003e= 1.18\n\nYou also need Python \u003e= 3.6. \n\n\n## Install\n\nThe easiest way to install `skweak` is through `pip`:\n\n```shell\npip install skweak\n```\n\nor if you want to install from the repo:\n\n```shell\npip install --user git+https://github.com/NorskRegnesentral/skweak\n```\n\nThe above installation only includes the core library (not the additional examples in `examples`).\n\nNote: some examples and tests may require trained spaCy pipelines. These can be downloaded automatically using the syntax (for the pipeline `en_core_web_sm`)\n```shell\npython -m spacy download en_core_web_sm\n```\n\n\n## Basic Overview\n\n\u003cbr\u003e\n\u003cp align=\"center\"\u003e\n   \u003cimg alt=\"Overview of skweak\" src=\"https://raw.githubusercontent.com/NorskRegnesentral/skweak/main/data/skweak_procedure.png\"/\u003e\n\u003c/p\u003e\u003cbr\u003e\n\nWeak supervision with `skweak` goes through the following steps:\n- **Start**: First, you need raw (unlabelled) data from your text domain. `skweak` is build on top of [SpaCy](http://www.spacy.io), and operates with Spacy `Doc` objects, so you first need to convert your documents to `Doc` objects using SpaCy.\n- **Step 1**: Then, we need to define a range of labelling functions that will take those documents and annotate spans with labels. Those labelling functions can comes from heuristics, gazetteers, machine learning models, etc. See the ![documentation](https://github.com/NorskRegnesentral/skweak/wiki) for more details. \n- **Step 2**: Once the labelling functions have been applied to your corpus, you need to _aggregate_ their results in order to obtain a single annotation layer (instead of the multiple, possibly conflicting annotations from the labelling functions). This is done in `skweak` using a generative model that automatically estimates the relative accuracy and possible confusions of each labelling function. \n- **Step 3**: Finally, based on those aggregated labels, we can train our final model. Step 2 gives us a labelled corpus that (probabilistically) aggregates the outputs of all labelling functions, and you can use this labelled data to estimate any kind of machine learning model. You are free to use whichever model/framework you prefer. \n\n## Quickstart\n\nHere is a minimal example with three labelling functions (LFs) applied on a single document:\n\n```python\nimport spacy, re\nfrom skweak import heuristics, gazetteers, generative, utils\n\n# LF 1: heuristic to detect occurrences of MONEY entities\ndef money_detector(doc):\n   for tok in doc[1:]:\n      if tok.text[0].isdigit() and tok.nbor(-1).is_currency:\n          yield tok.i-1, tok.i+1, \"MONEY\"\nlf1 = heuristics.FunctionAnnotator(\"money\", money_detector)\n\n# LF 2: detection of years with a regex\nlf2= heuristics.TokenConstraintAnnotator(\"years\", lambda tok: re.match(\"(19|20)\\d{2}$\", \n                                                  tok.text), \"DATE\")\n\n# LF 3: a gazetteer with a few names\nNAMES = [(\"Barack\", \"Obama\"), (\"Donald\", \"Trump\"), (\"Joe\", \"Biden\")]\ntrie = gazetteers.Trie(NAMES)\nlf3 = gazetteers.GazetteerAnnotator(\"presidents\", {\"PERSON\":trie})\n\n# We create a corpus (here with a single text)\nnlp = spacy.load(\"en_core_web_sm\")\ndoc = nlp(\"Donald Trump paid $750 in federal income taxes in 2016\")\n\n# apply the labelling functions\ndoc = lf3(lf2(lf1(doc)))\n\n# create and fit the HMM aggregation model\nhmm = generative.HMM(\"hmm\", [\"PERSON\", \"DATE\", \"MONEY\"])\nhmm.fit([doc]*10)\n\n# once fitted, we simply apply the model to aggregate all functions\ndoc = hmm(doc)\n\n# we can then visualise the final result (in Jupyter)\nutils.display_entities(doc, \"hmm\")\n```\n\nObviously, to get the most out of `skweak`, you will need more than three labelling functions. And, most importantly, you will need a larger corpus including as many documents as possible from your domain, so that the model can derive good estimates of the relative accuracy of each labelling function. \n\n## Documentation\n\nSee the [Wiki](https://github.com/NorskRegnesentral/skweak/wiki). \n\n\n## License\n\n`skweak` is released under an MIT License. \n\nThe MIT License is a short and simple permissive license allowing both commercial and non-commercial use of the software. The only requirement is to preserve\nthe copyright and license notices (see file [License](https://github.com/NorskRegnesentral/skweak/blob/main/LICENSE.txt)). Licensed works, modifications, and larger works may be distributed under different terms and without source code.\n\n## Citation\n\nSee our paper describing the framework: \n\nPierre Lison, Jeremy Barnes and Aliaksandr Hubin (2021), \"[skweak: Weak Supervision Made Easy for NLP](https://aclanthology.org/2021.acl-demo.40/)\", *ACL 2021 (System demonstrations)*. \n\n```bibtex\n@inproceedings{lison-etal-2021-skweak,\n    title = \"skweak: Weak Supervision Made Easy for {NLP}\",\n    author = \"Lison, Pierre  and\n      Barnes, Jeremy  and\n      Hubin, Aliaksandr\",\n    booktitle = \"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations\",\n    month = aug,\n    year = \"2021\",\n    address = \"Online\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://aclanthology.org/2021.acl-demo.40\",\n    doi = \"10.18653/v1/2021.acl-demo.40\",\n    pages = \"337--346\",\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnorskregnesentral%2Fskweak","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnorskregnesentral%2Fskweak","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnorskregnesentral%2Fskweak/lists"}