{"id":13467611,"url":"https://github.com/dataqa/nlp-labelling","last_synced_at":"2026-02-18T10:08:50.767Z","repository":{"id":39407406,"uuid":"403550076","full_name":"dataqa/nlp-labelling","owner":"dataqa","description":"Labelling platform for text using weak supervision.","archived":false,"fork":false,"pushed_at":"2022-06-24T08:33:00.000Z","size":5215,"stargazers_count":260,"open_issues_count":3,"forks_count":18,"subscribers_count":4,"default_branch":"master","last_synced_at":"2024-10-29T21:55:47.665Z","etag":null,"topics":["annotation-tool","data-labeling","data-science","learning-with-limited-labeled-data","learning-with-noisy-labels","natural-language-processing","ner","nlp","nlp-machine-learning","pseudo-labeling","search-engine","text-annotation-tool","text-classification","text-mining","weak-supervision"],"latest_commit_sha":null,"homepage":"https://dataqa.ai/docs/","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dataqa.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-09-06T08:49:10.000Z","updated_at":"2024-10-05T20:06:05.000Z","dependencies_parsed_at":"2022-07-12T19:30:41.891Z","dependency_job_id":null,"html_url":"https://github.com/dataqa/nlp-labelling","commit_stats":null,"previous_names":[],"tags_count":18,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataqa%2Fnlp-labelling","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataqa%2Fnlp-labelling/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataqa%2Fnlp-labelling/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataqa%2Fnlp-labelling/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dataqa","download_url":"https://codeload.github.com/dataqa/nlp-labelling/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245584458,"owners_count":20639550,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["annotation-tool","data-labeling","data-science","learning-with-limited-labeled-data","learning-with-noisy-labels","natural-language-processing","ner","nlp","nlp-machine-learning","pseudo-labeling","search-engine","text-annotation-tool","text-classification","text-mining","weak-supervision"],"created_at":"2024-07-31T15:00:58.447Z","updated_at":"2026-02-18T10:08:50.727Z","avatar_url":"https://github.com/dataqa.png","language":"JavaScript","readme":"\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"dataqa-ui/public/images/protractor.png?raw=true\" width=\"200\" height=\"200\"/\u003e\n    \u003ch1 align=\"center\"\u003eDataQA\u003c/h1\u003e\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"https://img.shields.io/pypi/pyversions/dataqa\"/\u003e\n    \u003cimg src=\"https://img.shields.io/github/license/dataqa/dataqa?color=success\"/\u003e\n    \u003cimg src=\"https://img.shields.io/pypi/v/dataqa.svg?label=PyPI\u0026logo=PyPI\u0026logoColor=white\u0026color=success\"/\u003e\n    \u003cimg src=\"https://github.com/dataqa/dataqa/actions/workflows/github-actions.yml/badge.svg?\u0026color=success\"/\u003e\n\u003c/div\u003e\n\n\u0026nbsp;\n\nDataQA is a tool to label and explore unstructured documents. It uses rules-based weak supervision to significantly reduce the number of labels needed compared to other tools. Here are a few things you can do with it:\n- Search your documents using Elasticsearch powerful text search engine,\n- Classify your documents,\n- Extract entities from your own data or from Wikipedia,\n- Link mentions of entities to your own ontology.\n\n... and it's all available with a simple pip command!\n\n\u0026nbsp;\n\n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"github_images/merged.gif\" width=\"800\" align=\"center\"/\u003e\n\u003c/div\u003e\n\n\u0026nbsp;\n\n* [Installation](#installation)\n* [Usage](#usage)\n* [What is weak supervision and why does it work?](#what-is-weak-supervision-and-why-does-it-work)\n* [Tutorials](#documentation)\n* [Contact](#contact)\n\n# Installation\n\n## Pre-requisites:\n\n* Python 3.6, 3.7, 3.8 and 3.9\n* (Recommended) start a new python virtual environment\n* Update your pip `pip install -U pip`\n* Tested on backend: MacOSX, Ubuntu. Tested on browser: Chrome, Firefox.\n\n## Installing from pypi\n\n* `pip install dataqa`\n\n## To run with Docker\n\n* The first time it is run: `docker run -d -p 5000:5000 dataqa/dataqa`\n* In order to keep the data between runs, use `docker start [container-id]` and `docker stop [container-id]`\n\n# Usage\n\nIn the terminal, type `dataqa run`. Wait a few minutes initially, as it takes some minutes to start everything up.\n\nDoing this will run a server locally and open a browser window at port `5000`. If the application does not open the browser automatically, open `localhost:5000` in your browser. You need to keep the terminal open.\n\nTo quit the application, simply do `Ctr-C` in the terminal. To resume the application, type `dataqa run`. Doing so will create a folder at `$HOME/.dataqa_data`.\n\n## Uploading data\n\nThe text file needs to be a csv file in utf-8 encoding of up to 30MB with a column named \"text\" which contains the main text. The other columns will be ignored.\n\nThis step is running some analysis on your text and might take up to 5 minutes.\n\n## Uninstall\n\nIn the terminal:\n\n* `dataqa uninstall`: this deletes your local application data in the home directory in the folder `.dataqa_data`. It will prompt the user before deleting.\n* `pip uninstall dataqa` \n\n### Does this tool need an internet connection?\n\nNope. **No data will ever leave your local machine.**\n\n## Troubleshooting\n\nIf the project data does not load, try to go to the homepage and `http://localhost:5000` and navigate to the project from there.\n\nTry running `dataqa test` to get more information about the error, and bug reports are very welcome!\n\nTo test the application, it is possible to upload a text that contains a column \"\\_\\_LABEL\\_\\_\". The ground-truth labels will then be displayed during labelling and the real performance will be shown in the performance table between brackets.\n\n# Documentation\n\nDocumentation at: [https://dataqa.ai/docs/](https://dataqa.ai/docs/).\n\n* To get started with a multi-class classification problem, go [here](https://dataqa.ai/docs/latest/tutorials/ecomm_product_categories/classification_product_categories/).\n* To get started with a named entity recognition problem, go [here](https://dataqa.ai/docs/latest/tutorials/medical_side_effects/ner_medical/).\n* To get started with a named entity linking problem, go [here](https://dataqa.ai/docs/latest/tutorials/medical_entity_disambiguation/ned_side_effects/).\n\n# What is weak supervision and why does it work?\n\nWeak supervision is a set of techniques to produce noisy labels for large quantities of data. It has gained popularity in recent years due to the large amounts of data typically needed for ML systems. The annotator is able to encode any prior domain knowledge it has in the form of rules. Even though these rules can be noisy, the algorithm learns how to weigh them accordingly and use them as signals to extract patterns from the data.\n\n\u003cdiv align=\"center\"\u003e\n    \u003ch4\u003eCreating a rule for classification\u003c/h4\u003e\n    \u003cimg src=\"github_images/rule_creation.gif\" width=\"800\" align=\"center\"/\u003e\n    \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n    \u003ch4\u003eCreating a rule for NER\u003c/h4\u003e\n    \u003cimg src=\"github_images/ner_rule.gif\" width=\"800\" align=\"center\"/\u003e\n\u003c/div\u003e\n\n# Contact\n\nFor any feedback, please contact us at contact@dataqa.ai. Also follow me on [![alt text][1.1]][1] for more updates and content around ML and labelling.\n\n[1.1]: https://i.imgur.com/wWzX9uB.png \n[1]: https://www.twitter.com/DataqaAi\n","funding_links":[],"categories":["JavaScript","search-engine"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdataqa%2Fnlp-labelling","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdataqa%2Fnlp-labelling","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdataqa%2Fnlp-labelling/lists"}