{"id":19381591,"url":"https://github.com/megagonlabs/ruler","last_synced_at":"2025-07-07T16:33:51.570Z","repository":{"id":41551932,"uuid":"302129029","full_name":"megagonlabs/ruler","owner":"megagonlabs","description":"Data Programming by Demonstration (DPBD) for Document Classification","archived":false,"fork":false,"pushed_at":"2021-06-17T17:05:28.000Z","size":17846,"stargazers_count":35,"open_issues_count":5,"forks_count":6,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-04-02T19:46:58.988Z","etag":null,"topics":["data-labeling","data-programming","data-science","machine-learning","training-data","weak-supervision"],"latest_commit_sha":null,"homepage":"https://drive.google.com/file/d/1iOQt81VDg9sCPcbrMWG8CR_8dOCfpKP5/view","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/megagonlabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-10-07T18:41:12.000Z","updated_at":"2024-09-27T05:39:45.000Z","dependencies_parsed_at":"2022-09-12T03:51:00.303Z","dependency_job_id":null,"html_url":"https://github.com/megagonlabs/ruler","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/megagonlabs%2Fruler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/megagonlabs%2Fruler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/megagonlabs%2Fruler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/megagonlabs%2Fruler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/megagonlabs","download_url":"https://codeload.github.com/megagonlabs/ruler/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250509706,"owners_count":21442482,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-labeling","data-programming","data-science","machine-learning","training-data","weak-supervision"],"created_at":"2024-11-10T09:17:38.859Z","updated_at":"2025-04-23T20:31:56.956Z","avatar_url":"https://github.com/megagonlabs.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# RULER: Data Programming by Demonstration for Text \n \nThis repo contains the source code and the user evaluation data and analysis scripts for Ruler, a data programming by demonstration system for document labeling. \n\n\u003ch3 align=\"center\"\u003e\n Ruler synthesizes labeling functions based on your span-level annotations, allowing you to quickly and easily generate large amounts of training data for text classification, without the need to program. \u003cbr/\u003e\n\u003cimg width=800px src=media/ruler_teaser.gif\u003e\n\u003c/h3\u003e\n\nCheck out our [demo video](https://drive.google.com/file/d/1iOQt81VDg9sCPcbrMWG8CR_8dOCfpKP5/view?usp=sharing) to see Ruler in action on a spam classification task, or [try it yourself](http://54.83.150.235:3000/) on a sentiment analysis task.\n\n\n\u003ch3 align=\"center\"\u003e\n News: We have recently released \u003ca href= https://github.com/megagonlabs/tagruler\u003eTagRuler\u003c/a\u003e, an extension  of Ruler that you can use to generate labelled data for span annotation tasks. Check it out! \u003cbr/\u003e\n\u003c/h3\u003e\n\n\n## Table of Contents\n1. [What is Ruler?](#ruler)\n2. [How to Run the Source Code in This Repo](#Use)\n   - [Engine](#Engine)\n   - [User Interface](#UI)\n3. [Using Ruler: the Basics](#Basics)\n4. [For Researchers](#research)\n5. [Contact](#contact)\n\n\n## \u003ca name='ruler'\u003e\u003c/a\u003eWhat is Ruler?\n\nThe success of machine learning has dramatically increased the demand for high-quality labeled data---but this data is \nexpensive to obtain, which inhibits broader utilization of machine learning models outside resource rich settings. \nThat's where data programming [[1](https://arxiv.org/pdf/1605.07723.pdf), [2](https://arxiv.org/pdf/1711.10160.pdf)] \ncomes in. Data programming aims to address the difficulty of collecting labeled data using a \nprogrammatic approach to weak supervision, where domain (subject-matter) experts are expected to provide functions\nincorporating their domain knowledge to label a subset of a large training dataset. \n\nThis approach has a few drawbacks, however. Many domain experts lack programming expertise, but it would still be useful to translate their knowledge into functions. For example, training models for the medical domain requires volumes of high-accuracy training data, but the medical experts' time is very valuable, limiting the amount of time they can spend labeling. Even for domain experts who are proficient programmers, it is often difficult to convert domain knowledge to a set of rules. \n\nIn short, the accessibility of writing labeling functions is a challenge to wider adoption of data programming. To address this challenge, we introduce a new framework, __Data Programming by Demonstration (DPBD)__, to synthesize labeling functions through user interactions.\n\n\u003ch3 align=\"center\"\u003e\n\u003cimg  align=\"center\" width=\"900\" src=\"media/overview.png\" /\u003e\u003cbr/\u003e\nOverview of the data programming by demonstration (DPBD) framework. Straight lines indicate the flow of domain\nknowledge, and dashed lines indicate the flow of data.\n\u003cbr/\u003e\n\u003c/h3\u003e\n\nDPBD aims to move the burden of writing labeling functions to an intelligent synthesizer while enabling users to steer this synthesis. Ruler is an interactive tool that operationalizes data programming by demonstration for document text.\n\n\u003ch3 align=\"center\"\u003e\n\u003cimg width=800px src=media/ruler_teaser_wide.png\u003e\n \u003cbr/\u003eAn overview of the Ruler workflow. The user iteratively annotates and labels text, selects functions from those Ruler generates, and gets feedback on the performance of the set of labeling functions they have selected.\u003cbr/\u003e\n\u003c/h3\u003e\n\nFor example, consider a sentiment classification task. A labeling function might look something like this Python code:\n```\ndef find_positive_adj(text):\n    if \"awesome\" in text or \"great\" in text:\n        return POSITIVE\n    else:\n        return NEGATIVE\n```\nInstead of formalizing this function as Python code, a user can use Ruler to annotate the words \"awesome\" and \"great\" to get the same function. This is the \"demonstration\" part of DPBD.  Ruler functions can also make use of word co-occurence, named entities, and more.\n\nOnce the user is satisfied with the functions they've created using Ruler, these functions are aggregated using [Snorkel](https://www.snorkel.org/), which denoises the resulting label model. With this model, the user can label as much training data as they would like, and use it to train a more sophisticated supervised model.\n\n\n\u003ch3 align=\"center\"\u003e\nBy limiting users' task to simple annotation and selection from suggested rules, \u003cbr/\u003e\nwe allow fast exploration over the space of labeling functions.\n \u003cbr/\u003e\n\u003cimg width=700px src=media/fast-exporation-thin.gif\u003e\n\u003c/h3\u003e\n\n\n**This allows users to focus on**\n\n  :white_check_mark: choosing the right generalization of observed instances\n\n  :white_check_mark: capturing the tail end of their data distribution\n\n**and avoid worrying about**\n\n  :x: implementation details in a programming language\n\n  :x: how to express rules in natural language\n\n  :x: how to formalize their intuition\n  \n  \n# \u003ca name='Use'\u003e\u003c/a\u003eHow to run the source code in this repo\n\nFollow these instructions to run the system on your own, where you can plug in your own data and save the resulting labels, models, and annotations.\n\n## \u003ca name='Engine'\u003e\u003c/a\u003eEngine\n\nThe server runs on [Flask](https://flask.palletsprojects.com/en/1.1.x/) and can be found in [`server/`](server/). \n\nIt is strongly reccomended that you use Python version 3.6\n\n### 1. Install Dependencies :wrench:\n\n```shell\ncd server\npip install -r requirements.txt\n```\n\n\n### 2. Run :runner:\n\n```\npython api/server.py\n```\n\nNow the engine is running. To use Ruler, you will need to run the UI as well, described below.\n\nYou can check out http://localhost:5000/api/ui to see the supported endpoints.\nThis will display a [Swagger UI](https://swagger.io/tools/swagger-ui/) page that allows you to interact directly with the API.\n\n\n## \u003ca name='UI'\u003e\u003c/a\u003eUser Interface\n\n\nThe user interface is implemented in [React JavaScript Library](https://reactjs.org). The code can be found in [`ui/`](ui/).\n\n### 1. Install Node.js\n\n[You can download node.js here.](https://nodejs.org/en/)\n\nTo confirm that you have node.js installed, run `node - v`\n\n### 2. Run\n\n```shell\ncd ui\nnpm install \nnpm start\n```\n\nBy default, the app will make calls to `localhost:5000`, assuming that you have the server running on your machine. (See the [instructions above](#Engine)).\n\nOnce you have both of these running, navigate to `localhost:3000`.\n\n\n# \u003ca name='Basics'\u003e\u003c/a\u003eUsing Ruler: the Basics\n\nCongrats, you've got Ruler running! 🎉\n\n### Create/Load a Project\n\nWhen you navigate to `localhost:3000`, you will be guided through the process of initializing your project.\n\n1. Upload data. \nThere is some example data under (server/datasets/spam_example/processed.csv)[server/datasets/spam_example/processed.csv]. \nYou can also upload your own data here, just make sure it's a valid csv file, and your text column is labeled `text`.  If you have labels you want to use for development, these should be in a column named `labels`. Ruler will automatically split your data into training (the data you interactively label), development (the data your functions are evaluated on), and test/validation (to evaluate the end model).\n\n2. Create/load a model. \nIf you're iterating on a model you've previously saved, you can load it here. Otherwise, enter a name for your new model, and you will define the label classes in the next step.\n\n3. Define Labels.\n__WARNING__ your label classes need to match the data you've uploaded. If you're dataset has labels `{0: NON-SPAM, 1: SPAM}` then you need to add the labels in this order to make sure they're mapped correctly.\nIf you're loading a previous model, make sure these label classes match the dataset.\n\n4. Continue to Project.\nYou should automatically be redirected to `localhost:3000/project` once your data is pre-processed.\n\n\nNeed some ideas? Try sentiment classification on this (Amazon Review dataset)[https://www.kaggle.com/bittlingmayer/amazonreviews].\nUpload this dataset, create a new model, define the labels `NON-SPAM` and `SPAM`, and get labelling.\n\n\n### Get Labeling\n\nNow you're at `localhost:3000/project`, where the magic happens. \n\n\u003ch3 align=\"center\"\u003e\n\u003cimg width=800px src=media/ruler_ui.png\u003e\n\u003c/h3\u003e\n\n__A/B__  Highlight parts of the text, add links between them, or create concepts to annotate the data. \n\n__C__ Once you select a label class, Ruler will automatically suggest functions for you. Select and submit the ones you like.\n\n__D__ Your label model performance will update as you go, showing changes with each addition/deletion of a function.\n\n__E__ If you want to evaluate a model trained on your generated labels, click the refresh icon in this panel. This will train a logistic regression model on bag of words features and report the performance. You should use this sparingly to avoid overfitting to the test set. Note that this is a very simplistic model which may not be suitable for evaluating labels for some tasks.\n\n__F__ Here, you can inspect individual functions' performance, and deactivate them.\n\nSee our [demo video](https://drive.google.com/file/d/1iOQt81VDg9sCPcbrMWG8CR_8dOCfpKP5/view?usp=sharing) for some example interactions.\n\n### Finished?\n\nSave your model by clicking the icon on the top right. If you decide to iterate on it more later, you can load it on the create/load project page.\n\n\n\n# \u003ca name='research'\u003e\u003c/a\u003eFor Researchers\n\n\u003ca href=https://github.com/megagonlabs/ruler/tree/master/user_study\u003eHere you can find the data from our user study\u003c/a\u003e, along with \u003ca href=https://github.com/megagonlabs/ruler/blob/master/user_study/ruler_user_study_figures.ipynb\u003ethe code to generate all of our figures and analysis\u003c/a\u003e. \n\nPlease see our [Findings of EMNLP'20 publication](media/Ruler_EMNLP2020.pdf) for details. \n\n# \u003ca name='contact'\u003e\u003c/a\u003eContact\nIf you have any problems, please feel free to create a Github issue. \n\nFor other inquiries, contact \u003csara@megagon.ai\u003e.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmegagonlabs%2Fruler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmegagonlabs%2Fruler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmegagonlabs%2Fruler/lists"}