{"id":13834765,"url":"https://github.com/IBM/low-resource-text-classification-framework","last_synced_at":"2025-07-10T07:30:52.237Z","repository":{"id":43293289,"uuid":"307936232","full_name":"IBM/low-resource-text-classification-framework","owner":"IBM","description":"Research framework for low resource text classification that allows the user to experiment with classification models and active learning strategies on a large number of sentence classification datasets, and to simulate real-world scenarios. The framework is easily expandable to new classification models, active learning strategies and datasets.","archived":true,"fork":false,"pushed_at":"2022-03-09T14:58:47.000Z","size":941,"stargazers_count":101,"open_issues_count":1,"forks_count":20,"subscribers_count":12,"default_branch":"main","last_synced_at":"2025-06-08T23:03:09.591Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/IBM.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-10-28T07:20:02.000Z","updated_at":"2025-06-05T13:46:32.000Z","dependencies_parsed_at":"2022-09-23T11:50:30.423Z","dependency_job_id":null,"html_url":"https://github.com/IBM/low-resource-text-classification-framework","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/IBM/low-resource-text-classification-framework","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IBM%2Flow-resource-text-classification-framework","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IBM%2Flow-resource-text-classification-framework/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IBM%2Flow-resource-text-classification-framework/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IBM%2Flow-resource-text-classification-framework/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/IBM","download_url":"https://codeload.github.com/IBM/low-resource-text-classification-framework/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IBM%2Flow-resource-text-classification-framework/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264545157,"owners_count":23625403,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-04T14:00:52.045Z","updated_at":"2025-07-10T07:30:51.078Z","avatar_url":"https://github.com/IBM.png","language":"Python","funding_links":[],"categories":["3.3 AL in AI Fields - 人工智能背景中的主动学习"],"sub_categories":["**Tutorials - 教程**"],"readme":"# Low-Resource Text Classification Framework\n\nIntroduced in [Ein-dor et al. (2020)](#reference), this is a framework for experimenting with text classification tasks.\nThe focus is on low-resource scenarios, and examining how active learning (AL) can be used in combination with\nclassification models.\n\nThe framework includes a selection of labeled datasets, machine learning models and active learning strategies (see \n[Built-in Implementations](#built-in-implementations) below), and can be easily adapted for additional setups and\nscenarios.\n\n\n**Table of contents**\n\n[Installation](#installation)\n\n[Running active learning experiments](#running-active-learning-experiments)\n\n[Adapting to additional scenarios](#adapting-to-additional-scenarios):\n* [Implementing a new machine learning model](#implementing-a-new-machine-learning-model)\n* [Implementing a new active learning strategy](#implementing-a-new-active-learning-strategy)\n* [Adding a new dataset](#adding-a-new-dataset)\n\n[Built-in Implementations](#built-in-implementations)\n\n[Reference](#reference)\n\n[License](#license)\n\n## Installation\nCurrently, the framework requires Python 3.7\n1. Clone the repository locally: \n\n   `git clone https://github.com/IBM/low-resource-text-classification-framework`\n2. Install the project dependencies: `pip install -r lrtc_lib/requirements.txt`\n\n   Windows users also need to download the latest [Microsoft Visual C++ Redistributable for Visual Studio](https://support.microsoft.com/en-us/help/2977003/the-latest-supported-visual-c-downloads) in order to support tensorflow\n3. Run the shell script `lrtc_lib/download_and_prepare_datasets.sh`.\nThis script downloads the [datasets with built-in support](#built-in-implementations).\n\n\n## Running active learning experiments\nThe `ExperimentRunner` class enables running experiments in the vein of [Ein-dor et al. (2020)](#reference),\ni.e. an experimental flow where an initial seed of labeled instances is used to train a model, and then several\niterations of active learning are performed. In each active learning iteration, the set of labeled instances is \nexpanded with the batch examples selected by the active learning module, and a new model is trained on this larger set.\n\nImplementations of `ExperimentRunner` vary in terms of how the initial seed of labeled instances is selected.\nThe three scenarios described in the paper are implemented by:\n1. *ExperimentRunnerBalanced*\n2. *ExperimentRunnerImbalanced*\n3. *ExperimentRunnerImbalancedPractical*\n\nThe experiment flow can be performed on a custom combination of datasets, model types and active learning strategies.\n\nTo run an experiment from a terminal, go to the repository directory \n(usually `\u003cpath_to_python_projects\u003e/low-resource-text-classification-framework`) and run `python -m path.to.module`, \nfor example: \n```commandline\npython -m lrtc_lib.experiment_runners.experiment_runner_imbalanced_practical\n```\nAlternatively, an IDE such as PyCharm can be used.\n\nThe main function of each ExperimentRunner specifies all the experimental parameters. For information on all the \ndataset and category names available for running experiments, run `loaded_datasets_info.py`\nusing `python -m lrtc_lib.data_access.loaded_datasets_info`.\n\n\n## Adapting to additional scenarios\n\n### Implementing a new machine learning model\nThese are the steps for integrating a new classification model:\n1. Implement a new `TrainAndInferAPI`\n\n    Machine learning models are integrated by adding a new implementation of the TrainAndInferAPI.\n    The main functions are *train* and *infer*:\n    \n    **Train** a new model and return a unique model identifier that will be used for inference.\n    ```python    \n    def train(self, train_data: Sequence[Mapping], dev_data: Sequence[Mapping], test_data: Sequence[Mapping], \n    train_params: dict) -\u003e str\n   ```\n        \n    - train_data - a list of dictionaries with at least the \"text\" and \"label\" fields. Additional fields can be passed e.g.\n    *[{'text': 'text1', 'label': 1, 'additional_field': 'value1'}, {'text': 'text2', 'label': 0, 'additional_field': 'value2'}]*\n    - dev_data: can be None if not used by the implemented model\n    - test_data - can be None if not used by the implemented model\n    - train_params - dictionary for additional train parameters (can be None)\n\n    **Infer** a given sequence of elements and return the results.\n\n    ```python    \n    def infer(self, model_id, items_to_infer: Sequence[Mapping], infer_params: dict, use_cache=True) -\u003e dict:\n    ```    \n    - model_id\n    - items_to_infer: a list of dictionaries with at least the \"text\" field. Additional fields can be passed,\n    e.g. *[{'text': 'text1', 'additional_field': 'value1'}, {'text': 'text2', 'additional_field': 'value2'}]*\n    - infer_params: dictionary for additional inference parameters (can be None)\n    - use_cache: save the inference results to cache. Default is True\n    \n    Returns a dictionary with at least the \"labels\" key, where the value is a list of numeric labels for each element in\n    items_to_infer.\n    Additional keys (with list values of the same length) can be passed,\n    e.g. *{\"labels\": [1, 0], \"gradients\": [[0.24, -0.39, -0.66, 0.25], [0.14, 0.29, -0.26, 0.16]]}*\n\n2. Specify a new ModelType in `ModelTypes`\n3. Return the newly implemented TrainAndInferAPI in `TrainAndInferFactory`\n4. The system assumes that active learning strategies that require special inference outputs (e.g. text embeddings)\nare not supported by your new model. If your model does support this, add it to the appropriate category \nin `get_compatible_models` in `strategies.py`\n5. Set your ModelType in one of the ExperimentRunners, and run\n\n### Implementing a new active learning strategy\nThese are the steps for integrating a new active learning approach:\n1. Implement a new `ActiveLearner`\n   \n   Active learning modules inherit from the ActiveLearner API.\n   The main function to implement is *get_recommended_items_for_labeling*:\n   ```python\n   def get_recommended_items_for_labeling(self, workspace_id: str, model_id: str, dataset_name: str,\n                                           category_name: str, sample_size: int = 1) -\u003e Sequence[TextElement]:\n    \n   ```    \n   This function returns a batch of *sample_size* elements suggested by the active learning module for a given dataset\n   and category, based on the outputs of model *model_id*.\n   \n   Optionally, the ActiveLearner can also implement the function `get_per_element_score`, where the active learning \n   module does not just return a batch of selected elements, but can also assign each text element with a score.\n\n2. Specify a new ActiveLearningStrategy in `ActiveLearningStrategies`\n3. Return your new ActiveLearner in `ActiveLearningFactory`\n4. If the active learner requires particular outputs from the machine learning model, update `get_compatible_models` \naccordingly. For instance, if the strategy relies on model embeddings, add it to the set of embedding-based strategies.\n5. Set your ActiveLearningStrategy in one of the ExperimentRunners, and run\n\n### Adding a new dataset\nThese are the steps for adding a new dataset:\n\n1. Split your dataset into 3 csv files: `train.csv`, `dev.csv`, and `test.csv`. \n   1. Each line is a text element.\n   1. Each file should have at least two columns: `label` and `text`, and may have additional columns.\n   1. Files are placed under `lrtc_lib/data/available_datasets/\u003cnew_dataset_name\u003e`\n1. Create a processor for the new dataset by extending `CsvProcessor` (which implements `DataProcessorAPI`)\nand place it under `lrtc_lib/data_access/processors`.\n   `CsvProcessor`  `__init__` function looks like this:\n   \n   ```python    \n   def __init__(self, dataset_name: str, dataset_part: DatasetPart, text_col: str = 'text',\n                 label_col: str = 'label', context_col: str = None,\n                 doc_id_col: str = None,\n                 encoding: str = 'utf-8'):\n   ``` \n\n    - dataset_name: the name of the processed dataset\n    - dataset_part: the part - train/dev/test - of the dataset\n    - text_col: the name of the column which holds the text of the TextElement. Default is `text`.\n    - label_col: the name of the column which holds the label. Default is `label`.\n    - context_col: the name of the column which provides context for the text, None if no context is available.\n    Default is None.\n    - doc_id_col: column name by which text elements should be grouped into documents.\n    If None, all text elements would be put in a single dummy doc. Default is None.\n    - encoding: the encoding to use to read the csv raw file(s). Default is `utf-8`.\n    \n    For example, here is the processor for DBPedia (which uses the default values of `CsvProcessor`):\n       \n   ```python\n    class DbpediaProcessor(CsvProcessor):\n\n    def __init__(self, dataset_part: DatasetPart):\n        super().__init__(dataset_name='dbpedia', dataset_part=dataset_part)\n   ```\n    \n   If more flexibility is needed, implement `DataProcessorAPI` directly. \n1. Add the new processor to `data_processor_factory`. Note, in this step you define the name of the new dataset. \n1. Run `load_dataset` with the new dataset name (as defined in `data_processor_factory`) to generate dump files under \n`data/data_access_dumps` (for the documents and text elements of the dataset) and `data/oracle_access_dumps` \n(for the gold labels of the text elements).\n\n\n## Built in Implementations\n### Datasets\n- [AG’s News](https://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html)\n- [CoLA](https://nyu-mll.github.io/CoLA/)\n- [ISEAR](https://www.unige.ch/cisa/research/materials-and-online-research/research-material/)*\n- [Polarity](https://www.cs.cornell.edu/people/pabo/movie-review-data/)\n- [Subjectivity](https://www.cs.cornell.edu/people/pabo/movie-review-data/)\n- [TREC](https://cogcomp.seas.upenn.edu/Data/QA/QC/)\n- [Wiki Attack](https://meta.wikimedia.org/wiki/Research:Detox/Data_Release)\n\n\\* _Loading the ISEAR dataset requires installing additional dependencies before \n[running the installation script](#installation), and is only supported on Mac/Linux. Specifically, you will need to \ninstall [mdbtools](https://github.com/mdbtools/mdbtools) on your machine and then `pip install pandas_access`_.\n\n### Classification models\n- **ModelTypes.NB**: a Naive Bayes implementation from [scikit-learn](https://scikit-learn.org)\n- **ModelTypes.HF_BERT**: A tensorflow implementation of BERT (Devlin et al. 2018) that uses the [huggingface Transformers](https://github.com/huggingface/transformers) library \n\n### Active learning strategies\n- **RANDOM**: AL baseline, randomly sample from unlabeled data.\n- **RETROSPECTIVE**: select the top scored instances by the model.\n- **HARD_MINING**: a.k.a uncertainty sampling / least confidence; Lewis and Gale 1994\n- **GREEDY_CORE_SET**: the greedy method from Sener and Savarese 2017\n- **DAL**: Discriminative representation sampling; Gissin and Shalev-Shwartz 2019\n- **PERCEPTRON_ENSEMBLE**: lightweight ensemble version of uncertainty sampling; uncertainty is determined\nusing an ensemble of perceptrons, which were trained over output embeddings from the original model.\n- **DROPOUT_PERCEPTRON**: similar to the above, but instead of an ensemble of perceptrons, uses a single perceptron\nwith Monte Carlo dropout (Gal and Ghahramani, 2016)\n\n\n## Reference\nLiat Ein-Dor, Alon Halfon, Ariel Gera, Eyal Shnarch, Lena Dankin, Leshem Choshen, Marina Danilevsky, Ranit Aharonov, Yoav Katz and Noam Slonim (2020). \n[Active Learning for BERT: An Empirical Study](https://www.aclweb.org/anthology/2020.emnlp-main.638/). EMNLP 2020\n\nPlease cite: \n```\n@inproceedings{ein-dor-etal-2020-active,\n    title = \"Active Learning for {BERT}: An Empirical Study\",\n    author = \"Ein-Dor, Liat  and\n      Halfon, Alon  and\n      Gera, Ariel  and\n      Shnarch, Eyal  and\n      Dankin, Lena  and\n      Choshen, Leshem  and\n      Danilevsky, Marina  and\n      Aharonov, Ranit  and\n      Katz, Yoav  and\n      Slonim, Noam\",\n    booktitle = \"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)\",\n    month = nov,\n    year = \"2020\",\n    address = \"Online\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://www.aclweb.org/anthology/2020.emnlp-main.638\",\n    pages = \"7949--7962\",\n}\n```\n\n## License\nThis work is released under the Apache 2.0 license. The full text of the license can be found in [LICENSE](LICENSE).\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FIBM%2Flow-resource-text-classification-framework","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FIBM%2Flow-resource-text-classification-framework","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FIBM%2Flow-resource-text-classification-framework/lists"}