{"id":19860859,"url":"https://github.com/asyml/forte","last_synced_at":"2025-04-04T09:09:55.468Z","repository":{"id":37318720,"uuid":"201518876","full_name":"asyml/forte","owner":"asyml","description":"Forte is a flexible and powerful ML workflow builder.  This is part of the CASL project: http://casl-project.ai/","archived":false,"fork":false,"pushed_at":"2024-02-05T17:54:48.000Z","size":18702,"stargazers_count":244,"open_issues_count":105,"forks_count":60,"subscribers_count":18,"default_branch":"master","last_synced_at":"2025-03-28T08:08:23.911Z","etag":null,"topics":["data-processing","deep-learning","information-retrieval","machine-learning","natural-language","natural-language-processing","pipeline","python","text-data"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/asyml.png","metadata":{"files":{"readme":"README.md","changelog":"HISTORY.rst","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":"citation","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-08-09T18:12:12.000Z","updated_at":"2025-03-13T00:46:44.000Z","dependencies_parsed_at":"2024-06-20T22:09:28.061Z","dependency_job_id":null,"html_url":"https://github.com/asyml/forte","commit_stats":{"total_commits":1028,"total_committers":53,"mean_commits":19.39622641509434,"dds":0.7247081712062257,"last_synced_commit":"13e50aebe2afd79a7a8b3c01f0bb2568addea54f"},"previous_names":[],"tags_count":12,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asyml%2Fforte","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asyml%2Fforte/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asyml%2Fforte/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asyml%2Fforte/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/asyml","download_url":"https://codeload.github.com/asyml/forte/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247149502,"owners_count":20891954,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-processing","deep-learning","information-retrieval","machine-learning","natural-language","natural-language-processing","pipeline","python","text-data"],"created_at":"2024-11-12T15:07:24.318Z","updated_at":"2025-04-04T09:09:55.451Z","avatar_url":"https://github.com/asyml.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n   \u003cimg src=\"https://raw.githubusercontent.com/asyml/forte/master/docs/_static/img/logo_h.png\"\u003e\u003cbr\u003e\u003cbr\u003e\n\u003c/div\u003e\n\n-----------------\n\u003cp align=\"center\"\u003e\n   \u003ca href=\"https://github.com/asyml/forte/actions/workflows/main.yml\"\u003e\u003cimg src=\"https://github.com/asyml/forte/actions/workflows/main.yml/badge.svg\" alt=\"build\"\u003e\u003c/a\u003e\n   \u003ca href=\"https://codecov.io/gh/asyml/forte\"\u003e\u003cimg src=\"https://codecov.io/gh/asyml/forte/branch/master/graph/badge.svg\" alt=\"test coverage\"\u003e\u003c/a\u003e\n   \u003ca href=\"https://asyml-forte.readthedocs.io/en/latest/\"\u003e\u003cimg src=\"https://readthedocs.org/projects/asyml-forte/badge/?version=latest\" alt=\"documentation\"\u003e\u003c/a\u003e\n   \u003ca href=\"https://github.com/asyml/forte/blob/master/LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/license-Apache%202.0-blue.svg\" alt=\"apache license\"\u003e\u003c/a\u003e\n   \u003ca href=\"https://gitter.im/asyml/community\"\u003e\u003cimg src=\"http://img.shields.io/badge/gitter.im-asyml/forte-blue.svg\" alt=\"gitter\"\u003e\u003c/a\u003e\n   \u003ca href=\"https://github.com/psf/black\"\u003e\u003cimg src=\"https://img.shields.io/badge/code%20style-black-000000.svg\" alt=\"code style: black\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"#installation\"\u003eDownload\u003c/a\u003e •\n  \u003ca href=\"#quick-start-guide\"\u003eQuick Start\u003c/a\u003e •\n  \u003ca href=\"#contributing\"\u003eContribution Guide\u003c/a\u003e •\n  \u003ca href=\"#license\"\u003eLicense\u003c/a\u003e •\n  \u003ca href=\"https://asyml-forte.readthedocs.io/en/latest\"\u003eDocumentation\u003c/a\u003e •\n  \u003ca href=\"https://aclanthology.org/2020.emnlp-demos.26/\"\u003ePublication\u003c/a\u003e\n\u003c/p\u003e\n\n**Bring good software engineering to your ML solutions, starting from Data!**\n\n**Forte** is a data-centric framework designed to engineer complex ML workflows. Forte allows practitioners to build ML components in a composable and modular way. Behind the scene, it introduces [DataPack](https://asyml-forte.readthedocs.io/en/latest/notebook_tutorial/handling_structued_data.html), a standardized data structure for unstructured data, distilling\ngood software engineering practices such as reusability, extensibility, and flexibility into\nML solutions.\n\n![image](https://user-images.githubusercontent.com/1015991/165164897-e69fd9e7-278c-4e2b-80e4-5d1c389c1bfe.png)\n\nDataPacks are standard data packages in an ML workflow, that can represent the source data (e.g. text, audio, images) and additional markups (e.g. entity mentions, bounding boxes). It is powered by a customizable data schema named \"Ontology\", allowing domain experts to inject their knowledge into ML engineering processes easily.\n\n## Installation\n\nTo install the released version from PyPI:\n\n```bash\npip install forte\n```\n\nTo install from source:\n\n```bash\ngit clone https://github.com/asyml/forte.git\ncd forte\npip install .\n```\n\nTo install some forte adapter for some existing [libraries](https://github.com/asyml/forte-wrappers#libraries-and-tools-supported):\n\nInstall from PyPI:\n```bash\n# To install other tools. Check here https://github.com/asyml/forte-wrappers#libraries-and-tools-supported for available tools.\npip install forte.spacy\n```\n\nInstall from source:\n\n```bash\ngit clone https://github.com/asyml/forte-wrappers.git\ncd forte-wrappers\n# Change spacy to other tools. Check here https://github.com/asyml/forte-wrappers#libraries-and-tools-supported for available tools.\npip install src/spacy\n```\n\nSome components or modules in forte may require some [extra requirements](https://github.com/asyml/forte/blob/master/setup.py#L45):\n\n\n* `pip install forte[data_aug]`: Install packages required for [data augmentation modules](https://github.com/asyml/forte/tree/master/forte/processors/data_augment).\n* `pip install forte[ir]`: Install packages required for [Information Retrieval Supports](https://github.com/asyml/forte/tree/master/forte/processors/ir/)\n* `pip install forte[remote]`: Install packages required for pipeline serving functionalities, such as [Remote Processor](https://github.com/asyml/forte/processors/misc/remote_processor.py).\n* `pip install forte[audio_ext]`: Install packages required for Forte Audio support, such as [Audio Reader](https://github.com/asyml/forte/blob/master/forte/data/readers/audio_reader.py).\n* `pip install forte[stave]`: Install packages required for [Stave](https://github.com/asyml/forte/blob/master/forte/processors/stave/stave_processor.py) integration.\n* `pip install forte[models]`: Install packages required for [ner training](https://github.com/asyml/forte/blob/master/forte/trainer/ner_trainer.py), [srl](https://github.com/asyml/forte/tree/master/forte/models/srl), [srl with new training system](https://github.com/asyml/forte/tree/master/forte/models/srl_new), and [srl_predictor](https://github.com/asyml/forte/tree/master/forte/processors/nlp/srl_predictor.py) and [ner_predictor](https://github.com/asyml/forte/tree/master/forte/processors/nlp/ner_predictor.py)\n* `pip install forte[test]`: Install packages required for running [unit tests](https://github.com/asyml/forte/tree/master/tests).\n* `pip install forte[wikipedia]`: Install packages required for reading [wikipedia datasets](https://github.com/asyml/forte/tree/master/forte/datasets/wikipedia).\n* `pip install forte[nlp]`: Install packages required for additional NLP supports, such as [subword_tokenizer](https://github.com/asyml/forte/tree/master/forte/processors/nlp/subword_tokenizer.py) and [texar encoder](https://github.com/asyml/forte/tree/master/forte/processors/third_party/pretrained_encoder_processors.py)\n* `pip install forte[extractor]`: Install packages required for extractor-based training system, [extractor](https://github.com/asyml/forte/blob/master/forte/data/extractors), [train_preprocessor](https://github.com/asyml/forte/tree/master/forte/train_preprocessor.py), [tagging trainer](https://github.com/asyml/forte/tree/master/examples/tagging/tagging_trainer.py), [DataPack dataset](https://github.com/asyml/forte/blob/master/forte/data/data_pack_dataset.py), [types](https://github.com/asyml/forte/blob/master/forte/data/types.py), and [converter](https://github.com/asyml/forte/blob/master/forte/data/converter).\n* `pip install forte[payload]` install packages required for payload.\n## Quick Start Guide\nWriting NLP pipelines with Forte is easy. The following example creates a simple pipeline that analyzes the sentences, tokens, and named entities from a piece of text.\n\nBefore we start, make sure the SpaCy wrapper is installed.\n```bash\npip install forte.spacy\n```\n\nLet's start by writing a simple processor that analyze POS tags to tokens using the good old NLTK library.\n```python\nimport nltk\nfrom forte.processors.base import PackProcessor\nfrom forte.data.data_pack import DataPack\nfrom ft.onto.base_ontology import Token\n\nclass NLTKPOSTagger(PackProcessor):\n    r\"\"\"A wrapper of NLTK pos tagger.\"\"\"\n\n    def initialize(self, resources, configs):\n        super().initialize(resources, configs)\n        # download the NLTK average perceptron tagger\n        nltk.download(\"averaged_perceptron_tagger\")\n\n    def _process(self, input_pack: DataPack):\n        # get a list of token data entries from `input_pack`\n        # using `DataPack.get()`` method\n        token_texts = [token.text for token in input_pack.get(Token)]\n\n        # use nltk pos tagging module to tag token texts\n        taggings = nltk.pos_tag(token_texts)\n\n        # assign nltk taggings to token attributes\n        for token, tag in zip(input_pack.get(Token), taggings):\n            token.pos = tag[1]\n```\nIf we break it down, we will notice there are two main functions.\nIn the `initialize` function, we download and prepare the model. And then in the `_process`\nfunction, we actually process the `DataPack` object, take the some tokens from it, and\nuse the NLTK tagger to create POS tags. The results are stored as the `pos` attribute of\nthe tokens.\n\nBefore we go into the details of the implementation, let's try it in\na full pipeline.\n\n```python\nfrom forte import Pipeline\n\nfrom forte.data.readers import StringReader\nfrom fortex.spacy import SpacyProcessor\n\npipeline: Pipeline = Pipeline[DataPack]()\npipeline.set_reader(StringReader())\npipeline.add(SpacyProcessor(), {\"processors\": [\"sentence\", \"tokenize\"]})\npipeline.add(NLTKPOSTagger())\n```\n\nHere we have successfully created a pipeline with a few components:\n* a `StringReader` that reads data from a string.\n* a `SpacyProcessor` that calls SpaCy to split the sentences and create tokenization\n* and finally the brand new `NLTKPOSTagger` we just implemented,\n\nLet's see it run in action!\n\n```python\ninput_string = \"Forte is a data-centric ML framework\"\nfor pack in pipeline.initialize().process_dataset(input_string):\n    for sentence in pack.get(\"ft.onto.base_ontology.Sentence\"):\n        print(\"The sentence is: \", sentence.text)\n        print(\"The POS tags of the tokens are:\")\n        for token in pack.get(Token, sentence):\n            print(f\" {token.text}[{token.pos}]\", end = \" \")\n        print()\n```\nIt gives us output as follows:\n\n```\nForte[NNP]  is[VBZ]  a[DT]  data[NN]  -[:]  centric[JJ]  ML[NNP]  framework[NN]  .[.]\n```\n\nWe have successfully created a simple pipeline. In the nutshell, the `DataPack`s are\nthe standard packages \"flowing\" on the pipeline. They are created by the reader, and\nthen pass along the pipeline.\n\nEach processor, such as our `NLTKPOSTagger`,\ninterfaces directly with `DataPack`s and do not need to worry about the\nother part of the pipeline, making the engineering process more modular. In this example\npipeline, `SpacyProcessor` creates the `Sentence` and `Token`, and then we implemented\nthe `NLTKPOSTagger` to add Part-of-Speech tags to the tokens.\n\nTo learn more about the details, check out of [documentation](https://asyml-forte.readthedocs.io/)!\nThe classes used in this guide can also be found in this repository or\n[the Forte Wrappers repository](https://github.com/asyml/forte-wrappers/tree/main/src/spacy)\n\n## And There's More\nThe data-centric abstraction of Forte opens the gate to many other opportunities.\nNot only does Forte allow engineers to develop reusable components easily, it further provides a simple way to develop composable ML modules. For example, Forte allows us to:\n* create composable ML solutions with reusable models and processing logic\n* easily interface with a great collection of [3rd party toolkits](https://github.com/asyml/forte-wrappers) built by the community\n* build plug-and-play [data augmentation tools](https://asyml-forte.readthedocs.io/en/latest/code/data_aug.html)\n\n![image](https://user-images.githubusercontent.com/1015991/164107427-66a5c9bd-a3ae-4d75-bfe2-24246e574e07.png)\n\n\nTo learn more about these, you can visit:\n* [Examples](https://github.com/asyml/forte/tree/master/examples)\n* [Documentation](https://asyml-forte.readthedocs.io/)\n* Currently we are working on some interesting [tutorials](https://asyml-forte.readthedocs.io/en/latest/index_toc.html), stay tuned for a full set of documentation on how to do NLP with Forte!\n\n\n## Contributing\nForte was originally developed in CMU and is actively contributed by [Petuum](https://petuum.com/) in collaboration with other institutes. This project is part of the [CASL Open Source](http://casl-project.ai/) family.\n\nIf you are interested in making enhancement to Forte, please first go over our [Code of Conduct](https://github.com/asyml/forte/blob/master/CODE_OF_CONDUCT.md) and [Contribution Guideline](https://github.com/asyml/forte/blob/master/CONTRIBUTING.md)\n\n## About\n\n### Supported By\n\n\u003cp align=\"center\"\u003e\n   \u003cimg src=\"https://user-images.githubusercontent.com/28021889/165799232-2bb9f819-f394-4ade-98b0-c55c751ec8b1.png\", width=\"180\" align=\"top\"\u003e\n      \u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\n   \u003cimg src=\"https://user-images.githubusercontent.com/28021889/165799272-9e51b864-04f6-432a-92e8-e0f84e091f72.png\" width=\"180\" align=\"top\"\u003e\n      \u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\n   \u003cimg src=\"https://user-images.githubusercontent.com/28021889/165802470-f478de54-6c44-4ec8-8cab-ba74ed1f0163.png\" width=\"180\" align=\"top\"\u003e\n   \u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\n\u003c/p\u003e\n\n![image](https://user-images.githubusercontent.com/28021889/165806563-1542aeac-9656-4ad4-bf9c-f9a2e083f5d8.png)\n\n### License\n\n[Apache License 2.0](https://github.com/asyml/forte/blob/master/LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fasyml%2Fforte","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fasyml%2Fforte","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fasyml%2Fforte/lists"}