{"id":13856985,"url":"https://github.com/tofunlp/lineflow","last_synced_at":"2026-03-12T15:23:29.001Z","repository":{"id":34234500,"uuid":"172405800","full_name":"tofunlp/lineflow","owner":"tofunlp","description":":zap:A Lightweight NLP Data Loader for All Deep Learning Frameworks in Python","archived":false,"fork":false,"pushed_at":"2024-01-17T03:36:07.000Z","size":863,"stargazers_count":181,"open_issues_count":5,"forks_count":9,"subscribers_count":5,"default_branch":"master","last_synced_at":"2026-02-20T05:07:13.799Z","etag":null,"topics":["deep-learning","machine-learning","natural-language-processing","python"],"latest_commit_sha":null,"homepage":"https://towardsdatascience.com/lineflow-introduction-1caf7851125e","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tofunlp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-02-25T00:05:53.000Z","updated_at":"2025-02-05T00:37:58.000Z","dependencies_parsed_at":"2024-11-27T06:17:11.700Z","dependency_job_id":null,"html_url":"https://github.com/tofunlp/lineflow","commit_stats":null,"previous_names":["yasufumy/lineflow"],"tags_count":24,"template":false,"template_full_name":null,"purl":"pkg:github/tofunlp/lineflow","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tofunlp%2Flineflow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tofunlp%2Flineflow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tofunlp%2Flineflow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tofunlp%2Flineflow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tofunlp","download_url":"https://codeload.github.com/tofunlp/lineflow/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tofunlp%2Flineflow/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30430200,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-12T14:34:45.044Z","status":"ssl_error","status_checked_at":"2026-03-12T14:09:33.793Z","response_time":114,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","machine-learning","natural-language-processing","python"],"created_at":"2024-08-05T03:01:21.485Z","updated_at":"2026-03-12T15:23:28.940Z","avatar_url":"https://github.com/tofunlp.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# LineFlow: Framework-Agnostic NLP Data Loader in Python\n[![CI](https://github.com/tofunlp/lineflow/actions/workflows/ci.yml/badge.svg)](https://github.com/tofunlp/lineflow/actions/workflows/ci.yml)\n[![codecov](https://codecov.io/gh/tofunlp/lineflow/branch/master/graph/badge.svg)](https://codecov.io/gh/tofunlp/lineflow)\n\nLineFlow is a simple text dataset loader for NLP deep learning tasks.\n\n- LineFlow was designed to use in all deep learning frameworks.\n- LineFlow enables you to build pipelines via functional APIs (`.map`, `.filter`, `.flat_map`).\n- LineFlow provides common NLP datasets.\n\nLineFlow is heavily inspired by [tensorflow.data.Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) and [chainer.dataset](https://docs.chainer.org/en/stable/reference/datasets.html).\n\n## Basic Usage\n\nlineflow.TextDataset expects line-oriented text files:\n\n```py\nimport lineflow as lf\n\n\n'''/path/to/text will be expected as follows:\ni 'm a line 1 .\ni 'm a line 2 .\ni 'm a line 3 .\n'''\nds = lf.TextDataset('/path/to/text')\n\nds.first()  # \"i 'm a line 1 .\"\nds.all() # [\"i 'm a line 1 .\", \"i 'm a line 2 .\", \"i 'm a line 3 .\"]\nlen(ds)  # 3\nds.map(lambda x: x.split()).first()  # [\"i\", \"'m\", \"a\", \"line\", \"1\", \".\"]\n```\n\n## Example\n\n- Please check out the [examples](https://github.com/yasufumy/lineflow/tree/master/examples) to see how to use LineFlow, especially for tokenization, building vocabulary, and indexing.\n\nLoads Penn Treebank:\n\n```py\n\u003e\u003e\u003e import lineflow.datasets as lfds\n\u003e\u003e\u003e train = lfds.PennTreebank('train')\n\u003e\u003e\u003e train.first()\n' aer banknote berlitz calloway centrust cluett fromstein gitano guterman hydro-quebec ipo kia memotec mlx nahb punts rake regatta rubens sim snack-food ssangyong swapo wachter '\n```\n\nSplits the sentence to the words:\n\n```py\n\u003e\u003e\u003e # continuing from above\n\u003e\u003e\u003e train = train.map(str.split)\n\u003e\u003e\u003e train.first()\n['aer', 'banknote', 'berlitz', 'calloway', 'centrust', 'cluett', 'fromstein', 'gitano', 'guterman', 'hydro-quebec', 'ipo', 'kia', 'memotec', 'mlx', 'nahb', 'punts', 'rake', 'regatta', 'rubens', 'sim', 'snack-food', 'ssangyong', 'swapo', 'wachter']\n```\n\nObtains words in dataset:\n\n```py\n\u003e\u003e\u003e # continuing from above\n\u003e\u003e\u003e words = train.flat_map(lambda x: x)\n\u003e\u003e\u003e words.take(5) # This is useful to build vocabulary.\n['aer', 'banknote', 'berlitz', 'calloway', 'centrust']\n```\n\nFurther more:\n\n- [How to fine-tune BERT with pytorch-lightning](https://towardsdatascience.com/how-to-fine-tune-bert-with-pytorch-lightning-ba3ad2f928d2) by [@sobamchan](https://towardsdatascience.com/@sobamchan)\n\n## Requirements\n\n- Python3.6+\n\n## Installation\n\nTo install LineFlow:\n\n```sh\npip install lineflow\n```\n\n## Datasets\n\nIs the dataset you want to use not supported? [Suggest a new dataset](https://github.com/tofunlp/lineflow/issues/new?template=dataset.md\u0026title=Add+support+for+\u003cdataset\u003e) :tada:\n\n- [Commonsense Reasoning](#commonsense-reasoning)\n- [Language Modeling](#language-modeling)\n- [Machine Translation](#machine-translation)\n- [Paraphrase](#paraphrase)\n- [Question Answering](#question-answering)\n- [Sentiment Analysis](#sentiment-analysis)\n- [Sequence Tagging](#sequence-tagging)\n- [Text Summarization](#text-summarization)\n\n\n### Commonsense Reasoning\n\n#### [CommonsenseQA](https://www.tau-nlp.org/commonsenseqa)\n\nLoads the CommonsenseQA dataset:\n\n```py\n\u003e\u003e\u003e import lineflow.datasets as lfds\n\n\u003e\u003e\u003e train = lfds.CommonsenseQA(\"train\")\n\u003e\u003e\u003e dev = lfds.CommonsenseQA(\"dev\")\n\u003e\u003e\u003e test = lfds.CommonsenseQA(\"test\")\n```\n\nThe items in this datset as follows:\n\n```py\n\u003e\u003e\u003e import lineflow.datasets as lfds\n\n\u003e\u003e\u003e train = lfds.CommonsenseQA(\"train\")\n\u003e\u003e\u003e train.first()\n{\"id\": \"075e483d21c29a511267ef62bedc0461\",\n \"answer_key\": \"A\",\n \"options\": {\"A\": \"ignore\",\n \"B\": \"enforce\",\n \"C\": \"authoritarian\",\n \"D\": \"yell at\",\n \"E\": \"avoid\"},\n \"stem\": \"The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?\"}\n}\n```\n\n### Language Modeling\n\n\n#### [Penn Treebank](https://catalog.ldc.upenn.edu/docs/LDC95T7/cl93.html)\n\nLoads the Penn Treebank dataset:\n\n```py\nimport lineflow.datasets as lfds\n\ntrain = lfds.PennTreebank('train')\ndev = lfds.PennTreebank('dev')\ntest = lfds.PennTreebank('test')\n```\n#### [WikiText-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/)\n\nLoads the WikiText-103 dataset:\n\n```py\nimport lineflow.datasets as lfds\n\ntrain = lfds.WikiText103('train')\ndev = lfds.WikiText103('dev')\ntest = lfds.WikiText103('test')\n```\n\nThis dataset is preprossed, so you can tokenize each line with `str.split`:\n\n```py\n\u003e\u003e\u003e import lineflow.datasets as lfds\n\u003e\u003e\u003e train = lfds.WikiText103('train').flat_map(lambda x: x.split() + ['\u003ceos\u003e'])\n\u003e\u003e\u003e train.take(5)\n['\u003ceos\u003e', '=', 'Valkyria', 'Chronicles', 'III']\n```\n\n#### [WikiText-2](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) (Added by [@sobamchan](https://github.com/sobamchan), thanks.)\n\nLoads the WikiText-2 dataset:\n\n```py\nimport lineflow.datasets as lfds\n\ntrain = lfds.WikiText2('train')\ndev = lfds.WikiText2('dev')\ntest = lfds.WikiText2('test')\n```\n\nThis dataset is preprossed, so you can tokenize each line with `str.split`:\n\n```py\n\u003e\u003e\u003e import lineflow.datasets as lfds\n\u003e\u003e\u003e train = lfds.WikiText2('train').flat_map(lambda x: x.split() + ['\u003ceos\u003e'])\n\u003e\u003e\u003e train.take(5)\n['\u003ceos\u003e', '=', 'Valkyria', 'Chronicles', 'III']\n```\n\n### Machine Translation\n\n#### [small_parallel_enja](https://github.com/odashi/small_parallel_enja):\n\nLoads the small_parallel_enja dataset which is small English-Japanese parallel corpus:\n\n```py\nimport lineflow.datasets as lfds\n\ntrain = lfds.SmallParallelEnJa('train')\ndev = lfds.SmallParallelEnJa('dev')\ntest = lfd.SmallParallelEnJa('test')\n```\n\nThis dataset is preprossed, so you can tokenize each line with `str.split`:\n\n```py\n\u003e\u003e\u003e import lineflow.datasets as lfds\n\u003e\u003e\u003e train = lfds.SmallParallelEnJa('train').map(lambda x: (x[0].split(), x[1].split()))\n\u003e\u003e\u003e train.first()\n(['i', 'can', \"'t\", 'tell', 'who', 'will', 'arrive', 'first', '.'], ['誰', 'が', '一番', 'に', '着', 'く', 'か', '私', 'に', 'は', '分か', 'り', 'ま', 'せ', 'ん', '。']\n```\n\n### Paraphrase\n\n#### [Microsoft Research Paraphrase Corpus](https://www.microsoft.com/en-us/download/details.aspx?id=52398):\n\nLoads the Miscrosoft Research Paraphrase Corpus:\n\n```py\nimport lineflow.datasets as lfds\n\ntrain = lfds.MsrParaphrase('train')\ntest = lfds.MsrParaphrase('test')\n```\n\nThe item in this dataset as follows:\n\n```py\n\u003e\u003e\u003e import lineflow.datasets as lfds\n\u003e\u003e\u003e train = lfds.MsrParaphrase('train')\n\u003e\u003e\u003e train.first()\n{'quality': '1',\n 'id1': '702876',\n 'id2': '702977',\n 'string1': 'Amrozi accused his brother, whom he called \"the witness\", of deliberately distorting his evidence.',\n 'string2': 'Referring to him as only \"the witness\", Amrozi accused his brother of deliberately distorting his evidence.'\n}\n```\n\n### Question Answering\n\n[SQuAD](https://rajpurkar.github.io/SQuAD-explorer/):\n\nLoads the SQuAD dataset:\n\n```py\nimport lineflow.datasets as lfds\n\ntrain = lfds.Squad('train')\ndev = lfds.Squad('dev')\n```\n\nThe item in this dataset as follows:\n\n```py\n\u003e\u003e\u003e import lineflow.datasets as lfds\n\u003e\u003e\u003e train = lfds.Squad('train')\n\u003e\u003e\u003e train.first()\n{'answers': [{'answer_start': 515, 'text': 'Saint Bernadette Soubirous'}],\n 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',\n 'id': '5733be284776f41900661182',\n 'title': 'University_of_Notre_Dame',\n 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.'}\n```\n\n### Sentiment Analysis\n\n#### [IMDB](http://ai.stanford.edu/~amaas/data/sentiment/):\n\nLoads the IMDB dataset:\n\n```py\nimport lineflow.datasets as lfds\n\ntrain = lfds.Imdb('train')\ntest = lfds.Imdb('test')\n```\n\nThe item in this dataset as follows:\n\n```py\n\u003e\u003e\u003e import lineflow.datasets as lfds\n\u003e\u003e\u003e train = lfds.Imdb('train')\n\u003e\u003e\u003e train.first()\n('For a movie that gets no respect there sure are a lot of memorable quotes listed for this gem. Imagine a movie where Joe Piscopo is actually funny! Maureen Stapleton is a scene stealer. The Moroni character is an absolute scream. Watch for Alan \"The Skipper\" Hale jr. as a police Sgt.', 0)\n```\n\n### Sequence Tagging\n\n#### [CoNLL2000](https://www.clips.uantwerpen.be/conll2000/chunking/)\n\nLoads the CoNLL2000 dataset:\n\n```py\nimport lineflow.datasets as lfds\n\ntrain = lfds.Conll2000('train')\ntest = lfds.Conll2000('test')\n```\n\n### Text Summarization\n\n#### [CNN / Daily Mail](https://github.com/harvardnlp/sent-summary):\n\nLoads the CNN / Daily Mail dataset:\n\n```py\nimport lineflow.datasets as lfds\n\ntrain = lfds.CnnDailymail('train')\ndev = lfds.CnnDailymail('dev')\ntest = lfds.CnnDailymail('test')\n```\n\nThis dataset is preprossed, so you can tokenize each line with `str.split`:\n\n```py\n\u003e\u003e\u003e import lineflow.datasets as lfds\n\u003e\u003e\u003e train = lfds.CnnDailymail('train').map(lambda x: (x[0].split(), x[1].split()))\n\u003e\u003e\u003e train.first()\n... # the output is omitted because it's too long to display here.\n```\n\n#### [SciTLDR](https://github.com/allenai/scitldr)\n\nLoads the TLDR dataset:\n\n```py\nimport lineflow.datasets as lfds\n\ntrain = lfds.SciTLDR('train')\ndev = lfds.SciTLDR('dev')\ntest = lfds.SciTLDR('test')\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftofunlp%2Flineflow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftofunlp%2Flineflow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftofunlp%2Flineflow/lists"}