{"id":18894257,"url":"https://github.com/takelab/podium","last_synced_at":"2025-07-27T02:38:00.109Z","repository":{"id":49540265,"uuid":"147528943","full_name":"TakeLab/podium","owner":"TakeLab","description":"Podium: a framework agnostic Python NLP library for data loading and preprocessing","archived":false,"fork":false,"pushed_at":"2022-12-12T21:35:40.000Z","size":2296,"stargazers_count":60,"open_issues_count":15,"forks_count":2,"subscribers_count":16,"default_branch":"master","last_synced_at":"2025-01-20T02:02:24.086Z","etag":null,"topics":["data-loading","datasets","natural-language-processing","nlp","preprocessing","python"],"latest_commit_sha":null,"homepage":"http://takelab.fer.hr/podium","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TakeLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-09-05T14:13:35.000Z","updated_at":"2024-10-05T04:05:58.000Z","dependencies_parsed_at":"2023-01-28T01:16:21.312Z","dependency_job_id":null,"html_url":"https://github.com/TakeLab/podium","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TakeLab%2Fpodium","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TakeLab%2Fpodium/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TakeLab%2Fpodium/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TakeLab%2Fpodium/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TakeLab","download_url":"https://codeload.github.com/TakeLab/podium/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248984423,"owners_count":21193750,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-loading","datasets","natural-language-processing","nlp","preprocessing","python"],"created_at":"2024-11-08T08:20:29.075Z","updated_at":"2025-04-15T00:32:01.455Z","avatar_url":"https://github.com/TakeLab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n    \u003cbr\u003e\n    \u003cimg alt=\"TakeLab Podium\" src=\"docs/source/_static/podium_logo.svg\" width=\"350\"/\u003e\n    \u003cp\u003e\n    A framework agnostic Python NLP library for data loading and preprocessing.\n    \u003cbr\u003e\n\u003c/div\u003e\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://github.com/TakeLab/podium/actions\"\u003e\n        \u003cimg alt=\"Continuous integration\" src=\"https://github.com/TakeLab/podium/actions/workflows/ci.yml/badge.svg\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/TakeLab/podium/blob/master/LICENSE\"\u003e\n        \u003cimg alt=\"License\" src=\"https://img.shields.io/github/license/TakeLab/podium.svg?color=blue\u0026cachedrop\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"http://takelab.fer.hr/podium/\"\u003e\n        \u003cimg alt=\"Documentation\" src=\"https://img.shields.io/website?down_color=red\u0026down_message=offline\u0026up_message=online\u0026url=http%3A%2F%2Ftakelab.fer.hr%2Fpodium%2Findex.html\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/TakeLab/podium/releases\"\u003e\n        \u003cimg alt=\"Release\" src=\"https://img.shields.io/github/release/TakeLab/podium.svg\"\u003e\n    \u003c/a\u003e\n    \u003cbr\u003e\n\u003c/p\u003e\n\n## What is Podium?\n\nPodium is a framework agnostic Python natural language processing library which standardizes data loading and preprocessing.\nOur goal is to accelerate users' development of NLP models whichever aspect of the library they decide to use. \n\nWe desire Podium to be **lightweight**, in terms of code and dependencies, **flexible**, to cover most common use-cases and easily adapt to more specific ones and **clearly defined**, so new users can quickly understand the sequence of operations and how to inject their custom functionality.\n\nCheck out our [documentation](http://takelab.fer.hr/podium/) for more details. \nThe main source of inspiration for Podium is an old version of [torchtext](https://github.com/pytorch/text).\n\n### Contents\n\n- [Installation](#installation)\n- [Usage examples](#usage)\n- [Contributing](#contributing)\n- [Versioning](#versioning)\n- [Authors](#authors)\n- [License](#license)\n\n## Installation\n\n### Installing from pip\n\nYou can install `podium` using pip\n\n```bash\npip install podium-nlp\n```\n\n### Installing from source\n\nCommands to install `podium` from source\n\n```bash\ngit clone git@github.com:TakeLab/podium.git \u0026\u0026 cd podium\npip install .\n```\n\nFor more detailed installation instructions, check the [installation page](https://takelab.fer.hr/podium/installation.html) in the documentation.\n\n## Usage\n\n### Loading datasets\n\nUse some of our pre-defined datasets:\n\n```python\n\u003e\u003e\u003e from podium.datasets import SST\n\u003e\u003e\u003e sst_train, sst_dev, sst_test = SST.get_dataset_splits()\n\u003e\u003e\u003e sst_train.finalize_fields() # Trigger vocab construction\n\u003e\u003e\u003e print(sst_train)\nSST({\n    size: 6920,\n    fields: [\n        Field({\n            name: text,\n            keep_raw: False,\n            is_target: False,\n            vocab: Vocab({specials: ('\u003cUNK\u003e', '\u003cPAD\u003e'), eager: False, is_finalized: True, size: 16284})\n        }),\n        LabelField({\n            name: label,\n            keep_raw: False,\n            is_target: True,\n            vocab: Vocab({specials: (), eager: False, is_finalized: True, size: 2})\n        })\n    ]\n})\n\u003e\u003e\u003e print(sst_train[222]) # A short example\nExample({\n    text: (None, ['A', 'slick', ',', 'engrossing', 'melodrama', '.']),\n    label: (None, 'positive')\n})\n```\n\nLoad datasets from [🤗 datasets](https://github.com/huggingface/datasets):\n\n```python\n\u003e\u003e\u003e from podium.datasets.hf import HFDatasetConverter as HF\n\u003e\u003e\u003e import datasets\n\u003e\u003e\u003e # Load the huggingface dataset\n\u003e\u003e\u003e imdb = datasets.load_dataset('imdb')\n\u003e\u003e\u003e print(imdb.keys())\ndict_keys(['train', 'test', 'unsupervised'])\n\u003e\u003e\u003e # Wrap it so it can be used in Podium (without being loaded in memory!)\n\u003e\u003e\u003e imdb_train, imdb_test, imdb_unsupervised = HF.from_dataset_dict(imdb).values()\n\u003e\u003e\u003e # We need to trigger Vocab construction\n\u003e\u003e\u003e imdb_train.finalize_fields()\n\u003e\u003e\u003e print(imdb_train)\nHFDatasetConverter({\n    dataset_name: imdb,\n    size: 25000,\n    fields: [\n        Field({\n            name: 'text',\n            keep_raw: False,\n            is_target: False,\n            vocab: Vocab({specials: ('\u003cUNK\u003e', '\u003cPAD\u003e'), eager: False, is_finalized: True, size: 280619})\n        }),\n        LabelField({\n            name: 'label',\n            keep_raw: False,\n            is_target: True\n        })\n    ]\n})\n```\n\nLoad your own dataset from a standardized tabular format (e.g. `csv`, `tsv`, `jsonl`, ...):\n\n```python\n\u003e\u003e\u003e from podium.datasets import TabularDataset\n\u003e\u003e\u003e from podium import Vocab, Field, LabelField\n\u003e\u003e\u003e fields = {'premise':   Field('premise', numericalizer=Vocab()),\n...           'hypothesis':Field('hypothesis', numericalizer=Vocab()),\n...           'label':     LabelField('label')}\n\u003e\u003e\u003e dataset = TabularDataset('my_dataset.csv', format='csv', fields=fields)\n\u003e\u003e\u003e dataset.finalize_fields() # Trigger vocab construction\n\u003e\u003e\u003e print(dataset)\nTabularDataset({\n    size: 1,\n    fields: [\n        Field({\n            name: 'premise',\n            keep_raw: False,\n            is_target: False, \n            vocab: Vocab({specials: ('\u003cUNK\u003e', '\u003cPAD\u003e'), eager: False, is_finalized: True, size: 15})\n        }),\n        Field({\n            name: 'hypothesis',\n            keep_raw: False,\n            is_target: False, \n            vocab: Vocab({specials: ('\u003cUNK\u003e', '\u003cPAD\u003e'), eager: False, is_finalized: True, size: 6})\n        }),\n        LabelField({\n            name: 'label',\n            keep_raw: False,\n            is_target: True, \n            vocab: Vocab({specials: (), eager: False, is_finalized: True, size: 1})\n        })\n    ]\n})\n```\n\nCheck our documentation to see how you can load a dataset from [Pandas](https://pandas.pydata.org/), the CoNLL format, or define your own `Dataset` subclass (tutorial coming soon).\n\n### Define your preprocessing\n\nWe wrap dataset pre-processing in customizable `Field` classes. Each `Field` has an optional `Vocab` instance which automatically handles token-to-index conversion.\n\n```python\n\u003e\u003e\u003e from podium import Vocab, Field, LabelField\n\u003e\u003e\u003e vocab = Vocab(max_size=5000, min_freq=2)\n\u003e\u003e\u003e text = Field(name='text', numericalizer=vocab)\n\u003e\u003e\u003e label = LabelField(name='label')\n\u003e\u003e\u003e fields = {'text': text, 'label': label}\n\u003e\u003e\u003e sst_train, sst_dev, sst_test = SST.get_dataset_splits(fields=fields)\n\u003e\u003e\u003e sst_train.finalize_fields()\n\u003e\u003e\u003e print(vocab)\nVocab({specials: ('\u003cUNK\u003e', '\u003cPAD\u003e'), eager: False, finalized: True, size: 5000})\n```\n\nEach `Field` allows the user full flexibility to modify the data in multiple stages:\n- Prior to tokenization (by using pre-tokenization `hooks`)\n- During tokenization (by using your own `tokenizer`)\n- Post tokenization (by using post-tokenization `hooks`)\n\nYou can also completely disregard our preprocessing and define your own by setting your own `numericalizer`.\n\nYou could decide to lowercase all the characters and filter out all non-alphanumeric tokens:\n\n```python\n\u003e\u003e\u003e def lowercase(raw):\n...     return raw.lower()\n\u003e\u003e\u003e def filter_alnum(raw, tokenized):\n...     filtered_tokens = [token for token in tokenized if\n...                        any([char.isalnum() for char in token])]\n...     return raw, filtered_tokens\n\u003e\u003e\u003e text.add_pretokenize_hook(lowercase)\n\u003e\u003e\u003e text.add_posttokenize_hook(filter_alnum)\n\u003e\u003e\u003e fields = {'text': text, 'label': label}\n\u003e\u003e\u003e sst_train, sst_dev, sst_test = SST.get_dataset_splits(fields=fields)\n\u003e\u003e\u003e sst_train.finalize_fields()\n\u003e\u003e\u003e print(sst_train[222])\nExample({\n    text: (None, ['a', 'slick', 'engrossing', 'melodrama']),\n    label: (None, 'positive')\n})\n```\n\n**Pre-tokenization** hooks accept and modify only on `raw` data.\n**Post-tokenization** hooks accept and modify `raw` and `tokenized` data.\n\n### Use preprocessing from other libraries\n\nA common use-case is to incorporate existing components of pretrained language models, such as BERT. This is extremely simple to incorporate as part of our `Field`s. This snippet requires installation of the `🤗 transformers` (`pip install transformers`) library.\n\n```python\n\u003e\u003e\u003e from transformers import BertTokenizer\n\u003e\u003e\u003e # Load the tokenizer and fetch pad index\n\u003e\u003e\u003e tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n\u003e\u003e\u003e pad_index = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)\n\u003e\u003e\u003e # Define a BERT subword Field\n\u003e\u003e\u003e subword_field = Field(name=\"subword\",\n...                       padding_token=pad_index,\n...                       tokenizer=tokenizer.tokenize,\n...                       numericalizer=tokenizer.convert_tokens_to_ids)\n\u003e\u003e\u003e fields = {'text': subword_field, 'label': label}\n\u003e\u003e\u003e sst_train, sst_dev, sst_test = SST.get_dataset_splits(fields=fields)\n\u003e\u003e\u003e # No need to finalize since we're not using a vocab!\n\u003e\u003e\u003e print(sst_train[222])\nExample({\n    subword: (None, ['a', 'slick', ',', 'eng', '##ross', '##ing', 'mel', '##od', '##rama', '.']),\n    label: (None, 'positive')\n})\n```\n\nFor a more interactive introduction, check out the quickstart on Google Colab: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/takelab/podium/blob/master/docs/source/notebooks/quickstart.ipynb)\n\nFull usage examples can be found in our [docs](https://takelab.fer.hr/podium/) under the **Examples** heading.\n\n## Contributing\n\nWe welcome contributions! To learn more about making a contribution to Podium, please see our [Contribution page](CONTRIBUTING.md) and our [Roadmap](ROADMAP.md).\n\n## Versioning\n\nWe use [SemVer](http://semver.org/) for versioning. For the versions available, see the [tags on this repository](../../tags). \n\n## Authors\n\n* Podium is currently maintained by [Ivan Smoković](https://github.com/ivansmokovic), [Mario Šaško](https://github.com/mariosasko), [Filip Boltužić](https://github.com/FilipBolt), and [Martin Tutek](https://github.com/mttk). A non-exhaustive but growing list of collaborators: [Silvije Skudar](https://github.com/sskudar), [Domagoj Pluščec](https://github.com/domi385), [Marin Kačan](https://github.com/mkacan), [Dunja Vesinger](https://github.com/dunja-v), [Mate Mijolović](https://github.com/matemijolovic).\n* Thanks to the amazing [Mihaela Bošnjak](https://github.com/Bmikaella) for the logo!\n* Project made as part of [TakeLab](https://takelab.fer.hr) at Faculty of Electrical Engineering and Computing, University of Zagreb.\n\nSee also the list of [contributors](../../graphs/contributors) who participated in this project.\n\n## Citation\n\nIf you are using Podium, please cite the following entry in your work:\n\n```\n@misc{tutek-etal-2021-podium,\n  author = {Martin Tutek and Filip Boltužić and Ivan Smoković and Mario Šaško and Silvije Škudar and Domagoj Pluščec and Marin Kačan and Dunja Vesinger and Mate Mijolović and Jan Šnajder},\n  title = {Podium: a framework-agnostic NLP preprocessing toolkit},\n  year = {2021},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https://github.com/TakeLab/podium}},\n  commit = {4fed78b8d8366768df10454b8368f416a3305cc4}\n}\n```\n\n## License\n\nThis project is licensed under the BSD 3-Clause - see the [LICENSE](LICENSE) file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftakelab%2Fpodium","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftakelab%2Fpodium","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftakelab%2Fpodium/lists"}