{"id":15175886,"url":"https://github.com/tma15/bunruija","last_synced_at":"2025-10-26T11:31:25.983Z","repository":{"id":45199820,"uuid":"302823282","full_name":"tma15/bunruija","owner":"tma15","description":"A text classification toolkit","archived":false,"fork":false,"pushed_at":"2024-02-24T08:05:08.000Z","size":6338,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-10-30T04:49:31.512Z","etag":null,"topics":["neural-networks","pytorch","scikit-learn","text-classification"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tma15.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-10-10T05:34:32.000Z","updated_at":"2022-01-01T04:49:54.000Z","dependencies_parsed_at":"2024-02-03T04:22:39.235Z","dependency_job_id":"1dfe0187-f182-4721-882f-efde72ea6596","html_url":"https://github.com/tma15/bunruija","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tma15%2Fbunruija","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tma15%2Fbunruija/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tma15%2Fbunruija/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tma15%2Fbunruija/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tma15","download_url":"https://codeload.github.com/tma15/bunruija/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238319452,"owners_count":19452340,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["neural-networks","pytorch","scikit-learn","text-classification"],"created_at":"2024-09-27T12:43:31.205Z","updated_at":"2025-10-26T11:31:20.660Z","avatar_url":"https://github.com/tma15.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Bunruija\n[![PyPI version](https://badge.fury.io/py/bunruija.svg)](https://badge.fury.io/py/bunruija)\n\nBunruija is a text classification toolkit.\nBunruija aims at enabling pre-processing, training and evaluation of text classification models with **minimum coding effort**.\nBunruija is mainly focusing on Japanese though it is also applicable to other languages.\n\nSee `example` for understanding how bunruija is easy to use.\n\n## Features\n- **Minimum requirements of coding**: bunruija enables users to train and evaluate their models through command lines. Because all experimental settings are stored in a yaml file, users do not have to write codes.\n- **Easy to compare neural-based model with non-neural-based model**: because bunruija supports models based on scikit-learn and PyTorch in the same framework, users can easily compare classification accuracies and prediction times of neural- and non-neural-based models.\n- **Easy to reproduce the training of a model**: because all hyperparameters of a model are stored in a yaml file, it is easy to reproduce the model.\n\n## Install\n```\npip install bunruija\n```\n\n## Example configs\nExample of `sklearn.svm.SVC`\n\n```yaml\ndata:\n  label_column: category\n  text_column: title\n  args:\n    path: data/jsonl\n\noutput_dir: models/svm-model\n\npipeline:\n  - type: sklearn.feature_extraction.text.TfidfVectorizer\n    args:\n      tokenizer:\n        type: bunruija.tokenizers.mecab_tokenizer.MeCabTokenizer\n        args:\n          lemmatize: true\n          exclude_pos:\n            - 助詞\n            - 助動詞\n      max_features: 10000\n      min_df: 3\n      ngram_range:\n        - 1\n        - 3\n  - type: sklearn.svm.SVC\n    args:\n      verbose: false\n      C: 10.\n```\n\nExample of BERT\n\n```yaml\ndata:\n  label_column: category\n  text_column: title\n  args:\n    path: data/jsonl\n\noutput_dir: models/transformer-model\n\npipeline:\n  - type: bunruija.feature_extraction.sequence.SequenceVectorizer\n    args:\n      tokenizer:\n        type: transformers.AutoTokenizer\n        args:\n          pretrained_model_name_or_path: cl-tohoku/bert-base-japanese\n  - type: bunruija.classifiers.transformer.TransformerClassifier\n    args:\n      device: cpu\n      pretrained_model_name_or_path: cl-tohoku/bert-base-japanese\n      optimizer:\n        type: torch.optim.AdamW\n        args:\n          lr: 3e-5\n          weight_decay: 0.01\n          betas:\n            - 0.9\n            - 0.999\n      max_epochs: 3\n```\n\n## CLI\n```sh\n# Training a classifier\nbunruija-train -y config.yaml\n\n# Evaluating the trained classifier\nbunruija-evaluate -y config.yaml\n```\n\n## Config\n### data\nYou can set data-related settings in `data`.\n\n```yaml\ndata:\n  label_column: category\n  text_column: title\n  args:\n    # Use local data in `data/jsonl`. In this path is assumed to contain data files such as train.jsonl, validation.jsonl and test.jsonl\n    path: data/jsonl\n\n    # If you want to use data on Hugging Face Hub, use the following args instead.\n    # Data is from https://huggingface.co/datasets/shunk031/livedoor-news-corpus\n    # path: shunk031/livedoor-news-corpus\n    # random_state: 0\n    # shuffle: true\n\n```\n\ndata is loaded via [datasets.load_dataset](https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#datasets.load_dataset).\nSo, you can load local data as well as data on [Hugging Face Hub](https://huggingface.co/datasets).\nWhen loading data, `args` are passed to `load_dataset`.\n\n`label_column` and `text_column` are field names of label and text.\n\nFormat of `csv`:\n\n```csv\ncategory,sentence\nsports,I like sports!\n…\n```\n\nFormat of `json`:\n\n```json\n[{\"category\", \"sports\", \"text\": \"I like sports!\"}]\n```\n\nFormat of `jsonl`:\n\n```json\n{\"category\", \"sports\", \"text\": \"I like suports!\"}\n```\n\n### pipeline\nYou can set pipeline of your model in `pipeline` section.\nIt is a list of components that are used in your model.\n\nFor each component, `type` is a module path and `args` is arguments for the module.\nFor instance, when you set the first component as follows, [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) is instanciated with given arguments, and then applied to data at first in your model.\n\n```yaml\n  - type: sklearn.feature_extraction.text.TfidfVectorizer\n    args:\n      tokenizer:\n        type: bunruija.tokenizers.mecab_tokenizer.MeCabTokenizer\n        args:\n          lemmatize: true\n          exclude_pos:\n            - 助詞\n            - 助動詞\n      max_features: 10000\n      min_df: 3\n      ngram_range:\n        - 1\n        - 3\n```\n\n## Prediction using the trained classifier in Python code\nAfter you trained a classification model, you can use that model for prediction as follows:\n```python\nfrom bunruija import Predictor\n\npredictor = Predictor.from_pretrained(\"output_dir\")\nwhile True:\n    text = input(\"Input:\")\n    label: list[str] = predictor([text], return_label_type=\"str\")\n    print(label[0])\n```\n\n`output_dir` is a directory that is specified in `output_dir` in config.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftma15%2Fbunruija","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftma15%2Fbunruija","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftma15%2Fbunruija/lists"}