{"id":13482410,"url":"https://github.com/chakki-works/chazutsu","last_synced_at":"2025-04-05T17:04:08.839Z","repository":{"id":46080397,"uuid":"90588866","full_name":"chakki-works/chazutsu","owner":"chakki-works","description":"The tool to make NLP datasets ready to use","archived":false,"fork":false,"pushed_at":"2022-10-20T22:08:19.000Z","size":1044,"stargazers_count":242,"open_issues_count":3,"forks_count":32,"subscribers_count":13,"default_branch":"master","last_synced_at":"2025-03-29T16:08:34.637Z","etag":null,"topics":["dataset","machine-learning","natural-language-processing"],"latest_commit_sha":null,"homepage":"https://medium.com/chakki/how-to-load-text-datasets-before-youre-in-trouble-with-them-345cdb1f1b33","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chakki-works.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-05-08T05:14:43.000Z","updated_at":"2025-01-20T10:51:12.000Z","dependencies_parsed_at":"2022-08-30T08:00:39.302Z","dependency_job_id":null,"html_url":"https://github.com/chakki-works/chazutsu","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chakki-works%2Fchazutsu","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chakki-works%2Fchazutsu/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chakki-works%2Fchazutsu/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chakki-works%2Fchazutsu/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chakki-works","download_url":"https://codeload.github.com/chakki-works/chazutsu/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247369953,"owners_count":20927928,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset","machine-learning","natural-language-processing"],"created_at":"2024-07-31T17:01:01.717Z","updated_at":"2025-04-05T17:04:08.816Z","avatar_url":"https://github.com/chakki-works.png","language":"Python","funding_links":[],"categories":["Libraries","函式庫"],"sub_categories":["Videos and Online Courses","書籍"],"readme":"# chazutsu\n\n![chazutsu_top.PNG](./docs/chazutsu_top.PNG)  \n*[photo from Kaikado, traditional Japanese chazutsu maker](http://www.kaikado.jp/english/goods/design.html)*\n\n[![PyPI version](https://badge.fury.io/py/chazutsu.svg)](https://badge.fury.io/py/chazutsu)\n[![Build Status](https://travis-ci.org/chakki-works/chazutsu.svg?branch=master)](https://travis-ci.org/chakki-works/chazutsu)\n[![codecov](https://codecov.io/gh/chakki-works/chazutsu/branch/master/graph/badge.svg)](https://codecov.io/gh/chakki-works/chazutsu)\n\nchazutsu is the dataset downloader for NLP.\n\n```py\n\u003e\u003e\u003e import chazutsu\n\u003e\u003e\u003e r = chazutsu.datasets.IMDB().download()\n\u003e\u003e\u003e r.train_data().head(5)\n```\nThen\n\n```\n   polarity  rating                                             review\n0         0       3  You'd think the first landing on the Moon woul...\n1         1       9  I took a flyer in renting this movie but I got...\n2         1      10  Sometimes I just want to laugh. Don't you? No ...\n3         0       2  I knew it wasn't gunna work out between me and...\n4         0       2  Sometimes I rest my head and think about the r...\n```\n\nYou can use chazutsu on Jupyter.\n\n## Install\n\n```\npip install chazutsu\n```\n\n## Supported datasetd\n\nchazutsu supports various kinds of datasets!  \n**[Please see the details here!](https://github.com/chakki-works/chazutsu/tree/master/chazutsu)**\n\n* Sentiment Analysis\n  * Movie Review Data\n  * Customer Review Datasets\n  * Large Movie Review Dataset(IMDB)\n* Text classification\n  * 20 Newsgroups\n  * Reuters News Courpus (RCV1-v2)\n* Language Modeling\n  * Penn Tree Bank\n  * WikiText-2\n  * WikiText-103\n  * text8\n* Text Summarization\n  * DUC2003\n  * DUC2004\n  * Gigaword\n* Textual entailment\n  * The Multi-Genre Natural Language Inference (MultiNLI)\n* Question Answering\n  * The Stanford Question Answering Dataset (SQuAD)\n\n\n# How it works\n\nchazutsu not only download the dataset, but execute expand archive file, shuffle, split, picking samples process also (You can disable the process by arguments if you don't need).\n\n![chazutsu_process1.png](./docs/chazutsu_process1.png)\n\n```\nr = chazutsu.datasets.MovieReview.polarity(shuffle=False, test_size=0.3, sample_count=100).download()\n```\n\n* `shuffle`: The flag argument for executing shuffle or not(True/False).\n* `test_size`: The ratio of the test dataset (If dataset already prepares train and test dataset, this value is ignored).\n* `sample_count`: You can pick some samples from the dataset to avoid the editor freeze caused by the heavy text file.\n* `force`: Don't use cache, re-download the dataset.\n\nchazutsu supports fundamental process for tokenization.\n\n![chazutsu_process2.png](./docs/chazutsu_process2.png)\n\n```py\n\u003e\u003e\u003e import chazutsu\n\u003e\u003e\u003e r = chazutsu.datasets.MovieReview.subjectivity().download()\n\u003e\u003e\u003e r.train_data().head(3)\n```\n\nThen\n\n```\n    subjectivity                                             review\n0             0  . . . works on some levels and is certainly wo...\n1             1  the hulk is an anger fueled monster with incre...\n2             1  when the skittish emma finds blood on her pill...\n```\n\nNow we want to convert this data to train various frameworks.\n\n```py\nfixed_len = 10\nr.make_vocab(vocab_size=1000)\nr.column(\"review\").as_word_seq(fixed_len=fixed_len)\nX, y = r.to_batch(\"train\")\nassert X.shape == (len(y), fixed_len, len(r.vocab))\nassert y.shape == (len(y), 1)\n```\n\n* `make_vocab`\n  * `vocab_resources`: resources to make vocabulary (\"train\", \"valid\", \"test\")\n  * `columns_for_vocab`: The columns to make vocabulary\n  * `tokenizer`: Tokenizer\n  * `vocab_size`: Vocacbulary size\n  * `min_word_freq`: Minimum word count to include the vocabulary\n  * `unknown`: The tag used for out of vocabulary word\n  * `padding`: The tag used to pad the sequence\n  * `end_of_sentence`: If you want to clarify the end-of-line by specific tag, then use this.\n  * `reserved_words`: The word that should included in vocabulary (ex. tag for padding)\n  * `force`: Don't use cache, re-create the dataset.\n\nIf you don't want to load all the training data? You can use `to_batch_iter`.\n\n## Additional Feature\n\n### Use on Jupyter\n\nYou can use chazutsu on [Jupyter Notebook](http://jupyter.org/).  \n\n![on_jupyter.png](./docs/on_jupyter.png)\n\nBefore you execute chazutsu on Jupyter, you have to enable widget extention by below command.\n\n```\njupyter nbextension enable --py --sys-prefix widgetsnbextension\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchakki-works%2Fchazutsu","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchakki-works%2Fchazutsu","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchakki-works%2Fchazutsu/lists"}