{"id":13461718,"url":"https://github.com/google/budou","last_synced_at":"2025-05-15T10:01:39.182Z","repository":{"id":38272733,"uuid":"67390307","full_name":"google/budou","owner":"google","description":"Budou is an automatic organizer tool for beautiful line breaking in CJK (Chinese, Japanese, and Korean).","archived":false,"fork":false,"pushed_at":"2023-04-09T23:56:52.000Z","size":270,"stargazers_count":1171,"open_issues_count":7,"forks_count":55,"subscribers_count":34,"default_branch":"master","last_synced_at":"2025-04-14T16:53:46.858Z","etag":null,"topics":["cjk","natural-language-processing","python","web-development"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/google.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2016-09-05T05:21:04.000Z","updated_at":"2025-04-11T00:27:34.000Z","dependencies_parsed_at":"2024-01-12T00:26:55.022Z","dependency_job_id":"7f9c3893-c54b-4419-b389-46ddca609c9f","html_url":"https://github.com/google/budou","commit_stats":{"total_commits":173,"total_committers":18,"mean_commits":9.61111111111111,"dds":"0.19653179190751446","last_synced_commit":"3e1890f17fdeaf4050abc69a5004bb3bb3e32538"},"previous_names":[],"tags_count":41,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google%2Fbudou","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google%2Fbudou/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google%2Fbudou/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google%2Fbudou/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/google","download_url":"https://codeload.github.com/google/budou/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254319715,"owners_count":22051072,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cjk","natural-language-processing","python","web-development"],"created_at":"2024-07-31T11:00:54.325Z","updated_at":"2025-05-15T10:01:38.777Z","avatar_url":"https://github.com/google.png","language":"Python","funding_links":[],"categories":["Python","Python (144)"],"sub_categories":["Miscs"],"readme":"Budou 🍇\n===========\n\n**Budou is in maintenance mode. The development team is focusing on developing its successor,** `BudouX \u003chttps://github.com/google/budoux\u003e`_\n\n.. image:: https://badge.fury.io/py/budou.svg\n   :target: https://badge.fury.io/py/budou\n\nEnglish text has many clues, like spacing and hyphenation, that enable beautiful\nand legible line breaks. Some CJK languages lack these clues, and so are\nnotoriously more difficult to process. Without a more careful approach,\nbreaks can occur randomly and usually in the middle of a word. This is a\nlong-standing issue with typography on the web and results in a degradation\nof readability.\n\nBudou automatically translates CJK sentences into HTML with\nlexical chunks wrapped in non-breaking markup, so as to semantically control line\nbreaks. Budou uses word segmenters to analyze input sentences. It can also\nconcatenate proper nouns to produce meaningful chunks utilizing\npart-of-speech (pos) tagging and other syntactic information. Processed chunks are\nwrapped with the :code:`SPAN` tag. These semantic units will no longer be split at\nthe end of a line if given a CSS :code:`display` property set to :code:`inline-block`.\n\n\nInstallation\n--------------\n\nThe package is listed in the Python Package Index (PyPI), so you can install it\nwith pip:\n\n.. code-block:: sh\n\n   $ pip install budou\n\n\nOutput\n--------------\n\nBudou outputs an HTML snippet wrapping chunks with :code:`span` tags:\n\n.. code-block:: html\n\n   \u003cspan\u003e\u003cspan class=\"ww\"\u003e常に\u003c/span\u003e\u003cspan class=\"ww\"\u003e最新、\u003c/span\u003e\n   \u003cspan class=\"ww\"\u003e最高の\u003c/span\u003e\u003cspan class=\"ww\"\u003eモバイル。\u003c/span\u003e\u003c/span\u003e\n\nSemantic chunks in the output HTML will not be split at the end of line by\nconfiguring each :code:`span` tag with :code:`display: inline-block` in CSS.\n\n.. code-block:: css\n\n   .ww {\n     display: inline-block;\n   }\n\nBy using the output HTML from Budou and the CSS above, sentences\non your webpage will be rendered with legible line breaks:\n\n.. image:: https://raw.githubusercontent.com/wiki/google/budou/images/nexus_example.jpeg\n\n\nUsing as a command-line app\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nYou can process your text by running the :code:`budou` command:\n\n.. code-block:: sh\n\n   $ budou 渋谷のカレーを食べに行く。\n\nThe output is:\n\n.. code-block:: html\n\n   \u003cspan\u003e\u003cspan class=\"ww\"\u003e渋谷の\u003c/span\u003e\u003cspan class=\"ww\"\u003eカレーを\u003c/span\u003e\n   \u003cspan class=\"ww\"\u003e食べに\u003c/span\u003e\u003cspan class=\"ww\"\u003e行く。\u003c/span\u003e\u003c/span\u003e\n\nYou can also configure the command with optional parameters.\nFor example, you can change the backend segmenter to `MeCab \u003c#mecab-segmenter\u003e`_ and change the\nclass name to :code:`wordwrap` by running:\n\n.. code-block:: sh\n\n   $ budou 渋谷のカレーを食べに行く。 --segmenter=mecab --classname=wordwrap\n\nThe output is:\n\n.. code-block:: html\n\n   \u003cspan\u003e\u003cspan class=\"wordwrap\"\u003e渋谷の\u003c/span\u003e\u003cspan class=\"wordwrap\"\u003eカレーを\u003c/span\u003e\n   \u003cspan class=\"wordwrap\"\u003e食べに\u003c/span\u003e\u003cspan class=\"wordwrap\"\u003e行く。\u003c/span\u003e\u003c/span\u003e\n\nRun the help command :code:`budou -h` to see other available options.\n\n\nUsing programmatically\n~~~~~~~~~~~~~~~~~~~~~~~~~\n\nYou can use the :code:`budou.parse` method in your Python scripts.\n\n.. code-block:: python\n\n   import budou\n   results = budou.parse('渋谷のカレーを食べに行く。')\n   print(results['html_code'])\n   # \u003cspan\u003e\u003cspan class=\"ww\"\u003e渋谷の\u003c/span\u003e\u003cspan class=\"ww\"\u003eカレーを\u003c/span\u003e\n   # \u003cspan class=\"ww\"\u003e食べに\u003c/span\u003e\u003cspan class=\"ww\"\u003e行く。\u003c/span\u003e\u003c/span\u003e\n\n\nYou can also make a parser instance to reuse the segmenter backend with the same\nconfiguration. If you want to integrate Budou into your web development\nframework in the form of a custom filter or build process, this would be the way\nto go.\n\n.. code-block:: python\n\n   import budou\n   parser = budou.get_parser('mecab')\n   results = parser.parse('渋谷のカレーを食べに行く。')\n   print(results['html_code'])\n   # \u003cspan\u003e\u003cspan class=\"ww\"\u003e渋谷の\u003c/span\u003e\u003cspan class=\"ww\"\u003eカレーを\u003c/span\u003e\n   # \u003cspan class=\"ww\"\u003e食べに\u003c/span\u003e\u003cspan class=\"ww\"\u003e行く。\u003c/span\u003e\u003c/span\u003e\n\n   for chunk in results['chunks']:\n     print(chunk.word)\n   # 渋谷の 名詞\n   # カレーを 名詞\n   # 食べに 動詞\n   # 行く。 動詞\n\n\n(deprecated) :code:`authenticate` method\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n:code:`authenticate`, which had been the method used to create a parser in\nprevious releases, is now deprecated.\nThe :code:`authenticate` method is now a wrapper around the :code:`get_parser` method\nthat returns a parser with the\n`Google Cloud Natural Language API \u003c#google-cloud-natural-language-api-segmenter\u003e`_\nsegmenter backend.\nThe method is still available, but it may be removed in a future release.\n\n.. code-block:: python\n\n   import budou\n   parser = budou.authenticate('/path/to/credentials.json')\n\n   # This is equivalent to:\n   parser = budou.get_parser(\n       'nlapi', credentials_path='/path/to/credentials.json')\n\n\nAvailable segmenter backends\n------------------------------\n\nYou can choose different segmenter backends depending on the needs of \nyour environment. Currently, the segmenters below are supported.\n\n.. csv-table::\n  :header: Name, Identifier, Supported Languages\n\n  `Google Cloud Natural Language API \u003c#google-cloud-natural-language-api-segmenter\u003e`_, nlapi, \"Chinese, Japanese, Korean\"\n  `MeCab \u003c#mecab-segmenter\u003e`_, mecab, \"Japanese\"\n  `TinySegmenter \u003c#tinysegmenter-based-segmenter\u003e`_, tinysegmenter, \"Japanese\"\n\n\nSpecify the segmenter when you run the :code:`budou` command or load a parser.\nFor example, you can run the :code:`budou` command with the MeCab segmenter by\npassing the :code:`--segmenter=mecab` parameter:\n\n.. code-block:: sh\n\n  $ budou 今日も元気です --segmenter=mecab\n\nYou can pass :code:`segmenter` parameter when you load a parser:\n\n.. code-block:: python\n\n  import budou\n  parser = budou.get_parser('mecab')\n  parser.parse('今日も元気です')\n\nIf no segmenter is specified, the Google Cloud Natural Language API is used as\nthe default.\n\n\n.. _nlapi-segmenter:\n\nGoogle Cloud Natural Language API Segmenter\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nThe Google Cloud Natural Language API (https://cloud.google.com/natural-language/)\n(NL API) analyzes input sentences using\nmachine learning technology. The API can extract not only syntax but also\nentities included in the sentence, which can be used for better quality\nsegmentation (see more at `Entity mode \u003c#entity-mode\u003e`_). Since this is a simple\nREST API, you don't need to maintain a dictionary. You can also support multiple\nlanguages using one single source.\n\nSupported languages\n++++++++++++++++++++++\n\n- Simplified Chinese (zh)\n- Traditional Chinese (zh-Hant)\n- Japanese (ja)\n- Korean (ko)\n\nFor those considering using Budou for Korean sentences, please refer to\nthe `Korean support \u003c#korean-support\u003e`_ section.\n\n\nAuthentication\n+++++++++++++++\n\nThe NL API requires authentication before use. First, create a Google Cloud Platform\nproject and enable the Cloud Natural Language API. Billing also needs to be enabled\nfor the project. Then, download a credentials file for a service account by\naccessing the `Google Cloud Console \u003chttps://console.cloud.google.com/\u003e`_\nand navigating through \"API \u0026 Services\" \u003e \"Credentials\" \u003e \"Create credentials\" \u003e\n\"Service account key\" \u003e \"JSON\".\n\nBudou will handle authentication once the path to the credentials file is set\nin the :code:`GOOGLE_APPLICATION_CREDENTIALS` environment variable.\n\n.. code-block:: sh\n\n   $ export GOOGLE_APPLICATION_CREDENTIALS='/path/to/credentials.json'\n\nYou can also pass the path to the credentials file when you initialize the\nparser.\n\n.. code-block:: python\n\n   parser = budou.get_parser(\n       'nlapi', credentials_path='/path/to/credentials.json')\n\nThe NL API segmenter uses *Syntax Analysis* and incurs costs according to\nmonthly usage. The NL API has free quota to start testing the feature without charge.\nPlease refer to https://cloud.google.com/natural-language/pricing for more\ndetailed pricing information.\n\nCaching system\n++++++++++++++++\n\nParsers using the NL API segmenter cache responses from the API in order to\nprevent unnecessary requests to the API and to make processing faster. If you want to\nforce-refresh the cache, set :code:`use_cache` to :code:`False`.\n\n.. code-block:: python\n\n   parser = budou.get_parser(segmenter='nlapi', use_cache=False)\n   result = parser.parse('明日は晴れるかな')\n\nIn the `Google App Engine Python 2.7 Standard Environment \u003chttps://cloud.google.com/appengine/docs/standard/python/\u003e`_,\nBudou tries to use the\n`memcache \u003chttps://cloud.google.com/appengine/docs/standard/python/memcache/\u003e`_\nservice to cache output efficiently across instances.\nIn other environments, Budou creates a cache file in the\n`python pickle \u003chttps://docs.python.org/3/library/pickle.html\u003e`_ format in\nyour file system.\n\n\n.. _entity-mode:\n\nEntity mode\n++++++++++++++++++\n\nThe default parser only uses results from Syntactic Analysis for parsing, but you\ncan also utilize results from *Entity Analysis* by specifying `use_entity=True`.\nEntity Analysis will improve the accuracy of parsing for some phrases,\nespecially proper nouns, so it is recommended if your target sentences\ninclude names of individual people, places, organizations, and so on.\n\nPlease note that Entity Analysis will result in additional pricing because it\nrequires additional requests to the NL API. For more details about API pricing,\nplease refer to https://cloud.google.com/natural-language/pricing.\n\n.. code-block:: python\n\n  import budou\n  # Without Entity mode (default)\n  result = budou.parse('六本木ヒルズでご飯を食べます。', use_entity=False)\n  print(result['html_code'])\n  # \u003cspan class=\"ww\"\u003e六本木\u003c/span\u003e\u003cspan class=\"ww\"\u003eヒルズで\u003c/span\u003e\n  # \u003cspan class=\"ww\"\u003eご飯を\u003c/span\u003e\u003cspan class=\"ww\"\u003e食べます。\u003c/span\u003e\n\n  # With Entity mode\n  result = budou.parse('六本木ヒルズでご飯を食べます。', use_entity=True)\n  print(result['html_code'])\n  # \u003cspan class=\"ww\"\u003e六本木ヒルズで\u003c/span\u003e\n  # \u003cspan class=\"ww\"\u003eご飯を\u003c/span\u003e\u003cspan class=\"ww\"\u003e食べます。\u003c/span\u003e\n\n\n.. _mecab-segmenter:\n\nMeCab Segmenter\n~~~~~~~~~~~~~~~~~~~~~~~\n\nMeCab (https://github.com/taku910/mecab) is an open source text segmentation\nlibrary for the Japanese language. Unlike the Google Cloud Natural Language API segmenter,\nthe MeCab segmenter does not require any billed API calls, so you can process\nsentences for free and without an internet connection. You can also customize the\ndictionary by building your own.\n\nSupported languages\n++++++++++++++++++++++\n\n- Japanese\n\nInstallation\n+++++++++++++++++\n\nYou need to have MeCab installed to use the MeCab segmenter in Budou.\nYou can install MeCab with an IPA dictionary by running\n\n.. code-block:: sh\n\n   $ make install-mecab\n\nin the project's home directory after cloning this repository.\n\n\n.. _tinysegmenter-based-segmenter:\n\nTinySegmenter-based Segmenter\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nTinySegmenter (http://chasen.org/~taku/software/TinySegmenter/) is a compact\nJapanese tokenizer originally created by (c) 2008 Taku Kudo.\nIt tokenizes sentences by matching against a combination of patterns carefully\ndesigned using machine learning. This means that **you can use this backend\nwithout any additional setup!**\n\nSupported languages\n++++++++++++++++++++++\n\n- Japanese\n\n\n.. _korean:\n\nKorean support\n-------------------\n\nKorean has spaces between chunks, so you can perform line breaking simply by\nputting `word-break: keep-all` in your CSS. We recommend that you use this\ntechnique instead of using Budou.\n\n\nUse cases\n---------------\n\nBudou is designed to be used mostly in eye-catching sentences such as titles\nand headings on the assumption that split chunks would stand out negatively\nat larger font sizes.\n\n\nAccessibility\n-------------------\n\nSome screen reader software packages read Budou's wrapped chunks one by one.\nThis may degrade the user experience for those who need audio support.\nYou can attach any attribute to the output chunks to enhance accessibility.\nFor example, you can make screen readers read undivided sentences by\ncombining the `aria-describedby` and `aria-label` attributes in the output.\n\n.. code-block:: html\n\n  \u003cp id=\"description\" aria-label=\"やりたいことのそばにいる\"\u003e\n    \u003cspan class=\"ww\" aria-describedby=\"description\"\u003eやりたい\u003c/span\u003e\n    \u003cspan class=\"ww\" aria-describedby=\"description\"\u003eことの\u003c/span\u003e\n    \u003cspan class=\"ww\" aria-describedby=\"description\"\u003eそばに\u003c/span\u003e\n    \u003cspan class=\"ww\" aria-describedby=\"description\"\u003eいる\u003c/span\u003e\n  \u003c/p\u003e\n\n**This functionality is currently nonfunctional** due to the html5lib sanitizer's\nbehavior, which strips ARIA-related attributes from the output HTML. Progress on this\nissue is tracked at https://github.com/google/budou/issues/74\n\nAuthor\n----------\n\nShuhei Iitsuka\n\n- Website: https://tushuhei.com\n- Twitter: https://twitter.com/tushuhei\n\n\nDisclaimer\n-----------------\n\nThis library is authored by a Googler and copyrighted by Google, but is not an\nofficial Google product.\n\n\nLicense\n-----------\n\nCopyright 2018 Google LLC\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not use this file except in compliance with the License.\nYou may obtain a copy of the License at\n\n    http://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle%2Fbudou","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgoogle%2Fbudou","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle%2Fbudou/lists"}