{"id":15520722,"url":"https://github.com/proycon/python-ucto","last_synced_at":"2026-02-02T16:14:28.486Z","repository":{"id":17260786,"uuid":"20030361","full_name":"proycon/python-ucto","owner":"proycon","description":"This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is regular-expression based, extensible, and advanced tokeniser written in C++ (http://ilk.uvt.nl/ucto).","archived":false,"fork":false,"pushed_at":"2024-12-17T11:56:01.000Z","size":90,"stargazers_count":30,"open_issues_count":5,"forks_count":5,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-10-18T06:25:21.703Z","etag":null,"topics":["computational-linguistics","folia","nlp","nlp-library","python","text-processing","tokenizer"],"latest_commit_sha":null,"homepage":"","language":"Cython","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/proycon.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2014-05-21T17:28:45.000Z","updated_at":"2025-10-14T11:19:49.000Z","dependencies_parsed_at":"2023-02-13T00:01:02.363Z","dependency_job_id":"a21a3ed7-bfcb-424a-baea-a710954ca8f5","html_url":"https://github.com/proycon/python-ucto","commit_stats":{"total_commits":138,"total_committers":1,"mean_commits":138.0,"dds":0.0,"last_synced_commit":"01d7ece7ea9dd55552ba59251f26360faf44eaf3"},"previous_names":[],"tags_count":22,"template":false,"template_full_name":null,"purl":"pkg:github/proycon/python-ucto","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/proycon%2Fpython-ucto","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/proycon%2Fpython-ucto/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/proycon%2Fpython-ucto/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/proycon%2Fpython-ucto/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/proycon","download_url":"https://codeload.github.com/proycon/python-ucto/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/proycon%2Fpython-ucto/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29015145,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-02T14:58:54.169Z","status":"ssl_error","status_checked_at":"2026-02-02T14:58:51.285Z","response_time":58,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computational-linguistics","folia","nlp","nlp-library","python","text-processing","tokenizer"],"created_at":"2024-10-02T10:29:04.441Z","updated_at":"2026-02-02T16:14:28.435Z","avatar_url":"https://github.com/proycon.png","language":"Cython","funding_links":[],"categories":["Python","Packages","Libraries"],"sub_categories":["General-Purpose Machine Learning","Libraries","Books"],"readme":".. image:: http://applejack.science.ru.nl/lamabadge.php/python-ucto\n   :target: http://applejack.science.ru.nl/languagemachines/\n\n.. image:: https://zenodo.org/badge/20030361.svg\n   :target: https://zenodo.org/badge/latestdoi/20030361\n\n.. image:: https://www.repostatus.org/badges/latest/active.svg\n   :alt: Project Status: Active – The project has reached a stable, usable state and is being actively developed.\n   :target: https://www.repostatus.org/#active\n\nUcto for Python\n=================\n\nThis is a Python binding to the tokeniser Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is a regular-expression based, extensible, and advanced tokeniser written in C++ (https://languagemachines.github.io/ucto).\n\nDemo\n------------------\n\n.. image:: https://raw.githubusercontent.com/CLARIAH/wp3-demos/master/python-ucto.gif \n\nInstallation\n----------------\n\nWe recommend you use a Python virtual environment and install using ``pip``::\n\n    pip install python-ucto\n\nWhen possible on your system, this will install the binary\nPython wheels *that include ucto and all necessary dependencies* **except for**\nuctodata. To download and install the data (in ``~/.config/ucto``) you then only need to\nrun the following once::\n\n    python -c \"import ucto; ucto.installdata()\"\n\nIf you want language detection support, ensure you the have `libexttextcat`\npackage (if provided by your distribution) installed prior to executing the\nabove command.\n\nIf the binary wheels are not available for your system, you will need to first\ninstall `Ucto \u003chttps://github.com/LanguageMachines/ucto\u003e`_ yourself and then\nrun ``pip install python-ucto`` to install this python binding, it will then be\ncompiled from source. The following instructions apply in that case:\n\nOn Arch Linux, you can alternatively use the `AUR package \u003chttps://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=python-ucto-git\u003e`_ .\n\nOn macOS; use `homebrew \u003chttps://brew.sh/\u003e`_ to install `Ucto \u003chttps://languagemachines.github.io/ucto\u003e`_::\n\n    brew tap fbkarsdorp/homebrew-lamachine\n    brew install ucto\n\nOn Alpine Linux, run: ``apk add cython ucto ucto-dev``\n\nWindows is not supported natively at all, but you should be able to use the Ucto python binding if you use WSL, or using Docker containers (see below).\n\nDocker/OCI Containers\n~~~~~~~~~~~~~~~~~~~~~~~\n\nA Docker/OCI container image is available containing Python, ucto, and python-ucto::\n\n    docker pull proycon/python-ucto\n    docker run -t -i proycon/python-ucto\n\nYou can also build the container from scratch from this repository with the included `Dockerfile`.\n\nUsage\n---------------------\n\nImport and instantiate the ``Tokenizer`` class with a configuration file.\n\n.. code:: python\n\n    import ucto\n    configurationfile = \"tokconfig-eng\"\n    tokenizer = ucto.Tokenizer(configurationfile)\n\n\nThe configuration files supplied with ucto are named ``tokconfig-xxx`` where\n``xxx`` corresponds to a three letter iso-639-3 language code. There is also a\n``tokconfig-generic`` one that has no language-specific rules. Alternatively,\nyou can make and supply your own configuration file. Note that for older\nversions of ucto you may need to provide the absolute path, but the latest\nversions will find the configurations supplied with ucto automatically. See\n`here \u003chttps://github.com/LanguageMachines/uctodata/tree/master/config\u003e`_ for a\nlist of available configuration in the latest version.\n\nThe constructor for the ``Tokenizer`` class takes the following keyword\narguments:\n\n * ``lowercase`` (defaults to ``False``) -- Lowercase all text\n * ``uppercase`` (defaults to ``False``) -- Uppercase all text\n * ``sentenceperlineinput`` (defaults to ``False``) -- Set this to True if each\n   sentence in your input is on one line already and you do not require further\n   sentence boundary detection from ucto.\n * ``sentenceperlineoutput`` (defaults to ``False``) -- Set this if you want\n   each sentence to be outputted on one line. Has not much effect within the\n   context of Python.\n * ``paragraphdetection`` (defaults to ``True``) -- Do paragraph detection.\n   Paragraphs are simply delimited by an empty line.\n * ``quotedetection`` (defaults to ``False``) -- Set this if you want to enable\n   the experimental quote detection, to detect quoted text (enclosed within some\n   sort of single/double quote)\n * ``debug`` (defaults to ``False``) -- Enable verbose debug output\n\nText is passed to the tokeniser using the ``process()`` method, this method\nreturns the number of tokens rather than the tokens itself. It may be called\nmultiple times in sequence. The tokens\nthemselves will be buffered in the ``Tokenizer`` instance and can be\nobtained by iterating over it, after which the buffer will be cleared:\n\n.. code:: python\n\n    #pass the text (a str) (may be called multiple times),\n    tokenizer.process(text)\n\n    #read the tokenised data\n    for token in tokenizer:\n        #token is an instance of ucto.Token, serialise to string using str()\n        print(str(token))\n\n        #tokens remember whether they are followed by a space\n        if token.isendofsentence():\n            print()\n        elif not token.nospace():\n            print(\" \",end=\"\")\n\nThe ``process()`` method takes a single string (``str``), as parameter. The string may contain newlines, and newlines\nare not necessary sentence bounds unless you instantiated the tokenizer with ``sentenceperlineinput=True``.\n\nEach token is an instance of ``ucto.Token``. It can be serialised to string\nusing ``str()`` as shown in the example above.\n\nThe following methods are available on ``ucto.Token`` instances:\n* ``isendofsentence()`` -- Returns a boolean indicating whether this is the last token of a sentence.\n* ``nospace()`` -- Returns a boolean, if ``True`` there is no space following this token in the original input text.\n* ``isnewparagraph()`` -- Returns ``True`` if this token is the start of a new paragraph.\n* ``isbeginofquote()``\n* ``isendofquote()``\n* ``tokentype`` -- This is an attribute, not a method. It contains the type or class of the token (e.g. a string like  WORD, ABBREVIATION, PUNCTUATION, URL, EMAIL, SMILEY, etc..)\n\nIn addition to the low-level ``process()`` method, the tokenizer can also read\nan input file and produce an output file, in the same fashion as ucto itself\ndoes when invoked from the command line. This is achieved using the\n``tokenize(inputfilename, outputfilename)`` method:\n\n.. code:: python\n\n    tokenizer.tokenize(\"input.txt\",\"output.txt\")\n\nInput and output files may\nbe either plain text, or in the `FoLiA XML format \u003chttps://proycon.github.io/folia\u003e`_.  Upon instantiation of the ``Tokenizer`` class, there\nare two keyword arguments to indicate this:\n\n* ``xmlinput`` or ``foliainput`` -- A boolean that indicates whether the input is FoLiA XML (``True``) or plain text (``False``). Defaults to ``False``.\n* ``xmloutput`` or ``foliaoutput`` -- A boolean that indicates whether the input is FoLiA XML (``True``) or plain text (``False``). Defaults to ``False``. If this option is enabled, you can set an additional keyword parameter ``docid`` (string) to set the document ID.\n\nAn example for plain text input and FoLiA output:\n\n.. code:: python\n\n    tokenizer = ucto.Tokenizer(configurationfile, foliaoutput=True)\n    tokenizer.tokenize(\"input.txt\", \"ucto_output.folia.xml\")\n\nFoLiA documents retain all the information ucto can output, unlike the plain\ntext representation. These documents can be read and manipulated from Python using the\n`FoLiaPy library \u003chttps://github.com/proycon/foliapy\u003e`_. FoLiA is especially recommended if\nyou intend to further enrich the document with linguistic annotation. A small\nexample of reading ucto's FoLiA output using this library follows, but consult the `documentation \u003chttps://folia.readthedocs.io/en/latest/\u003e`_ for more:\n\n.. code:: python\n\n    import folia.main as folia\n    doc = folia.Document(file=\"ucto_output.folia.xml\")\n    for paragraph in doc.paragraphs():\n        for sentence in paragraph.sentence():\n            for word in sentence.words()\n                print(word.text(), end=\"\")\n                if word.space:\n                    print(\" \", end=\"\")\n            print()\n        print()\n\nTest and Example\n~~~~~~~~~~~~~~~~~~~\n\nRun and inspect ``example.py``.\n\n\n\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fproycon%2Fpython-ucto","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fproycon%2Fpython-ucto","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fproycon%2Fpython-ucto/lists"}