{"id":18507888,"url":"https://github.com/educationaltestingservice/match","last_synced_at":"2025-04-09T03:31:34.961Z","repository":{"id":14008806,"uuid":"16710290","full_name":"EducationalTestingService/match","owner":"EducationalTestingService","description":"Match tokenized words and phrases within the original, untokenized, often messy, text.","archived":false,"fork":false,"pushed_at":"2023-04-11T15:48:17.000Z","size":250,"stargazers_count":19,"open_issues_count":3,"forks_count":5,"subscribers_count":19,"default_branch":"develop","last_synced_at":"2025-04-07T06:48:19.467Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"aprendegit/fork","license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/EducationalTestingService.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2014-02-10T21:39:15.000Z","updated_at":"2025-01-23T05:43:23.000Z","dependencies_parsed_at":"2024-06-21T03:51:41.195Z","dependency_job_id":"93c44f06-7f75-4088-b7c3-85ed93f0f72e","html_url":"https://github.com/EducationalTestingService/match","commit_stats":{"total_commits":101,"total_committers":8,"mean_commits":12.625,"dds":0.594059405940594,"last_synced_commit":"bea64266e4bcd091e3ce16adbb26ac657b3f5648"},"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EducationalTestingService%2Fmatch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EducationalTestingService%2Fmatch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EducationalTestingService%2Fmatch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EducationalTestingService%2Fmatch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/EducationalTestingService","download_url":"https://codeload.github.com/EducationalTestingService/match/tar.gz/refs/heads/develop","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247973647,"owners_count":21026707,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-06T15:12:44.926Z","updated_at":"2025-04-09T03:31:34.715Z","avatar_url":"https://github.com/EducationalTestingService.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"match\n=====\n\n|Build Status|\n\n|Latest Conda Version|\\ |Latest PyPI Version|\\ |Python Versions|\n\nThe purpose of the module ``Match`` is to get the offsets (as well as\nthe string between those offsets, for debugging) of a cleaned-up,\ntokenized string from its original, untokenized source. “Big deal,” you\nmight say, but this is actually a pretty difficult task if the original\ntext is sufficiently messy, not to mention rife with Unicode characters.\n\nConsider some text, stored in a variable ``original_text``, like:\n\n::\n\n   I   am writing a letter !  Sometimes,I forget to put spaces (and do weird stuff with punctuation)  ?  J'aurai une pomme, s'il vous plâit !\n\nThis will/should/might be properly tokenized as:\n\n.. code:: python\n\n   [['I', 'am', 'writing', 'a', 'letter', '!'],\n    ['Sometimes', ',', 'I', 'forget', 'to', 'put', 'spaces', '-LRB-', 'and', 'do', 'weird', 'stuff', 'with', 'punctuation', '-RRB-', '?'],\n    [\"J'aurai\", 'une', 'pomme', ',', \"s'il\", 'vous', 'plâit', '!']]\n\nNow:\n\n.. code:: python\n\n   In [2]: import match\n\n   In [3]: match.match(original_text, ['-LRB-', 'and', 'do', 'weird', 'stuff', 'with', 'punctuation', '-RRB-'])\n   Out[3]: [(60, 97, '(and do weird stuff with punctuation)')]\n\n   In [4]: match.match(original_text, ['I', 'am', 'writing', 'a', 'letter', '!'])\n   Out[4]: [(0, 25, 'I   am writing a letter !')]\n\n   In [5]: match.match(original_text, [\"s'il\", 'vous', 'plâit', '!'])\n   Out[5]: [(121, 138, \"s'il vous plâit !\")]\n\nThe return type from ``match()`` is a ``list`` because it will return\n*all* occurrences of the argument, be it a ``list`` of tokens or a\nsingle ``string`` (word):\n\n.. code:: python\n\n   In [6]: match.match(original_text, \"I\")\n   Out[6]: [(0, 1, 'I'), (37, 38, 'I')]\n\nWhen passing in a single ``string``, ``match()`` is expecting that\n``string`` to be a single word or token. Thus:\n\n.. code:: python\n\n   In [7]: match.match(\"****because,the****\", \"because , the\")\n   Out[7]: []\n\nTry passing in ``\"because , the\".split(' ')`` instead, or better yet,\nthe output from a proper tokenizer.\n\nFor convenience, a function called ``match_lines()`` is provided:\n\n.. code:: python\n\n   In [8]: match.match_lines(original_text, [\n      ...: ['-LRB-', 'and', 'do', 'weird', 'stuff', 'with', 'punctuation', '-RRB-'],\n      ...: ['I', 'am', 'writing', 'a', 'letter', '!'],\n      ...: \"I\"\n      ...: ])\n   Out[8]:\n   [(0, 1, 'I'),\n    (0, 25, 'I   am writing a letter !'),\n    (37, 38, 'I'),\n    (60, 97, '(and do weird stuff with punctuation)')]\n\nThe values returned will always be sorted by their offsets.\n\nInstallation\n------------\n\n``pip install match`` or ``conda install -c ets match``\n\nRequirements\n------------\n\n-  Python \u003e= 3.8\n-  `nltk \u003chttp://www.nltk.org\u003e`__\n-  `regex \u003chttps://pypi.python.org/pypi/regex\u003e`__\n\nDocumentation\n-------------\n\n`Here! \u003cmatch\u003e`__.\n\n.. |Build Status| image:: https://github.com/EducationalTestingService/match/actions/workflows/python-test.yml/badge.svg\n   :target: https://github.com/EducationalTestingService/match/actions/workflows/python-test.yml/\n.. |Latest Conda Version| image:: https://img.shields.io/conda/v/ets/match\n   :target: https://anaconda.org/ets/match\n.. |Latest PyPI Version| image:: https://img.shields.io/pypi/v/match\n   :target: https://pypi.org/project/match/\n.. |Python Versions| image:: https://img.shields.io/pypi/pyversions/match\n   :target: https://pypi.python.org/pypi/match/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feducationaltestingservice%2Fmatch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feducationaltestingservice%2Fmatch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feducationaltestingservice%2Fmatch/lists"}