{"id":22020639,"url":"https://github.com/uudigitalhumanitieslab/perfectextractor","last_synced_at":"2025-07-25T13:41:44.957Z","repository":{"id":36719546,"uuid":"41026113","full_name":"UUDigitalHumanitieslab/perfectextractor","owner":"UUDigitalHumanitieslab","description":"Extracting present perfects (and related forms) from parallel corpora","archived":false,"fork":false,"pushed_at":"2022-12-18T20:09:28.000Z","size":1317,"stargazers_count":7,"open_issues_count":3,"forks_count":2,"subscribers_count":7,"default_branch":"develop","last_synced_at":"2025-05-07T06:05:37.267Z","etag":null,"topics":["extraction","parallel-corpus","xpath"],"latest_commit_sha":null,"homepage":"https://time-in-translation.hum.uu.nl","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/UUDigitalHumanitieslab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-08-19T09:44:35.000Z","updated_at":"2024-12-29T19:07:08.000Z","dependencies_parsed_at":"2023-01-17T04:17:53.842Z","dependency_job_id":null,"html_url":"https://github.com/UUDigitalHumanitieslab/perfectextractor","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UUDigitalHumanitieslab%2Fperfectextractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UUDigitalHumanitieslab%2Fperfectextractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UUDigitalHumanitieslab%2Fperfectextractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UUDigitalHumanitieslab%2Fperfectextractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/UUDigitalHumanitieslab","download_url":"https://codeload.github.com/UUDigitalHumanitieslab/perfectextractor/tar.gz/refs/heads/develop","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252823918,"owners_count":21809713,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["extraction","parallel-corpus","xpath"],"created_at":"2024-11-30T06:07:24.887Z","updated_at":"2025-05-07T06:05:46.287Z","avatar_url":"https://github.com/UUDigitalHumanitieslab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PefectExtractor\n\n![GitHub](https://img.shields.io/github/license/UUDigitalHumanitieslab/perfectextractor?style=plastic)\n![Travis (.org)](https://img.shields.io/travis/UUDigitalHumanitieslab/perfectextractor?style=plastic)\n![PyPI](https://img.shields.io/pypi/v/perfectextractor?style=plastic)\n\n*Extracting Perfects (and related forms) from parallel corpora*\n\nThis command-line application allows for extraction of Perfects (and related forms, like the Recent Past construction in French and Spanish) from part-of-speech-tagged, lemmatized and sentence-aligned parallel corpora encoded in XML.\n \n## Installation\n\nFirst, create a [virtual environment](https://docs.python.org/3/library/venv.html) and activate it:\n\n    $ python -m venv venv\n    $ source venv/bin/activate\n\nThen, install the requirements in this virtual environment via:\n\n    $ pip install -r requirements.txt\n\nFinally, create the executable `extract` via:\n\n    $ pip install --editable .\n\n## Recognizing Perfects \n\nIn English, a *present perfect* is easily recognizable as a present form of *to have* plus a past participle, like in (1):\n\n    (1) I have seen that movie twenty times.\n\nHowever, one difficulty in finding Perfects in most languages is that there might be words between the auxiliary and the past participle, like in (2):\n\n    (2) Nobody has ever climbed that mountain.\n\nFurthermore, languages have passive forms that generally require the past participle of *to be* to be interjected, like in (3):\n\n    (3) The bill has been paid by John.\n     \nIn English, there is the additional issue of the *present perfect continuous*, which in form shares the first part of the construction with the *present perfect*, like in (4):\n\n    (4) He has been waiting here for two hours.\n    \nIn some languages (e.g. French, German, and Dutch), the Perfect can be formed with both Have and Be. \nThe past participle governs which auxiliary verb is used, as (5) and (6) show.\n\n    (5) J'ai vu quelque chose [lit. I have seen some thing]\n    (6) Elle est arrivé [lit. She is arrived]\n    \nFor French, this is a closed list \n([DR and MRS P. VANDERTRAMP](https://en.wikipedia.org/wiki/Pass%C3%A9_compos%C3%A9#Auxiliary_.22.C3.8Atre.22)), \nbut for other languages, this might be a more open class.\n\nThe last common issue with finding Perfects is that in e.g. Dutch and German, the Perfect might appear before the auxiliary verb in subordinate clauses. (7) is an example: \n\n    (7) Dat is de stad waar hij gewoond heeft. [lit. This is the city where he lived has]\n    \nThe extraction script provided here takes care of all these issues, and can have language-specific settings. \n\n### Implementation \n\nThe extraction script (`perfectextractor/apps/extractor/perfectextractor.py`) is implemented using the [lxml XML toolkit](http://lxml.de/). \n\nThe script looks for auxiliary verbs (using a [XPath expression](https://en.wikipedia.org/wiki/XPath)), and for each of these, \nit tries to find a past participle on the right hand side of the sentence (or left hand side in Dutch/German), allowing for words between the verbs, \nthough this lookup stops at the occurrence of other verbs, punctuation and coordinating conjunctions.\n\nThe script also allows for extraction of *present perfect continuous* forms. \n\nThe script handles these by a list of verbs that use Be as auxiliary. \nThe function *get_ergative_verbs* in `perfectextractor/apps/extractor/wiktionary.py` extracts these verbs from [Wiktionary](https://en.wiktionary.org) for Dutch.\nThis function uses the [Requests: HTTP for Humans](http://docs.python-requests.org/) package.\nFor German, the list is compiled from [this list](https://deutsch.lingolia.com/en/grammar/verbs/sein-haben).\n\n## Recognizing Recent Pasts\n\nMost Romance languages share a grammaticalized construction to refer to events in the recent past, e.g. the *passé récent* in French and the *pasado reciente* in Spanish.\nIn English, typically a *present perfect* alongside the adverb *just* is used to convey this meaning, commonly referred to as *perfect of recent past* (Comrie 1985) or *hot news perfect* (McCawley 1971).\n\nThe French *passé récent* is formed with a present tense of *venir* 'come' followed by the particle *de* and an infinitive, as in (8) below.\n \n    (8) Je viens de voir Marie. [lit. I come DE see Mary] \n    \nThe Spanish *pasado reciente* is (quite similarly) formed with a present tense of *acabar* 'finish' followed by the particle *de* and an infinitive, as in (9) below.\n\n    (9) Acabo de ver a María. [lit. I finish DE see Mary]\n\nThe extraction script (`perfectextractor/apps/extractor/recentpastextractor.py`) provided here allows export of these constructions from parallel corpora.  \n\n## SINCE + duration\n\nIn most languages, the adverb SINCE can be followed by a durational adverbial, such as *seit drei Jahren* 'since three years' in (10) below.\n\n    (10) Marie ist seit drei Jahren glücklich mit Jan zusammen.\n\nWe allow extraction of such phrases in German and Dutch.\n\n## Present Continuous\n\nThis application also allows extraction of the English *Present Continuous* form, such as *is reading* in (11) below.\n\n    (11) Mary is reading Middlemarch.\n\n## Other extractors\n\nThis application also allows extraction from parallel corpora based on part-of-speech tags or regexes. \n\n## Corpora\n\n### Dutch Parallel Corpus\n\nThe extraction was first tested with the [Dutch Parallel Corpus](http://www.kuleuven-kulak.be/DPC).\nThis corpus (that uses the [TEI format](http://www.tei-c.org/)) consists of three languages: Dutch, French and English. \nThe configuration for this corpus can be found in `perfectextractor/corpora/dpc/base.cfg` and `perfectextractor/corpora/dpc/perfect.cfg`.\nExample documents from this corpus are included in the `perfectextractor/tests/data/dpc` directory.\nThe data for this corpus is **closed source**, to retrieve the corpus, you'll have to contact the authors on the cited website.\nAfter you've obtained the data, you can run the extraction script with:\n\n    extract \u003cfolder\u003e en fr nl --corpus=dpc --extractor=perfect\n\n### OPUS Corpora\n\nThe extraction has also been implemented for the open parallel corpus collection [OPUS](http://opus.nlpl.eu/), that contains most notably the [Europarl Corpus](http://opus.nlpl.eu/Europarl.php) and the [OpenSubtitles Corpus](http://opus.nlpl.eu/OpenSubtitles.php).\nThis corpus (that uses the [XCES format](http://www.tei-c.org/) for alignment) consists of a wide variety of languages. \nThe configuration for this corpus can be found in `perfectextractor/corpora/opus/base.cfg` and `perfectextractor/corpora/opus/perfect.cfg`: implementations have been made for Dutch, English, French, German and Spanish. \nExample documents from this corpus are included in the `perfectextractor/tests/data/europarl` directory.\nThe data for this corpus is **open source**: you can download the corpus and the alignment files from the cited website.\nAfter you've obtained the data, you can run the extraction script with:\n\n    extract \u003cfolder\u003e en de es --corpus=opus --extractor=perfect\n\n### British National Corpus (BNC)\n\nThe extraction has also been implemented for the monolingual [British National Corpus](http://www.natcorp.ox.ac.uk/).\nThe data for this corpus is **open source**: you can download the corpus from the linked website.\nAfter you've obtained the data, you can run the extraction script with:\n\n    extract \u003cfolder\u003e en --corpus=bnc --extractor=perfect\n\n### Implementing your own corpus\n\nIf you want to implement the extraction for another corpus, you'll have to create: \n\n  * An implementation of the corpus in the `perfectextractor/corpora` directory (see `perfectextractor/corpora/opus` for an example).\n  * A configuration file in this directory (see `perfectextractor/corpora/opus/base.cfg` for an example).\n  * An entry in the main script (see `perfectextractor/extract.py`)\n\n## Other options to the extraction script\n\nYou can view all options of the extraction script by typing:\n\n    extract --help\n\nDo note that at this point in time, not all options are available in all corpora.\nFeel free to send a pull request once you have implemented an option, or to request one by creating an issue. \n\n## Other scripts\n\nThese scripts can be found in `perfectextractor/scripts`.\n\n### pick_alignments\n\nThis script allows to filter the alignment file based on (for example) a file prefix.\nThis is helpful in the case of large alignment files, as is e.g. the case for the Europarl corpus.\nExample usage:\n\n    python pick_alignments.py \n\n### merge_results\n\nThis script allows merging results from various files.\nExample usage:\n\n    python merge_results.py \n\n### splitter\n\nThis script allows to split a big corpus into subparts and then to run the extractors.\nExample usage:\n\n    python splitter.py \n\n## Tests\n\nThe unit tests can be run using: \n\n    python -m unittest discover -b\n\nA coverage report can be generated (after installing [coverage.py](https://coverage.readthedocs.io/en/coverage-4.2/)) using:\n\n    coverage run --source . -m unittest discover -b\n    coverage html\n\n## Citing\n\nIf you happen to have used (parts of) this project for your research, please refer to this paper:\n\n[van der Klis, M., Le Bruyn, B., de Swart, H. (2017)](http://www.aclweb.org/anthology/E17-2080). Mapping the Perfect via Translation Mining. *Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers* 2017, 497-502.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fuudigitalhumanitieslab%2Fperfectextractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fuudigitalhumanitieslab%2Fperfectextractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fuudigitalhumanitieslab%2Fperfectextractor/lists"}