{"id":25180510,"url":"https://github.com/rsdc2/pyepidoc","last_synced_at":"2025-10-25T21:45:08.414Z","repository":{"id":153356204,"uuid":"617596832","full_name":"rsdc2/PyEpiDoc","owner":"rsdc2","description":"Python library for handling TEI EpiDoc files","archived":false,"fork":false,"pushed_at":"2025-09-30T20:23:23.000Z","size":1826,"stargazers_count":3,"open_issues_count":5,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-09T09:14:04.589Z","etag":null,"topics":["epidoc","epidoc-xml-markup","python","tei-xml"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rsdc2.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSES/LICENSE-lxml","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2023-03-22T18:03:20.000Z","updated_at":"2025-10-01T06:30:54.000Z","dependencies_parsed_at":"2023-09-23T13:35:47.786Z","dependency_job_id":"d2442733-fb67-4a57-a18c-0a9dc0d20da9","html_url":"https://github.com/rsdc2/PyEpiDoc","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/rsdc2/PyEpiDoc","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rsdc2%2FPyEpiDoc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rsdc2%2FPyEpiDoc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rsdc2%2FPyEpiDoc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rsdc2%2FPyEpiDoc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rsdc2","download_url":"https://codeload.github.com/rsdc2/PyEpiDoc/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rsdc2%2FPyEpiDoc/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279001114,"owners_count":26083021,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-09T02:00:07.460Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["epidoc","epidoc-xml-markup","python","tei-xml"],"created_at":"2025-02-09T16:18:38.504Z","updated_at":"2025-10-25T21:45:08.401Z","avatar_url":"https://github.com/rsdc2.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv\u003e\n  \u003cimg align=\"left\" valign=\"center\" src=\"assets/ISicily.jpg?raw=true\" alt=\"isicily logo\" height=\"80\" \u003e\n  \u003cimg align=\"left\" valign=\"center\" src=\"assets/oxford.png?raw=true\" alt=\"oxford logo\" height=\"80\"  style=\"padding-top: 80px\" \u003e\n  \u003cimg align=\"left\" valign=\"center\" src=\"assets/EU_ERC.jpg?raw=true\" alt=\"erc logo\" height=\"80\" \u003e\n\u003c/div\u003e\n\u003cbr clear=\"all\"\u003e\n\n# PyEpiDoc\n\n\nPyEpiDoc is a Python (\u003e=3.10) library for parsing and interacting with [TEI](https://tei-c.org/) XML\n[EpiDoc](https://epidoc.stoa.org/) files. It has been tested on Python on Linux (Ubuntu) and Windows.\n\nPyEpiDoc has been designed for use, in the first instance, \nwith the [I.Sicily](http://sicily.classics.ox.ac.uk/) corpus.\nFor information on the encoding of I.Sicily texts in TEI EpiDoc, see\nthe [I.Sicily GitHub wiki](https://github.com/ISicily/ISicily/wiki).\n\n**NB: PyEpiDoc is currently under active development.**\n\n\n## Install (no dev dependencies)\n\n### Locally\n\nTo install PyEpiDoc along with its dependencies (```lxml```):\n\n1. Clone or download the repository.\n\n2. Navigate into the cloned / downloaded repository.\n\n3. From within the cloned repository, install at the ```user``` level with:\n\n```bash\npip install . --user\n```\n\n### In a virtual environment\n\nIf you are using a ```venv``` virtual environment:\n\n1. Make sure the virtual environment has been activated, e.g. on Linux:\n\n```bash\nsource env/bin/activate\n```\n\n2. Install with ```pip```:\n\n```bash\npip install .\n```\n\n## Uninstall\n```bash\npip uninstall pyepidoc\n```\n\n## Install for development\n\nTo install PyEpiDoc along with its dependencies (```lxml```) and development dependencies (```pytest```, ```mypy```), e.g. in a virtual environment:\n\n1. Clone or download the repository;\n\n2. Navigate into the cloned / downloaded repository.\n\n3. From within the cloned repository, install with:\n\n    ```\n    pip install .[dev]\n    ```\n\n## Running the Jupyter Notebooks\n\nJupyter notebooks are included in the repository under `notebooks/` to provide example usage:\n\n- `getting_started.ipynb`\n- `abbreviations.ipynb`\n- `setting_ids.ipynb`\n\nFor instructions on installing Jupyter notebook, see https://docs.jupyter.org/en/latest/install/notebook-classic.html. Alternatively, see also https://jupyter.org/install.\n\nOnce Jupyter notebook is installed, to run `getting_started.ipynb`, type:\n\n```\njupyter notebook getting_started.ipynb\n```\n\n## Example usage\n\nGiven a tokenized EpiDoc file ```ISic000001.xml``` in an ```examples/``` folder in the current working directory.\n\n### Load the EpiDoc file\n\n```python\nfrom pyepidoc import EpiDoc\n\ndoc = EpiDoc(\"examples/ISic000001_tokenized.xml\")\n```\n\n\n### Print the text of the edition\n\n```\nprint(doc.edition_text)\n```\n\n### Print all tokens in an edition (e.g. ```\u003cw\u003e```, ```\u003cname\u003e``` etc.)\n\n```python\ntokens = doc.tokens\nprint(' '.join([str(token) for token in tokens]))\n```\n\n### Produce a tokenized version of a given EpiDoc file\n\nGiven an untokenized EpiDoc file ```ISic000032_untokenized.xml``` in an ```examples``` folder in the current working directory:\n\n```python\nfrom pyepidoc import EpiDoc\n\n# Load the EpiDoc file\ndoc = EpiDoc(\"examples/ISic000032_untokenized.xml\")\n\n# Tokenize the edition with default settings\ndoc.tokenize()\n\n# Print list of tokens\nprint('Tokens: ', doc.tokens_list_str)\n\n# Save the results to a new XML file\ndoc.to_xml_file(\"examples/ISic000032_tokenized.xml\")\n```\n\n### Corpus level analysis\n\nGiven a corpus of EpiDoc XML files in a folder ```corpus/``` in the current working directory, the following code filters the corpus and writes a text file containing the ids of all Latin funerary inscriptions from Catania / Catina:\n\n```python\nfrom pyepidoc import EpiDocCorpus\nfrom pyepidoc.epidoc.enums import TextClass\nfrom pyepidoc.file.funcs import str_to_file\n\n# Load the corpus\ncorpus = EpiDocCorpus('corpus')\n\n# Filter the corpus to find the funerary inscriptions\nfunerary_corpus = corpus.filter_by_textclass([TextClass.Funerary.value])\n\n# Within the funerary corpus, find all the Latin inscriptions from Catania / Catina:\ncatina_funerary_corpus = (\n    funerary_corpus\n        .filter_by_orig_place(['Catina'])\n        .filter_by_languages(['la'])\n)\n\n# Output the of this set of documents to a file ```catina_funerary_ids_la.txt``` \n# in the current working directory.\ncatina_funerary_ids = '\\n'.join(catina_funerary_corpus.ids)\nstr_to_file(catina_funerary_ids, 'catina_funerary_ids_la.txt')\n\n```\n\n### Validate EpiDoc XML\n\nThere are two ways to validate an EpiDoc XML file: \n\n1. Validate on load, e.g.:\n\n```python\nfrom pyepidoc import EpiDoc\n\ndoc = EpiDoc('examples/ISic000001_tokenized.xml', validate_on_load=True)\n```\n\n- This validates according to the RelaxNG schema `tei-epidoc.rng` \nin the `pyepidoc` root directory.\n- By default `validate_on_load` is set to `False`.\n\n2. Validate against a custom RelaxNG schema:\n\n```python\nfrom pyepidoc import EpiDoc\ndoc = EpiDoc('examples/ISic000001_tokenized.xml')\n\ndoc.validate_by_relaxng(fp='path/to/relaxngschema.rng')\n```\n\n# Code organisation\n\n## Package structure\n\nThe PyEpiDoc package has four subpackages:\n\n- `xml` containing modules with base classes for XML handling;\n- `epidoc` containing modules for handling EpiDoc specific XML handling, e.g. ```\u003cab\u003e```, ```\u003cw\u003e``` etc.;\n- `analysis` containing modules for analysing EpiDoc files and corpora, e.g. of abbreviations;\n- `shared` containing modules and classes for use generally in the project.\n\nProbably the most useful subpackage in the first instance will be `epidoc`, and in particular \n`epidoc.py` and `corpus.py`, which, via the classes `EpiDoc` and `EpiDocCorpus`, provide\nAPIs to EpiDoc files and corpora respectively.\n\n\n## Modifying tokenizer behaviour\n\nThe treatment of a given token by the tokenizer may be affected by one or more of the following:\n\n- Status in ```pyepidoc/epidoc/epidoctypes.py```\n- Presence in ```pyepidoc/constants.py``` in ```SubsumableRels```\n\nThe token will be subsumed into a neighbouring ```\u003cw\u003e``` token if it is not separated by whitespace if:\n- it is listed in as a ```dep``` of e.g. ```\u003cw\u003e``` in ```SubsumableRels```\n\nThe token will be subsumed into a neighbouring ```\u003cw\u003e``` token regardless of the presence of intervening whitespace if:\n- it is listed in as a ```dep``` of e.g. ```\u003cw\u003e``` in ```SubsumableRels``` and\n- it is a member of ```AlwaysSubsumableType``` in ```epidoctypes.py```\n\n# Code integrity\n\n## Run the tests\n\nwith ```pytest``` installed (the dev installation will do this for you):\n\n\n2. To run all the tests, in the project root directory, type:\n\n    ```\n    pytest\n    ```\n\nIf ```pytest``` is not available to the currently active version of Python, \nit may be necessary to specify the Python executable with ```pytest``` \ninstalled, e.g.:\n\n    ```\n    python3.10 -m pytest\n    ```\n\n## Check the types\n\nTo check the integrity of the type annotations, \nwith ```mypy``` installed (the dev installation will\ndo this for you):\n\n```\nmypy src/pyepidoc\n```\n\nIf ```mypy``` is not available to the currently active version of Python, \nit may be necessary to specify the Python executable with ```mypy``` \ninstalled, e.g.:\n\n    ```\n    python3.10 -m mypy src/pyepidoc\n    ```\n\n## Features to be included in future\n\n### XML comments\n\nXML comments should now be handled correctly, and reproduced in new files.\n\n## Dependencies\n\nPyEpiDoc depends on [lxml](https://lxml.de/) ([BSD 3](https://github.com/lxml/lxml/blob/master/LICENSE.txt)). \nDevelopment dependencies are [mypy](https://mypy.readthedocs.io/en/stable/) ([MIT](https://github.com/python/mypy/blob/master/LICENSE)), [pytest](https://docs.pytest.org/en/7.4.x/) ([MIT](https://github.com/pytest-dev/pytest/blob/main/LICENSE)) and [pytest-cov](https://pytest-cov.readthedocs.io/en/latest/) ([MIT](https://github.com/pytest-dev/pytest-cov?tab=MIT-1-ov-file#readme)). Licenses for these dependencies are included in the `LICENSES` directory.\n\n\n# Licencing\n- The software for PyEpiDoc ([src/pyepidoc](src/pyepidoc) ) was written by Robert Crellin as part of the Crossreads project at the Faculty of Classics, University of Oxford, and is licensed under MIT (see [LICENSES/LICENSE-pyepidoc](LICENSES/LICENSE-pyepidoc)). \n\n- Example and test ```.xml``` files, contained in the ```examples/```, ```example_corpus/``` and ```tests/``` subfolders are either directly from, or derived from, the [I.Sicily corpus](https://github.com/ISicily/ISicily), which are made available under the [CC-BY-4.0 licence](https://creativecommons.org/licenses/by/4.0/) (see [LICENSES/LICENSE-texts](LICENSES/LICENSE-texts) and [https://github.com/ISicily/ISicily/blob/master/licence.txt](https://github.com/ISicily/ISicily/blob/master/licence.txt)).\n\n- The [TEI EpiDoc schema](src/pyepidoc_data/schemas/tei-epidoc.rng) is licensed under the GNU General Public license (see the license on the [EpiDoc repository](https://github.com/EpiDoc/Source/blob/main/schema/LICENSE.txt)) (see [LICENSES/LICENSE-EpiDoc-schema](LICENSES/LICENSE-EpiDoc-schema) and [LICENSES/gpl-3.0.txt](LICENSES/gpl-3.0.txt)).\n\n- The repository as a whole is licensed under the [GNU GPL v 3 license](LICENSES/gpl-3.0.txt). My understanding is that this license is one-way compatible with the CC-BY-4.0 licence, MIT and BSD-3 licenses, such that it is possible for the requirements of those licenses to be fulfilled under GPL (see [https://creativecommons.org/2015/10/08/cc-by-sa-4-0-now-one-way-compatible-with-gplv3/](https://creativecommons.org/2015/10/08/cc-by-sa-4-0-now-one-way-compatible-with-gplv3/)).\n\n\n## Funding\n\nThis project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 885040, “Crossreads”).\n\n\u003cdiv\u003e\n  \u003cimg align=\"left\" valign=\"center\" src=\"assets/ISicily.jpg?raw=true\" alt=\"isicily logo\" height=\"80\" \u003e\n  \u003cimg align=\"left\" valign=\"center\" src=\"assets/oxford.png?raw=true\" alt=\"oxford logo\" height=\"80\"  style=\"padding-top: 80px\" \u003e\n  \u003cimg align=\"left\" valign=\"center\" src=\"assets/EU_ERC.jpg?raw=true\" alt=\"erc logo\" height=\"80\" \u003e\n\u003c/div\u003e","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frsdc2%2Fpyepidoc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frsdc2%2Fpyepidoc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frsdc2%2Fpyepidoc/lists"}