{"id":16753642,"url":"https://github.com/rillian/2019-gsoc","last_synced_at":"2026-03-18T22:02:05.991Z","repository":{"id":137154889,"uuid":"195283488","full_name":"rillian/2019-gsoc","owner":"rillian","description":"Work log from a CDLI summer project","archived":false,"fork":false,"pushed_at":"2019-09-15T18:35:12.000Z","size":4580,"stargazers_count":1,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-07-11T16:51:14.057Z","etag":null,"topics":["atf","cuneiform","tei-xml","text-processing"],"latest_commit_sha":null,"homepage":"https://cdli.thaumas.net/","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rillian.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-07-04T18:01:43.000Z","updated_at":"2019-09-15T18:35:14.000Z","dependencies_parsed_at":null,"dependency_job_id":"a30fb8b4-b99f-4d60-8778-9007d612faad","html_url":"https://github.com/rillian/2019-gsoc","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/rillian/2019-gsoc","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rillian%2F2019-gsoc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rillian%2F2019-gsoc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rillian%2F2019-gsoc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rillian%2F2019-gsoc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rillian","download_url":"https://codeload.github.com/rillian/2019-gsoc/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rillian%2F2019-gsoc/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29157508,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-06T07:18:23.844Z","status":"ssl_error","status_checked_at":"2026-02-06T07:13:32.659Z","response_time":59,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["atf","cuneiform","tei-xml","text-processing"],"created_at":"2024-10-13T02:50:44.506Z","updated_at":"2026-02-06T10:08:57.120Z","avatar_url":"https://github.com/rillian.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# 2019 Google Summer of Code Project\n\nFrom June through August of 2019 I wrote a\n[Text Encoding Initiative](https://tei-c.org/)\nexporter for the [Cuneiform Digital Library Initiative](https://cdli.ucla.edu),\nto make data from cuneiform tablets and other inscriptions more accessible,\nin particular to the [Scaife](https://scaife-viewer.org) reading environment.\nThe work was funded as a\n[Google Summer of Code Project](https://summerofcode.withgoogle.com/projects/#5983146665836544).\n\nThis is a summary of what I accomplished.\n\n## Demonstration\n\nThe project itself didn't have a public staging server, so I set\nup a temporary one on my own domain.\n\n - Visit the test server at https://cdli.thaumas.net/scaife/\n   (If necessary, click the gear and enable the 'CDLI Reader'\n   and 'Suggested Documents' components.)\n - Click on one of the suggested documents.\n - Click **Translation** to show/hide the parallel translation.\n\n## Milestones\n\n- [x] Convert a document from ATF and display it in Scaife.\n- [x] Publish a repo with a subset of convertible records.\n- [x] Set up automated export from CDLI data to a CTS repo.\n- [x] Demonstrate a scaife instance running somewhere.\n\n# Code\n\nThe following repositories are new code I wrote as part of the project:\n\n - https://github.com/cdli-gh/atf2tei (document converter)\n - https://github.com/cdli-gh/cdli-cts (tei export target repo)\n - https://github.com/cdli-gh/cdli-cts-server (capitains server for cdli-cts)\n - https://github.com/cdli-gh/cdli-search (Catalogue search experiment)\n\nI made contributions to two more repositories which are important\ncomponents of the project:\n\n - https://github.com/cdli-gh/scaife (fork of the reading environment)\n - https://github.com/oracc/pyoracc (parser atf2tei is using)\n\nSee the [cdli branch](https://github.com/cdli-gh/scaife/commits/cdli)\nfor the changes I made for the demo running above.\nChanges to the upstream scaife repo were not accepted, so this\nrepo reflects our project's customized version.\n\nSee below for a contributions to the pyoracc parser.\n\n## Contributions to upstream projects\n\n### Pull requests\n\n - https://github.com/Capitains/MyCapytain/pull/192 (merged)\n - https://github.com/Capitains/flask-capitains-nemo/pull/125 (merged)\n - https://github.com/Capitains/HookTest/pull/146 (merged)\n - https://github.com/Capitains/HookTest/pull/144 (merged)\n - https://github.com/Capitains/HookTest/pull/143 (merged)\n - https://github.com/oracc/pyoracc/pull/85 (merged)\n - https://github.com/oracc/pyoracc/pull/84 (merged)\n - https://github.com/oracc/pyoracc/pull/81 (merged)\n - https://github.com/oracc/pyoracc/pull/80 (merged)\n - https://github.com/oracc/pyoracc/pull/79 (merged)\n - https://github.com/oracc/pyoracc/pull/77 (merged)\n - https://github.com/oracc/pyoracc/pull/76 (merged)\n - https://github.com/cdli-gh/pyoracc/pull/42 (merged)\n - https://github.com/cdli-gh/pyoracc/pull/41 (merged)\n - https://github.com/pallets/jinja/pull/1030 (merged)\n - https://github.com/pytest-dev/pytest/pull/5416 (merged)\n - https://github.com/scaife-viewer/readhomer/pull/30\n - https://github.com/scaife-viewer/readhomer/pull/31\n - https://github.com/scaife-viewer/readhomer/pull/33\n - https://github.com/scaife-viewer/scaife-basic/pull/6\n - https://github.com/scaife-viewer/scaife-basic/pull/5\n - https://github.com/vim/vim/pull/4619 (merged)\n - https://gitlab.com/cdli/framework/merge_requests/11 (merged)\n\n### Issues\n\n - https://github.com/Capitains/Nautilus/issues/85\n - https://github.com/Capitains/HookTest/issues/145\n - https://github.com/scaife-viewer/scaife-viewer/issues/370\n - https://github.com/scaife-viewer/readhomer/issues/34\n - https://github.com/oracc/pyoracc/issues/78\n - https://github.com/oracc/pyoracc/issues/82\n - https://github.com/oracc/pyoracc/issues/83\n\n### Data\n\n - [Reported](https://github.com/cdli-gh/data/issues?q=is:issue+label:%22atf+syntax%22)\n   various atf syntax inconsistencies I found to @epp who\n   corrected the master database.\n\n## Future work\n\nContinuation points for the project, which I didn't have\ntime to pursue. Hopefully these can be developed over time.\n\n### Improve ATF conversion\n\nATF line markup isn't fully converted. The export xml files\nshould use tei markup to represent damage, restorations,\nsmallcap logograms, and superscript determinative.\nThis would need to be supported both in `atf2tei` and in the\ntei parser in Scaife. Greek and Latin layout works well enough\nwith plain unicode text, but cuneiform transliteration requires\nextra typographic features.\n\nThere are also a some unhandled annotations, like comments\nand cross-references which should be supported.\n\nExport data should maintain correctness according to the\n[HookTest](https://github.com/Capitains/HookTest) suite.\n\n### Add catalogue metadata\n\nIn addition to the ATF-format transcription data, CDLI publishes\na csv-format catalog of each record. The conversion tool should\nread this and represent relevant fields in the teiHeader, so\nthe xml documents are a more complete representation of the\ntexts. At a minimum there should be publication references\nand urls for the hand-drawn copies and photographs of the\nsource object. This will provide viewer software with everything\nit needs to present the same data as the main CDLI website.\n\nI wrote a [python wrapper](https://github.com/cdli-gh/cdli-cts/blob/1813415/update/cdli.py)\nfor the catalog metadata. This should be packaged so it can be\nshared between the various applications.\n\nThere are other data sources which can also be added,\neither to the xml or the viewer. There are lemma and\npart of speech annotation data for many tablets from\nmtaac, oracc, and other projects.\n\n### Develop a Scaife parallel reader\n\nScaife right now expects text and translation as separate, long\ndocuments. That makes some sense given the history of the scholarship\nthey're supporting. We have short documents where photo, copy,\ntransliteration, normalization and translation are usually thought\nof line-by-line together. Ideally all of these could be shown/hidden\nindependently, and transition between parallel and interlinear\npresentations depending on screen size.\n\nI would like to see my proof-of-concept *CDLI Reader* component\ndeveloped into a proper modular set of files which could be easily\nadded to any Scaife instance to support cuneiform documents.\n\n### Develop full-text search\n\nI wrote a quick script to upload the catalog metadata to\nElasticsearch where is could be searched as a full text.\nIt didn't do better than the general search on the current\nCDLI website, but with some tuning it should be possible\nto improve things.\n\nFor example, searching for a tablet reference like 'K 162'\nshould find P345482, the primary example of the Akkadian\n*Descent of Ishtar* text, without the user having to know\nto search for 'K 00162' in the accession number field.\n\nIf the atf data is also uploaded and indexed appropriately,\nthe service could provide easy programmatic access to the\nwhole corpus from a very small codebase, loaded directly\nfrom the published data set.\n\n## Open problems\n\n### Find a js library to sanitize html code.\n\nTo get good typography for transliteration lines we need markup for\ndeterminative, logograms and damage. Those should be represented\nby tei elements in the document served by CTS and converted to html\nelements like `\u003csup\u003e` and `\u003cspan\u003e` with custom classes for display.\nThat's not too hard, but to protect against cross-site scripting\nthe entire xml tree must be checked and cleaned of elements we don't\nwant, which isn't something we should write ourselves.\n\nSuggestions welcome. Scaife avoids this issue by serving plain unicode\ntext, which works ok for greek and latin, but not for us.\n\n### pyoracc doesn't handle enough of the CDLI atf files.\n\nThe parser is strict, but what's in the library hasn't been carefully\nvalidated, so it rejects many entries.\n\nI did a lot of cleanup work on the library, but it wasn't\npossible to get it handling the whole CDLI corpus within the term.\nOver the long term, syntax errors with the ATF in the database\nshould be corrected, ingest should do more validation to reduce\nnew errors, and pyoracc should be extended to support common features\nof the CDLI corpus.\n\nFor future work, I'd also want to revisit my ad-hoc, line-based parser\nand see if that can get more documents available. ATF is a simple\nformat and a permissive parser might work better for an application\nlike this where we're just trying to present the corpus as it is.\n\n### Localization.\n\nAn Arabic interface translation is something we'd like to do for the\nreader. Scaife-viewer is using django to do this, can try one of the\nvue.js packages on the readhomer re-write.\n\n### CTS GetCapabilities doesn't scale.\n\nReturning the whole CDLI corpus in a single query is too much data.\nCapitains is also quite slow indexing a large corpus.\nI opened an issue with scaife-viewer to figure out a shared way to\naddress this.\n[DTS](https://distributed-text-services.github.io/specifications/) (ld-json)\nor ATLAS (graphql) are options.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frillian%2F2019-gsoc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frillian%2F2019-gsoc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frillian%2F2019-gsoc/lists"}