{"id":24808680,"url":"https://github.com/veldhub/veld_chain__automatic_tei-ification_of_gutenberg","last_synced_at":"2026-02-06T08:38:32.314Z","repository":{"id":269398597,"uuid":"906357362","full_name":"veldhub/veld_chain__automatic_tei-ification_of_gutenberg","owner":"veldhub","description":"Chain velds encapsulating automatic tei conversion on gutenberg data:","archived":false,"fork":false,"pushed_at":"2025-03-02T18:08:09.000Z","size":76,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-25T11:15:00.345Z","etag":null,"topics":["data-science","nlp","project-gutenberg","rdf","tei","tei-xml","xml"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/veldhub.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-20T18:06:18.000Z","updated_at":"2025-03-02T18:08:12.000Z","dependencies_parsed_at":"2024-12-23T09:40:59.617Z","dependency_job_id":"2847bb69-57ff-4795-bb2b-eb21f50c7821","html_url":"https://github.com/veldhub/veld_chain__automatic_tei-ification_of_gutenberg","commit_stats":null,"previous_names":["veldhub/veld_chain__automatic_tei-ification_of_gutenberg"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/veldhub/veld_chain__automatic_tei-ification_of_gutenberg","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/veldhub%2Fveld_chain__automatic_tei-ification_of_gutenberg","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/veldhub%2Fveld_chain__automatic_tei-ification_of_gutenberg/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/veldhub%2Fveld_chain__automatic_tei-ification_of_gutenberg/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/veldhub%2Fveld_chain__automatic_tei-ification_of_gutenberg/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/veldhub","download_url":"https://codeload.github.com/veldhub/veld_chain__automatic_tei-ification_of_gutenberg/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/veldhub%2Fveld_chain__automatic_tei-ification_of_gutenberg/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":268010095,"owners_count":24180459,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-31T02:00:08.723Z","response_time":66,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","nlp","project-gutenberg","rdf","tei","tei-xml","xml"],"created_at":"2025-01-30T10:18:31.123Z","updated_at":"2026-02-06T08:38:32.248Z","avatar_url":"https://github.com/veldhub.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# ![veld chain](https://raw.githubusercontent.com/veldhub/.github/refs/heads/main/images/symbol_V_letter.png) veld_chain__automatic_tei-ification_of_gutenberg \n\nThis repo contains [chain velds](https://zenodo.org/records/13322913) encapsulating the entire process \nof automatic TEI conversion of Gutenberg books.\n\nThe individual processing workflows are:\n- download of entire [project gutenberg](https://www.gutenberg.org/) metadata\n- ingestion into a local triplestore for complex sparql queryies\n- query and download of all german books from gutenberg that don't have a TEI representation but a\n  txt one\n- running a veldified version of [teitok tools](https://github.com/ufal/teitok-tools) + \n  [udpipe](https://lindat.mff.cuni.cz/services/udpipe/) to automatically generate TEI files of them.\n\n## requirements\n\n- git\n- docker compose (note: older docker compose versions require running `docker-compose` instead of \n  `docker compose`)\n\nClone this repo with all its submodules\n```\ngit clone --recurse-submodules https://github.com/veldhub/veld_chain__automatic_tei-ification_of_gutenberg.git\n```\n\n## how to reproduce\n\nThere are two ways to reproduce the entirety of all chains: \n- individually: going through each following step sequentially. See \n[individual chains](#individual-chains)\n- multichain: aggregates all individual chains into one multichain. See [multichain](#multichain)\n\n### individual chains\n\nAll details of the following chains can be inspected within their respective veld_* yaml file:\n\n**[./veld_step_1_download_gutenberg_metadata.yaml](./veld_step_1_download_gutenberg_metadata.yaml)**\n\nSince project gutenberg doesn't offer an API, it's not programmatically possible to query its data.\nIt does however offer a download of its entire metadata as rdf-xml ( \nhttps://gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2 ), which will be used for querying. This \nmetadata is downloaded into [./data/gutenberg_rdf/](./data/gutenberg_rdf/)\n\n```\ndocker compose -f veld_step_1_download_gutenberg_metadata.yaml up\n```\n\n**[./veld_step_2_run_server.yaml](./veld_step_2_run_server.yaml)**\n\nQuerying the data is done via a apache fuseki triplestore, since sparql can adapt to the data's \ncomplexity and also allows for high flexibility should query requirements change at some point. For \nthis, a triplestore is started in this step. Configuration for the server can be found in\n[./data/fuseki_config/](./data/fuseki_config/). The server can be reached at \n[http://localhost:3030](http://localhost:3030). Note: this service needs to be kept running for step \n3 (ingestion) and 4 (querying). After step 4, it can be shut down.\n\n```\ndocker compose -f veld_step_2_run_server.yaml up\n```\n\n**[./veld_step_3_import_rdf.yaml](./veld_step_3_import_rdf.yaml)**\n\nOnce the server is running via step 2, the metadata downloaded by step 1 can be ingested. Note that \nthis step can take a long time (on a AMD Ryzen 7 4800H it took 11 hours). \n\n```\ndocker compose -f veld_step_3_import_rdf.yaml up\n```\n\n**[./veld_step_4_query_books_urls.yaml](./veld_step_4_query_books_urls.yaml)**\n\nAfter the metadata is ingested, the triplestore can be queried for german books that have no TEI but\ntxt files. The query for this can be found at \n[./data/queries/german_books_txt_no_tei.rq](./data/queries/german_books_txt_no_tei.rq), and the \noutput is saved as csv file in [./data/fuseki_export/](./data/fuseki_export/)\n\n```\ndocker compose -f veld_step_4_query_books_urls.yaml up\n```\n\n**[./veld_step_5_download_gutenberg_books.yaml](./veld_step_5_download_gutenberg_books.yaml)**\n\nThe csv file from step 4 contains book download links and their designated file names. This csv is\nused as input for downloading the book's txt files within this step. The books can be found at \n[./data/gutenberg_books/](./data/gutenberg_books/).\n\n```\ndocker compose -f veld_step_5_download_gutenberg_books.yaml up\n```\n\n**[./veld_step_6_convert_books_to_teitok.yaml](./veld_step_6_convert_books_to_teitok.yaml)**\n\nTODO: implement.\n\n## multichain\n\n[./veld_multichain_all.yaml](./veld_multichain_all.yaml)\n\nAll of the individual chains above are also aggregated into one multichain which keeps the order of\nthe steps above. Note that the multichain simply references the indvidual chains by loading them\nfrom their respective veld_* yaml file. This means that any change to any such file will be also\nreflected in this multichain. For more details, see \n[./veld_multichain_all.yaml](./veld_multichain_all.yaml) \n\n```\ndocker compose -f veld_multichain_all.yaml up\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fveldhub%2Fveld_chain__automatic_tei-ification_of_gutenberg","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fveldhub%2Fveld_chain__automatic_tei-ification_of_gutenberg","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fveldhub%2Fveld_chain__automatic_tei-ification_of_gutenberg/lists"}