{"id":19162870,"url":"https://github.com/centre-for-humanities-computing/computing-antiquity","last_synced_at":"2026-06-15T02:35:00.392Z","repository":{"id":117644436,"uuid":"610201332","full_name":"centre-for-humanities-computing/computing-antiquity","owner":"centre-for-humanities-computing","description":"Code and analyses for the computing antiquity project","archived":false,"fork":false,"pushed_at":"2023-12-08T13:49:55.000Z","size":521,"stargazers_count":3,"open_issues_count":0,"forks_count":2,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-09T23:59:45.105Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/centre-for-humanities-computing.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-06T09:50:28.000Z","updated_at":"2025-08-24T15:40:10.000Z","dependencies_parsed_at":null,"dependency_job_id":"6b892f16-963e-4e44-8ad9-28c2e8ab8ea1","html_url":"https://github.com/centre-for-humanities-computing/computing-antiquity","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/centre-for-humanities-computing/computing-antiquity","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/centre-for-humanities-computing%2Fcomputing-antiquity","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/centre-for-humanities-computing%2Fcomputing-antiquity/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/centre-for-humanities-computing%2Fcomputing-antiquity/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/centre-for-humanities-computing%2Fcomputing-antiquity/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/centre-for-humanities-computing","download_url":"https://codeload.github.com/centre-for-humanities-computing/computing-antiquity/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/centre-for-humanities-computing%2Fcomputing-antiquity/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34345577,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-15T02:00:07.085Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-09T09:13:23.513Z","updated_at":"2026-06-15T02:35:00.376Z","avatar_url":"https://github.com/centre-for-humanities-computing.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Computing Antiquity\n\nThis repository contains scripts for fetching and preprocessing data as well as all analysis scripts in the \n\"Computing Antiquity\" project.\n\n\u003e Please note that everything is designed for usage in Debian based system with a venv-compatible\n\u003e Python distribution and bash shell.\n\n## Usage\n\n### Data fetching\n\nTo fetch the corpus run:\n\n```bash\nbash src/scripts/get_corpus.py\n```\n\nThis command will put all texts from the corpus in raw XML format in `dat/greek/raw_data/`.\n\n\u003e Note that some Septuigant texts do not get fetched by the scripts\n\u003e as they are closed source and cannot be disclosed to outside parties.\n\u003e You have to manually insert `SEPA.zip` to `dat/greek/` before running the script.\n\n### Parsing\n\nTo Parse the XML files into raw text:\n\n```bash\nbash src/scripts/parse_corpus.py\n```\n\nThis command will put all texts from the corpus in raw txt format in `dat/greek/parsed_data/`\nAll files will follow this naming convention: `\u003csource_corpus\u003e/\u003ccorpus_specific_id\u003e.txt`\nIndex of destinations, document ids and source files will be found in `dat/greek/parsed_data/index.csv`\n\n### Processing\n\nThis step processes all documents with odyCy and saves them as DocBins.\nThe code requires an environment where an Nvidia GPU can be used.\nIt is specifically designed for usage in the Ubuntu(CUDA+Jupyter) Virtual Machine on AAU in Ucloud.\n\nFor installing spaCy GPU dependencies run:\n\n```bash\nbash src/scripts/init_gpu_server.sh\nsudo reboot\n\n# After reboot\nbash src/install_processing_env.sh\n```\n\nFor preprocessing texts and saving them as spaCy DocBins run:\n\n```bash\nbash src/scripts/spacy_process_corpus.py\n```\n\nThis will save everything in `dat/greek/processed_data/`. All files will have their document id as name and .spacy extension.\nIndex of files can be accessed under `dat/greek/processed_data/index.csv`\n\n### Cleaning\n\nThis step cleans texts by normalizing them and removing non-greek tokens.\nIt produces the following corpora (under `dat/greek/cleaned_data`):\n  - Normalized with stopwords (`with_stopwords.csv`)\n  - Normalized without stopwords (`without_stopwords.csv`)\n  - Lemmatized with stopwords (`lemmatized_with_stopwords.csv`)\n  - Lemmatized without stopwords (`lemmatized_without_stopwords.csv`)\n\nThese files include a `document_id` and a `text` column.\nTexts are represented as follows:\n  - Sentences are separated by newlines\n  - Tokens in sentences are separated by spaces\n\n```bash\nbash src/scripts/clean_corpus.py\n```\n\n \u003e It is recommended that you do these steps consecutively on the same server, as they use the same environment.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcentre-for-humanities-computing%2Fcomputing-antiquity","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcentre-for-humanities-computing%2Fcomputing-antiquity","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcentre-for-humanities-computing%2Fcomputing-antiquity/lists"}