{"id":17336580,"url":"https://github.com/restioson/isixhosa-crawler","last_synced_at":"2025-10-04T03:33:30.301Z","repository":{"id":78428018,"uuid":"542802284","full_name":"Restioson/isixhosa-crawler","owner":"Restioson","description":"(Undergrad independent study project) Focused web crawler aiming to discover documents written in isiXhosa on the web","archived":false,"fork":false,"pushed_at":"2024-05-30T07:58:24.000Z","size":127,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-07T05:34:47.025Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Restioson.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-09-28T21:37:01.000Z","updated_at":"2024-10-15T21:19:56.000Z","dependencies_parsed_at":"2023-03-05T02:15:32.518Z","dependency_job_id":"b018cd07-0e93-46cf-ac02-5c8a319299b5","html_url":"https://github.com/Restioson/isixhosa-crawler","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Restioson/isixhosa-crawler","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Restioson%2Fisixhosa-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Restioson%2Fisixhosa-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Restioson%2Fisixhosa-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Restioson%2Fisixhosa-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Restioson","download_url":"https://codeload.github.com/Restioson/isixhosa-crawler/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Restioson%2Fisixhosa-crawler/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278260182,"owners_count":25957614,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-04T02:00:05.491Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-15T15:31:30.168Z","updated_at":"2025-10-04T03:33:30.263Z","avatar_url":"https://github.com/Restioson.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# isixhosa-crawler\n\nSimple focused web crawler for discovering documents written in isiXhosa.\nThis was produced as part of an undergraduate independent research project under the supervision of Professor Hussein Suleman during my B.Sc Computer Science \u0026 Xhosa Communication at the University of Cape Town.\n\n# Disclosure\n\nThis research was partially funded by the National Research Foundation of South Africa (Grant number: 129253)\nand University of Cape Town. The authors acknowledge that opinions, findings and conclusions or\nrecommendations expressed in this publication are that of the authors, and that\nthe NRF accepts no liability whatsoever in this regard.\n\n# Publication\n\nResults associated with the crawler were published in the SAICSIT2023 conference.\n\nThe final paper is available from [SpringerLink](https://link.springer.com/chapter/10.1007/978-3-031-39652-6_2), and a pre-print version is available for free from [UCT CS's publications archive](https://pubs.cs.uct.ac.za/id/eprint/1551/).\n\nThe dataset itself is available [here](https://zivahub.uct.ac.za/articles/dataset/Focused_Crawling_for_Automated_IsiXhosa_Corpus_Building_-_Final_Crawl_Data_Analysis/25125359/1).\n\n# Citation\n \nPlease cite as follows:\n```bibtex\n@InProceedings{10.1007/978-3-031-39652-6_2,\nauthor=\"Marquard, Cael\nand Suleman, Hussein\",\neditor=\"Gerber, Aurona\nand Coetzee, Marijke\",\ntitle=\"Focused Crawling for Automated IsiXhosa Corpus Building\",\nbooktitle=\"South African Institute of Computer Scientists and Information Technologists\",\nyear=\"2023\",\npublisher=\"Springer Nature Switzerland\",\naddress=\"Cham\",\npages=\"19--31\",\nabstract=\"IsiXhosa is a low-resource language, which means that it does not have many large, high-quality corpora. This makes it difficult to perform many kinds of research with the language. This paper examines the use of focused Web crawling for automatic corpus generation. The resulting corpus is characterised using statistical methods: its vocabulary growth has been found to fit Heaps' Law, and its word frequency has been found to be heavy-tailed. In addition, as expected, the corpus statistics did not match expectations from non-agglutinative languages.\",\nisbn=\"978-3-031-39652-6\"\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frestioson%2Fisixhosa-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frestioson%2Fisixhosa-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frestioson%2Fisixhosa-crawler/lists"}