{"id":16104161,"url":"https://github.com/dpriskorn/odsc","last_synced_at":"2026-01-04T12:41:32.362Z","repository":{"id":208238098,"uuid":"721145941","full_name":"dpriskorn/odsc","owner":"dpriskorn","description":"Project that aims to sentenize all the open data of Riksdagen and other sources to create an easily linkable dataset of sentences that can be refered to from Wikidata lexemes and other resources","archived":false,"fork":false,"pushed_at":"2024-07-15T19:08:28.000Z","size":2561,"stargazers_count":0,"open_issues_count":32,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-26T04:31:32.875Z","etag":null,"topics":["civic-tech","entity-linking","folketinget","named-entity-recognition","nlp","part-of-speech-tagging","riksdagen","riksdagensoppnadata","wikidata","wikidata-lexemes"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dpriskorn.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-20T12:59:07.000Z","updated_at":"2024-05-04T07:27:39.000Z","dependencies_parsed_at":"2024-01-18T04:51:08.761Z","dependency_job_id":"4a695a79-72fe-42b2-b853-4a42aedf3009","html_url":"https://github.com/dpriskorn/odsc","commit_stats":{"total_commits":136,"total_committers":2,"mean_commits":68.0,"dds":"0.13970588235294112","last_synced_commit":"3280eb194fc7a0f8f0c12c7b580fe9a950b3012d"},"previous_names":["dpriskorn/riskdagen_sentences","dpriskorn/riksdagen_sentences","dpriskorn/odsc"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dpriskorn%2Fodsc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dpriskorn%2Fodsc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dpriskorn%2Fodsc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dpriskorn%2Fodsc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dpriskorn","download_url":"https://codeload.github.com/dpriskorn/odsc/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244759920,"owners_count":20505715,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["civic-tech","entity-linking","folketinget","named-entity-recognition","nlp","part-of-speech-tagging","riksdagen","riksdagensoppnadata","wikidata","wikidata-lexemes"],"created_at":"2024-10-09T18:59:46.875Z","updated_at":"2026-01-04T12:41:32.334Z","avatar_url":"https://github.com/dpriskorn.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Open Data Sentence Corpora\nThis civic science project aims to analyze and sentenize all the open \ndata of Riksdagen and other sources using spaCy \nto create an easily linkable \ndataset of sentences that can be refered to from \n[Wikidata lexemes](https://www.wikidata.org/wiki/Wikidata:Lexicographical_data) and other resources. \n\nThe advantage of such a dataset is huge from a language perspective. \nThe sentences contain valuable information about what is going on in society. \nThey contain a lot of words, phrases and idioms which are highly valuable to anyone interested in the language.\nThe 600k documents to be analyzed contains a lot of political dialogue and written documents from institutions in the Swedish state.\n\nKeywords: NLP, data science, open data, swedish, \nopen government data, riksdagen, sweden, API\n\n# Author\n[Dennis Priskorn](https://www.wikidata.org/wiki/Q111016131).\n\n## Idea\nUse spaCy to create the first version.\nAll sentences are language detected and given an \nUUID which is unique for each release. \n\nAs better sentenizing becomes available or Riksdagen improve their \ndata over time, the hashes and UUIDs will change, but all released \nversions will be locked in time and can always be refered to \nconsistently and reliably.\n\nThe resulting dataset is planned to be released in Zenodo \nand is expected to be around 1TB  \n\n## Features\n* Reliability\n* Locked in time\n* Referencable\n* Language detected (using Fasttext langdetect)\n* Uniquely identifiable\n* Linkable (the individual sentences are not planned to be \nlinkable at this stage, but the release is and line numbers \nor UUIDs can be used to link with no ambiguity)\n* Named Entity Recognition entities for each sentence and document\n* An [evolvable](https://levelup.gitconnected.com/to-create-an-evolvable-api-stop-thinking-about-urls-2ad8b4cc208e) API\n  * /lookup endpoint to get sentences to use as usage examples for [Wikidata lexemes](https://www.wikidata.org/wiki/Wikidata:Lexicographical_data) (based on the needs of [Luthor](https://luthor.toolforge.org/))\n\n## Scope\nThis way of chopping up open data can be applied to any open data, provided that it is in a machine readable form like TEXT, XML, JSON or HTML.\n\nRiksdagen has about 600k documents that can be downloaded as open data.\n\nThis project is a stepping stone to an even larger database of sentences and tokens that we can use to enrich the lexicographic data in Wikidata.\n\n## Statistics\nSee [STATISTICS.md](/STATISTICS.md)\n\n## Design\n\n### API design inspired by\n* https://medium.com/@jccguimaraes/designing-an-api-6609eb771b18\n* https://levelup.gitconnected.com/to-create-an-evolvable-api-stop-thinking-about-urls-2ad8b4cc208e\n* https://en.wikipedia.org/wiki/Don%27t_Make_Me_Think\n* https://jsonapi.org/format/#document-structure\n\n### Data model\n![Datamodel](/diagrams/datamodel.svg)\n\n[UML source](/diagrams/datamodel.puml)\n\n## Installation\nClone the repo\n\nRun\n\n`$ pip install poetry \u0026\u0026 poetry install`\n\nAlso download the model needed\n\n`$ python -m spacy download sv_core_news_lg`\n(250 MB)\n\nNow download some of the source datasets from Riksdagen and put them in a data/sv/ folder hierarchy.\n\n## Use\n`$ python riksdagen_analyzer --analyze`\n\n## Sources\n### Mostly unilingual\n* (sv) Riksdagen open data: ~600k machine readable HTML/TEXT documents ~1TB database size in total https://www.riksdagen.se/sv/dokument-och-lagar/riksdagens-oppna-data/dokument/\n* (da) Folketinget open data: ~500k programmatically generated PDF documents https://www.ft.dk/da/dokumenter/aabne_data#276BF4DB3854444286D8F71F742FD018\n\n## Related corpora\n* Digital Corpus of the European Parliament https://wt-public.emm4u.eu/Resources/DCEP-2013/DCEP-Download-Page.html (EU languages)\n* europarl corpus https://www.statmt.org/europarl/ (EU languages)\n* wikisentences https://analytics.wikimedia.org/published/datasets/one-off/santhosh/wikisentences/ (all Wikipedia languages)\n* The European Parliamentary Comparable and Parallel Corpora https://www.islrn.org/resources/036-939-425-010-1/ (en, es)\n* Corrected \u0026 Structured Europarl Corpus https://pub.cl.uzh.ch/wiki/public/costep/start (EU languages)\n\n## Inspiration\nAlice Zhao https://www.youtube.com/watch?v=8Fw1nh8lR54\n\n## Thanks\nThanks to Nicolas Vigneron and Asof Bartov for dicussions about the needs of Luthor and how to make this project most suitable as a source of sentences used in usage examples on Wikidata lexemes.\n\n## License\nGPLv3+\n\n## What I learned\n* the default sentenizer for Swedish in spaCy is not ideal\n* fasttext langdetect cannot reliably detect language of sentences with only one token/word\n* chatgpt can write good code, but it still outputs wonky code sometimes\n* chatgpt is very good at creating sql queries!\n* working on millions of sentences with NLP takes time even on a fast machine \nlike my 8th gen 8-core i5 laptop\n* python langdetect was too slow and only utilized 1 CPU, swiching to fasttext langdetect was a bit challenging because I had to fix the python module\n* it's so nice to work with classes and small methods and \ncombining them in ways that makes sense. KISS!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdpriskorn%2Fodsc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdpriskorn%2Fodsc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdpriskorn%2Fodsc/lists"}