{"id":16104174,"url":"https://github.com/dpriskorn/swepub2python","last_synced_at":"2025-10-14T21:38:04.811Z","repository":{"id":75337167,"uuid":"457726430","full_name":"dpriskorn/SwePub2Python","owner":"dpriskorn","description":"This project converts and cleans the scientific metadata from Swepub into Python objects","archived":false,"fork":false,"pushed_at":"2024-06-17T22:48:04.000Z","size":225,"stargazers_count":1,"open_issues_count":6,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-07T16:52:27.241Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dpriskorn.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-02-10T10:09:26.000Z","updated_at":"2023-05-16T08:08:56.000Z","dependencies_parsed_at":"2024-10-27T17:27:03.468Z","dependency_job_id":"1c872c0a-592a-4761-9d91-7261341ee0c5","html_url":"https://github.com/dpriskorn/SwePub2Python","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/dpriskorn/SwePub2Python","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dpriskorn%2FSwePub2Python","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dpriskorn%2FSwePub2Python/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dpriskorn%2FSwePub2Python/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dpriskorn%2FSwePub2Python/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dpriskorn","download_url":"https://codeload.github.com/dpriskorn/SwePub2Python/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dpriskorn%2FSwePub2Python/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278638122,"owners_count":26019947,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-06T02:00:05.630Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-09T18:59:48.961Z","updated_at":"2025-10-06T16:05:28.071Z","avatar_url":"https://github.com/dpriskorn.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SwePub2Python\n\nThis tries to extract SwePub into Python objects to work on the data using Pandas\n\n## Downloading SwePub\nUse a FTP program and fetch the deduplicated file from here ftp://ftp.libris.kb.se/pub/spa/\nThe whole file is about 2GB\n\n## Issues in SwePub\n\nThere is a lot of bloat in their choice of specification.\nE.g. titles of all the UKÄ codes could have been left out\nand put into a Wikibase graph database instead and just linked\nThat would have saved a lot of space and hassle for consumers. \nThe same could be done with all the language handling. \nHere SwePub could simply link to Wikidata because all languages \nin the world are already modeled there and many data consumers \nalready added support for WD into their workflows.\n\nIf this was done for SwePub the filesize would probably shrink considerably.\n\nThis is also good for the environment, as it keeps processing time to a minimum for consumers.\n\nDiva example article http://www.diva-portal.org/smash/record.jsf?dswid=6479\u0026pid=diva2%3A1216612\u0026c=11\u0026searchType=SIMPLE\u0026language=sv\u0026query=anv%C3%A4ndarv%C3%A4nlig\u0026af=%5B%5D\u0026aq=%5B%5B%5D%5D\u0026aq2=%5B%5B%5D%5D\u0026aqe=%5B%5D\u0026noOfRows=50\u0026sortOrder=author_sort_asc\u0026sortOrder2=title_sort_asc\u0026onlyFullText=false\u0026sf=all\n\nSuggestions for improvements of the dataset:\n1) Add language codes to titles just as you do for summaries.\n2) Publish a specification of the format and validate that you follow it\n3) Publication date seems to be completely missing in the data. Why? In DiVA there are 3 dates e.g. \"Available from: 2018-06-12 Created: 2018-06-12 Last updated: 2018-06-12\" (Diva example article)\n4) Subjects could be matched to concepts in OpenAlex, but they are just text strings.\n5) In DiVA exist \"keywords\" in multiple languages on some publications (Diva example article)\n\n## What I learned from this project\n* Good data starts with a clear and well-thought-out specification. \n  It results in a minimum of guesswork for the consumer as opposed to \n  a hairy mess of objects with unclear relations like in this case.\n* I ran into Kubernetes errors with this script. It is still unknown\n  what causes them. I really prefer to do my own computing whenever \n  possible so I can control the whole environment. Kubernetes introduces\n  complexity.\n\n## License \nGPLv3+","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdpriskorn%2Fswepub2python","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdpriskorn%2Fswepub2python","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdpriskorn%2Fswepub2python/lists"}