Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dpriskorn/swepub2python
This project converts and cleans the scientific metadata from Swepub into Python objects
https://github.com/dpriskorn/swepub2python
Last synced: 4 days ago
JSON representation
This project converts and cleans the scientific metadata from Swepub into Python objects
- Host: GitHub
- URL: https://github.com/dpriskorn/swepub2python
- Owner: dpriskorn
- License: gpl-3.0
- Created: 2022-02-10T10:09:26.000Z (almost 3 years ago)
- Default Branch: master
- Last Pushed: 2024-06-17T22:48:04.000Z (7 months ago)
- Last Synced: 2024-12-21T06:43:05.122Z (15 days ago)
- Language: Python
- Size: 220 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# SwePub2Python
This tries to extract SwePub into Python objects to work on the data using Pandas
## Downloading SwePub
Use a FTP program and fetch the deduplicated file from here ftp://ftp.libris.kb.se/pub/spa/
The whole file is about 2GB## Issues in SwePub
There is a lot of bloat in their choice of specification.
E.g. titles of all the UKÄ codes could have been left out
and put into a Wikibase graph database instead and just linked
That would have saved a lot of space and hassle for consumers.
The same could be done with all the language handling.
Here SwePub could simply link to Wikidata because all languages
in the world are already modeled there and many data consumers
already added support for WD into their workflows.If this was done for SwePub the filesize would probably shrink considerably.
This is also good for the environment, as it keeps processing time to a minimum for consumers.
Diva example article http://www.diva-portal.org/smash/record.jsf?dswid=6479&pid=diva2%3A1216612&c=11&searchType=SIMPLE&language=sv&query=anv%C3%A4ndarv%C3%A4nlig&af=%5B%5D&aq=%5B%5B%5D%5D&aq2=%5B%5B%5D%5D&aqe=%5B%5D&noOfRows=50&sortOrder=author_sort_asc&sortOrder2=title_sort_asc&onlyFullText=false&sf=all
Suggestions for improvements of the dataset:
1) Add language codes to titles just as you do for summaries.
2) Publish a specification of the format and validate that you follow it
3) Publication date seems to be completely missing in the data. Why? In DiVA there are 3 dates e.g. "Available from: 2018-06-12 Created: 2018-06-12 Last updated: 2018-06-12" (Diva example article)
4) Subjects could be matched to concepts in OpenAlex, but they are just text strings.
5) In DiVA exist "keywords" in multiple languages on some publications (Diva example article)## What I learned from this project
* Good data starts with a clear and well-thought-out specification.
It results in a minimum of guesswork for the consumer as opposed to
a hairy mess of objects with unclear relations like in this case.
* I ran into Kubernetes errors with this script. It is still unknown
what causes them. I really prefer to do my own computing whenever
possible so I can control the whole environment. Kubernetes introduces
complexity.## License
GPLv3+