https://github.com/dpriskorn/swepub2python

This project converts and cleans the scientific metadata from Swepub into Python objects
https://github.com/dpriskorn/swepub2python

Last synced: 3 months ago
JSON representation

This project converts and cleans the scientific metadata from Swepub into Python objects

Host: GitHub
URL: https://github.com/dpriskorn/swepub2python
Owner: dpriskorn
License: gpl-3.0
Created: 2022-02-10T10:09:26.000Z (over 3 years ago)
Default Branch: master
Last Pushed: 2024-06-17T22:48:04.000Z (about 1 year ago)
Last Synced: 2025-02-13T19:17:51.729Z (5 months ago)
Language: Python
Size: 220 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 6
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# SwePub2Python

This tries to extract SwePub into Python objects to work on the data using Pandas

## Downloading SwePub
Use a FTP program and fetch the deduplicated file from here ftp://ftp.libris.kb.se/pub/spa/
The whole file is about 2GB

## Issues in SwePub

There is a lot of bloat in their choice of specification.
E.g. titles of all the UKÄ codes could have been left out
and put into a Wikibase graph database instead and just linked
That would have saved a lot of space and hassle for consumers.
The same could be done with all the language handling.
Here SwePub could simply link to Wikidata because all languages
in the world are already modeled there and many data consumers
already added support for WD into their workflows.

If this was done for SwePub the filesize would probably shrink considerably.

This is also good for the environment, as it keeps processing time to a minimum for consumers.

Diva example article http://www.diva-portal.org/smash/record.jsf?dswid=6479&pid=diva2%3A1216612&c=11&searchType=SIMPLE&language=sv&query=anv%C3%A4ndarv%C3%A4nlig&af=%5B%5D&aq=%5B%5B%5D%5D&aq2=%5B%5B%5D%5D&aqe=%5B%5D&noOfRows=50&sortOrder=author_sort_asc&sortOrder2=title_sort_asc&onlyFullText=false&sf=all

Suggestions for improvements of the dataset:
1) Add language codes to titles just as you do for summaries.
2) Publish a specification of the format and validate that you follow it
3) Publication date seems to be completely missing in the data. Why? In DiVA there are 3 dates e.g. "Available from: 2018-06-12 Created: 2018-06-12 Last updated: 2018-06-12" (Diva example article)
4) Subjects could be matched to concepts in OpenAlex, but they are just text strings.
5) In DiVA exist "keywords" in multiple languages on some publications (Diva example article)

## What I learned from this project
* Good data starts with a clear and well-thought-out specification.
It results in a minimum of guesswork for the consumer as opposed to
a hairy mess of objects with unclear relations like in this case.
* I ran into Kubernetes errors with this script. It is still unknown
what causes them. I really prefer to do my own computing whenever
possible so I can control the whole environment. Kubernetes introduces
complexity.

## License
GPLv3+

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dpriskorn/swepub2python

Awesome Lists containing this project

README