https://github.com/droher/diachronic
Get daily historical snapshots of every article on any Wiki, formatted as Parquet files
https://github.com/droher/diachronic
apache-arrow google-cloud terraform wikimedia wikipedia
Last synced: 7 months ago
JSON representation
Get daily historical snapshots of every article on any Wiki, formatted as Parquet files
- Host: GitHub
- URL: https://github.com/droher/diachronic
- Owner: droher
- License: apache-2.0
- Created: 2017-10-09T05:52:35.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2022-09-23T20:53:57.000Z (over 3 years ago)
- Last Synced: 2023-04-08T23:35:41.727Z (almost 3 years ago)
- Topics: apache-arrow, google-cloud, terraform, wikimedia, wikipedia
- Language: Python
- Homepage:
- Size: 52.7 KB
- Stars: 2
- Watchers: 0
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# diachronic
A parser that turns the revision history dump for a set of wiki sites
(e.g. Wikipedia, Wiktionary) into parquet files of daily snapshots.
Uses Apache Arrow for serialization.
The files are uploaded to a specified Google Cloud bucket.