Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/defgsus/frontpage-archive
archive of online press frontpages
https://github.com/defgsus/frontpage-archive
archive historic media news press
Last synced: 28 days ago
JSON representation
archive of online press frontpages
- Host: GitHub
- URL: https://github.com/defgsus/frontpage-archive
- Owner: defgsus
- Created: 2022-01-28T18:32:08.000Z (almost 3 years ago)
- Default Branch: master
- Last Pushed: 2024-10-01T01:57:52.000Z (about 2 months ago)
- Last Synced: 2024-10-10T15:16:13.394Z (about 1 month ago)
- Topics: archive, historic, media, news, press
- Language: Python
- Homepage:
- Size: 233 MB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Archive of news front pages
[![Scraper](https://github.com/defgsus/frontpage-archive/actions/workflows/scraper.yml/badge.svg)](https://github.com/defgsus/teletext-archive/actions/workflows/scraper.yml)
Collector of raw index.html files and the like.
Should have started this 20 years ago!## Scraped sites:
| id | since | files | url |
|:------------------------------------------------------------|:-----------|--------:|:---------------------------------|
| [bild.de](docs/snapshots/bild.de) | 2022-01-28 | 26 | https://www.bild.de |
| [compact-online.de](docs/snapshots/compact-online.de) | 2022-01-29 | 5 | https://www.compact-online.de/ |
| [faz.net](docs/snapshots/faz.net) | 2022-01-29 | 14 | https://www.faz.net/ |
| [fr.de](docs/snapshots/fr.de) | 2022-01-28 | 8 | https://www.fr.de/ |
| [gmx.net](docs/snapshots/gmx.net) | 2022-01-29 | 7 | https://www.gmx.net/ |
| [heise.de](docs/snapshots/heise.de) | 2022-01-28 | 12 | https://www.heise.de/ |
| [spiegel.de](docs/snapshots/spiegel.de) | 2022-01-28 | 21 | https://www.spiegel.de/ |
| [spiegeldaily.de](docs/snapshots/spiegeldaily.de) | 2022-01-28 | 5 | https://www.spiegeldaily.de/ |
| [sueddeutsche.de](docs/snapshots/sueddeutsche.de) | 2022-01-29 | 7 | https://www.sueddeutsche.de/ |
| [t-online.de](docs/snapshots/t-online.de) | 2022-01-29 | 8 | https://www.t-online.de/ |
| [volksstimme.de](docs/snapshots/volksstimme.de) | 2022-01-29 | 8 | https://www.volksstimme.de/ |
| [web.de](docs/snapshots/web.de) | 2022-01-29 | 15 | https://web.de |
| [welt.de](docs/snapshots/welt.de) | 2022-01-29 | 16 | https://www.welt.de |
| [zeit.de](docs/snapshots/zeit.de) | 2022-01-29 | 16 | https://www.zeit.de/ |
| [zeitfuerdieschule.de](docs/snapshots/zeitfuerdieschule.de) | 2022-01-29 | 5 | https://www.zeitfuerdieschule.de |Well, let's see how far this goes with a free github account.
Many websites transmit click-ids and random uuids in their
documents so there is a change in every file in each snapshot.Anyways, currently each snapshot adds about 10mb to the
repository size (size of `.git` directory). That's not going
to work for long :-(## UPDATE
Okay, raw data is just too much. The snapshot rate is now set
to **once a month**. I'll try to scrape just the article
headlines and archive them in another repository.### TODO
- https://www.n-tv.de/
- https://www.handelsblatt.com/
- https://www.taz.de/
- https://www.wa.de/
- https://www.rnd.de/
- https://www.nzz.ch/
- https://www.bazonline.ch/
- https://www.focus.de/
- https://www.tagesschau.de/
- https://www.heise.de/tp/
- https://www.golem.de/
- https://www.kicker.de/
- https://www.achgut.com/
- https://www.stern.de/