Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/defgsus/frontpage-archive

archive of online press frontpages
https://github.com/defgsus/frontpage-archive

archive historic media news press

Last synced: 28 days ago
JSON representation

archive of online press frontpages

Host: GitHub
URL: https://github.com/defgsus/frontpage-archive
Owner: defgsus
Created: 2022-01-28T18:32:08.000Z (almost 3 years ago)
Default Branch: master
Last Pushed: 2024-10-01T01:57:52.000Z (about 2 months ago)
Last Synced: 2024-10-10T15:16:13.394Z (about 1 month ago)
Topics: archive, historic, media, news, press
Language: Python
Homepage:
Size: 233 MB
Stars: 0
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # Archive of news front pages

[![Scraper](https://github.com/defgsus/frontpage-archive/actions/workflows/scraper.yml/badge.svg)](https://github.com/defgsus/teletext-archive/actions/workflows/scraper.yml)

Collector of raw index.html files and the like.

Should have started this 20 years ago! 

## Scraped sites:

| id                                                          | since      |   files | url                              |

|:------------------------------------------------------------|:-----------|--------:|:---------------------------------|

| [bild.de](docs/snapshots/bild.de)                           | 2022-01-28 |      26 | https://www.bild.de              |

| [compact-online.de](docs/snapshots/compact-online.de)       | 2022-01-29 |       5 | https://www.compact-online.de/   |

| [faz.net](docs/snapshots/faz.net)                           | 2022-01-29 |      14 | https://www.faz.net/             |

| [fr.de](docs/snapshots/fr.de)                               | 2022-01-28 |       8 | https://www.fr.de/               |

| [gmx.net](docs/snapshots/gmx.net)                           | 2022-01-29 |       7 | https://www.gmx.net/             |

| [heise.de](docs/snapshots/heise.de)                         | 2022-01-28 |      12 | https://www.heise.de/            |

| [spiegel.de](docs/snapshots/spiegel.de)                     | 2022-01-28 |      21 | https://www.spiegel.de/          |

| [spiegeldaily.de](docs/snapshots/spiegeldaily.de)           | 2022-01-28 |       5 | https://www.spiegeldaily.de/     |

| [sueddeutsche.de](docs/snapshots/sueddeutsche.de)           | 2022-01-29 |       7 | https://www.sueddeutsche.de/     |

| [t-online.de](docs/snapshots/t-online.de)                   | 2022-01-29 |       8 | https://www.t-online.de/         |

| [volksstimme.de](docs/snapshots/volksstimme.de)             | 2022-01-29 |       8 | https://www.volksstimme.de/      |

| [web.de](docs/snapshots/web.de)                             | 2022-01-29 |      15 | https://web.de                   |

| [welt.de](docs/snapshots/welt.de)                           | 2022-01-29 |      16 | https://www.welt.de              |

| [zeit.de](docs/snapshots/zeit.de)                           | 2022-01-29 |      16 | https://www.zeit.de/             |

| [zeitfuerdieschule.de](docs/snapshots/zeitfuerdieschule.de) | 2022-01-29 |       5 | https://www.zeitfuerdieschule.de |

Well, let's see how far this goes with a free github account. 

Many websites transmit click-ids and random uuids in their

documents so there is a change in every file in each snapshot. 

Anyways, currently each snapshot adds about 10mb to the

repository size (size of `.git` directory). That's not going

to work for long :-(

## UPDATE

Okay, raw data is just too much. The snapshot rate is now set

to **once a month**. I'll try to scrape just the article 

headlines and archive them in another repository.  

### TODO

- https://www.n-tv.de/

- https://www.handelsblatt.com/

- https://www.taz.de/

- https://www.wa.de/

- https://www.rnd.de/

- https://www.nzz.ch/

- https://www.bazonline.ch/

- https://www.focus.de/

- https://www.tagesschau.de/

- https://www.heise.de/tp/

- https://www.golem.de/

- https://www.kicker.de/

- https://www.achgut.com/

- https://www.stern.de/