https://github.com/arrrrrmin/lanz-mining

Crawl talkshow guests, descriptions, political party memberships and all other available information from ARD and ZDF public media's web presence. Crawlers and parsers support Markus Lanz, Maischberger, Maybrit Illner, Caren Miosga and Hart aber fair. Public money, public data!
https://github.com/arrrrrmin/lanz-mining

crawling data-mining data-visualization python

Last synced: 4 months ago
JSON representation

Host: GitHub
URL: https://github.com/arrrrrmin/lanz-mining
Owner: arrrrrmin
License: mit
Created: 2024-01-21T06:51:23.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-03-07T08:17:46.000Z (4 months ago)
Last Synced: 2025-03-15T03:27:43.492Z (4 months ago)
Topics: crawling, data-mining, data-visualization, python
Language: Svelte
Homepage: https://arrrrrmin.github.io/lanz-mining/
Size: 3.37 MB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-github-repos - arrrrrmin/lanz-mining - A data project to explore media participation in German talk shows of the public broadcasting media. (Svelte)

README

        # Lanz Mining

## What's it about?

Crawl talkshow guests, descriptions, political party memberships and all other 

available information from ARD and ZDF public media's web presence.

Crawlers and parsers support *Markus Lanz*, *Maischberger*, *Maybrit Illner*,

*Caren Miosga* and *Hart aber fair*. Public media data should be publically 

available (public money, public data).

## Requires

* Python3.11

* Postgres installed for development

* (Optional) Node installed

## Installation

* Fork and clone the repo

* Install the python dependencies with you'r favourite package management tools

* (Optionally) use [pdm](https://pdm-project.org/latest/).

## Crawl and extract data

### Crawling commands

To get the data locally run `pdm run python -m src.lanz_mining.crawl -t `.

This project currently supports following `targetShows`:

* `markuslanz`, `maybritillner`, `carenmiosga`, `maischberger`, `hartaberfair`

* If you'r using it with a cronjob, use `--lates-only`-flag.

There's another option for ZDF-`targetShow`s. Visit [zdf-mediathek](https://www.zdf.de), 

find the search field and enter the name of you'r `targetShow` and hit the checkbox

for 'ganze Sendungen' and load as many results as possible. Next save the html page and

run `pdm run python -m src.lanz_mining.crawl -t  --file `.

Any of the combinations above will write found html files to `outputs/html`. 

This will visit all urls found in the file and saves all episodes html files.

### Extract data

When you got some html files ready, you need to run

`pdm run python -m src.lanz_mining.scrape_local --input-dir outputs/html --output-file data-processed.csv`.

Information extraction is done with regexes to match certain indicators on e.g. roles.

In cases where information is missing, `scrape_local` tries to find information in other

formats using the guests name.

# Contributions

Currently I'm looking to reduce the manual tasks more, so idealy everything runs

automatically. To get this reliable, I'd be thankful for any hints on public 

APIs or other methods to map genres, identify party memberships and alike.

Further I'd really be happy if you let me know what you think, DM me on 

[chaos.social/@arrrrrmin](https://chaos.social/@arrrrrmin), or open an issue to

further improve things.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/arrrrrmin/lanz-mining

Awesome Lists containing this project

README