An open API service indexing awesome lists of open source software.

https://github.com/schbenedikt/datamining

Heise (https://heise.de) News Crawler
https://github.com/schbenedikt/datamining

data data-science heise postgresql web-crawler

Last synced: about 1 year ago
JSON representation

Heise (https://heise.de) News Crawler

Awesome Lists containing this project

README

          






# 🌍 Purpose & Functionality
The **Heise News Crawler** is designed to automatically extract and store news articles from Heise's archive. The primary goals are:

- πŸ“‘ **Data Collection:** Gather historical news articles from Heise.de.
- πŸ› **Structured Storage:** Store articles in a PostgreSQL database for easy querying and analysis.
- πŸ” **Metadata Extraction:** Retrieve key information such as title, author, category, keywords, and word count.
- πŸ”„ **Incremental crawling:** Detect duplicate articles and save only new articles of the current day.
- πŸ”” **Notifications:** Send an email if an error occurs during the crawling process.
- 🎨 **Enhanced Terminal Output:** Uses PyFiglet for improved readability.
- πŸ“€ **Data export:** Export of articles as .csv, .json, .xlsx-file or display the data in a stats.html file
- πŸ–₯ **API**: Provision of statistics and complete data sets.

Also an API endpoint is provided that can display the crawled data and statistics.

---

## πŸš€ Installation & Setup

### 1️⃣ Requirements

πŸ”Ή Python 3

πŸ”Ή PostgreSQL

πŸ”Ή Required Python Libraries (Dependencies in [requirements.txt](requirements.txt))

### 2️⃣ Install Dependencies

Install required Python libraries:

```sh
pip3 install -r requirements.txt
```

### 3️⃣ Create `.env` File

Set up your database and email credentials by creating a `.env` file:

```env
EMAIL_USER=...
EMAIL_PASSWORD=...
SMTP_SERVER=...
SMTP_PORT=...
ALERT_EMAIL=...
DB_NAME=...
DB_USER=...
DB_PASSWORD=...
DB_HOST=...
DB_PORT=...
DISCORD_TOKEN=...
CHANNEL_ID=...
```

---

## πŸ›  Usage

### 1️⃣ Start the first Crawler (into the past)

```sh
python3 main.py
```

#### Example Terminal Output

```
[INFO] Crawle URL: https://www.heise.de/newsticker/archiv/xxxx/xx
[INFO] Gefundene Artikel (insgesamt): 55
xxxx-xx-xx xx:xx:xx [INFO] Verarbeite 16 Artikel fΓΌr den Tag xxxx-xx-xx
xxxx-xx-xx xx:xx:xx [INFO] 2025-03-01T20:00:00 - article-name
(⬆️ date)
```
If fewer than 10 items are found per day, an e-mail will be sent

### 2️⃣ Start the second Crawler (for current articles in the present)

```sh
python3 current_crawler.py
```

#### Example Terminal Output

```
[INFO] Crawle URL: https://www.heise.de/newsticker/archiv/xxxx/xx
[INFO] Gefundene Artikel (insgesamt): 55
xxxx-xx-xx xx:xx:xx [INFO] Aktueller Crawl-Durchlauf abgeschlossen.
xxxx-xx-xx xx:xx:xx [INFO] Warte 300 Sekunden bis zum nΓ€chsten Crawl.
(⬆️ date)
```

### 3️⃣ Use API

The API server starts automatically. You can call up the statistics here:
```
http://127.0.0.1:6600/stats
```

### 4️⃣ Export articles

You can export the data for each item to a CSV, JSON or XLSX file.
```sh
python3 export_articles.py
```
Exported articles are saved in the current directory.

---

---

## πŸ— Database Schema

| Column | Type | Description |
| ------------ | ------ | -------------------- |
| id | SERIAL | Unique ID |
| title | TEXT | Article title |
| url | TEXT | Article URL (unique) |
| date | TEXT | Publication date |
| author | TEXT | Author(s) |
| category | TEXT | Category |
| keywords | TEXT | Keywords |
| word\_count | INT | Word count |
| editor\_abbr | TEXT | Editor abbreviation |
| site\_name | TEXT | Website name |

---

## πŸ“© Error Notifications

If any errors occur, an email notification will be sent.

---

## πŸ“‚ Project Structure

(old)
```
πŸ“‚ Heise-News-Crawler
β”œβ”€β”€ πŸ“„ .gitignore # Git ignore file
β”œβ”€β”€ πŸ“„ .env # Environment variables (email & database config, you have to create this file manually)
β”œβ”€β”€ πŸ“„ main.py # Main crawler script
β”œβ”€β”€ πŸ“„ api.py # API functionalities
β”œβ”€β”€ πŸ“„ notification.py # Email notification handler
β”œβ”€β”€ πŸ“„ test_notifications.py # Testing email notifications
β”œβ”€β”€ πŸ“„ README.md
β”œβ”€β”€ πŸ“„ current_crawler.py # Crawler for newer articles
β”œβ”€β”€ πŸ“„ export_articles.py # Function to export the data
β”œβ”€β”€ πŸ“„ requirements.txt
└── πŸ“‚ templates/ # HTML email templates
β”œβ”€β”€ πŸ“„ stats.html # API functionalities
└── πŸ“‚ data/ # Export data (as of 03/03/2025)
β”œβ”€β”€ πŸ“„ .gitattributes
β”œβ”€β”€ πŸ“„ README.md
β”œβ”€β”€ πŸ“„ api.py
β”œβ”€β”€ πŸ“„ articles_export.csv
β”œβ”€β”€ πŸ“„ articles_export.json
β”œβ”€β”€ πŸ“„ articles_export.xlsx
└── πŸ“„ LICENCE
```

## ❗Troubleshooting

### 🌐 Start API manually

```sh
python3 api.py
```

### πŸ“§ Testing Notifications

```sh
python3 test_notification.py
```

### ⚠️ Found an error?
Please create a pull request or contact us via server@schΓ€chner.de

---

## πŸ—‚οΈ Examples

(with Tableu and DeepNote, status March 2025)
![image](https://github.com/user-attachments/assets/ce6ceae0-bdf4-499c-9577-973017bb1eff)

![image](https://github.com/user-attachments/assets/3affd472-8475-4534-99e6-54500493418c)

![image](https://github.com/user-attachments/assets/984babc4-d264-44be-8534-17fdae1f8d5f)

![image](https://github.com/user-attachments/assets/0c1d7a13-0f28-497c-afb3-048ee0a309e7)

![image](https://github.com/user-attachments/assets/ba9a3180-4ae8-4ab3-b4ae-3e81f4621c23)

![image](https://github.com/user-attachments/assets/85ecd8a3-1f31-49d0-ae3a-efdfd98bef21)

![image](https://github.com/user-attachments/assets/1d5c57f7-72be-4aca-8f03-d4fba8bfba9d)

![image](https://github.com/user-attachments/assets/cde65d2c-2b22-481d-9ba4-1c4086eb3f23)

![image](https://github.com/user-attachments/assets/10c87c9c-d444-487c-992f-73d3d4b4a185)

### Deepnote:
We have also generated some graphs with [Deepnote](https://deepnote.com/app/schachner/Web-Crawler-d5025a36-3829-4c12-ad2d-b81aa84bd217?utm_source=app-settings&utm_medium=product-embed&utm_campaign=data-app&utm_content=d5025a36-3829-4c12-ad2d-b81aa84bd217&__embedded=true) (❗ only with Random 10.000 rows ❗)

![image](https://github.com/user-attachments/assets/ea99ead8-0b48-47d0-8ddc-7c8ce3bd6b53)

Check out also the [data/Datamining_Heise web crawler-3.twb](https://github.com/SchBenedikt/datamining/blob/3f3fe413aeff25a1ae024215745ed6fa82fc2add/data/Datamining_Heise%20web%20crawler-3.twb)-file with an excerpt of analyses.

---

## πŸ“œ License
This program is licensed under **GNU GENERAL PUBLIC LICENSE**

## πŸ™‹ About us

This project was programmed by both of us within a few days and is constantly being further developed:
- https://github.com/schBenedikt
- https://github.com/schVinzenz

### πŸ“¬ Contact

Feel free to reach out if you have any questions, feedback, or just want to say hi!

πŸ“§ Email: [server@schΓ€chner.de](mailto:server@schΓ€chner.de)

🌐 Website:
- https://technik.schΓ€chner.de
- https://benedikt.schΓ€chner.de
- https://vinzenz.schΓ€chner.de

πŸ’– Special Thanks

The idea for our Heise News Crawler comes from David Kriesel and his presentation β€œSpiegel Mining” at 33c3.

---

Happy Crawling! πŸŽ‰