https://github.com/schbenedikt/datamining
Heise (https://heise.de) News Crawler
https://github.com/schbenedikt/datamining
data data-science heise postgresql web-crawler
Last synced: about 1 year ago
JSON representation
Heise (https://heise.de) News Crawler
- Host: GitHub
- URL: https://github.com/schbenedikt/datamining
- Owner: SchBenedikt
- License: gpl-3.0
- Created: 2025-03-01T13:37:27.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-11T07:05:28.000Z (over 1 year ago)
- Last Synced: 2025-03-24T09:38:47.715Z (over 1 year ago)
- Topics: data, data-science, heise, postgresql, web-crawler
- Language: Python
- Homepage: https://discord.gg/Q6Nn2z3tUP
- Size: 3.92 MB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Security: SECURITY.md
Awesome Lists containing this project
README
# π Purpose & Functionality
The **Heise News Crawler** is designed to automatically extract and store news articles from Heise's archive. The primary goals are:
- π‘ **Data Collection:** Gather historical news articles from Heise.de.
- π **Structured Storage:** Store articles in a PostgreSQL database for easy querying and analysis.
- π **Metadata Extraction:** Retrieve key information such as title, author, category, keywords, and word count.
- π **Incremental crawling:** Detect duplicate articles and save only new articles of the current day.
- π **Notifications:** Send an email if an error occurs during the crawling process.
- π¨ **Enhanced Terminal Output:** Uses PyFiglet for improved readability.
- π€ **Data export:** Export of articles as .csv, .json, .xlsx-file or display the data in a stats.html file
- π₯ **API**: Provision of statistics and complete data sets.
Also an API endpoint is provided that can display the crawled data and statistics.
---
## π Installation & Setup
### 1οΈβ£ Requirements
πΉ Python 3
πΉ PostgreSQL
πΉ Required Python Libraries (Dependencies in [requirements.txt](requirements.txt))
### 2οΈβ£ Install Dependencies
Install required Python libraries:
```sh
pip3 install -r requirements.txt
```
### 3οΈβ£ Create `.env` File
Set up your database and email credentials by creating a `.env` file:
```env
EMAIL_USER=...
EMAIL_PASSWORD=...
SMTP_SERVER=...
SMTP_PORT=...
ALERT_EMAIL=...
DB_NAME=...
DB_USER=...
DB_PASSWORD=...
DB_HOST=...
DB_PORT=...
DISCORD_TOKEN=...
CHANNEL_ID=...
```
---
## π Usage
### 1οΈβ£ Start the first Crawler (into the past)
```sh
python3 main.py
```
#### Example Terminal Output
```
[INFO] Crawle URL: https://www.heise.de/newsticker/archiv/xxxx/xx
[INFO] Gefundene Artikel (insgesamt): 55
xxxx-xx-xx xx:xx:xx [INFO] Verarbeite 16 Artikel fΓΌr den Tag xxxx-xx-xx
xxxx-xx-xx xx:xx:xx [INFO] 2025-03-01T20:00:00 - article-name
(β¬οΈ date)
```
If fewer than 10 items are found per day, an e-mail will be sent
### 2οΈβ£ Start the second Crawler (for current articles in the present)
```sh
python3 current_crawler.py
```
#### Example Terminal Output
```
[INFO] Crawle URL: https://www.heise.de/newsticker/archiv/xxxx/xx
[INFO] Gefundene Artikel (insgesamt): 55
xxxx-xx-xx xx:xx:xx [INFO] Aktueller Crawl-Durchlauf abgeschlossen.
xxxx-xx-xx xx:xx:xx [INFO] Warte 300 Sekunden bis zum nΓ€chsten Crawl.
(β¬οΈ date)
```
### 3οΈβ£ Use API
The API server starts automatically. You can call up the statistics here:
```
http://127.0.0.1:6600/stats
```
### 4οΈβ£ Export articles
You can export the data for each item to a CSV, JSON or XLSX file.
```sh
python3 export_articles.py
```
Exported articles are saved in the current directory.
---
---
## π Database Schema
| Column | Type | Description |
| ------------ | ------ | -------------------- |
| id | SERIAL | Unique ID |
| title | TEXT | Article title |
| url | TEXT | Article URL (unique) |
| date | TEXT | Publication date |
| author | TEXT | Author(s) |
| category | TEXT | Category |
| keywords | TEXT | Keywords |
| word\_count | INT | Word count |
| editor\_abbr | TEXT | Editor abbreviation |
| site\_name | TEXT | Website name |
---
## π© Error Notifications
If any errors occur, an email notification will be sent.
---
## π Project Structure
(old)
```
π Heise-News-Crawler
βββ π .gitignore # Git ignore file
βββ π .env # Environment variables (email & database config, you have to create this file manually)
βββ π main.py # Main crawler script
βββ π api.py # API functionalities
βββ π notification.py # Email notification handler
βββ π test_notifications.py # Testing email notifications
βββ π README.md
βββ π current_crawler.py # Crawler for newer articles
βββ π export_articles.py # Function to export the data
βββ π requirements.txt
βββ π templates/ # HTML email templates
βββ π stats.html # API functionalities
βββ π data/ # Export data (as of 03/03/2025)
βββ π .gitattributes
βββ π README.md
βββ π api.py
βββ π articles_export.csv
βββ π articles_export.json
βββ π articles_export.xlsx
βββ π LICENCE
```
## βTroubleshooting
### π Start API manually
```sh
python3 api.py
```
### π§ Testing Notifications
```sh
python3 test_notification.py
```
### β οΈ Found an error?
Please create a pull request or contact us via server@schΓ€chner.de
---
## ποΈ Examples
(with Tableu and DeepNote, status March 2025)









### Deepnote:
We have also generated some graphs with [Deepnote](https://deepnote.com/app/schachner/Web-Crawler-d5025a36-3829-4c12-ad2d-b81aa84bd217?utm_source=app-settings&utm_medium=product-embed&utm_campaign=data-app&utm_content=d5025a36-3829-4c12-ad2d-b81aa84bd217&__embedded=true) (β only with Random 10.000 rows β)

Check out also the [data/Datamining_Heise web crawler-3.twb](https://github.com/SchBenedikt/datamining/blob/3f3fe413aeff25a1ae024215745ed6fa82fc2add/data/Datamining_Heise%20web%20crawler-3.twb)-file with an excerpt of analyses.
---
## π License
This program is licensed under **GNU GENERAL PUBLIC LICENSE**
## π About us
This project was programmed by both of us within a few days and is constantly being further developed:
- https://github.com/schBenedikt
- https://github.com/schVinzenz
### π¬ Contact
Feel free to reach out if you have any questions, feedback, or just want to say hi!
π§ Email: [server@schΓ€chner.de](mailto:server@schΓ€chner.de)
π Website:
- https://technik.schΓ€chner.de
- https://benedikt.schΓ€chner.de
- https://vinzenz.schΓ€chner.de
π Special Thanks
The idea for our Heise News Crawler comes from David Kriesel and his presentation βSpiegel Miningβ at 33c3.
---
Happy Crawling! π