{"id":26045169,"url":"https://github.com/schbenedikt/datamining","last_synced_at":"2025-04-10T10:50:36.344Z","repository":{"id":280440325,"uuid":"941098187","full_name":"SchBenedikt/datamining","owner":"SchBenedikt","description":"Heise (https://heise.de) News Crawler","archived":false,"fork":false,"pushed_at":"2025-03-11T07:05:28.000Z","size":4114,"stargazers_count":2,"open_issues_count":4,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-24T09:38:47.715Z","etag":null,"topics":["data","data-science","heise","postgresql","web-crawler"],"latest_commit_sha":null,"homepage":"https://discord.gg/Q6Nn2z3tUP","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SchBenedikt.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-01T13:37:27.000Z","updated_at":"2025-03-11T07:05:26.000Z","dependencies_parsed_at":"2025-03-09T20:45:16.565Z","dependency_job_id":null,"html_url":"https://github.com/SchBenedikt/datamining","commit_stats":null,"previous_names":["schbenedikt/datamining"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SchBenedikt%2Fdatamining","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SchBenedikt%2Fdatamining/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SchBenedikt%2Fdatamining/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SchBenedikt%2Fdatamining/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SchBenedikt","download_url":"https://codeload.github.com/SchBenedikt/datamining/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248202993,"owners_count":21064486,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data","data-science","heise","postgresql","web-crawler"],"created_at":"2025-03-07T19:32:17.026Z","updated_at":"2025-04-10T10:50:36.321Z","avatar_url":"https://github.com/SchBenedikt.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ca href=\"https://discord.com/invite/Q6Nn2z3tUP\"\u003e\n  \u003cimg src=\"https://discord.com/api/guilds/1346160903304773703/widget.png?style=banner2\" width=\"250\"/\u003e\n\u003c/a\u003e\n\u003ca href=\"https://deepnote.com/app/schachner/Web-Crawler-d5025a36-3829-4c12-ad2d-b81aa84bd217?utm_source=app-settings\u0026utm_medium=product-embed\u0026utm_campaign=data-app\u0026utm_content=d5025a36-3829-4c12-ad2d-b81aa84bd217\u0026__embedded=true\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/Open%20in-Deepnote-blue?style=for-the-badge\u0026logo=deepnote\" width=\"250\"/\u003e\n\u003c/a\u003e\n\n# 🌍 Purpose \u0026 Functionality\nThe **Heise News Crawler** is designed to automatically extract and store news articles from Heise's archive. The primary goals are:\n\n- 📡 **Data Collection:** Gather historical news articles from Heise.de.\n- 🏛 **Structured Storage:** Store articles in a PostgreSQL database for easy querying and analysis.\n- 🔍 **Metadata Extraction:** Retrieve key information such as title, author, category, keywords, and word count.\n- 🔄 **Incremental crawling:** Detect duplicate articles and save only new articles of the current day.\n- 🔔 **Notifications:** Send an email if an error occurs during the crawling process.\n- 🎨 **Enhanced Terminal Output:** Uses PyFiglet for improved readability.\n- 📤 **Data export:** Export of articles as .csv, .json, .xlsx-file or display the data in a stats.html file\n- 🖥 **API**: Provision of statistics and complete data sets.\n  \nAlso an API endpoint is provided that can display the crawled data and statistics.\n\n---\n\n## 🚀 Installation \u0026 Setup\n\n### 1️⃣ Requirements\n\n🔹 Python 3\n\n🔹 PostgreSQL\n\n🔹 Required Python Libraries (Dependencies in [requirements.txt](requirements.txt))\n\n### 2️⃣ Install Dependencies\n\nInstall required Python libraries:\n\n```sh\npip3 install -r requirements.txt\n```\n\n### 3️⃣ Create `.env` File\n\nSet up your database and email credentials by creating a `.env` file:\n\n```env\nEMAIL_USER=...\nEMAIL_PASSWORD=...\nSMTP_SERVER=...\nSMTP_PORT=...\nALERT_EMAIL=...\nDB_NAME=...\nDB_USER=...\nDB_PASSWORD=...\nDB_HOST=...\nDB_PORT=...\nDISCORD_TOKEN=...\nCHANNEL_ID=...\n```\n\n\n---\n\n## 🛠 Usage\n\n### 1️⃣ Start the first Crawler (into the past)\n\n```sh\npython3 main.py\n```\n\n#### Example Terminal Output\n\n```\n[INFO] Crawle URL: https://www.heise.de/newsticker/archiv/xxxx/xx\n[INFO] Gefundene Artikel (insgesamt): 55\nxxxx-xx-xx xx:xx:xx [INFO] Verarbeite 16 Artikel für den Tag xxxx-xx-xx\nxxxx-xx-xx xx:xx:xx [INFO] 2025-03-01T20:00:00 - article-name\n(⬆️ date)\n```\nIf fewer than 10 items are found per day, an e-mail will be sent\n\n\n### 2️⃣ Start the second Crawler (for current articles in the present)\n\n```sh\npython3 current_crawler.py\n```\n\n#### Example Terminal Output\n\n```\n[INFO] Crawle URL: https://www.heise.de/newsticker/archiv/xxxx/xx\n[INFO] Gefundene Artikel (insgesamt): 55\nxxxx-xx-xx xx:xx:xx [INFO] Aktueller Crawl-Durchlauf abgeschlossen.\nxxxx-xx-xx xx:xx:xx [INFO] Warte 300 Sekunden bis zum nächsten Crawl.\n(⬆️ date)\n```\n\n### 3️⃣ Use API\n\nThe API server starts automatically. You can call up the statistics here:\n```\nhttp://127.0.0.1:6600/stats\n```\n\n\n### 4️⃣ Export articles\n\nYou can export the data for each item to a CSV, JSON or XLSX file.\n```sh\npython3 export_articles.py\n```\nExported articles are saved in the current directory.\n\n---\n\n\n\n\n---\n\n## 🏗 Database Schema\n\n| Column       | Type   | Description          |\n| ------------ | ------ | -------------------- |\n| id           | SERIAL | Unique ID            |\n| title        | TEXT   | Article title        |\n| url          | TEXT   | Article URL (unique) |\n| date         | TEXT   | Publication date     |\n| author       | TEXT   | Author(s)            |\n| category     | TEXT   | Category             |\n| keywords     | TEXT   | Keywords             |\n| word\\_count  | INT    | Word count           |\n| editor\\_abbr | TEXT   | Editor abbreviation  |\n| site\\_name   | TEXT   | Website name         |\n\n---\n\n\n\n## 📩 Error Notifications\n\nIf any errors occur, an email notification will be sent.\n\n---\n\n## 📂 Project Structure\n\n(old)\n```\n📂 Heise-News-Crawler\n├── 📄 .gitignore                 # Git ignore file\n├── 📄 .env                       # Environment variables (email \u0026 database config, you have to create this file manually)\n├── 📄 main.py                    # Main crawler script\n├── 📄 api.py                     # API functionalities\n├── 📄 notification.py            # Email notification handler\n├── 📄 test_notifications.py      # Testing email notifications\n├── 📄 README.md                  \n├── 📄 current_crawler.py         # Crawler for newer articles\n├── 📄 export_articles.py         # Function to export the data\n├── 📄 requirements.txt           \n└── 📂 templates/                 # HTML email templates\n    ├── 📄 stats.html             # API functionalities\n└── 📂 data/                      # Export data (as of 03/03/2025)\n    ├── 📄 .gitattributes         \n    ├── 📄 README.md\n    ├── 📄 api.py             \n    ├── 📄 articles_export.csv\n    ├── 📄 articles_export.json\n    ├── 📄 articles_export.xlsx\n└── 📄 LICENCE  \n```\n\n## ❗Troubleshooting\n\n### 🌐 Start API manually\n\n```sh\npython3 api.py\n```\n\n### 📧 Testing Notifications\n\n```sh\npython3 test_notification.py\n```\n\n### ⚠️ Found an error?\nPlease create a pull request or contact us via server@schächner.de\n\n---\n\n\n\n\n\n## 🗂️ Examples\n\n(with Tableu and DeepNote, status March 2025)\n![image](https://github.com/user-attachments/assets/ce6ceae0-bdf4-499c-9577-973017bb1eff)\n\n\n![image](https://github.com/user-attachments/assets/3affd472-8475-4534-99e6-54500493418c)\n\n![image](https://github.com/user-attachments/assets/984babc4-d264-44be-8534-17fdae1f8d5f)\n\n![image](https://github.com/user-attachments/assets/0c1d7a13-0f28-497c-afb3-048ee0a309e7)\n\n![image](https://github.com/user-attachments/assets/ba9a3180-4ae8-4ab3-b4ae-3e81f4621c23)\n\n![image](https://github.com/user-attachments/assets/85ecd8a3-1f31-49d0-ae3a-efdfd98bef21)\n\n![image](https://github.com/user-attachments/assets/1d5c57f7-72be-4aca-8f03-d4fba8bfba9d)\n\n![image](https://github.com/user-attachments/assets/cde65d2c-2b22-481d-9ba4-1c4086eb3f23)\n\n![image](https://github.com/user-attachments/assets/10c87c9c-d444-487c-992f-73d3d4b4a185)\n\n### Deepnote:\nWe have also generated some graphs with [Deepnote](https://deepnote.com/app/schachner/Web-Crawler-d5025a36-3829-4c12-ad2d-b81aa84bd217?utm_source=app-settings\u0026utm_medium=product-embed\u0026utm_campaign=data-app\u0026utm_content=d5025a36-3829-4c12-ad2d-b81aa84bd217\u0026__embedded=true) (❗ only with Random 10.000 rows ❗)\n\n![image](https://github.com/user-attachments/assets/ea99ead8-0b48-47d0-8ddc-7c8ce3bd6b53)\n\n\nCheck out also the [data/Datamining_Heise web crawler-3.twb](https://github.com/SchBenedikt/datamining/blob/3f3fe413aeff25a1ae024215745ed6fa82fc2add/data/Datamining_Heise%20web%20crawler-3.twb)-file with an excerpt of analyses.\n\n---\n\n## 📜 License\nThis program is licensed under **GNU GENERAL PUBLIC LICENSE**\n\n\n\n## 🙋 About us\n\nThis project was programmed by both of us within a few days and is constantly being further developed:\n- https://github.com/schBenedikt\n- https://github.com/schVinzenz\n\n### 📬 Contact\n\nFeel free to reach out if you have any questions, feedback, or just want to say hi!\n\n📧 Email: [server@schächner.de](mailto:server@schächner.de)\n\n🌐 Website:\n- https://technik.schächner.de\n- https://benedikt.schächner.de\n- https://vinzenz.schächner.de\n\n\n💖 Special Thanks\n\nThe idea for our Heise News Crawler comes from David Kriesel and his presentation “Spiegel Mining” at 33c3.\n\n\n---\n\nHappy Crawling! 🎉\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fschbenedikt%2Fdatamining","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fschbenedikt%2Fdatamining","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fschbenedikt%2Fdatamining/lists"}