{"id":29230604,"url":"https://github.com/omar-maalej/devarticles-bigdata","last_synced_at":"2026-04-13T06:02:50.912Z","repository":{"id":298213687,"uuid":"976043911","full_name":"Omar-Maalej/DevArticles-BigData","owner":"Omar-Maalej","description":"DevArticles-BigData is a big data processing system built to analyze technical articles published on DEV.to, using both real-time and batch processing pipelines.","archived":false,"fork":false,"pushed_at":"2025-07-01T11:45:17.000Z","size":271,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-05T13:41:41.842Z","etag":null,"topics":["big-data","docker","docker-compose","fastapi","java","kafka","python","spark","spark-streaming"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Omar-Maalej.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-05-01T11:37:40.000Z","updated_at":"2025-07-01T11:49:39.000Z","dependencies_parsed_at":"2025-06-10T01:42:44.164Z","dependency_job_id":null,"html_url":"https://github.com/Omar-Maalej/DevArticles-BigData","commit_stats":null,"previous_names":["omar-maalej/devarticles-bigdata"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Omar-Maalej/DevArticles-BigData","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Omar-Maalej%2FDevArticles-BigData","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Omar-Maalej%2FDevArticles-BigData/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Omar-Maalej%2FDevArticles-BigData/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Omar-Maalej%2FDevArticles-BigData/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Omar-Maalej","download_url":"https://codeload.github.com/Omar-Maalej/DevArticles-BigData/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Omar-Maalej%2FDevArticles-BigData/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31741541,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-13T05:13:27.074Z","status":"ssl_error","status_checked_at":"2026-04-13T05:13:25.150Z","response_time":93,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["big-data","docker","docker-compose","fastapi","java","kafka","python","spark","spark-streaming"],"created_at":"2025-07-03T14:01:00.970Z","updated_at":"2026-04-13T06:02:50.876Z","avatar_url":"https://github.com/Omar-Maalej.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DevArticles-BigData\n\nDevArticles-BigData is a big data processing system built to analyze technical articles published on [DEV.to](https://dev.to/), using both real-time and batch processing pipelines. This project was developed as part of the Big Data course at the **National Institute of Applied Sciences and Technology (INSAT)**, under the supervision of **Ms. Lilia Sfaxi**.\n\n---\n\n## Objectives\n\n- Fetch and ingest articles from the DEV.to public API.\n- Apply **real-time processing** to identify trending tags as articles are published.\n- Perform **batch analytics** to capture historical tag usage patterns.\n- Store both raw and processed data in a flexible and queryable data store.\n- Provide interactive dashboards for data exploration and visualization.\n\n---\n\n## Technologies Used\n\n- **Python** – For data ingestion, batch jobs, and dashboards.\n- **Java** – For Spark Structured Streaming due to better support and stability.\n- **Apache Kafka** – For real-time article ingestion and stream communication.\n- **Apache Spark** – For both streaming and batch data processing.\n- **MongoDB** – NoSQL database for storing article metadata and analytics.\n- **FastAPI** – To expose REST APIs and serve real-time data.\n- **Streamlit** – For visualizing batch results through an interactive dashboard.\n- **Docker \u0026 Docker Compose** – For containerized, reproducible deployment.\n\n---\n\n## Pipeline Architecture\n\nThe system is designed around three key components: ingestion, processing, and visualization.\n\n### Ingestion\n\n- Articles are fetched regularly from the DEV.to API using `fetch_articles.py`.\n- Fetched articles are pushed to the `articles` topic in **Kafka**.\n\n### Real-Time Processing\n\n- A **Java-based Spark Structured Streaming** job consumes new articles from Kafka.\n- It extracts and counts tags from each article.\n- Results are sent to the `tag_counts` topic and also stored in MongoDB.\n- **FastAPI** exposes real-time tag trends via a REST API.\n\n### Batch Processing\n\n- A Python-based **Spark batch job** (`analyse_articles.py`) is run periodically.\n- It fetches a larger set of articles and computes global tag statistics.\n- Aggregated results are stored in MongoDB under `article_analytics`.\n- **Streamlit** is used to visualize these trends interactively.\n\n---\n\n### Pipeline Architecture Diagram\n\n![System Architecture](images/architecture.png)\n\n---\n\n## Example Dashboards\n\n### Batch Analytics (Streamlit)\n\nAggregated tag frequency across a historical window.\n\n![Batch Processing Dashboard](images/batch-tag-trends.png)\n\n---\n\n### Real-Time Tag Tracking (FastAPI)\n\nLive updates of trending tags based on the latest articles.\n\n![Real-Time Dashboard](images/stream-dashboard.png)\n\n---\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fomar-maalej%2Fdevarticles-bigdata","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fomar-maalej%2Fdevarticles-bigdata","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fomar-maalej%2Fdevarticles-bigdata/lists"}