https://github.com/mmerlyn/sensor-stream
End-to-end real-time sensor data pipeline using Python, Spark, PostgreSQL, Docker, and React for live visualization.
https://github.com/mmerlyn/sensor-stream
apache-kafka apache-spark fast-api postgresql-database python react spark-streaming streaming
Last synced: 15 days ago
JSON representation
End-to-end real-time sensor data pipeline using Python, Spark, PostgreSQL, Docker, and React for live visualization.
- Host: GitHub
- URL: https://github.com/mmerlyn/sensor-stream
- Owner: mmerlyn
- License: mit
- Created: 2025-05-04T20:06:19.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2025-05-07T09:13:57.000Z (12 months ago)
- Last Synced: 2025-05-12T10:07:15.120Z (12 months ago)
- Topics: apache-kafka, apache-spark, fast-api, postgresql-database, python, react, spark-streaming, streaming
- Language: TypeScript
- Homepage:
- Size: 1.01 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# SensorStream
A real-time data pipeline that streams sensor readings through Kafka and Spark to a live React dashboard. I built this to gain hands-on experience with **distributed streaming systems**, **big data processing**, and **end-to-end data engineering**.
[Demo](https://drive.google.com/file/d/10sotn4D0T8xfHV6UxCW88erpHJdvhuak/view?usp=sharing) | [Blog](https://medium.com/@merlynmercylona/building-a-live-sensor-monitoring-system-with-kafka-spark-postgresql-fastapi-react-e66a2aa10550)
## Why I Built This
I wanted to understand how real-time data systems work in production environments - the kind used for IoT telemetry, financial market data, and application monitoring. Rather than just reading about Kafka and Spark, I built a complete pipeline that ingests, processes, stores, and visualizes streaming data in real time. This forced me to solve real integration challenges: connecting distributed services, handling data serialization, managing stream processing checkpoints, and building responsive UIs that update continuously.
## Skills Demonstrated
| Area | Technologies & Concepts |
| ------------------------ | -------------------------------------------------------------------- |
| **Stream Processing** | Apache Spark Structured Streaming, PySpark, micro-batch processing |
| **Message Queues** | Apache Kafka, ZooKeeper, producer/consumer patterns, topic design |
| **Backend Development** | Python, FastAPI, REST API design, CORS configuration |
| **Database Design** | PostgreSQL, JDBC integration, indexing strategies, schema design |
| **Frontend Development** | React 19, TypeScript, Recharts, Tailwind CSS, polling-based updates |
| **Data Serialization** | JSON schema validation, Spark StructType definitions |
| **Containerization** | Docker, Docker Compose, multi-service orchestration, networking |
| **System Integration** | Service dependencies, environment configuration, cross-service comms |
## Architecture
```
┌─────────────────────────────────────────────────────────────────────┐
│ React Dashboard (:5173) │
│ • Live temperature/humidity charts • 5-second polling │
│ • Rolling data window • Recharts visualization │
└─────────────────────────────┬───────────────────────────────────────┘
│ HTTP GET /latest, /history
↓
┌─────────────────────────────────────────────────────────────────────┐
│ FastAPI Backend (:8000) │
│ • REST endpoints • PostgreSQL queries │
│ • CORS middleware • JSON response formatting │
└─────────────────────────────┬───────────────────────────────────────┘
│ SQL queries
↓
┌─────────────────────────────────────────────────────────────────────┐
│ PostgreSQL (:5432) │
│ • sensor_data table • Timestamp DESC index │
│ • Persistent storage • Optimized for latest-first queries │
└─────────────────────────────┬───────────────────────────────────────┘
↑ JDBC batch writes
│
┌─────────────────────────────────────────────────────────────────────┐
│ Spark Structured Streaming │
│ • Kafka consumer • JSON parsing with schema validation │
│ • Micro-batch processing • foreachBatch sink to PostgreSQL │
│ • Null filtering • Error handling per batch │
└─────────────────────────────┬───────────────────────────────────────┘
↑ Consumes from topic
│
┌─────────────────────────────────────────────────────────────────────┐
│ Apache Kafka (:9092) │
│ • Topic: sensor-data • Distributed message queue │
│ • ZooKeeper coordination • Decouples producer from consumer │
└─────────────────────────────┬───────────────────────────────────────┘
↑ Publishes sensor readings
│
┌─────────────────────────────────────────────────────────────────────┐
│ Python Producer │
│ • Simulates sensor-001 • 1-second intervals │
│ • JSON serialization • Auto-retry on connection failure │
└─────────────────────────────────────────────────────────────────────┘
```
## Components Implemented
**Data Producer**
- Simulates environmental sensor generating temperature (20-30°C) and humidity (40-60%)
- Publishes JSON messages to Kafka every second
- Auto-retry logic with exponential backoff for broker connectivity
- Clean shutdown handling
**Stream Processor (Spark)**
- Reads from Kafka with "earliest" offset for full data capture
- Strongly-typed schema validation using Spark StructType
- Filters malformed records before database insertion
- Batch-writes to PostgreSQL via JDBC with error isolation
**REST API (FastAPI)**
- `GET /latest` - Returns most recent sensor reading
- `GET /history?limit=N` - Returns N most recent readings in chronological order
- CORS-enabled for cross-origin frontend requests
- Connection pooling with proper error handling
**Dashboard (React)**
- Live-updating display with current temperature and humidity
- Three interactive Recharts visualizations:
- Temperature trend line chart
- Humidity trend line chart
- Combined dual-axis chart for correlation analysis
- 10-point rolling window to show recent trends
- Loading states and error handling for API failures
## What I Learned
- **Distributed systems design**: Understanding how loosely-coupled services communicate through message queues and why decoupling producers from consumers improves system resilience
- **Data pipeline architecture**: Designing end-to-end data flow from ingestion to storage to presentation, with clear boundaries between processing stages
- **Schema enforcement**: Implementing strongly-typed data contracts at system boundaries to catch malformed data early and prevent downstream failures
- **Database optimization**: Choosing appropriate indexing strategies based on query patterns rather than generic best practices
- **Error handling in distributed systems**: Isolating failures to individual batches so one bad record doesn't halt the entire pipeline
- **API design**: Building RESTful endpoints with proper CORS configuration, connection management, and meaningful error responses
- **Frontend state management**: Separating concerns between data fetching, caching, and UI rendering in reactive applications
- **Container orchestration**: Managing multi-service dependencies, networking, and environment configuration with Docker Compose
- **Debugging across service boundaries**: Tracing data flow through multiple systems to identify where issues originate
## License
MIT