{"id":31565498,"url":"https://github.com/devarshpatel1506/real-time-student-attendance-system","last_synced_at":"2026-05-17T10:37:20.443Z","repository":{"id":316738786,"uuid":"1064625518","full_name":"devarshpatel1506/Real-Time-Student-Attendance-System","owner":"devarshpatel1506","description":"A production-style, real-time attendance pipeline using Apache Pulsar, Redis Bloom and HyperLogLog, Apache Cassandra, and Python.","archived":false,"fork":false,"pushed_at":"2025-09-26T10:55:37.000Z","size":13,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-26T12:29:20.134Z","etag":null,"topics":["apache-cassandra","apache-pulsar","big-data","data-engineering","data-pipeline","hyperloglog","python","redis-bloomfilter"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/devarshpatel1506.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-26T10:06:21.000Z","updated_at":"2025-09-26T10:58:40.000Z","dependencies_parsed_at":"2025-09-26T12:29:22.734Z","dependency_job_id":null,"html_url":"https://github.com/devarshpatel1506/Real-Time-Student-Attendance-System","commit_stats":null,"previous_names":["devarshpatel1506/real-time-student-attendance-system"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/devarshpatel1506/Real-Time-Student-Attendance-System","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devarshpatel1506%2FReal-Time-Student-Attendance-System","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devarshpatel1506%2FReal-Time-Student-Attendance-System/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devarshpatel1506%2FReal-Time-Student-Attendance-System/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devarshpatel1506%2FReal-Time-Student-Attendance-System/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/devarshpatel1506","download_url":"https://codeload.github.com/devarshpatel1506/Real-Time-Student-Attendance-System/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devarshpatel1506%2FReal-Time-Student-Attendance-System/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278420510,"owners_count":25983868,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-05T02:00:06.059Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-cassandra","apache-pulsar","big-data","data-engineering","data-pipeline","hyperloglog","python","redis-bloomfilter"],"created_at":"2025-10-05T07:08:38.903Z","updated_at":"2025-10-05T07:08:39.855Z","avatar_url":"https://github.com/devarshpatel1506.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Real-Time Student Attendance System\n\n\u003e **A production-style, real-time attendance pipeline using Apache Pulsar, Redis Bloom and HyperLogLog, Apache Cassandra, and Python.**\n\n---\n\n### 1) Executive Summary\n\nThis project implements a **real-time student attendance tracking system** that simulates RFID swipes, validates them at scale, persists normalized records, and generates analytics such as **unique attendees per lecture**, **latecomer detection**, and **attendance patterns**.\n\n**Why it matters:** Traditional attendance capture is brittle and batch-oriented. This system demonstrates how to design a **streaming, fault-tolerant, and scalable** pipeline suitable for campuses or enterprises.\n\n**Core pipeline:**\n1. **Producer (Python):** Simulates RFID swipe events and publishes to **Apache Pulsar**.\n2. **Processor (Python):** Consumes events, validates **student_id** with **Redis Bloom**, updates **HyperLogLog** for unique counts, and writes canonical records to **Cassandra**.\n3. **Analytics (Python):** Reads Cassandra and Redis to derive insights (latecomers, patterns, rankings, consistency, invalid attempts).\n\n---\n\n### 2) Architecture Diagram\n\n```mermaid\nflowchart LR\n  subgraph SRC [RFID Simulation]\n    SIM[RFID Swipe Generator]\n  end\n\n  subgraph PULSAR [Apache Pulsar]\n    TOPIC[attendance-events topic]\n  end\n\n  subgraph PROC [Attendance Processor]\n    VAL[Validate student ID via Redis Bloom]\n    HLL[Update Redis HyperLogLog for unique counts]\n    CASS[Insert canonical rows into Cassandra]\n  end\n\n  subgraph DB [Data Stores]\n    BLOOM[Redis Bloom Filter]\n    HYPER[Redis HyperLogLog]\n    CQL[Cassandra Tables]\n  end\n\n  subgraph ANA [Analytics Jobs]\n    LATE[Latecomer Detection]\n    PATS[Attendance Patterns]\n    RANK[Lecture Rankings]\n    CONS[Consistency Analysis]\n    INV[Invalid Attempt Tracking]\n  end\n\n  SIM --\u003e TOPIC\n  TOPIC --\u003e VAL\n  VAL --\u003e|valid| HLL --\u003e CASS\n  VAL --\u003e|invalid| INV\n  BLOOM --- VAL\n  HYPER --- HLL\n  CQL --- CASS\n\n  CQL --\u003e LATE\n  CQL --\u003e PATS\n  CQL --\u003e CONS\n  CQL --\u003e RANK\n  HYPER --\u003e RANK\n```\n\n### Key Roles\n\n- **Apache Pulsar** – durable, horizontally scalable pub/sub for event ingress; supports shared subscriptions, acknowledgements, and back pressure.  \n- **Redis** – Bloom Filter validates student existence, HyperLogLog tracks unique attendees per lecture and date.  \n- **Cassandra** – write-optimized, partitioned storage for time-series attendance events and queries like *by lecture* or *by date*.  \n\n---\n\n### 3) Tech Stack \u0026 Justification\n\n| Component        | Choice              | Why it fits                                                                 |\n|------------------|---------------------|-----------------------------------------------------------------------------|\n| **Ingest**       | Apache Pulsar       | Segregated storage \u0026 compute via BookKeeper, multi-tenancy, flexible subs   |\n| **Fast validation** | Redis Bloom      | Low-memory membership test with small false positive rate                   |\n| **Unique counts** | Redis HyperLogLog  | Tiny memory footprint for approximate distinct per lecture per day          |\n| **Storage**      | Apache Cassandra    | High write throughput, linear scalability, query-first data modeling        |\n| **Language**     | Python              | Mature clients for Pulsar, Redis, Cassandra; fast prototyping environment   |\n\n\n### 4) Event Model \u0026 Keys\n\n**Event schema JSON**\n\n```json\n{\n  \"event_id\": \"d82e3a4e-2c21-4a5a-a6bb-70e8c12f66c5\",\n  \"student_id\": \"S123456\",\n  \"lecture_id\": \"CS101-L1\",\n  \"gate_id\": \"GATE-02\",\n  \"timestamp\": \"2025-03-19T09:05:12Z\",\n  \"action\": \"enter\"\n}\n```\n\n\n**4.2 Redis Keys**\n\n- **Bloom Filter:** `bf:students` (capacity = 100000, error_rate = 0.01)  \n- **HyperLogLog:** `hll:unique:\u003clecture_id\u003e:\u003cYYYY-MM-DD\u003e`  \n  - Example: `hll:unique:CS101-L1:2025-03-19`\n\n---\n\n### 5) Cassandra Data Modeling\n\n**5.1 Keyspace**\n\n```sql\nCREATE KEYSPACE IF NOT EXISTS attendance\nWITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};\n```\n\n**5.2 Tables**\n\nCanonical events, partitioned by lecture and day\n\n```sql\nCREATE TABLE IF NOT EXISTS attendance.events_by_lecture_day (\n  lecture_id   text,\n  day          date,\n  ts           timestamp,\n  student_id   text,\n  gate_id      text,\n  action       text,\n  event_id     uuid,\n  PRIMARY KEY ((lecture_id, day), ts, student_id)\n) WITH CLUSTERING ORDER BY (ts ASC);\n```\n\nAlternate query by student and day\n\n```sql\nCREATE TABLE IF NOT EXISTS attendance.events_by_student_day (\n  student_id   text,\n  day          date,\n  ts           timestamp,\n  lecture_id   text,\n  gate_id      text,\n  action       text,\n  event_id     uuid,\n  PRIMARY KEY ((student_id, day), ts, lecture_id)\n) WITH CLUSTERING ORDER BY (ts ASC);\n```\n\n## 6) omponent Design\n\n**6.1 Data Generator** — `data_generator.py`\n- Creates synthetic IDs and timestamps.  \n- Publishes events to Pulsar topic **`attendance-events`** with batching enabled.  \n- Configurable parameters: emit rate (events/sec), student population size, and late percentage to simulate tardiness.  \n- Can preload **Redis Bloom** with valid student IDs.  \n\n**6.2 Attendance Processor** — `attendance_processor.py`\n- Pulsar consumer on **`attendance-events`** with subscription type = **shared**.  \n- **Validation Path:**  \n  - Check membership with `BF.EXISTS bf:students \u003cstudent_id\u003e`.  \n  - If **false** → mark as invalid and optionally publish to **`attendance-invalid`**.  \n  - If **true** → proceed to counting \u0026 persistence.  \n- **Counting Path:**  \n  - Add student to HLL: `PFADD hll:unique:\u003clecture_id\u003e:\u003cday\u003e \u003cstudent_id\u003e`.  \n- **Persistence Path:**  \n  - Insert canonical rows into Cassandra:  \n    - `events_by_lecture_day`  \n    - `events_by_student_day` (optional for queries by student).  \n- **Reliability:**  \n  - Ack only after successful Redis + Cassandra writes.  \n  - Negative ack on failure triggers redelivery.  \n  - Optional idempotency via `event_id`.  \n\n**6.3 Analytics** — `attendance_analysis.py`\n- **Latecomer Detection:** flag students with `ts \u003e lecture_start_time + grace_period`.  \n- **Patterns:** aggregate by day-of-week; compute mean/median counts.  \n- **Top Lectures:** rank by unique daily counts (from HLL) or by raw event counts.  \n- **Consistency:** find students with attendance rate ≥ threshold across N sessions.  \n- **Invalid Attempts:** report invalid swipes grouped by gate or time.  \n\n\n### 9) Streaming Joins and Watermarks (Spark)\n\n- **Micro-batch trigger:** e.g., `processingTime=30s` for near real-time updates.  \n- **Event-time windows:** e.g., 30-minute sliding windows with 5-minute slide to align heterogeneous streams.  \n- **Watermarking:** e.g., `withWatermark(\"event_time\", \"20 minutes\")` to bound state and tolerate late data.  \n- **Stateful processing \u0026 checkpointing:** ensures fault tolerance with **exactly-once semantics** when coupled with Kafka offsets.  \n\n---\n\n### 8) Online Feature Engineering\n\n- **Rolling statistics:** mean, median, max of flow, speed, occupancy within the window.  \n- **Lags \u0026 deltas:** \\( flow_t - flow_{t-1} \\), capturing trend and acceleration.  \n- **Weather impact:** features like `precip_avg`, `wind_avg`, plus interaction terms (e.g., \\( occ\\_avg \\times precip\\_avg \\)).  \n- **Incident features:** severity, blocked lanes, time since incident.  \n- **Edge features:** map station pairs → road segments for routing.\n\n\n### 9) Running the System\n\n**9.1 Prerequisites**\n- Python **3.8+**  \n- **Docker** (recommended)  \n- Services: **Pulsar**, **Redis Stack**, **Cassandra**  \n\n---\n\n**9.2 Start Services (Docker)**\n\n```bash\n# Pulsar standalone\ndocker run -d --name pulsar -p 6650:6650 -p 8080:8080 apachepulsar/pulsar:latest bin/pulsar standalone\n\n# Redis Stack\ndocker run -d --name redis -p 6379:6379 redis/redis-stack-server\n\n# Cassandra\ndocker run -d --name cassandra -p 9042:9042 cassandra:latest\n```\n**9.3 Python Environment**\n\n```bash\npython -m venv .venv \u0026\u0026 source .venv/bin/activate\npip install -r requirements.txt\n```\n**9.4 Configuration Example — attendance_system/config/config.py**\n\n```bash\nPULSAR_SERVICE_URL = \"pulsar://localhost:6650\"\nPULSAR_TOPIC = \"attendance-events\"\nPULSAR_SUBSCRIPTION = \"attendance-sub\"\n\nREDIS_URL = \"redis://localhost:6379/0\"\nBLOOM_KEY = \"bf:students\"\nBLOOM_ERROR_RATE = 0.01\nBLOOM_CAPACITY = 100_000\n\nCASSANDRA_CONTACT_POINTS = [\"127.0.0.1\"]\nCASSANDRA_KEYSPACE = \"attendance\"\n```\n\n**9.5 Run**\n\n```bash\n# 1) Generate simulated events\npython attendance_system/src/data_generator.py\n\n# 2) Start processor in another terminal\npython attendance_system/src/attendance_processor.py\n\n# 3) Run analytics\npython attendance_system/src/attendance_analysis.py\n```\n\n### 10) Operations and Observability\n\n- **Structured logging** for sends, receives, Redis and Cassandra actions, acknowledgements, and redeliveries.  \n- **Back pressure** handled via Pulsar consumer flow control and producer batching.  \n- **DLQ (Dead Letter Queue):** optional for invalid or poison events.  \n- **Throughput knobs:**  \n  - Producer batch size and linger ms  \n  - Consumer prefetch size  \n  - Redis pipelines  \n  - Cassandra batch writes (used sparingly)  \n\n---\n\n### 11) Trade-offs and Alternatives\n\n- **Pulsar vs Kafka:** Pulsar’s multi-tenancy and BookKeeper separation vs Kafka’s simpler operations in single-tenant mode. Choose based on org expertise and tenancy needs.  \n- **Bloom vs Set:** Bloom provides **O(1)** membership checks with constant memory and controlled false positives; Redis Set offers exactness at higher memory cost.  \n- **HyperLogLog vs Exact Counting:** HLL uses tiny memory with ~1–2% relative error; exact per-lecture distinct sets can grow extremely large.  \n- **Cassandra vs Relational DB:** Cassandra excels at write-heavy, time-series, query-first design; relational DBs may bottleneck at scale unless sharded.  \n\n---\n\n### 12) Security, Reliability, Data Quality\n\n- **Security:** Use Pulsar auth (JWT), Redis auth, and Cassandra credentials via environment variables.  \n- **Idempotency:** Deduplicate by `event_id`; for example, Redis `SETNX event:\u003cuuid\u003e` with TTL.  \n- **Schema Evolution:** Add a `schema_version` field in the event schema to support forward compatibility.  \n- **Time Normalization:** Emit **UTC timestamps** and normalize to `day` within the processor.  \n\n### 13) Sample Queries Cassandra and Redis\n\n```sql\n-- By lecture and day ordered by time\nSELECT * FROM attendance.events_by_lecture_day\n WHERE lecture_id='CS101-L1' AND day='2025-03-19';\n\n-- Last N events for a student today\nSELECT * FROM attendance.events_by_student_day\n WHERE student_id='S123456' AND day='2025-03-19'\n ORDER BY ts DESC LIMIT 50;\n```\n\n### 16) Project Structure\n\n```text\nattendance_system/\n├── config/\n│   └── config.py\n├── src/\n│   ├── data_generator.py\n│   ├── attendance_processor.py\n│   └── attendance_analysis.py\n├── requirements.txt\n└── README.md\n```\n\n\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdevarshpatel1506%2Freal-time-student-attendance-system","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdevarshpatel1506%2Freal-time-student-attendance-system","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdevarshpatel1506%2Freal-time-student-attendance-system/lists"}