Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/fatiharslan-cmd/datagenerator-kafka-
Generate random data based on employees in PostgreSQL and send them to Kafka
https://github.com/fatiharslan-cmd/datagenerator-kafka-
datagenerator kafka postgre python sbt scala
Last synced: about 20 hours ago
JSON representation
Generate random data based on employees in PostgreSQL and send them to Kafka
- Host: GitHub
- URL: https://github.com/fatiharslan-cmd/datagenerator-kafka-
- Owner: FatihArslan-cmd
- License: mit
- Created: 2024-12-12T14:25:09.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2024-12-15T18:42:38.000Z (about 2 months ago)
- Last Synced: 2025-02-08T22:47:02.960Z (4 days ago)
- Topics: datagenerator, kafka, postgre, python, sbt, scala
- Language: Scala
- Homepage:
- Size: 114 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Kafka-Spark-Cassandra Expense Tracker
This project demonstrates an end-to-end streaming application to generate, process, and store expense data using Kafka, Spark, and Cassandra. It also integrates with PostgreSQL for user data management.
## Overview
1. **Data Generator**:
- Generates random expense data for users fetched from a PostgreSQL database.
- Publishes the expense data to Kafka topics.2. **Stream Handler**:
- Consumes Kafka streams.
- Processes and writes the expense data to a Cassandra database.
![]()
## Project Setup### Prerequisites
Ensure the following are installed and properly configured on your Ubuntu system:
- Java 8
- Apache Kafka
- Apache Spark (version 2.4.5)
- Apache Cassandra
- PostgreSQL
- sbt (Scala Build Tool)## Getting Started
### insert emp.csv and dept.csv files in your postgre db
https://github.com/ozmen54/SWE307-2023/tree/main/Pro1/data
[CSV files](https://github.com/ozmen54/SWE307-2023/tree/main/Pro1/data)
### 1. Start PostgreSQLEnsure PostgreSQL is running and contains a table named `emp` with user data.
```bash
sudo service postgresql start
```### 2. Start Cassandra
Start the Cassandra server:
```bash
./bin/cassandra -f
```### 3. Start Kafka and Zookeeper
Start Zookeeper:
```bash
bin/zookeeper-server-start.sh config/zookeeper.properties
```Start Kafka:
```bash
bin/kafka-server-start.sh config/server.properties
```### 4. Create Kafka Topic
Create a topic to publish expense data:
```bash
bin/kafka-topics.sh --create --topic user-expenses --bootstrap-server localhost:9092
```### 5. Run Data Generator
Use the `data_generator.py` script to generate and publish expense data:
```bash
python3 data_generator.py
```### 6. Verify Kafka Data
Consume messages from Kafka to verify the data:
```bash
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --include "user_.*_expenses" --from-beginning
```### 7. Build and Run Stream Handler
1. Package the project using sbt:
```bash
sbt package
```2. Submit the Spark job:
```bash
spark-submit --class StreamHandler \
--master local[*] \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5,\
com.datastax.cassandra:cassandra-driver-core:4.0.0,\
com.datastax.spark:spark-cassandra-connector_2.11:2.4.3 \
target/scala-2.11/stream-handler_2.11-1.0.jar
```## Key Commands Summary
```bash
# Start PostgreSQL
sudo service postgresql start# Start Cassandra
./bin/cassandra -f# Start Zookeeper
bin/zookeeper-server-start.sh config/zookeeper.properties# Start Kafka
bin/kafka-server-start.sh config/server.properties# Create Kafka Topic
bin/kafka-topics.sh --create --topic user-expenses --bootstrap-server localhost:9092# Consume Kafka Messages
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --include "user_.*_expenses" --from-beginning# Package with sbt
sbt package# Submit Spark Job
spark-submit --class StreamHandler \
--master local[*] \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5,\
com.datastax.cassandra:cassandra-driver-core:4.0.0,\
com.datastax.spark:spark-cassandra-connector_2.11:2.4.3 \
target/scala-2.11/stream-handler_2.11-1.0.jar
```## Project Repository
This project is part of the [Kafka-Spark-Cassandra-Expense-Tracker](https://github.com/FatihArslan-cmd/Kafka-Spark-Cassandra-Expense-Tracker) repository.
## License
This project is open-source and available under the MIT License.