Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/shayartt/employeestracker

Real-time ingestion of employées informations (Activity + location) and monitoring.
https://github.com/shayartt/employeestracker

Last synced: about 2 months ago
JSON representation

Real-time ingestion of employées informations (Activity + location) and monitoring.

Host: GitHub
URL: https://github.com/shayartt/employeestracker
Owner: Shayartt
Created: 2024-04-29T13:29:23.000Z (10 months ago)
Default Branch: main
Last Pushed: 2024-05-29T14:25:26.000Z (9 months ago)
Last Synced: 2024-11-05T21:08:46.030Z (3 months ago)
Language: Python
Size: 4.58 MB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: readme.md

Awesome Lists containing this project

README

Project Idea: Streaming employees informations (Activity + location), apply ETL and finally some analytics pieplines.

NOTE : Don't forget to update the .env and put your own configuration.

## Technology used :

- Apache Iceberg ( AWS Glue + Athena) with S3 as storage system.
- Kafka Connect for data ingestion.
- Spark (EMR) for pipelines.
- OpenSearch for logs.
- Kibana for visualization.
- AWS Lambda for scripting crone jobs

## High-Level Diagram

![High-Level Diagram](docs/diagrams/EmployeesTracker.drawio.png?raw=true "High-Level")

### Credits :

Apache Kafka & Kafka connect Installation : https://www.youtube.com/watch?v=_RdMCc4HGPY

## TODO Stream via AWS Glue streaming later or EMR instead of KAFKA Connect (Limitation of Partitioning ect..)

#### Setup (KAFKA)

docker-compose up -d

Connect into the docker image running zookeeper and run this command :

kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic employees
kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic employees_activity

#### Create Kafka connector :

## Flush size we'll make it 100 for now, maybe change it later, we don't need to be very precise for this project.

curl -i -X PUT -H "Accept:application/json" \
-H "Content-Type:application/json" http://localhost:8083/connectors/employees-s3-conn/config \
-d '
{
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"tasks.max": "1",
"topics": "employees",
"s3.region": "eu-central-1",
"s3.bucket.name": "iceberg-track",
"flush.size": "100",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat",
"schema.generator.class": "io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator",
"schema.compatibility": "NONE",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",

"partition.duration.ms": "86400000",
"path.format": "'yyyy/MM/dd'",
"locale": "en",
"timezone": "UTC"
}'

curl -i -X PUT -H "Accept:application/json" \
-H "Content-Type:application/json" http://localhost:8083/connectors/employees_activity-s3-conn/config \
-d '
{
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"tasks.max": "1",
"topics": "employees_activity",
"s3.region": "eu-central-1",
"s3.bucket.name": "iceberg-track",
"flush.size": "100",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat",
"schema.generator.class": "io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator",
"schema.compatibility": "NONE",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",

"partition.duration.ms": "86400000",
"path.format": "'yyyy'",
"locale": "en",
"timezone": "UTC"
}'

## To Run Spark analyzer Job :
spark-submit --jars tools/opensearch-spark-30_2.12-1.0.0.jar analyzer.py

## Conclusion :

The choice of technology was very good for this project, I really enjoyed learning this, however I have found some limitation using AWS Athena with iceberg like not being able to use the table property "sorted_by" to optimize more my IO and file sizes, however I was able to play around with the partitionning and compressions and file types and used the beautiful Iceberg's statistics to monitor the changes, this helps me undestand the importance of storage configuration and how they can optimize the cost and performance.