https://github.com/daniel-elston/real-time-reddit-scalable-processing
Scaling NLP processing pipelines with Dask and PySpark, utilising Apache Kafka real-time data streaming, for optimal LLM training
https://github.com/daniel-elston/real-time-reddit-scalable-processing
apache-kafka dask-distributed embeddings llm llm-training nlp pyspark scalability
Last synced: about 2 months ago
JSON representation
Scaling NLP processing pipelines with Dask and PySpark, utilising Apache Kafka real-time data streaming, for optimal LLM training
- Host: GitHub
- URL: https://github.com/daniel-elston/real-time-reddit-scalable-processing
- Owner: Daniel-Elston
- Created: 2025-01-22T14:21:57.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-02-24T15:35:31.000Z (over 1 year ago)
- Last Synced: 2025-02-24T16:40:34.423Z (over 1 year ago)
- Topics: apache-kafka, dask-distributed, embeddings, llm, llm-training, nlp, pyspark, scalability
- Language: Python
- Homepage:
- Size: 3.07 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# README
### Goal
Build scalable text data processing pipelines for efficient model training with Dask and PySpark, utilising Apache Kafka for real-time data streaming.
---
### Results
1. Results can be found in the ``reports/result.xlsx`` file.
2. This is partition number 2 of resultant Dask processing 4 partitions.
---
### **QRG:**
1. Load up docker app
2. Load 2 separate WSL terminals (T1 and T2)
3. In T1, run ``docker-compose up --build``
4. Open file ``config/settings`` and adjust the Config to either 'extract', 'transform' or 'results'
5. Once all images are running, in T2, run ``python main.py``
6. Data is streamed in temrinal but also saved: ``data/temp/reddit_comments.json``
7. Sample result of PySpark and Dask processing can be found as SDOs in ``data/results/*.xlsx``
---
### **Requirements:**
- WSL2
- Ubuntu 24.04
- Python 3.12.*
---
### Ensure Java Runtime Env:
``sudo apt update``
``sudo apt install openjdk-11-jdk``
``readlink -f $(which java)``
``export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64``
**Add lines to .zshrc:**
``echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64' >> ~/.bashrc``
``echo 'export PATH=$JAVA_HOME/bin:$PATH' >> ~/.bashrc``
---
### Apologies & Disclaimer
This project streams real-time Reddit comments, which are generated by users across various subreddits. As a result, some content may include offensive, inappropriate, or controversial language. Please note that I do not endorse or control the content of these comments.
I sincerely apologise for any offensive material that may appear during the data stream. If you come across content that is particularly concerning, I encourage you to report it directly to Reddit through their moderation tools.
Thank you for your understanding.