https://github.com/daniel-elston/real-time-reddit-scalable-processing

Scaling NLP processing pipelines with Dask and PySpark, utilising Apache Kafka real-time data streaming, for optimal LLM training
https://github.com/daniel-elston/real-time-reddit-scalable-processing

apache-kafka dask-distributed embeddings llm llm-training nlp pyspark scalability

Last synced: about 2 months ago
JSON representation

Scaling NLP processing pipelines with Dask and PySpark, utilising Apache Kafka real-time data streaming, for optimal LLM training

Host: GitHub
URL: https://github.com/daniel-elston/real-time-reddit-scalable-processing
Owner: Daniel-Elston
Created: 2025-01-22T14:21:57.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-02-24T15:35:31.000Z (over 1 year ago)
Last Synced: 2025-02-24T16:40:34.423Z (over 1 year ago)
Topics: apache-kafka, dask-distributed, embeddings, llm, llm-training, nlp, pyspark, scalability
Language: Python
Homepage:
Size: 3.07 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# README

### Goal
Build scalable text data processing pipelines for efficient model training with Dask and PySpark, utilising Apache Kafka for real-time data streaming.

---

### Results
1. Results can be found in the ``reports/result.xlsx`` file.
2. This is partition number 2 of resultant Dask processing 4 partitions.

---

### **QRG:**

1. Load up docker app
2. Load 2 separate WSL terminals (T1 and T2)
3. In T1, run ``docker-compose up --build``
4. Open file ``config/settings`` and adjust the Config to either 'extract', 'transform' or 'results'
5. Once all images are running, in T2, run ``python main.py``
6. Data is streamed in temrinal but also saved: ``data/temp/reddit_comments.json``
7. Sample result of PySpark and Dask processing can be found as SDOs in ``data/results/*.xlsx``

---

### **Requirements:**

- WSL2
- Ubuntu 24.04
- Python 3.12.*

---

### Ensure Java Runtime Env:

``sudo apt update``

``sudo apt install openjdk-11-jdk``

``readlink -f $(which java)``

``export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64``

**Add lines to .zshrc:**

``echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64' >> ~/.bashrc``

``echo 'export PATH=$JAVA_HOME/bin:$PATH' >> ~/.bashrc``

---

### Apologies & Disclaimer

This project streams real-time Reddit comments, which are generated by users across various subreddits. As a result, some content may include offensive, inappropriate, or controversial language. Please note that I do not endorse or control the content of these comments.

I sincerely apologise for any offensive material that may appear during the data stream. If you come across content that is particularly concerning, I encourage you to report it directly to Reddit through their moderation tools.

Thank you for your understanding.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/daniel-elston/real-time-reddit-scalable-processing

Awesome Lists containing this project

README