An open API service indexing awesome lists of open source software.

https://github.com/daniel-elston/real-time-reddit-scalable-processing

Scaling NLP processing pipelines with Dask and PySpark, utilising Apache Kafka real-time data streaming, for optimal LLM training
https://github.com/daniel-elston/real-time-reddit-scalable-processing

apache-kafka dask-distributed embeddings llm llm-training nlp pyspark scalability

Last synced: about 2 months ago
JSON representation

Scaling NLP processing pipelines with Dask and PySpark, utilising Apache Kafka real-time data streaming, for optimal LLM training

Awesome Lists containing this project

README

          

# README

### Goal
Build scalable text data processing pipelines for efficient model training with Dask and PySpark, utilising Apache Kafka for real-time data streaming.

---

### Results
1. Results can be found in the ``reports/result.xlsx`` file.
2. This is partition number 2 of resultant Dask processing 4 partitions.

---

### **QRG:**

1. Load up docker app
2. Load 2 separate WSL terminals (T1 and T2)
3. In T1, run ``docker-compose up --build``
4. Open file ``config/settings`` and adjust the Config to either 'extract', 'transform' or 'results'
5. Once all images are running, in T2, run ``python main.py``
6. Data is streamed in temrinal but also saved: ``data/temp/reddit_comments.json``
7. Sample result of PySpark and Dask processing can be found as SDOs in ``data/results/*.xlsx``

---

### **Requirements:**

- WSL2
- Ubuntu 24.04
- Python 3.12.*

---

### Ensure Java Runtime Env:

``sudo apt update``

``sudo apt install openjdk-11-jdk``

``readlink -f $(which java)``

``export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64``

**Add lines to .zshrc:**

``echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64' >> ~/.bashrc``

``echo 'export PATH=$JAVA_HOME/bin:$PATH' >> ~/.bashrc``

---

### Apologies & Disclaimer

This project streams real-time Reddit comments, which are generated by users across various subreddits. As a result, some content may include offensive, inappropriate, or controversial language. Please note that I do not endorse or control the content of these comments.

I sincerely apologise for any offensive material that may appear during the data stream. If you come across content that is particularly concerning, I encourage you to report it directly to Reddit through their moderation tools.

Thank you for your understanding.