https://github.com/webdevcaptain/distributed-word-counter
Distributed Word Counter using Apache Spark (PySpark)
https://github.com/webdevcaptain/distributed-word-counter
distributed-computing pyspark
Last synced: 3 months ago
JSON representation
Distributed Word Counter using Apache Spark (PySpark)
- Host: GitHub
- URL: https://github.com/webdevcaptain/distributed-word-counter
- Owner: WebDevCaptain
- License: mit
- Created: 2025-02-01T17:20:45.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2025-02-01T17:40:59.000Z (5 months ago)
- Last Synced: 2025-02-01T18:29:18.269Z (5 months ago)
- Topics: distributed-computing, pyspark
- Language: Python
- Homepage:
- Size: 4.88 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Distributed Word Counter with Apache Spark
## Introduction
This is a simple example of distributed word counter using Apache Spark.
It uses a sample dataset of text lines to count the occurrences of individual words.---
## Setup
This setup is focussed on local development using Docker Compose. It uses 3 Spark containers: a spark master and 2 spark workers nodes.
[Word counter python script](./word_count.py) is mounted into the master node and we connect to the master to execute the script. `PySpark` is used to create an RDD from the text lines and perform distributed word count.
---
## Usage
1. Start the Spark containers using Docker Compose:
```bash
docker-compose up -d
```2. Run the distributed word count example:
```bash
bash submit.sh
```---
## Sample Input
```python
text_lines = [
"Spark is fast",
"Spark is scalable",
"Distributed computing with Spark",
"Python and Spark go well together",
"Python provides a simple interface to Spark",
"Other options include Java, Scala, and R, but Python is the most beginner-friendly",
]
```## Sample Output
```bash
Distributed Word Count Results:
computing: 1
and: 2
well: 1
include: 1
Scala,: 1
most: 1
is: 3
Other: 1
Java,: 1
the: 1
options: 1
beginner-friendly: 1
together: 1
provides: 1
a: 1
simple: 1
fast: 1
interface: 1
to: 1
go: 1
R,: 1
Distributed: 1
with: 1
Python: 3
but: 1
Spark: 5
scalable: 1
```---
## License Information
This project is released under the [MIT License](LICENSE).