Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ctadeodev/spark-word-counter

A Dockerized PySpark application for counting word frequencies in an input PDF document
https://github.com/ctadeodev/spark-word-counter

docker pdf-document-processor pyspark python spark

Last synced: 3 months ago
JSON representation

A Dockerized PySpark application for counting word frequencies in an input PDF document

Awesome Lists containing this project

README

        

# PDF word counter with Spark

A Dockerized PySpark application for counting word frequencies in a PDF document

In this basic configuration, we will create the following:

1. A word_count.py script that list of words (top 10) that appear in a pdf document and the number of their occurence
2. A 3-node (1 master, 2 worker) spark cluster that runs in Docker

I ran these both on my local machine, but in theory, the driver and cluster can be on separate machines