{"id":20162676,"url":"https://github.com/airscholar/sparkingflow","last_synced_at":"2025-04-10T00:36:05.283Z","repository":{"id":234048237,"uuid":"714294968","full_name":"airscholar/SparkingFlow","owner":"airscholar","description":"This project demonstrates how to use Apache Airflow to submit jobs to Apache spark cluster in different programming laguages using Python, Scala and Java as an example. ","archived":false,"fork":false,"pushed_at":"2024-03-14T22:16:44.000Z","size":97,"stargazers_count":40,"open_issues_count":5,"forks_count":25,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-24T02:21:56.823Z","etag":null,"topics":["apache-airflow","dataengineering","docker","java","pyspark","scala","spark"],"latest_commit_sha":null,"homepage":"https://www.datamasterylab.com/home/course/apache-airflow-on-steriods-for-data-engineers/9","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/airscholar.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-11-04T13:47:02.000Z","updated_at":"2025-03-17T08:30:43.000Z","dependencies_parsed_at":"2024-04-18T02:57:21.333Z","dependency_job_id":"7dccffde-8649-43d5-a706-f5cbbd86af5a","html_url":"https://github.com/airscholar/SparkingFlow","commit_stats":null,"previous_names":["airscholar/sparkingflow"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/airscholar%2FSparkingFlow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/airscholar%2FSparkingFlow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/airscholar%2FSparkingFlow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/airscholar%2FSparkingFlow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/airscholar","download_url":"https://codeload.github.com/airscholar/SparkingFlow/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248137914,"owners_count":21053771,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-airflow","dataengineering","docker","java","pyspark","scala","spark"],"created_at":"2024-11-14T00:26:26.010Z","updated_at":"2025-04-10T00:36:05.268Z","avatar_url":"https://github.com/airscholar.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Apache Airflow on Steroids with Java, Scala and Python Spark Jobs\n\nThis project orchestrates Spark jobs written in different programming languages using Apache Airflow, all within a Dockerized environment. The DAG `sparking_flow` is designed to submit Spark jobs written in Python, Scala, and Java, ensuring that data processing is handled efficiently and reliably on a daily schedule.\n\n## Project Structure\n\nThe DAG `sparking_flow` includes the following tasks:\n\n- `start`: A PythonOperator that prints \"Jobs started\".\n- `python_job`: A SparkSubmitOperator that submits a Python Spark job.\n- `scala_job`: A SparkSubmitOperator that submits a Scala Spark job.\n- `java_job`: A SparkSubmitOperator that submits a Java Spark job.\n- `end`: A PythonOperator that prints \"Jobs completed successfully\".\n\nThese tasks are executed in a sequence where the `start` task triggers the Spark jobs in parallel, and upon their completion, the `end` task is executed.\n\n## Prerequisites\n\nBefore setting up the project, ensure you have the following:\n\n- Docker and Docker Compose installed on your system.\n- Apache Airflow Docker image or a custom image with Airflow installed.\n- Apache Spark Docker image or a custom image with Spark installed and configured to work with Airflow.\n- Docker volumes for Airflow DAGs, logs, and Spark jobs are properly set up.\n\n## Docker Setup\n\nTo run this project using Docker, follow these steps:\n\n1. Clone this repository to your local machine.\n2. Navigate to the directory containing the `docker-compose.yml` file.\n3. Build and run the containers using Docker Compose:\n\n```bash\ndocker-compose up -d --build\n```\nThis command will start the necessary services defined in your docker-compose.yml, such as Airflow webserver, scheduler, Spark master, and worker containers.\n\n## Directory Structure for Jobs\nEnsure your Spark job files are placed in the following directories and are accessible to the Airflow container:\n\n* Python job: jobs/python/wordcountjob.py\n* Scala job: jobs/scala/target/scala-2.12/word-count_2.12-0.1.jar\n* Java job: jobs/java/spark-job/target/spark-job-1.0-SNAPSHOT.jar\n\nThese paths should be relative to the mounted Docker volume for Airflow DAGs.\n\n## Usage\nAfter the Docker environment is set up, the `sparking_flow` DAG will be available in the Airflow web UI [localhost:8080](localhost:8080), where it can be triggered manually or run on its daily schedule.\n\n### The DAG will execute the following steps:\n* Print \"Jobs started\" in the Airflow logs.\n* Submit the Python Spark job to the Spark cluster.\n* Submit the Scala Spark job to the Spark cluster.\n* Submit the Java Spark job to the Spark cluster.\n* Print \"Jobs completed successfully\" in the Airflow logs after all jobs have finished.\n\n### Note:\nYou must add the spark cluster url to the spark connection in the configuration on Airflow UI\n\n### Full Course\n[![Sparking Flow](https://img.youtube.com/vi/o_pne3aLW2w/0.jpg)](https://www.youtube.com/watch?v=o_pne3aLW2w)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fairscholar%2Fsparkingflow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fairscholar%2Fsparkingflow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fairscholar%2Fsparkingflow/lists"}