Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/anant/example-cassandra-spark-elasticsearch

cassandra datastax docker elasticsearch scala spark spark-sql

Last synced: 7 days ago
JSON representation

Host: GitHub
URL: https://github.com/anant/example-cassandra-spark-elasticsearch
Owner: Anant
Created: 2021-06-24T16:06:49.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2021-06-24T17:26:20.000Z (over 3 years ago)
Last Synced: 2024-11-18T14:46:12.037Z (2 months ago)
Topics: cassandra, datastax, docker, elasticsearch, scala, spark, spark-sql
Language: Scala
Homepage:
Size: 4.88 KB
Stars: 3
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # Cassandra, Spark and Elasticsearch

This project is primarily a spark job written in Scala and built with SBT in the form of a fat jar with SBT assembly. It uses DSE Cassandra with Analytics (for Spark) and Elasticsearch, both running in separate docker containers but in the same docker-compose network, and performs three tasks located in three scala class files. 

Once a fat jar is built, it is submitted to spark (with spark-submit) with a different class name corresponding to the three scala classes located in the code and performs three different tasks: 

1. Reading a .CSV file into a SparkSQL Dataframe and saving it to Cassandra

2. Loading data from a Cassandra table into a SparkSQL Dataframe and saving that data into Elasticsearch

3. Loading data from Elasticsearch into a SparkSQL Dataframe

- - - 

## Software involved

- docker (docker-compose)

- Elasticsearch (7.13.0)

- Scala (2.11.12)

- DSE Server (6.7.7)

- Apache Spark, SparkSQL (2.2.3)

## Requirements

- docker, docker-compose

- sbt

## Table of Contents

1. [Run containers with docker-compose](#1-run-docker-containers)

2. [Setup Cassandra Table](#2-setup-cassandra-table)

3. [Perform first job (Read CSV, save to Cassandra)](#3-run-first-job)

4. [Perform second job (Read from Cassandra, save to ES)](#4-run-second-job)

5. [Perform third job (Read from ES)](#5-run-third-job)

## 1. Run Docker Containers

Make sure you are in the root folder of the repository. Run the following command: 

```bash

docker-compose up -d

```

After a minute or two, run the following command to make sure that both containers are up (both elasticsearch and dse server): 

```bash

docker ps -a

```

## 2. Setup Cassandra Table

Use the following command to setup the test Cassandra table: 

```bash

docker-compose exec dse1 cqlsh -f /app/test-data/keyspace.cql

```

Additionally, the fat jar needs to be built. Execute the following command in the root directory of the project: 

```bash

sbt assembly

```

## 3. Run First Job

This first job will read data.csv (located in /test-data/) into a SparkSQL Dataframe and then save it to DSE Cassandra. 

```bash

docker-compose exec dse1 dse spark-submit \

    --master dse://dse1 \

    --jars /jars/test-project-name-assembly-0.1.jar \

    --class "cassandraScalaEs.LoadIntoCass" \

    --conf "spark.driver.extraJavaOptions=-Dlogback.configurationFile=/app/src/test/resources/logback.xml" \

    --files /app/test-data/data.csv \

    /jars/test-project-name-assembly-0.1.jar

```

To test that the data is saved into Cassandra, see Second Job. 

## 4. Run Second Job

This second job will read data from DSE Cassandra that was inserted in the first job into a SparkSQL Dataframe. Afterwards, it will save that data to Elasticsearch. 

If the second job worked properly, then this step will run and the resulting data (being read from DSE Cassandra) will display in the console. 

Additionally, right after the non-filtered data is shown, a filtered version of the dataframe will also show (only users with id > 1). 

```bash

docker-compose exec dse1 dse spark-submit \

    --master dse://dse1 \

    --jars /jars/test-project-name-assembly-0.1.jar \

    --class "cassandraScalaEs.CassToEs" \

    --conf "spark.driver.extraJavaOptions=-Dlogback.configurationFile=/app/src/test/resources/logback.xml" \

    /jars/test-project-name-assembly-0.1.jar

```

To test that data was written to Elasticsearch, open up a browser and navigate to the following url: 

```bash

http://localhost:9200/usertestindex/_search

```

This should show all of the data from the original data.csv file written into the index "usertestindex" in Elasticsearch. Each row from the data is an individual doc entry in this case. 

## 5. Run Third Job

The third job reads from Elasticsearch's index that was created in the last job (testuserindex) and puts this data into a SparkSQL Dataframe. Afterwards, it displays the data in the console.

```bash

docker-compose exec dse1 dse spark-submit \

    --master dse://dse1 \

    --jars /jars/test-project-name-assembly-0.1.jar \

    --class "cassandraScalaEs.LoadFromEs" \

    --conf "spark.driver.extraJavaOptions=-Dlogback.configurationFile=/app/src/test/resources/logback.xml" \

    /jars/test-project-name-assembly-0.1.jar

```

This data is not filtered, but can be filtered with push-down operations (filter condition is automatically translated to a QueryDSL query which is then fed into elasticsearch by the elasticsearch spark connector, so that ES only gives back appropriate data)

See the following document for more information (Under Spark SQL Support section): 

https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html