https://github.com/nfo94/infraestrutura-cassandra-pd

cassandra jupyter python spark

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/nfo94/infraestrutura-cassandra-pd
Owner: nfo94
Created: 2024-11-05T23:39:57.000Z (7 months ago)
Default Branch: main
Last Pushed: 2024-11-06T00:00:34.000Z (7 months ago)
Last Synced: 2024-12-26T10:43:05.919Z (5 months ago)
Topics: cassandra, jupyter, python, spark
Language: Jupyter Notebook
Homepage:
Size: 382 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        ## Cassandra infrastructure project

This is a data engineering and data analysis project to showcase Cassandra. To run this

project you need:

- [Docker](https://www.docker.com/) installed and configured

- [Poetry](https://python-poetry.org/) installed and configured

- Python `3.11` (if you need to change Python versions it's recommended to use [pyenv](https://github.com/pyenv/pyenv))

- Download Spark 3.5.1 ([spark-3.5.1-bin-hadoop3](https://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz)), extract it and save it in `./jupyter/`

Run this command to generate the proper data:

```bash

make etl

```

This will generate two `.csv` files in the `./data/` folder. Now setup Docker:

```bash

make setup

```

Then run services (Cassandra and Jupyter):

```bash

make up

```

Enter the `cassandra1` container:

```bash

make cas

```

**Wait a bit for the Cassandra containers to finish setting up**. If you try to enter the

`cqlsh` tool too soon it will show this error message:

```bash

Connection error: ('Unable to connect to any servers', {'127.0.0.1:9042': ConnectionRefusedError(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")})

```

After waiting a bit, in the Cassandra container, enter `cqlsh`:

```bash

cqlsh

```

Then create the keyspace and use it:

```bash

CREATE KEYSPACE IF NOT EXISTS analises_covid WITH REPLICATION = { 'class': 'SimpleStrategy', 'replication_factor': 2 };

```

```bash

USE analises_covid;

```

Create the necessary tables:

```bash

CREATE TABLE IF NOT EXISTS total_population_by_city_in_ba (city_name TEXT, uuid UUID, population_40_59 INT, PRIMARY KEY (uuid, city_name));

```

```bash

CREATE TABLE IF NOT EXISTS vaccinated_people_by_city_in_ba (city_name TEXT, vaccinated_people_d1_40_59 INT, uuid UUID, PRIMARY KEY (uuid, city_name));

```

Copy the data generated by the Python script to the created tables:

```bash

COPY analises_covid.total_population_by_city_in_ba (city_name,uuid,population_40_59) FROM './data/people_40_59_by_city_ba.csv' WITH HEADER=true;

```

```bash

COPY analises_covid.vaccinated_people_by_city_in_ba (city_name,vaccinated_people_d1_40_59,uuid) FROM './data/people_40_59_covid_d1_by_city_ba.csv' WITH HEADER=true;

```

To check the tables with the copyied data, run:

```bash

SELECT * FROM total_population_by_city_in_ba;

```

You'll hopefully see something like this:

![test-query](./imgs/test-query.png)

Access Jupyter on your browser in http://localhost:8888/. Then get the Jupyter token:

```bash

make jup

```

Put the token in the Jupyter input:

![jupyter](./imgs/jupyter-login.png)

Open the notebook `/jupyter/analises_covid.ipynb` inside the Jupyter lab and run the

commands in each cell to see the analysis on the created database.

![jupyter-analysis](./imgs/jupyter-analysis.png)

If you want to stop all services:

```bash

make down

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/nfo94/infraestrutura-cassandra-pd

Awesome Lists containing this project

README