Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/lealre/one-billion-rows-challenge
Project 01 from Python + SQL BOOTCAMP
https://github.com/lealre/one-billion-rows-challenge
Last synced: 10 days ago
JSON representation
Project 01 from Python + SQL BOOTCAMP
- Host: GitHub
- URL: https://github.com/lealre/one-billion-rows-challenge
- Owner: lealre
- Created: 2024-02-28T19:08:06.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-03-03T14:55:05.000Z (9 months ago)
- Last Synced: 2024-03-04T00:20:29.987Z (9 months ago)
- Language: Python
- Homepage:
- Size: 510 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## One Billion Rows Challenge
This repository represents Project 01 from the Data Engineering BOOTCAMP, where the objective is to compare different ways of reading one billion rows in Python. Upon cloning the original repository, the objective is to implement script modifications based on the concepts learned in previous bootcamp classes, such as type hints and functions.
From the original repository, a decorator function was implemented to measure the execution time of each script function responsible for reading the file. Minor structural adjustments, such as introducing a type hint in the path variable, were made to the scripts. Additionally, the Dask script was intentionally excluded, narrowing the comparison to Python (native structure), Pandas, Polars, and DuckDB.
## Results
Machine specifications:
* 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz 2.42 GHz
* 16G RAM| Enviroment | Execution Time |
|-------------------|------------------:|
| Python | 46 min 37 s |
| Python + Pandas | 6 min 36 s |
| Python + Polars | 38 s |
| Python + Duckdb | 33 s |## How To Run
How to execute this repository in a bash terminal using `pyenv` and `poetry`.
Before you begin, ensure that *measurements.txt* is inside `.gitignore`
1. Clone the repository:
```bash
git clone https://github.com/lealre/one-billion-rows-challenge.git
```
2. Enter the folder:
```bash
cd one-billion-rows-challenge
```3. Set Python version using pyenv:
```bash
pyenv local 3.12.1
```4. Set poetry to use Python 3.12.1:
```bash
poetry env use 3.12.1
```5. Install dependencies and activate the virtual environment:
```bash
poetry install --no-root
```
```bash
poetry lock --no-update
```
```bash
poetry shell
```6. Create the *measurements.txt* file (takes some time to create):
```bash
python src/create_measurements.py
```7. Read the file using specific script:
```bash
python src/using_python.py
```
```bash
python src/using_pandas.py
```
```bash
python src/using_polars.py
```
```bash
python src/using_duckdb.py
```--------------
THIS IS A CLONE FROM [THIS REPO](https://github.com/lvgalvao/One-Billion-Row-Challenge-Python)! The scripts designed to read the file are not authored by me.