https://github.com/mdh266/polarsduckdbplayground
Playing Around With Polars On AWS
https://github.com/mdh266/polarsduckdbplayground
aws-ec2 data-science dataframe-library docker duckdb pandas polars s3 sql
Last synced: 2 months ago
JSON representation
Playing Around With Polars On AWS
- Host: GitHub
- URL: https://github.com/mdh266/polarsduckdbplayground
- Owner: mdh266
- License: mit
- Created: 2023-07-05T00:57:41.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2023-08-27T17:56:27.000Z (over 1 year ago)
- Last Synced: 2025-01-31T14:43:15.969Z (4 months ago)
- Topics: aws-ec2, data-science, dataframe-library, docker, duckdb, pandas, polars, s3, sql
- Language: Jupyter Notebook
- Homepage:
- Size: 4.37 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Polars DataFrame PlayGround
-------------------In this post I quickly covered what I view as the limitations in [Pandas](https://pandas.pydata.org/) library:
1. High memory usage
2. Limited multi-core algorithms
3. No ability to execute SQL statements (like [SparkSQL & DataFrame](https://spark.apache.org/sql/))
4. No query planning/lazy-execution
5. [NULL values only exist for floats not ints](https://pandas.pydata.org/docs/user_guide/integer_na.html) (this changed in Pandas 1.0+)
6. Using [strings is inefficient](https://pandas.pydata.org/docs/user_guide/text.html) (this too changed in Pandas 1.0+)
Many of these issues have been addressed by the [Pandas 2.0 release](https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html), but **I still feel the API is awkward!**So in this post I go over two alternatives:
* [Polars](https://www.pola.rs/) for dataframes
* [DuckDB](https://duckdb.org/) for SQL queriesI cover how to get set up in with
Juptyer lab using [Docker](https://www.docker.com/) on [AWS](https://aws.amazon.com/) as well as some basics of [Polars](https://www.pola.rs/), [DuckDB](https://duckdb.org/) and how to use the two in combination. The benefits of Polars is that,* It allows for fast parallel querying on dataframes.
* It uses [Apache Arrow](https://arrow.apache.org/) for backend datatypes making it efficient for memory.
* It has both lazy and eager execution mode.
* It allows for SQL queries direcly on dataframes.
* Its API is similar to Spark's API and allows for highly readable queries using method chaining.DuckDB is a blazingly fast [OLAP](https://aws.amazon.com/what-is/olap/) SQL query engine. In the context of this blog post I cover how to use it to run SQL queries against Pandas/Polars dataframes and even local Parquet files!
## Using The Notebook
----------
You can install the dependencies and access the notebook using Docker by building the Docker image with the following:docker build -t polars_nb .
Followed by running the command container:
docker run -ip 8888:8888 -v `pwd`:/home/jovyan -t polars_nb
See here for more info.
Otherwise without Docker, make sure to use Python 3.10 and install the libraries listed in
requirements.txt
. These can be installed with the command,pip install -r requirements.txt