https://github.com/borys25ol/pg2pyrquet
Python CLI tool designed to export PostgreSQL tables into Parquet files
https://github.com/borys25ol/pg2pyrquet
cli dumps exports parquet postgresql python
Last synced: 5 months ago
JSON representation
Python CLI tool designed to export PostgreSQL tables into Parquet files
- Host: GitHub
- URL: https://github.com/borys25ol/pg2pyrquet
- Owner: borys25ol
- License: mit
- Created: 2024-08-23T09:59:17.000Z (almost 2 years ago)
- Default Branch: master
- Last Pushed: 2024-08-27T20:21:31.000Z (almost 2 years ago)
- Last Synced: 2024-08-28T21:25:37.739Z (almost 2 years ago)
- Topics: cli, dumps, exports, parquet, postgresql, python
- Language: Python
- Homepage:
- Size: 69.3 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Postgres Exports to Apache Parquet with Python (pg2pyrquet)
====================
[](https://www.python.org/)
[](https://github.com/psf/black)
[](http://mypy-lang.org/)
[](https://pycqa.github.io/isort/)
[](https://github.com/pre-commit/pre-commit)
## Description
`pg2pyrquet` is a Python CLI tool designed to export PostgreSQL tables into Parquet files.
This tool is particularly useful for data engineers and analysts who want to efficiently convert PostgreSQL data into a columnar storage format that is optimal for analytical workloads.
## Features
- **Efficient Data Export**: Export PostgreSQL tables directly to Parquet files.
- **Batch Processing**: Specify batch size to handle large datasets efficiently.
- **Customizable Output**: Define output folder and file name for the Parquet file.
## Installation
Setup and activate a python3 virtualenv via your preferred method. e.g. and install production requirements:
```shell
make ve
```
To use `pg2pyrquet`, you need to have Python installed. You can install the necessary dependencies using `pip`:
```sh
pip install -r requirements.txt
```
Configuration
--------------
If your database has a password, you can set the `POSTGRES_USER` and `POSTGRES_PASSWORD` environment variables to avoid entering them every time you run the tool.
For security reasons, it is recommended to use environment variables to store sensitive information.
```shell
export POSTGRES_USER=
export POSTGRES_PASSWORD=
````
Or using more secure way:
```shell
read -s -p "Enter POSTGRES User: " POSTGRES_USER || export POSTGRES_USER
read -s -p "Enter POSTGRES Password: " POSTGRES_PASSWORD || export POSTGRES_PASSWORD
```
Usage
-----
The primary command provided by this tool is `export-table`, `export-database` and `export-query` commands,
which allows you to export a PostgreSQL table or all PostgreSQL database tables to a Parquet files.
### Export a Single Table
To export a single table from a PostgreSQL database to a Parquet file, use the `export-table` command.
This command allows you to specify the database, table, output folder, output file name, and the batch size for processing.
#### Example Command:
```shell
python -m pg2pyrquet export-table \
--host \
--port \
--database \
--table \
--folder \
--output-file \
--batch-size
```
#### Command Options
- `--host`: The hostname of the PostgreSQL server.
- `--port`: The port number of the PostgreSQL server.
- `--database`: The name of the PostgreSQL database you want to export data from.
- `--table`: The specific table within the database to export.
- `--folder`: The directory where the Parquet file will be saved.
- `--output-file`: The name of the output Parquet file.
- `--batch-size`: The number of rows to process in each batch. This helps in managing memory usage for large tables.
### Export All Database Tables
To export all tables from a PostgreSQL database to Parquet files, use the `export-database` command.
This command exports each table into a separate Parquet file in the specified output folder.
#### Example Command
```shell
python -m pg2pyrquet export-database \
--host \
--port \
--database \
--folder \
--batch-size
```
#### Command Options
- `--host`: The hostname of the PostgreSQL server.
- `--port`: The port number of the PostgreSQL server.
- `--database`: The name of the PostgreSQL database you want to export data from.
- `--folder`: The directory where the Parquet file will be saved.
- `--batch-size`: The number of rows to process in each batch. This helps in managing memory usage for large tables.
#### Note on File Naming
When using the `export-database` command, each Parquet file will be named according to the table name, following the format `{table_name}.parquet`.
### Export a Custom Query
To export the result of a custom SQL query from a PostgreSQL database to a Parquet file, use the `export_query` command.
This command allows you to specify the database, the file containing the SQL query, the output folder, the output file name, and the batch size for processing.
#### Example Command:
```shell
python -m pg2pyrquet export-query \
--host \
--port \
--database \
--query-file \
--folder \
--output-file \
--batch-size
```
#### Command Options
- `--host`: The hostname of the PostgreSQL server.
- `--port`: The port number of the PostgreSQL server.
- `--database`: The name of the PostgreSQL database you want to export data from.
- `--query-file`: The path to the file containing the SQL query (like `custom-query.sql`).
- `--folder`: The directory where the Parquet file will be saved.
- `--output-file`: The name of the output Parquet file.
- `--batch-size`: The number of rows to process in each batch. This helps in managing memory usage for large tables.
Example SQL query file (`custom-query.sql`):
```sql
SELECT *
FROM my_table
WHERE column_name = 'value'
GROUP BY column_name
LIMIT 1000;
```
### Running from Python
Also, you have the ability execute all available commands as Python functions:
Create a new file `export.py` with the following content:
```python
import os
from pg2pyrquet import export_database, export_query, export_table
def set_postgres_auth_env(username: str, password: str) -> None:
"""
Set the PostgreSQL username and password as environment variables
Args:
username (str): The username for the PostgreSQL database.
password (str): The password for the PostgreSQL database.
"""
os.environ["POSTGRES_USER"] = username
os.environ["POSTGRES_PASSWORD"] = password
def run_export_database() -> None:
export_database(
host="localhost",
port="5432",
database="test_database",
output_path="./data",
batch_size=5000,
)
def run_export_table() -> None:
export_table(
host="localhost",
port="5432",
database="test_database",
table="test_table",
output_path="./data",
output_file="test_table.parquet",
batch_size=5000,
)
def run_export_query() -> None:
export_query(
host="localhost",
port="5432",
database="test_database",
query_file="./custom_query.sql",
output_path="./data",
output_file="query.parquet",
batch_size=5000,
)
def main() -> None:
# If your database has a password,
# you can set the POSTGRES_USER and POSTGRES_PASSWORD environment variables.
set_postgres_auth_env(username="username", password="password")
run_export_database()
run_export_table()
run_export_query()
if __name__ == "__main__":
main()
```
And run it:
```shell
python export.py
```
Contributing
------------
Contributions are welcome!
If you find a bug or have a feature request, please open an issue or submit a pull request on GitHub.