Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/cody-scott/dagster-mssql-bcp
https://github.com/cody-scott/dagster-mssql-bcp
Last synced: 12 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/cody-scott/dagster-mssql-bcp
- Owner: cody-scott
- License: mit
- Created: 2024-09-16T15:16:28.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2024-10-31T20:01:23.000Z (18 days ago)
- Last Synced: 2024-10-31T21:17:14.423Z (18 days ago)
- Language: Python
- Size: 91.8 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# dagster-mssql-bcp
![Unit tests](https://github.com/cody-scott/dagster-mssql-bcp/actions/workflows/python-test.yml/badge.svg)
ODBC is slow 🐢 bcp is fast! 🐰
This is a custom dagster IO manager for loading data into SQL Server using the `bcp` utility.
## What you need to run it
### Pypi
![PyPI](https://img.shields.io/pypi/v/dagster-mssql-bcp?label=latest%20stable&logo=pypi)
`pip install dagster-mssql-bcp`
### BCP Utility
The `bcp` utility must be installed on the machine that is running the dagster pipeline.
See [Microsoft's documentation](https://learn.microsoft.com/en-us/sql/tools/bcp-utility?view=sql-server-ver16&tabs=windows) for more information.
Ideally you should place this on your `PATH`, but you can specify in the IO configuration where it is located.
### ODBC Drivers
You need the ODBC drivers installed on the machine that is running the dagster pipeline.
See [Microsoft's documentation](https://learn.microsoft.com/en-us/sql/connect/odbc/download-odbc-driver-for-sql-server?view=sql-server-ver16) for more information.
### Permissions
The user running the dagster pipeline must have the necessary permissions to load data into the SQL Server database.
* `CREATE SCHEMA`
* `CREATE/ALTER TABLES`## Basic Usage
## Polars
Polars processes as a `LazyFrame`. Either a `DataFrame` or `LazyFrame` can be provided as an output of your asset before its cast automatically to `lazy`
```python
from dagster import asset, Definitions
from dagster_mssql_bcp import PolarsBCPIOManager
import polars as plio_manager = PolarsBCPIOManager(
host="my_mssql_server",
database="my_database",
user="username",
password="password",
query_props={
"TrustServerCertificate": "yes",
},
bcp_arguments={"-u": ""},
bcp_path="/opt/mssql-tools18/bin/bcp",
)@asset(
metadata={
"asset_schema": [
{"name": "id", "type": "INT"},
],
"schema": "my_schema",
}
)
def my_polars_asset(context):
return pl.DataFrame({"id": [1, 2, 3]})@asset(
metadata={
"asset_schema": [
{"name": "id", "type": "INT"},
],
"schema": "my_schema",
}
)
def my_polars_asset_lazy(context):
return pl.LazyFrame({"id": [1, 2, 3]})defs = Definitions(
assets=[my_polars_asset, my_polars_asset_lazy],
io_managers={
"io_manager": io_manager,
},
)```
## Pandas
```python
from dagster import asset, Definitions
from dagster_mssql_bcp import PandasBCPIOManager
import pandas as pdio_manager = PandasBCPIOManager(
host="my_mssql_server",
database="my_database",
user="username",
password="password",
query_props={
"TrustServerCertificate": "yes",
},
bcp_arguments={"-u": ""},
bcp_path="/opt/mssql-tools18/bin/bcp",
)@asset(
metadata={
"asset_schema": [
{"name": "id", "type": "INT"},
],
"schema": "my_schema",
}
)
def my_pandas_asset(context):
return pd.DataFrame({"id": [1, 2, 3]})defs = Definitions(
assets=[my_pandas_asset],
io_managers={
"io_manager": io_manager,
},
)```
The `asset schema` defines your table structure and your asset returns your data to load.
## Docs
For more details see [assets doc](https://github.com/cody-scott/dagster-mssql-bcp/blob/main/docs/assets.md), [io manager doc](https://github.com/cody-scott/dagster-mssql-bcp/blob/main/docs/io_manager.md), and for how its implemented, the [dev doc](https://github.com/cody-scott/dagster-mssql-bcp/blob/main/docs/dev.md).