https://github.com/slingdata-io/sling-python

Python wrapper for the Sling CLI tool
https://github.com/slingdata-io/sling-python
Last synced: 2 months ago
JSON representation
Python wrapper for the Sling CLI tool
Host: GitHub
URL: https://github.com/slingdata-io/sling-python
Owner: slingdata-io
License: mit
Created: 2022-04-04T12:27:09.000Z (about 4 years ago)
Default Branch: main
Last Pushed: 2026-02-07T02:58:33.000Z (4 months ago)
Last Synced: 2026-02-22T22:57:55.752Z (4 months ago)
Language: Python
Homepage: https://docs.slingdata.io/
Size: 135 KB
Stars: 64
Watchers: 4
Forks: 12
Open Issues: 4
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          


Slings from a data source to a data target.


## Installation

`pip install sling` or `pip install sling[arrow]` for streaming.

Then you should be able to run `sling --help` from command line.

## Running a Extract-Load Task

### CLI

```shell

sling run --src-conn MY_PG --src-stream myschema.mytable \

  --tgt-conn YOUR_SNOWFLAKE --tgt-object yourschema.yourtable \

  --mode full-refresh

```

Or passing a yaml/json string or file

```shell

cat '

source: MY_POSTGRES

target: MY_SNOWFLAKE

# default config options which apply to all streams

defaults:

  mode: full-refresh

  object: new_schema.{stream_schema}_{stream_table}

streams:

  my_schema.*:

' > /path/to/replication.yaml

sling run -r /path/to/replication.yaml

```

### Using the `Replication` class

Run a replication from file:

```python

import yaml

from sling import Replication

# From a YAML file

replication = Replication(file_path="path/to/replication.yaml")

replication.run()

# Or load into object

with open('path/to/replication.yaml') as file:

  config = yaml.load(file, Loader=yaml.FullLoader)

replication = Replication(**config)

replication.run()

```

Build a replication dynamically:

```python

from sling import Replication, ReplicationStream, Mode

# build sling replication

streams = {}

for (folder, table_name) in list(folders):

  streams[folder] = ReplicationStream(

    mode=Mode.FULL_REFRESH, object=table_name, primary_key='_hash_id')

replication = Replication(

  source='aws_s3',

  target='snowflake',

  streams=streams,

  env=dict(SLING_STREAM_URL_COLUMN='true', SLING_LOADED_AT_COLUMN='true'),

  debug=True,

)

replication.run()

```

### Using the `Sling` Class

For more direct control and streaming capabilities, you can use the `Sling` class, which mirrors the CLI interface.

#### Basic Usage with `run()` method

```python

import os

from sling import Sling, Mode

# Set postgres & snowflake connection

# see https://docs.slingdata.io/connections/database-connections

os.environ["POSTGRES"] = 'postgres://...'

os.environ["SNOWFLAKE"] = 'snowflake://...'

# Database to database transfer

Sling(

    src_conn="postgres",

    src_stream="public.users",

    tgt_conn="snowflake",

    tgt_object="public.users_copy",

    mode=Mode.FULL_REFRESH

).run()

# Database to file

Sling(

    src_conn="postgres", 

    src_stream="select * from users where active = true",

    tgt_object="file:///tmp/active_users.csv"

).run()

# File to database

Sling(

    src_stream="file:///path/to/data.csv",

    tgt_conn="snowflake",

    tgt_object="public.imported_data"

).run()

```

#### Input Streaming - Python Data to Target

> **💡 Tip:** Install `pip install sling[arrow]` for better streaming performance and improved data type handling.

> **📊 DataFrame Support:** The `input` parameter accepts lists of dictionaries, pandas DataFrames, or polars DataFrames. DataFrame support preserves data types when using Arrow format.

> **⚠️ Note:** Be careful with large numbers of `Sling` invocations using `input` or `stream()` methods when working with external systems (databases, file systems). Each call re-opens the connection since it invokes the underlying sling binary. For better performance and connection reuse, consider using the `Replication` class instead, which maintains open connections across multiple operations.

```python

import os

from sling import Sling, Format

# Set postgres connection

# see https://docs.slingdata.io/connections/database-connections

os.environ["POSTGRES"] = 'postgres://...'

# Stream Python data to CSV file

data = [

    {"id": 1, "name": "John", "age": 30},

    {"id": 2, "name": "Jane", "age": 25},

    {"id": 3, "name": "Bob", "age": 35}

]

Sling(

    input=data,

    tgt_object="file:///tmp/output.csv"

).run()

# Stream Python data to database

Sling(

    input=data,

    tgt_conn="postgres",

    tgt_object="public.users"

).run()

# Stream Python data to JSON Lines file

Sling(

    input=data,

    tgt_object="file:///tmp/output.jsonl",

    tgt_options={"format": Format.JSONLINES}

).run()

# Stream from generator (memory efficient for large datasets)

def data_generator():

    for i in range(10000):

        yield {"id": i, "value": f"item_{i}", "timestamp": "2023-01-01"}

Sling(input=data_generator(), tgt_object="file:///tmp/large_dataset.csv").run()

# Stream pandas DataFrame to database

import pandas as pd

df = pd.DataFrame({

    "id": [1, 2, 3, 4],

    "name": ["Alice", "Bob", "Charlie", "Diana"],

    "age": [25, 30, 35, 28],

    "salary": [50000, 60000, 70000, 55000]

})

Sling(

    input=df,

    tgt_conn="postgres",

    tgt_object="public.employees"

).run()

# Stream polars DataFrame to CSV file

import polars as pl

df = pl.DataFrame({

    "product_id": [101, 102, 103],

    "product_name": ["Laptop", "Mouse", "Keyboard"],

    "price": [999.99, 25.50, 75.00],

    "in_stock": [True, False, True]

})

Sling(

    input=df,

    tgt_object="file:///tmp/products.csv"

).run()

# DataFrame with column selection

Sling(

    input=df,

    select=["product_name", "price"],  # Only export specific columns

    tgt_object="file:///tmp/product_prices.csv"

).run()

```

#### Output Streaming with `stream()`

```python

import os

from sling import Sling

# Set postgres connection

# see https://docs.slingdata.io/connections/database-connections

os.environ["POSTGRES"] = 'postgres://...'

# Stream data from database

sling = Sling(

    src_conn="postgres",

    src_stream="public.users",

    limit=1000

)

for record in sling.stream():

    print(f"User: {record['name']}, Age: {record['age']}")

# Stream data from file

sling = Sling(

    src_stream="file:///path/to/data.csv"

)

# Process records one by one (memory efficient)

for record in sling.stream():

    # Process each record

    processed_data = transform_record(record)

    # Could save to another system, send to API, etc.

# Stream with parameters

sling = Sling(

    src_conn="postgres",

    src_stream="public.orders",

    select=["order_id", "customer_name", "total"],

    where="total > 100",

    limit=500

)

records = list(sling.stream())

print(f"Found {len(records)} high-value orders")

```

#### High-Performance Streaming with `stream_arrow()`

> **🚀 Performance:** The `stream_arrow()` method provides the highest performance streaming with full data type preservation by using Apache Arrow's columnar format. Requires `pip install sling[arrow]`.

> **📊 Type Safety:** Unlike `stream()` which may convert data types during CSV serialization, `stream_arrow()` preserves exact data types including integers, floats, timestamps, and more.

```python

import os

from sling import Sling

# Set postgres connection  

# see https://docs.slingdata.io/connections/database-connections

os.environ["POSTGRES"] = 'postgres://...'

# Basic Arrow streaming from database

sling = Sling(src_conn="postgres", src_stream="public.users", limit=1000)

# Get Arrow RecordBatchStreamReader for maximum performance

reader = sling.stream_arrow()

# Convert to Arrow Table for analysis

table = reader.read_all()

print(f"Received {table.num_rows} rows with {table.num_columns} columns")

print(f"Column names: {table.column_names}")

print(f"Schema: {table.schema}")

# Convert to pandas DataFrame with preserved types

if table.num_rows > 0:

    df = table.to_pandas()

    print(df.dtypes)  # Shows preserved data types

# Stream Arrow file with type preservation

sling = Sling(

    src_stream="file:///path/to/data.arrow",

    src_options={"format": "arrow"}

)

reader = sling.stream_arrow()

table = reader.read_all()

# Access columnar data directly (very efficient)

for column_name in table.column_names:

    column = table.column(column_name)

    print(f"{column_name}: {column.type}")

# Process Arrow batches for large datasets (memory efficient)

sling = Sling(

    src_conn="postgres", 

    src_stream="select * from large_table"

)

reader = sling.stream_arrow()

for batch in reader:

    # Process each batch separately to manage memory

    print(f"Processing batch with {batch.num_rows} rows")

    # Convert batch to pandas if needed

    batch_df = batch.to_pandas()

    # Process batch_df...

# Round-trip with Arrow format preservation

import pandas as pd

# Write DataFrame to Arrow file with type preservation

df = pd.DataFrame({

    "id": [1, 2, 3],

    "amount": [100.50, 250.75, 75.25],

    "timestamp": pd.to_datetime(["2023-01-01", "2023-01-02", "2023-01-03"]),

    "active": [True, False, True]

})

Sling(

    input=df,

    tgt_object="file:///tmp/data.arrow",

    tgt_options={"format": "arrow"}

).run()

# Read back with full type preservation

sling = Sling(

    src_stream="file:///tmp/data.arrow",

    src_options={"format": "arrow"}

)

reader = sling.stream_arrow()

restored_table = reader.read_all()

restored_df = restored_table.to_pandas()

# Types are exactly preserved (no string conversion)

print(restored_df.dtypes)

assert restored_df['active'].dtype == 'bool'

assert 'datetime64' in str(restored_df['timestamp'].dtype)

```

**Notes:**

- `stream_arrow()` requires PyArrow: `pip install sling[arrow]`

- Cannot be used with a target object (use `run()` instead)

- Provides the best performance for large datasets

- Preserves exact data types including timestamps, decimals, and booleans

- Ideal for analytics workloads and data science applications

#### Round-trip Examples

```python

import os

from sling import Sling

# Set postgres connection

# see https://docs.slingdata.io/connections/database-connections

os.environ["POSTGRES"] = 'postgres://...'

# Python → File → Python

original_data = [

    {"id": 1, "name": "Alice", "score": 95.5},

    {"id": 2, "name": "Bob", "score": 87.2}

]

# Step 1: Python data to file

sling_write = Sling(

    input=original_data,

    tgt_object="file:///tmp/scores.csv"

)

sling_write.run()

# Step 2: File back to Python

sling_read = Sling(

    src_stream="file:///tmp/scores.csv"

)

loaded_data = list(sling_read.stream())

# Python → Database → Python (with transformations)

sling_to_db = Sling(

    input=original_data,

    tgt_conn="postgres",

    tgt_object="public.temp_scores"

)

sling_to_db.run()

sling_from_db = Sling(

    src_conn="postgres", 

    src_stream="select *, score * 1.1 as boosted_score from public.temp_scores",

)

transformed_data = list(sling_from_db.stream())

# DataFrame → Database → DataFrame (with pandas/polars)

import pandas as pd

# Start with pandas DataFrame

df = pd.DataFrame({

    "user_id": [1, 2, 3],

    "purchase_amount": [100.50, 250.75, 75.25],

    "category": ["electronics", "clothing", "books"]

})

# Write DataFrame to database

Sling(

    input=df,

    tgt_conn="postgres",

    tgt_object="public.purchases"

).run()

# Read back with SQL transformations as pandas DataFrame

sling_query = Sling(

    src_conn="postgres",

    src_stream="""

        SELECT category, 

               COUNT(*) as purchase_count,

               AVG(purchase_amount) as avg_amount

        FROM public.purchases 

        GROUP BY category

    """

)

summary_data = list(sling_query.stream())

summary_df = pd.DataFrame(summary_data)

print(summary_df)

```

### Using the `Pipeline` class

Run a [Pipeline](https://docs.slingdata.io/concepts/pipeline):

```python

from sling import Pipeline

from sling.hooks import StepLog, StepCopy, StepReplication, StepHTTP, StepCommand

# From a YAML file

pipeline = Pipeline(file_path="path/to/pipeline.yaml")

pipeline.run()

# Or using Hook objects for type safety

pipeline = Pipeline(

    steps=[

        StepLog(message="Hello world"),

        StepCopy(from_="sftp//path/to/file", to="aws_s3/path/to/file"),

        StepReplication(path="path/to/replication.yaml"),

        StepHTTP(url="https://trigger.webhook.com"),

        StepCommand(command=["ls", "-l"], print_output=True)

    ],

    env={"MY_VAR": "value"}

)

pipeline.run()

# Or programmatically using dictionaries

pipeline = Pipeline(

    steps=[

        {"type": "log", "message": "Hello world"},

        {"type": "copy", "from": "sftp//path/to/file", "to": "aws_s3/path/to/file"},

        {"type": "replication", "path": "path/to/replication.yaml"},

        {"type": "http", "url": "https://trigger.webhook.com"},

        {"type": "command", "command": ["ls", "-l"], "print": True}

    ],

    env={"MY_VAR": "value"}

)

pipeline.run()

```

### Building API Specs with `ApiSpec`

Build [API Spec](https://docs.slingdata.io/concepts/api-specs) YAML files programmatically with type checking and validation. API specs define how Sling extracts data from REST APIs.

```python

from sling.api_spec import (

    ApiSpec, Endpoint, Request, Pagination, Response, Records,

    Processor, Rule, Iterate, Call, DynamicEndpoint,

    AuthType, HTTPMethod, RuleAction, AggregationType, BackoffType, ResponseFormat,

)

spec = ApiSpec(

    name="My API",

    description="Extract data from My API",

    queues=["user_ids"],

    defaults=Endpoint(

        state={"base_url": "https://api.example.com/v1", "limit": 100},

        request=Request(

            headers={

                "Authorization": 'Bearer {require(secrets.api_key, "api_key required")}',

                "Accept": "application/json",

            },

            rate=5,

            concurrency=3,

        ),

        response=Response(

            records=Records(jmespath="data[]", primary_key=["id"]),

            rules=[

                Rule(

                    action=RuleAction.RETRY,

                    condition="response.status == 429",

                    max_attempts=5,

                    backoff=BackoffType.EXPONENTIAL,

                    backoff_base=2,

                ),

            ],

        ),

        pagination=Pagination(

            next_state={"offset": "{state.offset + state.limit}"},

            stop_condition="length(response.records) < state.limit",

        ),

    ),

    endpoints={

        "users": Endpoint(

            description="List all users",

            state={"offset": 0},

            request=Request(

                url="{state.base_url}/users",

                parameters={"limit": "{state.limit}", "offset": "{state.offset}"},

            ),

            response=Response(

                processors=[

                    Processor(expression="record.id", output="queue.user_ids"),

                ],

            ),

        ),

        "user_orders": Endpoint(

            description="Get orders for each user",

            iterate=Iterate(over="queue.user_ids", into="state.user_id", concurrency=5),

            request=Request(url="{state.base_url}/users/{state.user_id}/orders"),

            response=Response(

                processors=[

                    Processor(expression="state.user_id", output="record.user_id"),

                ],

            ),

        ),

        "metrics": Endpoint(

            description="Daily metrics (incremental)",

            state={

                "offset": 0,

                "since": '{coalesce(sync.last_date, date_format(date_add(now(), -30, "day"), "%Y-%m-%d"))}',

            },

            sync=["last_date"],

            request=Request(

                url="{state.base_url}/metrics",

                parameters={"since": "{state.since}"},

            ),

            response=Response(

                records=Records(primary_key=["id"], update_key="date"),

                processors=[

                    Processor(

                        expression="record.date",

                        output="state.last_date",

                        aggregation=AggregationType.MAXIMUM,

                    ),

                ],

            ),

        ),

    },

)

# Validate

errors = spec.validate()

assert errors == [], errors

# Write to file

spec.to_yaml_file("my_api.yaml")

# Or get as string

print(spec.to_yaml())

print(spec.to_json())

```

Parse an existing spec:

```python

from sling.api_spec import ApiSpec, Endpoint, Request, Response, Records

spec = ApiSpec.parse_file("path/to/spec.yaml")

print(spec.name)

print(list(spec.endpoints.keys()))

# Modify and re-export

spec.endpoints["new_endpoint"] = Endpoint(

    request=Request(url="{state.base_url}/new"),

    response=Response(records=Records(primary_key=["id"])),

)

spec.to_yaml_file("updated_spec.yaml")

```

Use `+rules`/`+processors` modifiers to append to defaults without replacing them:

```python

from sling.api_spec import Endpoint, Request, Response, Rule, RuleAction

endpoint = Endpoint(

    request=Request(url="{state.base_url}/fragile"),

    response=Response(

        # append_rules serializes as "rules+" in YAML, keeping default rules intact

        append_rules=[Rule(action=RuleAction.SKIP, condition="response.status == 404")],

    ),

)

```

## Testing

```bash

pytest sling/tests/tests.py -v

pytest sling/tests/test_sling_class.py -v

```

## MCP

To Login:

```

mcp-publisher login dns --domain slingdata.io --private-key $(openssl pkey -in mcp-key.pem  -noout -text | grep -A3 "priv:" | tail -n +2 | tr -d ' :\n')`

```

To Publish:

```bash

# to publish, adjust the version first in server.json

mcp-publisher publish

# check

curl "https://registry.modelcontextprotocol.io/v0/servers?search=io.slingdata/sling-cli"

```

mcp-name: io.slingdata/sling-cli
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/slingdata-io/sling-python

Awesome Lists containing this project

README