{"id":16032741,"url":"https://github.com/voutilad/rp-connect-python","last_synced_at":"2025-09-02T17:32:34.100Z","repository":{"id":250951491,"uuid":"835932135","full_name":"voutilad/rp-connect-python","owner":"voutilad","description":"A Python interpreter embedded in Redpanda Connect","archived":false,"fork":false,"pushed_at":"2024-09-25T18:01:09.000Z","size":344,"stargazers_count":4,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-02T10:11:16.276Z","etag":null,"topics":["benthos-plugin","redpanda-connect"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/voutilad.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-30T20:15:24.000Z","updated_at":"2024-10-03T16:25:18.000Z","dependencies_parsed_at":"2024-10-08T21:40:39.490Z","dependency_job_id":"720314dd-5419-4e1f-a9ae-0563d14778ff","html_url":"https://github.com/voutilad/rp-connect-python","commit_stats":null,"previous_names":["voutilad/rp-connect-python"],"tags_count":16,"template":false,"template_full_name":null,"purl":"pkg:github/voutilad/rp-connect-python","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/voutilad%2Frp-connect-python","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/voutilad%2Frp-connect-python/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/voutilad%2Frp-connect-python/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/voutilad%2Frp-connect-python/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/voutilad","download_url":"https://codeload.github.com/voutilad/rp-connect-python/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/voutilad%2Frp-connect-python/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273322092,"owners_count":25085019,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-02T02:00:09.530Z","response_time":77,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benthos-plugin","redpanda-connect"],"created_at":"2024-10-08T21:40:28.031Z","updated_at":"2025-09-02T17:32:33.656Z","avatar_url":"https://github.com/voutilad.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Redpanda Connect + Python\n[![Build without CGO_ENABLED](https://github.com/voutilad/rp-connect-python/actions/workflows/build.yml/badge.svg?branch=main)](https://github.com/voutilad/rp-connect-python/actions/workflows/build.yml)\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"./rpcn_and_python.jpg\" width=\"45%\" alt=\"A redpanda \u0026 a python sipping tea together as friends.\"\u003e\n\u003c/div\u003e\n\nAdds an embedded Python interpreter to Redpanda Connect, so you can\nwrite your integration and transformation logic in pure Python:\n\n```yaml\n# rot13.yaml\ninput:\n  stdin: {}\n\npipeline:\n  processors:\n    - python:\n        exe: python3\n        script: |\n          import codecs\n          msg = content().decode()\n          root.original = msg\n          root.encoded = codecs.encode(msg, \"rot_13\")\n\noutput:\n  stdout: {}\n\nlogger:\n  level: OFF\n```\n\n```\n$ echo My voice is my passport | ./rp-connect-python run examples/rot13.yaml\n{\"original\": \"My voice is my passport\", \"encoded\": \"Zl ibvpr vf zl cnffcbeg\"}\n```\n\n\n## Requirements\n- Python 3.12 (hard requirement, currently!)\n- `setuptools` (makes it so much easier to find `libpython`, just `pip install`\n  it.)\n  - On macOS, if you used `brew` to install Python, it can fall back to using\n    `otool` to find the dynamic library.\n  - On Linux...sorry! You must use `setuptools`.\n- Go 1.22 or newer\n\n## Building\nBuilding `rp-connect-python is simple as it's using pure Go code:\n\n```bash\nCGO_ENABLED=0 go build\n```\n\nThat's it! A variety of tests are provided and looking at the current GitHub\naction [file](./.github/workflows/build.yml) shows some examples.\n\n\n## Component Types\nThis project provides the following new Python component types:\n\n1. [Input](#input) -- for generating data using Python\n2. [Processor](#processors) -- for transforming data with Python\n3. [Output](#output) -- for sinking data with Python\n\n\n## Input\nThe `python` input allows you to generate or acquire data using Python. Your\nscript can provide one of the following data generation approaches based on\nthe type of object you target when setting the `name` configuration property:\n\n- `object`\n  - If you provide a single Python object, it can be passed as a single input.\n- `list` or `tuple`\n  - A list or tuple will have each item extracted and provided to the pipeline.\n- `generator`\n  - Items will be produced from the generator until it's exhausted.\n- `function`\n  - Any function provided will be called repeatedly until it returns `None`.\n  - Functions may take an optional kwarg `state`, a `dict`, and use it\n    to keep state between invocations.\n\n### Input Serialization\nBy default, the input will serialize data either as native Go values (in the\ncase of `string`, `number`, `bytes`) and will convert to JSON in the case of\nPython container types `dict`, `list`, and `tuple`.\n\nSerialization via `pickle` can be done manually, but if you set `pickle: true`\nthe input will convert the produced Python object using `pickle.dumps()`\nautomatically, storing the output as raw bytes on the Redpanda Connect\n`Message`.\n\n### Input Configuration\nCommon configuration with defaults for a Python `input`:\n\n```yaml\ninput:\n  label: \"\"\n  python:\n    pickle: false     # Enable pickle serializer\n    batch_size: 1     # How many messages to include in a single message batch.\n    mode: global      # Interpreter mode (one of \"global\", \"isolated\", \"isolated_legacy\")\n    exe: \"python3\"    # Name of python binary to use.\n    name:             # No default (required), name of generating local object.\n    script:           # No default (required), Python code to execute.\n```\n\nAn example that uses a Python generator to emit 10 records, one every second:\n\n```yaml\ninput:\n  python:\n    name: g\n    script: |\n      import time\n      def producer():\n        for i in range(10):\n          time.sleep(1)\n          yield { \"now\": time.ctime(), \"i\": i }\n      g = producer()\n```\n\n### Input Caveats\nCurrently, a single interpreter is used for executing the input script. If you\nchange the [mode](#interpreter-modes), it will use different interpreter\nsettings which could affect [python compatability](#python-compatability) of\nyour script. Keep this in mind.\n\n\n## Processor\nThe `python` processor provides a similar experience to the `mapping` bloblang\nprocessor, but in pure Python. The interpreter that runs your code provides\nlazy hooks back into Redpanda Connect, to mimic bloblang behavior:\n\n- `content()` -- similar to the bloblang function, it returns the `bytes` of\n  a message. This performs a lazy copy of raw bytes into the interpreter.\n\n- `metadata(key)` -- similar to the bloblang function, it provides access to\n  the metadata of a message using the provided `key`.\n\n- `root` -- this is a `dict`-like object in scope by default providing three\n  operating modes simultaneously:\n  - Assign key/values like a Python `dict`, e.g. `root[\"name\"] = \"Dave\"`\n  - Use bloblang-like assignment by attribute, e.g. `root.name.first = \"Dave\"`\n  - Reassign it to a new object, e.g. `root = (1, 2)`. (Note: if you reassign\n    `root`, it loses its magic properties!)\n\n\u003e Heads up!\n\u003e\n\u003e If using the bloblang-like assignment, it will create the hierarchy of keys\n\u003e similar to in bloblang. `root.name.first = \"Dave\" will work even if \"name\"\n\u003e hasn't been assigned yet, producing a dict like:\n\u003e ```python\n\u003e root = { \"name\": { \"first\": \"Dave\" } }\n\u003e ```\n\nFor the details of how `root` works, see the `Root` Python\n[class](./processor/globals.py).\n\nAdditionally, the following helper functions and objects improve\ninteroperability:\n\n- `unpickle()` -- will use `pickle.loads()` to deserialize the Redpanda Connect\n  `Message` into a Python object.\n\n- `meta` -- a `dict` that allows you to assign new metadata values to a message\n  or delete values (if you set the value to `None` for a given key).\n\nAn example using `unpickle()`:\n\n```yaml\npipeline:\n  processors:\n    - python:\n        script: |\n          # these are logically equivalent\n          import pickle\n          this = pickle.loads(content())\n\n          this = unpickle()\n\n          root = this.call_some_method()\n\n          # if relying on Redpanda Connect structured data, use JSON.\n          import json\n          this = json.loads(content().decode())\n\n          root = this[\"a_key\"]\n```\n\n\u003e The processor does not currently support automatic deserialization of\n\u003e incoming data in an effort to keep as much of the expensive hooks back into\n\u003e Go as lazy as possible so you only pay for what you use.\n\n## Processor Configuration\nCommon configuration with defaults for a Python `processor`:\n\n```yaml\npipeline:\n  processors:\n    - python:\n        exe: \"python3\"  # Name of python binary to use.\n        mode: \"global\"  # Interpreter mode (one of \"global\", \"isolated\", \"isolated_legacy\")\n        script:         # No default (required), Python script to execute\n```\n\n## Processor Demo\nA simple demo using [requests](./examples/requests.yaml) which will enrich a\nmessage with a callout to an external web service illustrates many of the prior\nconcepts of using a Python processor:\n\n```yaml\ninput:\n  generate:\n    count: 3\n    interval: 1s\n    mapping: |\n      root.title = \"this is a test\"\n      root.uuid = uuid_v4()\n      root.i = counter()\n\npipeline:\n  processors:\n    - python:\n        exe: ./venv/bin/python3\n        script: |\n          import json\n          import requests\n          import time\n\n          data = content()\n          try:\n            msg = json.loads(data)[\"title\"]\n          except:\n            msg = \"nothing :(\"\n          root.msg = f\"You said: '{msg}'\"\n          root.at = time.ctime()\n          try:\n            root.ip = requests.get(\"https://api.ipify.org\").text\n          except:\n            root.ip = \"no internet?\"\n\noutput:\n  stdout: {}\n```\n\nTo run the demo, you need a Python environment with the `requests` module\ninstalled. This is easy to do with a virtual environment:\n\n```shell\n# Create a new virtual environment.\npython3 -m venv venv\n\n# Update pip, install setuptools, and install requests into the virtual env.\n./venv/bin/pip install --quiet -U pip setuptools requests\n\n# Run the example.\n./rp-connect-python run --log.level=off examples/requests.yaml\n```\n\nYou should get output similar to:\n\n```\n{\"msg\": \"You said: 'this is a test'\", \"at\": \"Fri Aug  9 19:07:29 2024\", \"ip\": \"192.0.1.210\"}\n{\"msg\": \"You said: 'this is a test'\", \"at\": \"Fri Aug  9 19:07:30 2024\", \"ip\": \"192.0.1.210\"}\n{\"msg\": \"You said: 'this is a test'\", \"at\": \"Fri Aug  9 19:07:31 2024\", \"ip\": \"192.0.1.210\"}\n```\n\n\n## Output\nPresently, the Python `output` is a bit of a hack and really just a Python\n`processor` configured to use a single interpreter instance.\n\nThis means all the configuration and behavior is the same as in the\n[processor configuration](#processor-configuration).\n\nWhen the `output` improves and warrants further discussion, check this space!\n\nFor now, a simple [example](./examples/output.yaml) that simply writes the\nprovided message to `stdout`:\n\n```yaml\ninput:\n  generate:\n    count: 5\n    interval:\n    mapping: |\n      root = \"hello world\"\n\noutput:\n  python:\n    script: |\n      msg = content().decode()\n      print(f\"you said: '{msg}'\")\n\nhttp:\n  enabled: false\n```\n\n\n## Interpreter Modes\n`rp-connect-python` now supports multiple interpreter modes that may be set\nseparately on each `input`, `processor`, and `output` instance.\n\n- `global` (the default)\n  - Uses a global interpreter (i.e. no sub-interpreters) for all execution.\n  - Allows passing pointers to Python objects between components, avoiding\n    costly serialization/deserialization.\n  - Provides the most compatability at expense of throughput as your code will\n    rely on the global main interpreter for memory management and the GIL.\n\n- `isolated`\n  - Uses multiple isolated sub-interpreters with their own memory allocators\n    and GILs.\n  - Provides the best throughput performance for pure-Python use cases that\n    don't leverage Python modules that use native code (e.g. `numpy`).\n  - Require serializing/deserializing data as it leaves the context of the\n    interpreter.\n\n- `isolated_legacy`\n  - Same as `isolated`, but instead of distinct GIL and memory allocators, uses\n    a shared GIL and allocator.\n  - Balances compatability with performance. Some Python modules might not\n    support full isolation, but _will_ work in a shared GIL mode.\n\n\nA more detailed discussion for the nerds follows.\n\n### Isolated \u0026 Isolated Legacy Modes\nMost pure Python code should \"just work\" with `isolated` mode and\n`isolated_legacy` mode. Some older Python extensions, written in C or the\nlike, may not work in `isolated` mode and require `isolated_legacy` mode.\n\nIf you see issues using `isolated` (e.g. crashes), switch to\n`isolated_legacy`.\n\n\u003e In general, crashes should _not_ happen. The most common causes are bugs\n\u003e in `rp-connect-python` related to _use-after-free_'s in the Python\n\u003e integration layer. If it's not that, it's an interpreter state issue,\n\u003e which is also a bug most likely in `rp-connect-python`. However, given the\n\u003e immaturity of multi-interpreter support in Python, if the issue \"goes away\"\n\u003e by switching modes (e.g. to \"legacy\"), it's possible it's deeper than just\n\u003e `rp-connect-python`.\n\nIn some cases, `isolated_legacy` can perform as well or _slightly better_ than\n`isolated` even though it uses a shared GIL. It's very workload dependent, so\nit's worth experimenting.\n\n### Global Mode\nUsing `glopbal` mode for a runtime will execute the Python code in the\ncontext of the \"main\" interpreter. (In `isolated` and `isolated_legacy` modes,\nsub-interpreters derive from the \"main\" interpreter.) This is the traditional\nmethod of embedding Python into an application.\n\nWhile you may scale out your `global` mode components, only a single\ncomponent instance may utilize the \"main\" interpreter at a time. This is\nirrespective of the GIL as Python's C implementation relies heavily on\nthread-local storage for interpreter state.\n\n\u003e Go was design by people that think programmers can't handle managing\n\u003e threads. (Multi-threading is hard, but that's why we're paid the big\n\u003e bucks, right?) As a result, the Go runtime does its own scheduling of Go\n\u003e routines to some number of OS threads to achieve parallelism and\n\u003e concurrency. Python does not jibe with this and the vibes are off, so a\n\u003e lot of the `rp-connect-python` internals are for managing how to\n\u003e couple Python's thread-oriented approach with Go's go-routine world.\n\nA lot of scientific software that uses external non-Python native code\nmay run best in `global` mode. This includes, but is not limited to:\n\n- `numpy`\n- `pandas`\n- `pyarrow`\n\nA benefit to `global` mode is it's one interpreter state across all components,\nso you can create a Python object in one component (e.g. an `input`) and\neasily use it in a `processor` stage without mucking about with serialization.\nThis is great for workloads that create large, in-memory objects, like Pandas\nDataFrames or PyArrow Tables. In these cases, avoiding serialization may mean\n`global` mode is more efficient even if there's fighting over the interpreter\nlock.\n\n\u003e The current design assumes arbitrary Go routines will need to acquire\n\u003e ownership of the global (\"main\") interpreter and fight over a mutex. It's\n\u003e entirely possible the mutex is held at points where the GIL is actually\n\u003e released or releasable, meaning other Python code _could_ run safely. It's\n\u003e future work to figure out how to orchestrate this efficiently.\n\n## Python Compatability\nThis is en evolving list of notes/tips related to using certain\npopular Python modules:\n\n### `requests`\nWorks best in `isolated_legacy` mode. Currently, can panic `isolated` mode on\nsome systems.\n\n\u003e While `requests` is pure Python, it does hook into some modules that\n\u003e are not. Still identifying a race condition causing memory corruption\n\u003e in `isolated` mode.\n\n### `numpy`\nRecommends `global` mode as explicitly does not support Python\nsub-interpreters. May work in `isolated_legacy`, but be careful.\n\n### `pandas`\nDepends on `numpy`, so might be best used in `global` mode if stability is a\nconcern. Works fine with the `pickle` support for passing DataFrames, but might\nnot be the most efficient way for passing data around a long pipeline, so\n`global` might be preferable to isolated modes.\n\nAn [example](./examples/pandas.yaml) that shows filtering a DataFrame and using\n`pickle` to pass it from the `input` to the `processor`:\n\n```yaml\ninput:\n  python:\n    mode: global\n    name: df\n    pickle: true\n    script: |\n      import pandas as pd\n      df = pd.DataFrame.from_dict({\"name\": [\"Maple\", \"Moxie\"], \"age\": [8, 3]})\n\npipeline:\n  processors:\n    - python:\n        mode: global\n        script: |\n          import pickle\n          df = unpickle()\n          root = df.to_dict(\"list\")\n\noutput:\n  stdout: {}\n```\n\n\u003e Note the use of `mode: global`!\n\n### `pyarrow`\nWorks fine in `global` mode.\n\nAn example follows using the `pyarrow.dataset` capabilities to read from GCS.\n\n\u003e Note the use of `serializer: none` as it prevents data copying/duplication.\n\n```yaml\ninput:\n  python:\n    name: batches\n    serializer: none\n    script: |\n      import pyarrow as pa\n      from pyarrow import fs\n      import pyarrow.dataset as ds\n      \n      gcs = fs.GcsFileSystem()\n      dataset = ds.dataset(\"my-bucket/\", format=\"parquet\", filesystem=gcs)\n      \n      # need special handling to raw cython generators, so for now\n      # wrap with a pure python generator\n      def take_all():\n        for batch in dataset.to_batches():\n          yield batch\n      batches = take_all()\n\npipeline:\n  processors:\n    - python:\n        script: |\n          # 'this' is now a PyArrow RecordBatch\n          root.nbytes = this.nbytes\n          root.num_rows = this.num_rows\n\noutput:\n  stdout: {}\n```\n\n### `pillow`\nSeems to work ok in `isolated_legacy` mode, but doesn't support\nsub-interpreters, so recommended to run in `global` mode.\n\nAn example of a directory scanner that identifies types of JPEGs:\n\n```yaml\ninput:\n  file:\n    paths: [ ./*.jpg ]\n    scanner:\n      to_the_end: {}\n\npipeline:\n  processors:\n    - python:\n        exe: ./venv/bin/python3\n        mode: global\n        script: |\n          from PIL import Image\n          from io import BytesIO\n\n          infile = BytesIO(content())\n          try:\n            with Image.open(infile) as im:\n              root.format = im.format\n              root.size = im.size\n              root.mode = im.mode\n              root.path = metadata(\"path\")\n          except OSError:\n            pass\n\noutput:\n  stdout: {}\n```\n\nAssuming you `pip install` the dependencies of `setuptools` and `pillow`:\n\n```\n$  python3 -m venv venv\n$  ./venv/bin/pip install --quiet -U pip setuptools pillow\n$  ./rp-connect-python run --log.level=off examples/pillow.yaml\n{\"format\":\"JPEG\",\"mode\":\"RGB\",\"path\":\"rpcn_and_python.jpg\",\"size\":[1024,1024]}\n```\n\n\n## Known Issues / Limitations\n- Tested on macOS/arm64 and Linux/{arm64,amd64}.\n    - Not expected to work on Windows. Requires `gogopython` updates.\n- You can only use one Python binary across all Python processors.\n- Hardcoded still for Python 3.12. Should be portable to 3.13 and,\n  in cases of `global` mode, earlier versions. Requires changes to\n  `gogopython` I haven't made yet.\n\n\n## License and Supportability\nSource code in this project is licensed under the Apache v2 license unless\nnoted otherwise.\n\nThis software is provided without warranty or support. It is not part of\nRedpanda Data's enterprise offering and not supported by Redpanda Data.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvoutilad%2Frp-connect-python","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvoutilad%2Frp-connect-python","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvoutilad%2Frp-connect-python/lists"}