{"id":15015345,"url":"https://github.com/getindata/streaming-jupyter-integrations","last_synced_at":"2025-04-09T19:24:21.469Z","repository":{"id":37095990,"uuid":"490262469","full_name":"getindata/streaming-jupyter-integrations","owner":"getindata","description":null,"archived":false,"fork":false,"pushed_at":"2023-11-09T10:03:26.000Z","size":229,"stargazers_count":16,"open_issues_count":4,"forks_count":0,"subscribers_count":8,"default_branch":"main","last_synced_at":"2024-03-31T20:23:36.874Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/getindata.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2022-05-09T11:55:29.000Z","updated_at":"2024-04-15T09:43:47.357Z","dependencies_parsed_at":"2024-04-15T09:43:44.564Z","dependency_job_id":"09bb65f3-e48c-43c6-9961-ab43ee67d24a","html_url":"https://github.com/getindata/streaming-jupyter-integrations","commit_stats":null,"previous_names":[],"tags_count":36,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/getindata%2Fstreaming-jupyter-integrations","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/getindata%2Fstreaming-jupyter-integrations/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/getindata%2Fstreaming-jupyter-integrations/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/getindata%2Fstreaming-jupyter-integrations/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/getindata","download_url":"https://codeload.github.com/getindata/streaming-jupyter-integrations/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248096354,"owners_count":21047039,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-09-24T19:46:53.328Z","updated_at":"2025-04-09T19:24:21.448Z","avatar_url":"https://github.com/getindata.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Python Version](https://img.shields.io/badge/python-3.8-blue.svg)](https://github.com/getindata/streaming_jupyter_integrations)\n[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n[![SemVer](https://img.shields.io/badge/semver-2.0.0-green)](https://semver.org/)\n[![PyPI version](https://badge.fury.io/py/streaming-jupyter-integrations.svg)](https://pypi.org/project/streaming-jupyter-integrations/)\n[![Downloads](https://pepy.tech/badge/streaming_jupyter_integrations)](https://pepy.tech/badge/streaming_jupyter_integrations)\n\n# Streaming Jupyter Integrations\n\nStreaming Jupyter Integrations project includes a set of magics for interactively running _Flink SQL_  jobs in [Jupyter](https://jupyter.org/) Notebooks\n\n## Installation\n\nIn order to actually use these magics, you must install our PIP package along `jupyterlab-lsp`:\n\n```shell\npython3 -m pip install jupyterlab-lsp streaming-jupyter-integrations\n```\n\n## Usage\n\nRegister in Jupyter with a running IPython in the first cell:\n\n```python\n%load_ext streaming_jupyter_integrations.magics\n```\n\nThen you need to decide which _execution mode_ and _execution target_ to choose.\n\n```python\n%flink_connect --execution-mode [mode] --execution-target [target]\n```\n\nBy default, the `streaming` execution mode and `local` execution target are used.\n\n```python\n%flink_connect\n```\n\n### Execution mode\n\nCurrently, Flink supports two execution modes: _batch_ and _streaming_. Please see\n[Flink documentation](https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution_mode/)\nfor more details.\n\nIn order to specify execution mode, add `--execution-mode` parameter, for instance:\n```python\n%flink_connect --execution-mode batch\n```\n\n### Execution target\n\nStreaming Jupyter Integrations supports 3 execution targets:\n* Local\n* Remote\n* YARN Session\n\n#### Local execution target\n\nRunning Flink in `local` mode will start a MiniCluster in a local JVM with parallelism 1.\n\nIn order to run Flink locally, use:\n```python\n%flink_connect --execution-target local\n```\n\nAlternatively, since the execution target is `local` by default, use:\n```python\n%flink_connect\n```\n\nOne can specify port of the local JobManager (8099 by default). This is useful especially if you run multiple\nNotebooks in a single JupyterLab.\n\n```python\n%flink_connect --execution-target local --local-port 8123\n```\n\n\n#### Remote execution target\n\nRunning Flink in remote mode will connect to an existing Flink session cluster. Besides specifying `--execution-target`\nto be `remote`, you also need to specify `--remote-hostname` and `--remote-port` pointing to Flink Job Manager's\nREST API address.\n\n```python\n%flink_connect \\\n    --execution-target remote \\\n    --remote-hostname example.com \\\n    --remote-port 8888\n```\n\n#### YARN session execution target\n\nRunning Flink in `yarn-session` mode will connect to an existing Flink session cluster running on YARN. You may specify\nthe hostname and port of the YARN Resource Manager (`--resource-manager-hostname` and `--resource-manager-port`).\nIf Resource Manager address is not provided, it is assumed that notebook runs on the same node as Resource Manager.\nYou can also specify YARN applicationId (`--yarn-application-id`) to which the notebook will connect to.\nIf `--yarn-application-id` is not specified and there is one YARN application running on the cluster, the notebook will\ntry to connect to it. Otherwise, it will fail.\n\nConnecting to a remote Flink session cluster running on a remote YARN cluster:\n```python\n%flink_connect \\\n    --execution-target yarn-session \\\n    --resource-manager-hostname example.com \\\n    --resource-manager-port 8888 \\\n    --yarn-application-id application_1666172784500_0001\n```\n\nConnecting to a Flink session cluster running on a YARN cluster:\n```python\n%flink_connect \\\n    --execution-target yarn-session \\\n    --yarn-application-id application_1666172784500_0001\n```\n\nConnecting to a Flink session cluster running on a dedicated YARN cluster:\n```python\n%flink_connect --execution-target yarn-session\n```\n\n## Variables\nMagics allow for dynamic variable substitution in _Flink SQL_ cells.\n```python\nmy_variable = 1\n```\n```sql\nSELECT * FROM some_table WHERE product_id = {my_variable}\n```\n\nMoreover, you can mark sensitive variables like password so they will be read from environment variables or user input every time one runs the cell:\n```sql\nCREATE TABLE MyUserTable (\n  id BIGINT,\n  name STRING,\n  age INT,\n  status BOOLEAN,\n  PRIMARY KEY (id) NOT ENFORCED\n) WITH (\n   'connector' = 'jdbc',\n   'url' = 'jdbc:mysql://localhost:3306/mydatabase',\n   'table-name' = 'users',\n   'username' = '${my_username}',\n   'password' = '${my_password}'\n);\n```\n\n### `%%flink_execute` command\n\nThe command allows to use Python DataStream API and Table API. There are two handles exposed for each API:\n`stream_env` and `table_env`, respectively.\n\nTable API example:\n```python\n%%flink_execute\nquery = \"\"\"\n    SELECT   user_id, COUNT(*)\n    FROM     orders\n    GROUP BY user_id\n\"\"\"\nexecution_output = table_env.execute_sql(query)\n```\n\nWhen Table API is used, the final result has to be assigned to `execution_output` variable.\n\nDataStream API example:\n```python\n%%flink_execute\nfrom pyflink.common.typeinfo import Types\n\nexecution_output = stream_env.from_collection(\n    collection=[(1, 'aaa'), (2, 'bb'), (3, 'cccc')],\n    type_info=Types.ROW([Types.INT(), Types.STRING()])\n)\n```\n\nWhen DataStream API is used, the final result has to be assigned to `execution_output` variable. Please note that\nthe pipeline does not end with `.execute()`, the execution is triggered by the Jupyter magics under the hood.\n\n---\n\n## Local development\n\nThere are currently 2 options for running `streaming_jupyter_integrations` for development. We can either\nuse a Docker image or install it on our machine.\n\n### Docker image\n\nYou can build a `Docker` image of `Jupyter Notebooks` by running the command below.\nIt will contain functionality that was developed in this project.\n```bash\ndocker build --tag streaming_jupyter_integrations_image .\n```\n\nAfter the image is built, we can run it using this command.\n```bash\ndocker run --name streaming_jupyter_integrations -p 8888:8888 streaming_jupyter_integrations_image\n```\n\nAfter that we should be able to reach our Jupyterhub running on Docker under:\nhttp://127.0.0.1:8888/\n\n### Local installation\n\nNote: You will need NodeJS to build the extension package.\n\nThe `jlpm` command is JupyterLab's pinned version of\n[yarn](https://yarnpkg.com/) that is installed with JupyterLab. You may use\n`yarn` or `npm` in lieu of `jlpm` below. In order to use `jlpm`, you have to\nhave `jupyterlab` installed (e.g., by `brew install jupyterlab`, if you use\nHomebrew as your package manager).\n\n```bash\n# Clone the repo to your local environment\n# Change directory to the flink_sql_lsp_extension directory\n# Install package in development mode\npip install -e .\n# Link your development version of the extension with JupyterLab\njupyter labextension develop . --overwrite\n# Rebuild extension Typescript source after making changes\njlpm build\n```\n\n### pre-commit\n\nThe project uses [pre-commit](https://pre-commit.com/) hooks to ensure code quality, mostly by linting.\nTo use it, [install pre-commit](https://pre-commit.com/#install) and then run\n```shell\npre-commit install --install-hooks\n```\nFrom that moment, it will lint the files you have modified on every commit attempt.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgetindata%2Fstreaming-jupyter-integrations","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgetindata%2Fstreaming-jupyter-integrations","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgetindata%2Fstreaming-jupyter-integrations/lists"}