https://github.com/googleclouddataproc/dataproc-spark-connect-python
https://github.com/googleclouddataproc/dataproc-spark-connect-python
Last synced: about 1 year ago
JSON representation
- Host: GitHub
- URL: https://github.com/googleclouddataproc/dataproc-spark-connect-python
- Owner: GoogleCloudDataproc
- License: apache-2.0
- Created: 2024-09-13T20:04:28.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2025-04-11T01:14:39.000Z (about 1 year ago)
- Last Synced: 2025-04-12T15:07:17.364Z (about 1 year ago)
- Language: Python
- Size: 178 KB
- Stars: 1
- Watchers: 17
- Forks: 3
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: contributing.md
- License: LICENSE
Awesome Lists containing this project
README
# Google Spark Connect Client
A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with
additional functionalities that allow applications to communicate with a remote Dataproc
Spark cluster using the Spark Connect protocol without requiring additional steps.
## Install
.. code-block:: console
pip install google_spark_connect
## Uninstall
.. code-block:: console
pip uninstall google_spark_connect
## Setup
This client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam).
If you are running the client outside of Google Cloud, you must set following environment variables:
* GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads
* GOOGLE_CLOUD_REGION - The Compute Engine [region](https://cloud.google.com/compute/docs/regions-zones#available) where you run the Spark workload.
* GOOGLE_APPLICATION_CREDENTIALS - Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
* DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as `tests/integration/resources/session.textproto`
## Usage
1. Install the latest version of Dataproc Python client and Google Spark Connect modules:
.. code-block:: console
pip install google_cloud_dataproc --force-reinstall
pip install google_spark_connect --force-reinstall
2. Add the required import into your PySpark application or notebook:
.. code-block:: python
from google.cloud.spark_connect import GoogleSparkSession
3. There are two ways to create a spark session,
1. Start a Spark session using properties defined in `DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG`:
.. code-block:: python
spark = GoogleSparkSession.builder.getOrCreate()
2. Start a Spark session with the following code instead of using a config file:
.. code-block:: python
from google.cloud.dataproc_v1 import SparkConnectConfig
from google.cloud.dataproc_v1 import Session
google_session_config = Session()
google_session_config.spark_connect_session = SparkConnectConfig()
google_session_config.environment_config.execution_config.subnetwork_uri = ""
google_session_config.runtime_config.version = '3.0'
spark = GoogleSparkSession.builder.googleSessionConfig(google_session_config).getOrCreate()
## Billing
As this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing).
This will happen even if you are running the client from a non-GCE instance.
## Contributing
### Building and Deploying SDK
1. Install the requirements in virtual environment.
.. code-block:: console
pip install -r requirements.txt
2. Build the code.
.. code-block:: console
python setup.py sdist bdist_wheel
3. Copy the generated `.whl` file to Cloud Storage. Use the version specified in the `setup.py` file.
.. code-block:: console
VERSION= gsutil cp dist/google_spark_connect-${VERSION}-py2.py3-none-any.whl gs://
4. Download the new SDK on Vertex, then uninstall the old version and install the new one.
.. code-block:: console
%%bash
export VERSION=
gsutil cp gs:///google_spark_connect-${VERSION}-py2.py3-none-any.whl .
yes | pip uninstall google_spark_connect
pip install google_spark_connect-${VERSION}-py2.py3-none-any.whl