{"id":39698213,"url":"https://github.com/databricks-solutions/salesforce-zerobus","last_synced_at":"2026-01-18T10:19:11.341Z","repository":{"id":315114427,"uuid":"1052852136","full_name":"databricks-solutions/salesforce-zerobus","owner":"databricks-solutions","description":null,"archived":false,"fork":false,"pushed_at":"2025-10-15T18:17:09.000Z","size":464,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-16T17:14:12.109Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/databricks-solutions.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-08T16:30:26.000Z","updated_at":"2025-10-15T17:58:03.000Z","dependencies_parsed_at":"2025-09-16T21:08:43.513Z","dependency_job_id":"7995cbdd-2ea0-4325-a02e-f7f08f59ae31","html_url":"https://github.com/databricks-solutions/salesforce-zerobus","commit_stats":null,"previous_names":["databricks-solutions/salesforce-zerobus"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/databricks-solutions/salesforce-zerobus","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks-solutions%2Fsalesforce-zerobus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks-solutions%2Fsalesforce-zerobus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks-solutions%2Fsalesforce-zerobus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks-solutions%2Fsalesforce-zerobus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/databricks-solutions","download_url":"https://codeload.github.com/databricks-solutions/salesforce-zerobus/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks-solutions%2Fsalesforce-zerobus/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28534316,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-18T10:13:46.436Z","status":"ssl_error","status_checked_at":"2026-01-18T10:13:11.045Z","response_time":98,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-18T10:19:10.937Z","updated_at":"2026-01-18T10:19:11.314Z","avatar_url":"https://github.com/databricks-solutions.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SalesforceZerobus 🚀\n\n![Zerobus Architecture](zerobus_graphic.png)\n\nA simple, production-ready Python library for streaming Salesforce Change Data Capture (CDC) events to Databricks Delta tables in real-time using the Salesforce Pub/Sub API and Databricks Zerobus API.\n\n[![Python](https://img.shields.io/badge/Python-3.10-3776AB.svg?style=flat\u0026logo=python\u0026logoColor=white)](https://www.python.org)\n![Salesforce](https://img.shields.io/badge/Salesforce-00A1E0?logo=salesforce\u0026logoColor=white)\n![GitHub stars](https://img.shields.io/github/stars/databricks-solutions/salesforce-zerobus?style=social)\n![GitHub forks](https://img.shields.io/github/forks/databricks-solutions/salesforce-zerobus?style=social)\n![GitHub issues](https://img.shields.io/github/issues/databricks-solutions/salesforce-zerobus)\n![GitHub last commit](https://img.shields.io/github/last-commit/databricks-solutions/salesforce-zerobus)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\n## Databricks Zerobus is now in Public Preview and available to all customers!\n\n## Features\n\n- **🚀 Simple API** - Single class interface with just 4 required parameters\n- **⚡ Real-time Streaming** - Sub-second event forwarding to Databricks \n- **⛔️ Eliminates Message Buses** - With Databricks Zerobus you know longer need message buses to sink data to your lake.  \n- **🔄 Zero Data Loss** - Automatic replay recovery ensures no missed events during outages\n- **🛡️ Production Ready** - Comprehensive error handling, health monitoring, and timeout protection\n- **🔐 OAuth Security** - Uses Service Principal authentication for enhanced security (API tokens deprecated)\n- **📦 Self-contained** - Bundles all dependencies (no external wheel dependencies)\n- **🔧 Flexible Configuration** - Support for all Salesforce objects (Account, Lead, Contact, Custom Objects)\n- **⚙️ Both Sync \u0026 Async** - Use blocking calls or async context manager patterns\n- **📊 Built-in Logging** - Detailed event processing logs for monitoring\n- **🧱 Databricks Asset Bundle** - Provided Databricks Asset Bundle to get you up and running in minutes\n\n## 🚀 Local Quick Start\n### Installation\n\n**Prerequisites:**\n- Python 3.10 or higher\n- All dependencies are available via PyPI!\n\n**Install dependencies:**\n```bash\nuv add databricks-zerobus-ingest-sdk\nuv sync\n```\n\n### Minimal Working Example\n\n```python\n#!/usr/bin/env python3\nimport logging\nfrom salesforce_zerobus import SalesforceZerobus\n\n# Configure logging to see event processing\nlogging.basicConfig(\n    level=logging.INFO,\n    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'\n)\n\n# Initialize the streamer\nstreamer = SalesforceZerobus(\n    # What Salesforce CDC channel to monitor  \n    sf_object_channel=\"ChangeEvents\",\n    \n    # Where to send the data in Databricks\n    databricks_table=\"your_catalog.your_schema.all_change_events\", # If the table doesn't exist the service will create the table for you.\n    \n    # Salesforce credentials\n    salesforce_auth={\n        \"username\": \"your@email.com\",\n        \"password\": \"yourpassword+securitytoken\",  # Password + security token (no spaces)\n        \"instance_url\": \"https://your-instance.salesforce.com\"\n    },\n    \n    # Databricks credentials (OAuth Service Principal)\n    databricks_auth={\n        \"workspace_url\": \"https://your-workspace.cloud.databricks.com\",\n        \"client_id\": \"your-service-principal-client-id\",\n        \"client_secret\": \"your-service-principal-client-secret\",\n        \"ingest_endpoint\": \"12345.zerobus.region.cloud.databricks.com\"\n    }\n)\n\n\nprint(\"Starting Salesforce to Databricks streaming...\")\nprint(\n    f\"Monitoring Channel:AccountChangeEvent → Databricks Table:AccountchangeEvents\"\n)\n\nif __name__ == \"__main__\":\n    streamer.start()\n```\n\n### Expected Log Output\n\nWhen running, you'll see logs like this:\n\n```bash\n2025-08-26 08:20:12 - salesforce_zerobus.core.Account - INFO - Authenticating with Salesforce...\n2025-08-26 08:20:13 - salesforce_zerobus.core.Account - INFO - Authentication successful!\n2025-08-26 08:20:13 - salesforce_zerobus.core.Account - INFO - Resuming from previous session with replay_id: 00000000000408760000\n2025-08-26 08:20:13 - salesforce_zerobus.core.Account - INFO - Starting subscription to /data/AccountChangeEvent\n2025-08-26 08:20:13 - salesforce_zerobus.core.Account - INFO - Batch size: 10, Mode: CUSTOM\n2025-08-26 08:20:15 - salesforce_zerobus.core.Account - INFO - Received Account UPDATE 001abc123def456\n2025-08-26 08:20:15 - databricks_forwarder - INFO - Written to Databricks: your_catalog.your_schema.account_events - Account UPDATE 001abc123def456\n```\n\n### Async Usage\n\n```python\nimport asyncio\nimport logging\nfrom salesforce_zerobus import SalesforceZerobus\n\nasync def main():\n    logging.basicConfig(level=logging.INFO)\n    \n    streamer = SalesforceZerobus(\n        sf_object_channel=\"LeadChangeEvent\",\n        databricks_table=\"catalog.schema.lead_events\",\n        salesforce_auth={...},\n        databricks_auth={...}\n    )\n    \n    # Use async context manager\n    async with streamer:\n        print(\"🚀 Async streaming started...\")\n        await streamer.stream_forever()\n\n# Run the async streamer\nasyncio.run(main())\n```\n\n## 🌐 Multi-Object Streaming with ChangeEvents\n\n### Stream All Salesforce Objects at Once\n\nInstead of subscribing to individual object channels like `AccountChangeEvent` or `LeadChangeEvent`, you can subscribe to **all change events** across your entire Salesforce org using the `ChangeEvents` channel:\n\n```python\nfrom salesforce_zerobus import SalesforceZerobus\n\nstreamer = SalesforceZerobus(\n    # Subscribe to ALL object changes in your Salesforce org\n    sf_object_channel=\"ChangeEvents\",\n\n    databricks_table=\"catalog.schema.all_salesforce_events\",\n    salesforce_auth={...},\n    databricks_auth={...}\n)\n\nstreamer.start()\n```\n\n### Key Benefits\n\n- **🎯 Single Stream**: Capture Account, Contact, Lead, Opportunity, Custom Objects, etc. in one subscription\n- **🚀 Automatic Schema Handling**: Library automatically manages different schemas for each object type\n- **📊 Unified Table**: All events go to one Delta table with `entity_name` identifying the object type\n- **⚡ Efficient Caching**: Schemas are cached per object type for optimal performance\n\n### Understanding Multi-Object Data\n\nWith `ChangeEvents`, each event includes an `entity_name` field identifying the object:\n\n```bash\nINFO - Received Account UPDATE 001abc123def456\nINFO - Received Contact CREATE 003xyz789ghi012\nINFO - Received CustomObject__c DELETE 001def456abc789\n```\n\n### When to Use ChangeEvents vs Specific Objects\n\n**Use `ChangeEvents` when:**\n- You need a comprehensive view of all Salesforce activity\n- Building data lake ingestion for entire org\n- Creating audit trails or compliance monitoring\n- Prototyping or exploring data patterns\n\n**Use specific objects (e.g., `AccountChangeEvent`) when:**\n- You only care about specific object types\n- Building targeted integrations\n- Need to minimize data volume and processing\n- Want separate tables per object type\n\n## ⚡ Spark Structured Streaming Data Source\n\n### Bidirectional Streaming with Salesforce\n\nIn addition to the Databricks Zerobus integration, this project includes a **Spark Data Source** for bidirectional streaming with Salesforce Platform Events and Change Data Capture (CDC).\n\n### Key Capabilities\n\n**📖 Reader (Subscription)**\n- Real-time streaming from Salesforce Platform Events and CDC\n- Automatic bitmap field decoding for change events\n- Configurable replay with exactly-once processing\n- Automatic schema management with Avro decoding\n\n**✍️ Writer (Publishing)**\n- Publish streaming data to Salesforce Platform Events\n- Event forwarding between Salesforce topics with transformations\n- Custom data publishing from any Spark streaming source\n- Batch optimization for high-volume scenarios\n\n### Quick Example\n\n```python\nfrom spark_datasource import register_data_source\nfrom pyspark.sql.functions import col, current_timestamp\n\n# Register the data source\nregister_data_source(spark)\n\n# Read from Salesforce\ndf = spark.readStream.format(\"salesforce_pubsub\") \\\n    .option(\"username\", USERNAME) \\\n    .option(\"password\", PASSWORD) \\\n    .option(\"topic\", \"/data/AccountChangeEvent\") \\\n    .option(\"replayPreset\", \"EARLIEST\") \\\n    .load()\n\n# Stream to Delta table\ndf.writeStream \\\n    .format(\"delta\") \\\n    .option(\"checkpointLocation\", \"/path/to/checkpoints/\") \\\n    .toTable(\"catalog.schema.salesforce_events\")\n\n# Write back to Salesforce\nyour_stream.writeStream \\\n    .format(\"salesforce_pubsub\") \\\n    .option(\"username\", USERNAME) \\\n    .option(\"password\", PASSWORD) \\\n    .option(\"topic\", \"/data/CustomEvent__e\") \\\n    .start()\n```\n\n### When to Use Spark Data Source vs. Zerobus Library\n\n**Use Spark Data Source when:**\n- You need bidirectional streaming (read AND write to Salesforce)\n- Working with existing Spark Structured Streaming pipelines\n- Require complex transformations using Spark SQL/DataFrame APIs\n- Need to integrate with other Spark data sources\n\n**Use Zerobus Library when:**\n- You need simple, one-way streaming to Databricks Delta tables\n- Want minimal configuration and setup\n- Prefer lightweight Python applications\n- Need automatic table creation and replay recovery\n\n📖 **[View Full Spark Data Source Documentation →](spark_datasource/README.md)**\n\n## 📋 Prerequisites \u0026 Local Setup\n\n### 1. Salesforce Setup\n\n#### Enable Change Data Capture\n1. **Log into Salesforce** → **Setup** → **Integrations** → **Change Data Capture**\n2. **Select objects** to monitor (Account, Lead, Contact, etc.)\n3. **Click Save** and wait 2-3 minutes for topics to become available\n\n#### Get Security Token\n1. **Setup** → **My Personal Information** → **Reset My Security Token**\n2. **Check your email** for the security token\n3. **Append to password**: `yourpassword` + `SECURITYTOKEN` (no spaces)\n\n#### Verify API Access\n- Ensure your user profile has **API Enabled** permission\n- Check that your org allows Pub/Sub API access\n\n### 2. Databricks Setup\n\n#### Create OAuth Service Principal\n1. **Go to your Databricks workspace**\n2. **Navigate to Settings → Identity and Access → Service Principals**\n3. **Create a new Service Principal**\n4. **Generate OAuth credentials**:\n   - Generate and save the **client ID** and **client secret**\n   - These replace traditional API tokens for better security\n5. **Grant permissions** to your Service Principal:\n   - Table access permissions for your target Delta table\n   - Workspace permissions for SQL operations\n\n#### 🔐 OAuth vs API Tokens (Migration Guide)\n\n**For users migrating from API tokens:**\n- **Enhanced Security**: OAuth tokens auto-refresh every hour vs. static API tokens\n- **Unified Authentication**: Single Service Principal for all Databricks operations\n- **Simplified Configuration**: No separate SQL API tokens needed\n- **Future-Proof**: Databricks recommends OAuth for 2025+ (API tokens being deprecated)\n\n**Migration steps:**\n1. Create Service Principal (steps above)\n2. Update environment variables:\n   ```bash\n   # Replace these:\n   # DATABRICKS_API_TOKEN=dapi...\n   # DATABRICKS_SQL_API_TOKEN=dapi...\n\n   # With these:\n   DATABRICKS_CLIENT_ID=your-service-principal-client-id\n   DATABRICKS_CLIENT_SECRET=your-service-principal-client-secret\n   ```\n3. Update ingest endpoint format:\n   ```bash\n   # Change from: workspace-id.ingest.region.cloud.databricks.com\n   # To:         workspace-id.zerobus.region.cloud.databricks.com\n   ```\n\n#### 🚨 Table Configuration\nIf you do not create the table before running the service, a table will be made for you using the name specified in main.py. This step is optional.\n\n🚨 **Important**: `'delta.enableRowTracking' = 'false'` must be set for all Zerobus target tables.\n#### (Optional) Create Delta Table\nRun this SQL in your Databricks workspace:\n\n```sql\nCREATE TABLE IF NOT EXISTS your_catalog.your_schema.account_events (\n  event_id STRING COMMENT 'Unique Salesforce event identifier',\n  schema_id STRING COMMENT 'Event schema version from Salesforce',\n  replay_id STRING COMMENT 'Event position for replay functionality', \n  timestamp BIGINT COMMENT 'Event timestamp in milliseconds',\n  change_type STRING COMMENT 'Type of change: CREATE, UPDATE, DELETE, UNDELETE',\n  entity_name STRING COMMENT 'Salesforce object name (Account, Contact, etc.)',\n  change_origin STRING COMMENT 'Source of the change (API, UI, etc.)',\n  record_ids ARRAY\u003cSTRING\u003e COMMENT 'List of affected Salesforce record IDs',\n  changed_fields ARRAY\u003cSTRING\u003e COMMENT 'List of field names that were modified',\n  nulled_fields ARRAY\u003cSTRING\u003e COMMENT 'List of field names that were set to null',\n  diff_fields ARRAY\u003cSTRING\u003e COMMENT 'List of field names with differences',\n  record_data_json STRING COMMENT 'Complete record data serialized as JSON',\n  payload_binary BINARY COMMENT 'Raw Avro binary payload for schema-based parsing',\n  schema_json STRING COMMENT 'Avro schema JSON string for parsing binary payload',\n  org_id STRING COMMENT 'Salesforce organization ID',\n  processed_timestamp BIGINT COMMENT 'When this event was processed by our pipeline'\n)\nUSING DELTA\nTBLPROPERTIES (\n  'delta.enableRowTracking' = 'false',\n  'delta.autoOptimize.optimizeWrite' = 'true',\n  'delta.autoOptimize.autoCompact' = 'true',\n)\nCOMMENT 'Real-time Salesforce Change Data Capture events';\n```\n\n#### Get Databricks Credentials\n- **API Token**: User Settings → Developer → Access tokens → Generate New Token\n- **Workspace URL**: Your Databricks workspace URL (e.g., `https://workspace.cloud.databricks.com`)\n- **Ingest Endpoint**: Found in workspace settings (format: `workspace-id.ingest.cloud.databricks.com`)\n\n## ⚙️ Configuration Options\n\n### Complete Configuration Example\n\n```python\nstreamer = SalesforceZerobus(\n    # Required parameters\n    sf_object_channel=\"AccountChangeEvent\", # Salesforce CDC channel (AccountChangeEvent, CustomObject__cChangeEvent, or ChangeEvents)\n    databricks_table=\"catalog.schema.table\",   # Target Databricks table\n    salesforce_auth={                      # Salesforce credentials dict\n        \"username\": \"user@company.com\",\n        \"password\": \"password+token\", \n        \"instance_url\": \"https://company.salesforce.com\"\n    },\n    databricks_auth={                      # Databricks OAuth Service Principal\n        \"workspace_url\": \"https://workspace.cloud.databricks.com\",\n        \"client_id\": \"your-service-principal-client-id\",\n        \"client_secret\": \"your-service-principal-client-secret\",\n        \"ingest_endpoint\": \"workspace-id.zerobus.region.cloud.databricks.com\"\n    },\n    \n    # Optional parameters with defaults\n    batch_size=10,                         # Events per fetch request (default: 10)\n    enable_replay_recovery=True,           # Zero-loss recovery (default: True)\n    auto_create_table=True,                # Auto-create Databricks table if missing (default: True)\n    backfill_historical=True,              # Start from EARLIEST for new tables (default: True)\n    timeout_seconds=300.0,                 # Semaphore timeout (default: 300s)\n    max_timeouts=3,                        # Max consecutive timeouts (default: 3)\n    grpc_host=\"api.pubsub.salesforce.com\", # Salesforce gRPC host\n    grpc_port=7443,                        # Salesforce gRPC port  \n    api_version=\"57.0\"                     # Salesforce API version\n)\n```\n## 🧱 Run as Databricks App\nRunning this service as a Databricks app and subscribing to ChangeEvents is a great way to stream all Salesforce changes with low costs, simplified ci/cd, and a rich governance model. \nView the databricks.yml to see the .whl being built. \n1. View the contents of resources/app.yml\n2. Configure the app.yaml file variables\n3. Deploy the Databricks Asset Bundle: \n    1. Comment out the job/pipeline .yml contents if you do not wish to deploy a job or pipeline \n    2. terminal: ```databricks bundle deploy -t dev ```\n    3. terminal: ```databricks sync --full . /Workspace/Users/{user}/.bundle/{bundle_name}/dev/files```\n4. View the app in the databricks UI. Deploy the app. \n5. Deploy the Lakeflow Declarative Pipeline resource to flatten and parse the streamed data\n\n## 🧱 Run as Databricks Job\n### Running the service as a Databricks Job\nRunning this service as a Databricks job leverages the For/Each task type to ingest several Salesforce Objects in parallel. \n\nRunning the following commands in the terminal will deploy a serverless job, the packaged .whl file, and the notebook_task.py. To view the contents being built view databricks.yml\n\n1. In the notebook_task.py file edit the variables salesforce_auth, databricks_auth, secret_scope_name before deploying the job\n2. Run the following commands: \n    ```bash\n    brew tap databricks/tap\n    brew install databricks\n    databricks bundle deploy -t dev \n    ```\n3. To edit the objects being ingested, extend the list at the bottom of the databricks.yml\n\n\n### Supported Salesforce Objects\n\nWorks with any Salesforce object that has Change Data Capture enabled:\n#### Read Every Object Change\n- `ChangeEvents`\n\n#### Standard Objects ExampleL\n- `Account`, `Contact`, `Lead`, `Opportunity`, `Case`\n\n#### Custom Objects\n- Any custom object with CDC enabled (e.g., `CustomObject__c`)\n\n## 🏗️ Automatic Table Creation \u0026 Historical Backfill\n\n### Smart Table Management\n\nThe library automatically handles Databricks table creation and historical data backfill:\n\n**New Deployment (Table Doesn't Exist)**:\n- ✅ **Auto-creates** Delta table with optimized CDC schema\n- 🕰️ **Historical Backfill**: Starts from `EARLIEST` to capture all historical events\n- 📊 **Optimized Schema**: Includes partitioning and auto-compaction\n\n**Existing Deployment (Table Exists)**:\n- 🔄 **Resume**: Continues from last processed `replay_id` using zero-loss recovery\n- ⚡ **Fast Startup**: Uses cached replay position for immediate streaming\n\n**Empty Table (Created but No Data)**:\n- 🕰️ **Backfill Mode**: Starts from `EARLIEST` to capture historical events\n- 📈 **Progressive Load**: Processes events chronologically from the beginning\n\n### Configuration Options\n\n```python\nstreamer = SalesforceZerobus(\n    # Auto-creation behavior  \n    auto_create_table=True,     # Create table if missing (default: True)\n    backfill_historical=True,   # Start from EARLIEST for new/empty tables (default: True)\n    \n    # Alternative configurations\n    auto_create_table=False,    # Require table to exist, fail if missing\n    backfill_historical=False,  # Start from LATEST even for new tables (real-time only)\n)\n```\n\n### Example Scenarios\n\n**Scenario 1: Fresh Deployment**\n```bash\nINFO - Table catalog.schema.account_events doesn't exist - creating and configuring for historical backfill\nINFO - Successfully created table: catalog.schema.account_events  \nINFO - Starting historical backfill from EARLIEST (this may take time for large orgs)\n```\n\n**Scenario 2: Service Restart**\n```bash\nINFO - Found latest replay_id: 00000000000408760000\nINFO - Resuming from replay_id: 00000000000408760000\n```\n\n**Scenario 3: Real-time Only Mode**\n```bash\nINFO - Table created - starting from LATEST\nINFO - Starting fresh subscription from LATEST\n```\n\n**Scenario 4: Successful Auto-Creation \u0026 Backfill**\n```bash\nINFO - Table catalog.schema.account_events doesn't exist - creating and configuring for historical backfill\nINFO - Creating Databricks table: catalog.schema.account_events\nINFO - Successfully created table: catalog.schema.account_events\nINFO - Starting historical backfill from EARLIEST (this may take time for large orgs)\nINFO - Stream created. Stream ID: 787040db-804a-40b4-a721-941f9220853a\nINFO - Initialized stream to table: catalog.schema.account_events\nINFO - Received Account DELETE 001abc123...\nINFO - Written to Databricks: catalog.schema.account_events - Account DELETE 001abc123...\n```\n\n\n## 🔄 Zero Data Loss Recovery\n\n### How It Works\n\nThe library automatically handles service restarts with zero data loss:\n\n1. **On Startup**: Queries your Delta table for the latest `replay_id` for the specific object\n2. **Resume Subscription**: Continues from the exact last processed event using `CUSTOM` replay preset\n3. **Fallback Safety**: Falls back to `LATEST` if no previous state found (fresh start)\n4. **Per-Object Recovery**: Each object type recovers independently\n\n### ❗️ Salesforce events are only retained for 72 hours (3 Days). If the service is down for 3 days or more, change events will be missed. Learn more [here](https://developer.salesforce.com/docs/platform/pub-sub-api/guide/event-message-durability.html)\n### Example Recovery Behavior\n\n```bash\n# First time running - no previous events\nINFO - Starting fresh subscription from LATEST\n\n# After processing some events, then restarting\nINFO - Found latest replay_id: 00000000000408760000\nINFO - Resuming from previous session with replay_id: 00000000000408760000\nINFO - Subscription mode: CUSTOM\n```\n\n## 📊 Monitoring \u0026 Health\n\n### Built-in Health Monitoring\n\nThe service includes comprehensive monitoring:\n\n```python\n# Get current statistics\nstats = streamer.get_stats()\nprint(f\"Running: {stats['running']}\")\nprint(f\"Queue size: {stats['queue_size']}\")\nprint(f\"Org ID: {stats['org_id']}\")\nprint(f\"Healthy: {stats['is_healthy']}\")\n```\n\n### Automatic Health Reports\n\nThe service logs health reports every 5 minutes:\n\n```bash\nINFO - Flow Controller Health Report: Acquires: 150, Releases: 150, Timeouts: 0, Healthy: True\nINFO - Queue status: 2 events pending\n```\n\n### Key Metrics Tracked\n\n- **Event throughput**: Events processed per minute\n- **Queue depth**: Number of events waiting for processing  \n- **Semaphore statistics**: Acquire/release counts, timeout rates\n- **Replay lag**: How far behind real-time we are\n- **Error rates**: Failed event processing attempts\n\n## 🎯 Data Schema \u0026 Output\n\n### Delta Table Schema\n\nEvents are stored in Databricks with this schema:\n\n| Field | Type | Description |\n|-------|------|-------------|\n| `event_id` | STRING | Unique Salesforce event identifier |\n| `schema_id` | STRING | Event schema version |\n| `replay_id` | STRING | Event position for replay (used for recovery) |\n| `timestamp` | BIGINT | Event timestamp (milliseconds since epoch) |\n| `change_type` | STRING | `CREATE`, `UPDATE`, `DELETE`, `UNDELETE` |\n| `entity_name` | STRING | Salesforce object name (`Account`, `Contact`, etc.) |\n| `change_origin` | STRING | Source of change (`com/salesforce/api/rest/64.0`, etc.) |\n| `record_ids` | ARRAY\u003cSTRING\u003e | List of affected Salesforce record IDs |\n| `changed_fields` | ARRAY\u003cSTRING\u003e | Names of fields that were modified |\n| `nulled_fields` | ARRAY\u003cSTRING\u003e | Names of fields that were set to null |\n| `diff_fields` | ARRAY\u003cSTRING\u003e | Names of fields with differences (alternative to changed_fields) |\n| `record_data_json` | STRING | Complete record data as JSON string |\n| `payload_binary` | BINARY | **NEW**: Raw Avro binary payload for schema-based parsing |\n| `schema_json` | STRING | **NEW**: Avro schema JSON string for parsing binary payload |\n| `org_id` | STRING | Salesforce organization ID |\n| `processed_timestamp` | BIGINT | When our pipeline processed this event |\n\n### Lakeflow Declarative Pipeline Ingestion\n**Deploy** the DAB with the Lakeflow declarative pipeline to ingest and flatten your Salesforce data.\n**NEW**: Use `payload_binary` and `schema_json` for individual field extraction with automatic schema evolution support:\n\n**Schema Evolution**: When the object schema changes restart the pipeline (not a full refresh) to get the latest schema\n\n```python\nfrom pyspark import pipelines as dp\nfrom pyspark.sql.avro.functions import from_avro\nfrom pyspark.sql.functions import col, desc\n\ndef create_pipeline(salesforce_object):\n    @dp.table(name=f\"salesforce_parsed_{salesforce_object}\")\n    def parse_salesforce_stream():\n        df = dp.readStream(zerobus_table).filter(\n            col(\"entity_name\") == salesforce_object\n        )\n\n        latest_schema = (\n            dp.read(zerobus_table)\n            .filter(\n                (col(\"entity_name\") == salesforce_object)\n                \u0026 (col(\"payload_binary\").isNotNull())\n                \u0026 (col(\"schema_json\").isNotNull())\n            )\n            .orderBy(desc(\"timestamp\"))\n            .select(\"schema_json\")\n            .first()[0]\n        )\n\n        df = df.select(\n            \"*\",\n            from_avro(\n                col(\"payload_binary\"), latest_schema, {\"mode\": \"PERMISSIVE\"}\n            ).alias(\"parsed_data\"),\n        )\n        return df.select(\"*\", \"parsed_data.*\").drop(\"parsed_data\")\n\n\nzerobus_table = \"\u003cyour_zerobus_table_name\u003e\"\nsalesforce_objects = [\n    sf_object.entity_name\n    for sf_object in dp.read(zerobus_table).select(\"entity_name\").distinct().collect()\n]\nfor salesforce_object in salesforce_objects:\n    create_pipeline(salesforce_object)\n```\n\n### Benefits of Schema-Based Parsing\n\n- ✅ **Automatic Schema Evolution**: Handles new fields added to Salesforce objects\n- ✅ **Type Safety**: Preserves Avro data types vs. JSON string conversion\n- ✅ **Performance**: More efficient than JSON parsing for large datasets\n- ✅ **Field-Level Access**: Direct access to individual Salesforce fields as columns\n\n\n### Regenerating Protocol Buffer Files\n\nIf you need to regenerate the Protocol Buffer files (e.g., after modifying `.proto` files), run:\n\n```bash\n# Install protoc dependencies\nuv pip install grpcio-tools\u003e=1.50.0\n\n# Navigate to the proto directory\ncd salesforce_zerobus/pubsub/proto/\n\n# Compile protobuf files\npython -m grpc_tools.protoc \\\n    --proto_path=. \\\n    --python_out=. \\\n    --grpc_python_out=. \\\n    *.proto\n\n# This generates:\n# - pubsub_api_pb2.py (protobuf classes)\n# - pubsub_api_pb2_grpc.py (gRPC service stubs)  \n# - salesforce_events_pb2.py (event definitions)\n# - salesforce_events_pb2_grpc.py (event service stubs)\n```\n\n## 🔍 Troubleshooting\n\n### Common Issues \u0026 Solutions\n\n#### ❌ Authentication Error: \"SOAP request failed with status 500\"\n\n**Causes \u0026 Fixes:**\n- **Expired Security Token**: Reset token in Salesforce Setup → My Personal Information → Reset Security Token\n- **Wrong Password Format**: Ensure password is `yourpassword+SECURITYTOKEN` with no spaces\n- **Wrong Instance URL**: Use the exact URL from your browser after logging into Salesforce\n- **API Access Disabled**: Check user profile has \"API Enabled\" permission\n\n#### ❌ Permission Denied: \"INSUFFICIENT_ACCESS_ON_CROSS_REFERENCE_ENTITY\"\n\n**Fixes:**\n- Enable Pub/Sub API access for your user\n- Verify CDC is enabled for the target object\n- Check your user has read access to the object\n\n#### ❌ Table Not Found: \"Table 'catalog.schema.table' doesn't exist\"\n\n**With Auto-Creation Enabled (Default):**\nThe library should automatically create tables. If this error persists:\n- Check Databricks permissions for CREATE TABLE in the catalog/schema\n- Verify `auto_create_table=True` (default) in your configuration\n- Ensure SQL endpoint has sufficient permissions\n\n**With Auto-Creation Disabled:**\n- Create the Delta table in Databricks first (see setup instructions)\n- Verify table name format: `catalog.schema.table_name`\n- Check your Databricks permissions for the catalog/schema\n\n**Troubleshooting Auto-Creation:**\n```bash\n# Check if auto-creation is working\nINFO - Table catalog.schema.table doesn't exist - creating and configuring for historical backfill\nINFO - Creating Databricks table: catalog.schema.table\nINFO - Successfully created table: catalog.schema.table\n\n# If you see this instead:\nERROR - Failed to create table: catalog.schema.table - table does not exist after creation attempt\n# Check your Databricks SQL endpoint permissions and catalog access\n```\n\n#### ❌ Databricks Stream Error: \"Failed to open table for write (Error code 1022)\"\n\n**Root Cause:**\nThis error can occur when the table schema is incompatible with the Databricks Zerobus API.\n\n**Fixes:**\n- **Generated Columns**: Avoid `GENERATED ALWAYS AS` columns in your table schema\n- **Complex Partitioning**: Use simple table schemas without complex computed partitions\n- **Row Tracking**: Ensure `delta.enableRowTracking = false` (automatically set by auto-creation)\n- **Schema Compatibility**: Let the library auto-create tables for best compatibility\n\n## Contributing \u0026 Support\n\n### Contributing\n**To contribute:**\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes\n4. Add tests if applicable  \n5. Submit a pull request\n\n\n## How to get help\n\nDatabricks support doesn't cover this content. For questions or bugs, please open a GitHub issue and the team will help on a best effort basis.\n\n\n## License\n\n\u0026copy; 2025 Databricks, Inc. All rights reserved. The source in this notebook is provided subject to the Databricks License [https://databricks.com/db-license-source].  All included or referenced third party libraries are subject to the licenses set forth below.\n\n| library                                | description             | license    | source                                              |\n|----------------------------------------|-------------------------|------------|-----------------------------------------------------|\n| Salesforce Pub/Sub API | gRPC API framework | Creative Commons Zero v1.0 Universal | [GitHub](https://github.com/forcedotcom/pub-sub-api/blob/main/LICENSE) | \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabricks-solutions%2Fsalesforce-zerobus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatabricks-solutions%2Fsalesforce-zerobus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabricks-solutions%2Fsalesforce-zerobus/lists"}