{"id":19083192,"url":"https://github.com/psingh12354/databricks","last_synced_at":"2026-01-30T13:04:56.059Z","repository":{"id":231694963,"uuid":"782459526","full_name":"Psingh12354/databricks","owner":"Psingh12354","description":null,"archived":false,"fork":false,"pushed_at":"2024-06-09T12:01:16.000Z","size":133,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-12T11:54:56.221Z","etag":null,"topics":["certification","certification-exam","certification-prep","databricks","databricks-assocate-data-engineer","notes"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Psingh12354.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-04-05T10:50:14.000Z","updated_at":"2024-06-18T04:19:23.000Z","dependencies_parsed_at":"2024-05-02T10:39:40.220Z","dependency_job_id":"bc8df0fb-174d-48a1-912e-23178dda9953","html_url":"https://github.com/Psingh12354/databricks","commit_stats":null,"previous_names":["psingh12354/databricks"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Psingh12354/databricks","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Psingh12354%2Fdatabricks","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Psingh12354%2Fdatabricks/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Psingh12354%2Fdatabricks/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Psingh12354%2Fdatabricks/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Psingh12354","download_url":"https://codeload.github.com/Psingh12354/databricks/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Psingh12354%2Fdatabricks/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28913353,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-30T12:13:43.263Z","status":"ssl_error","status_checked_at":"2026-01-30T12:13:22.389Z","response_time":66,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["certification","certification-exam","certification-prep","databricks","databricks-assocate-data-engineer","notes"],"created_at":"2024-11-09T02:46:25.721Z","updated_at":"2026-01-30T13:04:56.041Z","avatar_url":"https://github.com/Psingh12354.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# Databricks Overview\n\n## Cluster Types\n\n### All-purpose Clusters\n- Designed for collaborative use, such as ad hoc analysis, data exploration, and development.\n- Multiple users can share these clusters.\n- Cost-effective for tasks that benefit from resource sharing.\n- May not guarantee immediate availability due to resource sharing.\n\n### Job Clusters\n- Specifically for running automated jobs.\n- Terminate once the job is completed, reducing resource usage and cost.\n- Dedicated to a specific task and optimized for performance.\n- More suitable for workflows with strict Service Level Agreements (SLAs).\n\n| Aspect          | All-purpose Clusters                                         | Job Clusters                                               |\n|-----------------|--------------------------------------------------------------|------------------------------------------------------------|\n| SLA Requirements| Versatile but less predictable for strict SLAs               | Dedicated and optimized for performance                    |\n| Resource Usage  | Cost-effective for exploration and development tasks         | Efficient for time-sensitive tasks, releasing resources promptly |\n| Trade-offs      | Flexibility but potential delay due to resource sharing      | Prioritizes specific jobs but higher resource costs        |\n\n![Cluster Types](https://github.com/Psingh12354/databricks/assets/55645997/cd21c411-5fdc-4399-8c6b-b111b2120964)\n\n## Databricks Lakehouse\n- Combines the advantages of Data Lakes and Data Warehouses.\n- Stores data in Parquet format and transaction logs in JSON format.\n\n## Utilities\n- `dbutils`: Module for interacting with Databricks.\n  - `dbutils.help()`: Lists available modules.\n  - `dbutils.fs.help()`: Provides file system utilities, e.g., `dbutils.fs.ls('path')`.\n\n## Database Commands\n- `describe database db_name;`: Retrieves the location of a database.\n- `spark.Table`: Registers a table through SparkSession.\n- `DESCRIBE DATABASE` or `DESCRIBE SCHEMA`: Returns metadata of a database.\n\n![Edit History](https://github.com/Psingh12354/databricks/assets/55645997/952d2198-8a0f-4fbf-8148-c022ae6114cf)\n\n## Delta Lake\n- Builds upon standard data formats.\n- VACUUM command: Deletes unused data files older than a specified retention period.\n- Supports ACID transactions and scalable metadata handling.\n- Delta tables store data in Parquet format with transaction logs in JSON format.\n\n## Databricks Repos\n- Supports Git operations like pull, fetch, and manage branches.\n- Useful for managing development work and versioning.\n\n## Data Exploration\n- `dbfs:/user/hive/warehouse/db_hr.db`: Default location for databases.\n- PIVOT: Transforms rows into columns for better readability.\n- `CREATE TABLE USING`: Creates external tables from external data sources (e.g., CSV).\n- `CREATE SCHEMA`: Alias for `CREATE DATABASE`.\n\n## Spark Structured Streaming\n- `trigger(availableNow=True)`: Runs the stream in batch mode and stops after processing available data.\n- Auto Loader: Tracks discovered files using checkpointing for exactly-once ingestion guarantees.\n- Default processing interval: `trigger(processingTime=\"500ms\")`.\n\n| Aspect                | Triggered Execution      | Continuous Execution    | Development         | Production          |\n|-----------------------|--------------------------|-------------------------|---------------------|---------------------|\n| Execution Trigger     | Event-driven             | Continuous data arrival | N/A                 | N/A                 |\n| Processing Characteristics | Batch or near-real-time | Real-time processing    | Iterative development \u0026 testing | Stable, optimized execution |\n| Environment           | N/A                      | N/A                     | Interactive notebooks or development environments | Dedicated clusters optimized for performance |\n| Resource Allocation   | N/A                      | N/A                     | Limited resources for experimentation and testing | Dedicated resources for stable and scalable execution |\n\n## Delta Live Tables (DLT)\n- Enables creating streaming live tables using the `STREAM()` function.\n- Use `COMMENT \"Contains PII\"` to indicate that a new table includes personally identifiable information (PII).\n\n## Data Ingestion\n- COPY INTO: Suitable for ingesting thousands of files.\n- Auto Loader: Preferred for ingesting millions of files or more and handles schema evolution.\n\n## Databricks Jobs\n- Orchestrates data processing tasks in a Directed Acyclic Graph (DAG).\n- Allows repairing failed multi-task jobs by rerunning only the unsuccessful tasks and their dependents.\n- `MERGE` command: Writes data into Delta tables while avoiding duplicate records.\n\n## User-Defined Functions (UDF)\n```sql\nCREATE FUNCTION blue() RETURNS STRING COMMENT 'Blue color code' LANGUAGE SQL RETURN '0000FF';\n```\n\n## Querying Data\n- `spark.readStream`: Reads streaming data.\n- `SELECT * FROM file_format.'path.type'`: Queries a table using file format and path.\n- `CREATE TABLE table_name DEEP CLONE source_table`: Creates a deep clone of a table.\n- `CREATE TABLE table_name SHALLOW CLONE source_table`: Creates a shallow clone of a table.\n\n## Databricks SQL\n- `Data Explorer`: Manages data object permissions.\n- View types:\n  - Standard view: `CREATE VIEW view_name AS query`\n  - Temporary view: `CREATE TEMP VIEW view_name AS query`\n  - Global temporary view: `CREATE GLOBAL TEMP VIEW view_name AS query`\n\n## Medallion Architecture\n- **Gold tables**: Refined data suitable for business reporting (e.g., BI dashboards).\n- **Silver tables**: Cleansed and conformed data.\n- **Bronze tables**: Raw data with minimal transformation.\n- **Raw data**: Unprocessed data directly ingested from source systems.\n\n![Medallion Architecture](https://github.com/Psingh12354/databricks/assets/55645997/168f3c24-4555-4c99-98c9-498c5c0aec20)\n\n## Sample Query\n- The query below is in the Bronze layer, reading data from cloud storage into an uncleaned orders table.\n\n```python\n(spark.readStream\n        .format(\"cloudFiles\")\n        .option(\"cloudFiles.format\", \"json\")\n        .load(ordersLocation)\n     .writeStream\n        .option(\"checkpointLocation\", checkpointPath)\n        .table(\"uncleanedOrders\")\n)\n```\n\n## Connecting to GitHub\n- Follow the steps outlined in [Databricks documentation](https://docs.databricks.com/en/repos/get-access-tokens-from-git-provider.html).\n\n## Parquet File Format\n- Parquet is a columnar storage file format, preferred for its efficiency in data storage and retrieval.\n- [Learn more about Parquet](https://towardsdatascience.com/demystifying-the-parquet-file-format-13adb0206705).\n\n## Time Travel\n- `DESCRIBE HISTORY employees`: Retrieves history and version details of a table.\n- Query nested data (e.g., JSON): `SELECT customer_id, profile:name FROM customer`.\n\n![Time Travel](https://github.com/Psingh12354/databricks/assets/55645997/556df0d5-16f0-4c99-8451-a0ac7f8929a3)\n\n![Garbage Collection](https://github.com/Psingh12354/databricks/assets/55645997/380852bc-a8a8-42d4-8634-a7e86e88216f)\n\n## Querying Table Details\n- `DESCRIBE DETAILS tablename`: Retrieves all details about the given table.\n\n## Triggers in Spark Structured Streaming\n- `.trigger(once=True)`: Processes one batch of data.\n- `.trigger(availableNow=True)`: Processes data immediately.\n- `.trigger(processingTime='60 minutes')`: Sets a 1-hour processing interval.\n\n## Managing Data Quality with DLT\n- Expectations apply data quality checks on each record passing through a query.\n\n![DLT Data Quality](https://github.com/Psingh12354/databricks/assets/55645997/52d5c201-34d7-460a-b591-f2bd5b82ebd6)\n\n## Reading Data from DLT\n```python\nspark.readStream.table(\"table_name\")\n```\n\n## Auto Loader\n- Incrementally processes new data files as they arrive in cloud storage.\n- Provides a Structured Streaming source called cloudFiles.\n\n## Creating Tables in Databricks\n```sql\n-- Creates a Delta table\nCREATE TABLE student (id INT, name STRING, age INT);\n\n-- Use data from another table\nCREATE TABLE student_copy AS SELECT * FROM student;\n\n-- Creates a CSV table from an external directory\nCREATE TABLE student USING CSV LOCATION '/mnt/csv_files';\n\n-- Specify table comment and properties\nCREATE TABLE student (id INT, name STRING, age INT)\n    COMMENT 'this is a comment'\n    TBLPROPERTIES ('foo'='bar');\n\n-- Create partitioned table\nCREATE TABLE student (id INT, name STRING, age INT)\n    PARTITIONED BY (age);\n\n-- Create a table with a generated column\nCREATE TABLE rectangles(a INT, b INT, area INT GENERATED ALWAYS AS (a * b));\n```\n\n## UDF Example\n```sql\nCREATE FUNCTION convert_f_to_c(unit STRING, temp DOUBLE)\nRETURNS DOUBLE\nRETURN CASE\n  WHEN unit = \"F\" THEN (temp - 32) * (5/9)\n  ELSE temp\nEND;\n\nSELECT convert_f_to_c(unit, temp) AS c_temp\nFROM tv_temp;\n```\n\n## Set Operators\n- Demonstrating set operators using `number1` and `number2` tables.\n\n```sql\nCREATE TEMPORARY VIEW number1(c) AS VALUES (3), (1), (2), (2), (3), (4);\nCREATE TEMPORARY VIEW number2(c) AS VALUES (5), (1), (1), (2);\n\nSELECT c FROM number1 EXCEPT SELECT c FROM number2;\nSELECT c FROM number1 MINUS SELECT c FROM number2;\nSELECT c FROM number1 EXCEPT ALL SELECT c FROM number2;\nSELECT c FROM number1 MINUS ALL SELECT c FROM number2;\nSELECT c FROM number1 INTERSECT SELECT c FROM number2;\nSELECT c FROM number1 UNION SELECT c FROM number2;\n```\n\n## Array Manipulation Functions\n\n### Filter Function\n```sql\nSELECT filter(array(1, 2, 3, 4), i -\u003e i % 2 == 0);\n```\nThis SQL-like function filters an array to include only elements that satisfy a given condition. For example, the above query returns an array containing only the even numbers from the input array.\n\n### Transform Function\n```sql\nSELECT transform(array(1, 2, 3, 4), i -\u003e i * 2);\n```\nThis function transforms each element of an array according to a specified transformation rule. For instance, the above query doubles each element of the input array.\n\n---\n\n## Alert Destinations\n\nSupported alert destinations include:\n- Email\n- Slack\n- Webhook\n- MS Teams\n- PagerDuty\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpsingh12354%2Fdatabricks","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpsingh12354%2Fdatabricks","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpsingh12354%2Fdatabricks/lists"}