{"id":23189409,"url":"https://github.com/mrintern/my_cheatsheets","last_synced_at":"2025-04-05T06:12:39.836Z","repository":{"id":197984689,"uuid":"699810822","full_name":"mrintern/my_cheatsheets","owner":"mrintern","description":null,"archived":false,"fork":false,"pushed_at":"2023-10-06T14:36:33.000Z","size":15,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-10T13:43:50.333Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mrintern.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-10-03T11:45:28.000Z","updated_at":"2023-10-03T11:45:29.000Z","dependencies_parsed_at":null,"dependency_job_id":"268e90ad-f6fb-4f11-91de-301f0201e6d5","html_url":"https://github.com/mrintern/my_cheatsheets","commit_stats":null,"previous_names":["mrintern/my_cheatsheets"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrintern%2Fmy_cheatsheets","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrintern%2Fmy_cheatsheets/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrintern%2Fmy_cheatsheets/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrintern%2Fmy_cheatsheets/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mrintern","download_url":"https://codeload.github.com/mrintern/my_cheatsheets/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247294550,"owners_count":20915340,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-18T11:19:08.966Z","updated_at":"2025-04-05T06:12:39.815Z","avatar_url":"https://github.com/mrintern.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# my_cheatsheets\n### examples\n## sql\nCASE statement\n```\nALTER TABLE your_table ADD COLUMN triangle_type VARCHAR(50);\n\nUPDATE your_table\nSET triangle_type = CASE\n    WHEN side1 = side2 AND side2 = side3 THEN 'Equilateral'\n    WHEN side1 = side2 OR side2 = side3 OR side1 = side3 THEN 'Isosceles'\n    ELSE 'Scalene'\nEND;\n```\ninner join\n```\nSELECT employees.name, departments.department_name\nFROM employees\nINNER JOIN departments ON employees.department_id = departments.id;\n```\nleft join\n```\nSELECT customers.name, orders.order_date\nFROM customers\nLEFT JOIN orders ON customers.id = orders.customer_id;\n```\n\nwindow function example\n```\n%sql\nSELECT trip_distance, fare_amount, dropoff_zip, COUNT(dropoff_zip) OVER(PARTITION BY dropoff_zip) AS dropoff_zip_count FROM samples.nyctaxi.trips ORDER BY dropoff_zip ASC;\n```\nscd type 2\nCreate a temporary view for the existing table and new table:\n```\ndf.createOrReplaceTempView(\"existing_table\")\nnew_data.createOrReplaceTempView(\"new_data\")\n\n```\nUpdate the date_of_birth for the records that will be expired:\n```\nspark.sql(\"\"\"\n    UPDATE existing_table \n    SET date_of_birth = (SELECT date_of_birth FROM new_data WHERE new_data.message_id = existing_table.message_id)\n    WHERE message_id IN (SELECT message_id FROM new_data)\n\"\"\")\n```\nInsert the new data into the table and set date_of_birth as NULL to mark them as the latest data changes:\n```\nspark.sql(\"\"\"\n    INSERT INTO existing_table (city, date_of_birth, email, message_id, name)\n    SELECT city, NULL, email, message_id, name FROM new_data\n\"\"\")\n```\n## pyspark\naccessing nested fields (json data)\n+ \nexplode\n```\ndf = spark.read.json(\"/databricks-datasets/COVID/CORD-19/2020-06-04/document_parses/pdf_json/d33a044bbb52673f1eeeb792b0376b0987fe02f6.json\")\n\n# acess nested field\ndf2 = df.select(\"abstract.text\")\n# explode text column\ndf2 = df2.withColumn(\"text\", explode(\"text\"))\ndf2.show()\n```\n\nexplode\n```\nfrom pyspark.sql.functions import explode\n\ndf = df.withColumn(\"Number\", explode(\"Numbers\")).drop(\"Numbers\").show()\n```\n\nsql query on dataframe\n```\nfrom pyspark.sql import SparkSession\ndf.createOrReplaceTempView(\"temp_table\")\ntemp_df = spark.sql(\"SELECT * FROM temp_table\")\ntemp_df.show()\n```\n\nleft join\n```\nfrom pyspark.sql import SparkSession\nsc  = SparkSession.builder.getOrCreate()\nquery = \"SELECT * FROM samples.tpch.orders LEFT JOIN samples.tpch.customer ON samples.tpch.orders.o_custkey = samples.tpch.customer.c_custkey\"\ndf = spark.sql(query)\ndf.show()\n```\n\nbroadcast join\n```\nfrom pyspark.sql.functions import broadcast\nfrom pyspark.sql import SparkSession\n\ndf = spark.sql(\"SELECT * FROM samples.tpch.orders\")\ndf2 = spark.sql(\"SELECT * FROM samples.tpch.customer\")\n# left join customer on orders\nresult = df.join(broadcast(df2),df.o_custkey == df2.c_custkey,'left')\nresult.show()\n```\n\nautoloader to delta lake example\n```\n# import functions\nfrom pyspark.sql.functions import col, current_timestamp\n\n# initialize variables\nfile_path = \"/databricks-datasets/structured-streaming/events\"\nusername = spark.sql(\"SELECT regexp_replace(current_user(), '[^a-zA-Z0-9]', '_')\").first()[0]\ntable_name = f\"{username}_etl_quickstart\"\ncheckpoint_path = f\"/tmp/{username}/_checkpoint/etl_quickstart\"\n\n# Clear out data from previous demo execution\nspark.sql(f\"DROP TABLE IF EXISTS {table_name}\")\ndbutils.fs.rm(checkpoint_path, True)\n\n# Configure Auto Loader to ingest JSON data to a Delta table\n(spark.readStream\n  .format(\"cloudFiles\")\n  .option(\"cloudFiles.format\", \"json\")\n  .option(\"cloudFiles.schemaLocation\", checkpoint_path)\n  .load(file_path)\n  .select(\"*\", col(\"_metadata.file_path\").alias(\"source_file\"), current_timestamp().alias(\"processing_time\"))\n  .writeStream\n  .option(\"checkpointLocation\", checkpoint_path)\n  .trigger(availableNow=True)\n  .toTable(table_name))\n\ndf = spark.read.table(table_name)\ndisplay(df)\n```\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmrintern%2Fmy_cheatsheets","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmrintern%2Fmy_cheatsheets","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmrintern%2Fmy_cheatsheets/lists"}