{"id":19084387,"url":"https://github.com/pawsanie/pyspark_universal_dq_report","last_synced_at":"2026-04-25T11:32:50.333Z","repository":{"id":158648153,"uuid":"505120892","full_name":"Pawsanie/PySpark_universal_dq_report","owner":"Pawsanie","description":"The script reads the dataset along the path and selects the columns in it received from the argument for the specified dates. Then it saves the report to the specified path of HDFS.","archived":false,"fork":false,"pushed_at":"2022-12-22T12:13:24.000Z","size":26,"stargazers_count":0,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-12T10:09:53.284Z","etag":null,"topics":["data-quality","data-quality-checks","data-quality-monitoring","dq","hadoop","hadoop-hdfs","hdfs","pyspark","python","python-3","python-script","python3"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"0bsd","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Pawsanie.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-06-19T13:41:32.000Z","updated_at":"2023-01-25T05:40:00.000Z","dependencies_parsed_at":null,"dependency_job_id":"de0181c4-b7a8-4c08-9fd3-7436657e1e44","html_url":"https://github.com/Pawsanie/PySpark_universal_dq_report","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Pawsanie/PySpark_universal_dq_report","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Pawsanie%2FPySpark_universal_dq_report","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Pawsanie%2FPySpark_universal_dq_report/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Pawsanie%2FPySpark_universal_dq_report/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Pawsanie%2FPySpark_universal_dq_report/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Pawsanie","download_url":"https://codeload.github.com/Pawsanie/PySpark_universal_dq_report/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Pawsanie%2FPySpark_universal_dq_report/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32261110,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-25T09:15:33.318Z","status":"ssl_error","status_checked_at":"2026-04-25T09:15:31.997Z","response_time":59,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-quality","data-quality-checks","data-quality-monitoring","dq","hadoop","hadoop-hdfs","hdfs","pyspark","python","python-3","python-script","python3"],"created_at":"2024-11-09T02:51:09.225Z","updated_at":"2026-04-25T11:32:50.318Z","avatar_url":"https://github.com/Pawsanie.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PySpark universal dq report\n\n## Disclaimer:\n:warning:**Using** some or all of the elements of this code, **You** assume **responsibility for any consequences!**\u003cbr/\u003e\n\n:warning:The **licenses** for the technologies on which the code **depends** are subject to **change by their authors**.\n\n## Description of the report:\nThe script reads the dataset along the path and selects the columns \u003cbr/\u003e\nin it received from the argument for the specified dates.\u003cbr/\u003e\nThen it saves the report to the specified path of HDFS\n\nThis example is an elementary report which in theory,\u003cbr/\u003e\nshould create a DataFrame with many rows that meet the requirements of 3 filters:\n* A value in the 'identifier' column is in 'interest_ids' list.\n* A value in the 'response' column contains the text 'Success' or 'Not_full_data'.\n* A value in the 'response' column contains the text 'Failure'.\n    \nAs a result, a '.csv' table with values from columns 'identifier',\u003cbr/\u003e\n'column_1', 'column_2' and 'column_3' will be saved on HDFS.\n* Where identifier contains id.\n* Where column_1_all contains count of all results.\n* Where column_2_ok_more_3sec contains count of trace_with_success when the latency is more 3 seconds.\n* Where column_3_fail_low_3sec contains count of trace_with_success when the latency is less 3 seconds.\n\nFor the practical result, it is required to substitute the real column names and data for filters into the get_report variable.\n****\n\n## Required:\nThe application code is written in python and obviously depends on it.\u003cbr\u003e\n**Python** version 3.6 [Python Software Foundation License / (with) Zero-Clause BSD license (after 3.8.6 version Python)]:\n* :octocat:[Python GitHub](https://github.com/python)\n* :bookmark_tabs:[Python internet page](https://www.python.org/)\n\n**PySpark** [Apache License 2.0/ (with) separate licenses for specific items]:\n* :octocat:[PySpark GitHub](https://github.com/apache/spark)\n* :bookmark_tabs:[PySpark internet page](https://spark.apache.org/)\n\n## Installing the Required Packages:\n```bash\npip install pyspark\n```\n## Launch:\nIf Your OS has a bash shell the ETL pipeline can be started using the bash script:\n```bash\n./start_universal_dq_report.sh\n```\nThe script contains an example of all the necessary arguments to run.\u003cbr/\u003e\nTo launch the pipeline through this script, do not forget to make it executable.\n```bash\nchmod +x ./start_universal_dq_report.sh\n```\nThe script can also be run directly with python.\n```bash\nspark-submit --queue uat --num-executors 5 --executor-cores 16 --executor-memory 15G --driver-memory 4G universal_dq_report.py \\\n-id '1234561,123452,123453' \\\n-n 'Name' \\\n-p '/example_warehouse/example_root/example_catalog/' \\\n-t 'daily' \\\n-df 'YYYY-MM-DD' \\\n-dt 'YYYY-MM-DD' \\\n-pts ''\n```\nWhere you can set or not set the following arguments as you wish for spark-submit:\n* --queue - The name of the queue in which the YARN application will run.\n* --num-executors - The number of executor machines that will carry out the task.\n* --executor-cores - The number of CPU cores for each executor.\n* --executor-memory - The amount of RAM for each executor.\n* --driver-memory - The amount of RAM for the main task that manages the rest.\n\nAbout script arguments:\n* -id - List or one id like string split by ',' without space.\n* -n - Dataset`s name like string.\n* -p - Partition path on HDFS, like string.\n* -t - Daily or hourly dataset type (daily/hourly).\n* -df - The date you plan to receive the report from (format YYYY-MM-DD).\n* -dt - The date you plan to receive the report to. If not specified, it will be today (format YYYY-MM-DD).\n* -pts - The path to save csv report on HDFS. If not specified, it will be users home directory.\n***\n\nThank you for showing interest in my work.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpawsanie%2Fpyspark_universal_dq_report","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpawsanie%2Fpyspark_universal_dq_report","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpawsanie%2Fpyspark_universal_dq_report/lists"}