{"id":41340002,"url":"https://github.com/mananabbasi/data-science-complete-project-using-big-data-tools-techniques-","last_synced_at":"2026-01-23T06:46:47.314Z","repository":{"id":239677523,"uuid":"800246513","full_name":"mananabbasi/Data-Science-Complete-Project-using-Big-Data-Tools-Techniques-","owner":"mananabbasi","description":"This repository contains Databricks projects utilizing RDDs, DataFrames, and SQL to process and analyze various real-world datasets. Data cleaning and analysis have been performed using PySpark functions to handle challenges such as inconsistent formats, missing values, and complex data structures. The project ensures efficient data transformation ","archived":false,"fork":false,"pushed_at":"2024-05-14T01:29:51.000Z","size":3893,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-17T11:11:07.601Z","etag":null,"topics":["azure","databricks","databricks-industry-solutions","databricks-notebooks","dataframe","pyspark-mllib","pyspark-notebook","pyspark-python","python-script","rdd"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mananabbasi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-14T01:17:00.000Z","updated_at":"2025-02-01T23:37:03.000Z","dependencies_parsed_at":"2024-05-14T01:49:17.700Z","dependency_job_id":null,"html_url":"https://github.com/mananabbasi/Data-Science-Complete-Project-using-Big-Data-Tools-Techniques-","commit_stats":null,"previous_names":["mananabbasi/big-data-tools-techniques-complete-project-01","mananabbasi/big-data-tools-techniques-project-01-"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mananabbasi/Data-Science-Complete-Project-using-Big-Data-Tools-Techniques-","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mananabbasi%2FData-Science-Complete-Project-using-Big-Data-Tools-Techniques-","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mananabbasi%2FData-Science-Complete-Project-using-Big-Data-Tools-Techniques-/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mananabbasi%2FData-Science-Complete-Project-using-Big-Data-Tools-Techniques-/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mananabbasi%2FData-Science-Complete-Project-using-Big-Data-Tools-Techniques-/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mananabbasi","download_url":"https://codeload.github.com/mananabbasi/Data-Science-Complete-Project-using-Big-Data-Tools-Techniques-/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mananabbasi%2FData-Science-Complete-Project-using-Big-Data-Tools-Techniques-/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28682261,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-23T05:48:07.525Z","status":"ssl_error","status_checked_at":"2026-01-23T05:48:07.129Z","response_time":59,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["azure","databricks","databricks-industry-solutions","databricks-notebooks","dataframe","pyspark-mllib","pyspark-notebook","pyspark-python","python-script","rdd"],"created_at":"2026-01-23T06:46:45.928Z","updated_at":"2026-01-23T06:46:47.307Z","avatar_url":"https://github.com/mananabbasi.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Big-Data-Tools-Techniques-Project-01-\nEvery row in the dataset corresponds to an individual clinical trial and is identified  by different variables. It's important to note that the first column contains a mixture  of various variables separated by a delimiter, and the date columns exhibit various  formats. Please consider these issues and ensure that the dataset is appropriately .\n(Source: ClinicalTrials.gov)\n2. pharma.csv:\nThe file contains a small number of a publicly available list of pharmaceutical \nviolations. For the purposes of this work, we are interested in the second column, \nParent Company, which contains the name of the pharmaceutical company in \nquestion. \n(Source: https://violationtracker.goodjobsfirst.org/industry/pharmaceuticals)\n\nYou are a data scientist / AI engineer whose client wishes to gain further insight into \nclinical trials. You are tasked with answering these questions, using visualisations where \nthese would support your conclusions.\nYou should address the following questions. \n1. The number of studies in the dataset. You must ensure that you explicitly check \ndistinct studies.\n2. You should list all the types (as contained in the Type column) of studies in the \ndataset along with the frequencies of each type. These should be ordered from \nmost frequent to least frequent.\n3. The top 5 conditions (from Conditions) with their frequencies.\n4. Find the 10 most common sponsors that are not pharmaceutical companies, along \nwith the number of clinical trials they have sponsored. Hint: For a basic \nimplementation, you can assume that the Parent Company column contains all \npossible pharmaceutical companies.\n5. Plot number of completed studies for each month in 2023. You need to include your \nvisualization as well as a table of all the values you have plotted for each month.\nYou are to implement all 5 tasks 3 times: once in Spark SQL and twice in PySpark (once\nin RDD and another time in DataFrame).\nFor the visualisation of the results, you are free to use any tool that fulfils the requirements, \nwhich can be tools such as Python’s matplotlib, Excel, Power Bi, Tableau, or any other free \nopen-source tool you may find suitable. Using built-in visualizations directly is permitted, \nit will however not yield a high number of marks. Your report needs to state the software \nused to generate the visualization, otherwise a built-in visualization will be assumed.\n\nExtra Features\nUnzipping the data inside the Databricks system (You can unzip the file on your \ncomputer before uploading it to Databricks. However, to earn extra marks, you \nshould be able to successfully unzip it within the Databricks environment. \nAdditionally, your code should be reusable for us, meaning it needs to include \nproper cleanup procedures to remove any unnecessary files and folders from the \nfilesystem. This ensures our ability to run your code without errors.).\n➢Maximum 3 further analyses of the data, motivated by the questions asked (new \nproblem statements other than the above 5 problems)\n➢ Writing general and reusable code for example for different versions of data. We have \nprovided the clinicaltrial_2020 and clinicaltrial_2021 datasets only for this purpose \nif you want (don’t forget, the main dataset is clinicaltrial_2023 and 2020 and 2021\nversions are just for extra mark and is not compulsory to use them).\n➢ Using more advance methods to solve the problems like defining and using user\ndefined functions.\n➢ Successfully implementing Spark functions that you have not used in the workshop.\n➢ Creation of additional visualizations presenting useful information based on your \nown exploration which is not covered by the problem statements\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmananabbasi%2Fdata-science-complete-project-using-big-data-tools-techniques-","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmananabbasi%2Fdata-science-complete-project-using-big-data-tools-techniques-","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmananabbasi%2Fdata-science-complete-project-using-big-data-tools-techniques-/lists"}