{"id":14062900,"url":"https://github.com/suryateja0153/Database-Design-And-Performance-Tuning","last_synced_at":"2025-07-29T14:32:03.177Z","repository":{"id":218501623,"uuid":"366261696","full_name":"suryateja0153/Database-Design-And-Performance-Tuning","owner":"suryateja0153","description":"Designing a database, querying, analytics and reporting.","archived":false,"fork":false,"pushed_at":"2022-01-29T00:09:39.000Z","size":6080,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-12-04T00:32:30.695Z","etag":null,"topics":["mssqlserver","oracle-database","queries","rstudio","sqlserver"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/suryateja0153.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2021-05-11T05:07:26.000Z","updated_at":"2022-01-29T00:09:41.000Z","dependencies_parsed_at":null,"dependency_job_id":"92415697-3cbf-44b2-82b3-fa7d24ddca50","html_url":"https://github.com/suryateja0153/Database-Design-And-Performance-Tuning","commit_stats":null,"previous_names":["suryateja0153/database-design-and-performance-tuning"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/suryateja0153/Database-Design-And-Performance-Tuning","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/suryateja0153%2FDatabase-Design-And-Performance-Tuning","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/suryateja0153%2FDatabase-Design-And-Performance-Tuning/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/suryateja0153%2FDatabase-Design-And-Performance-Tuning/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/suryateja0153%2FDatabase-Design-And-Performance-Tuning/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/suryateja0153","download_url":"https://codeload.github.com/suryateja0153/Database-Design-And-Performance-Tuning/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/suryateja0153%2FDatabase-Design-And-Performance-Tuning/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267703041,"owners_count":24130463,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-29T02:00:12.549Z","response_time":2574,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["mssqlserver","oracle-database","queries","rstudio","sqlserver"],"created_at":"2024-08-13T07:02:52.057Z","updated_at":"2025-07-29T14:32:00.683Z","avatar_url":"https://github.com/suryateja0153.png","language":null,"funding_links":[],"categories":["Others"],"sub_categories":[],"readme":"# Database Design And Performance Tuning\n\n## Executive Summary\nThe objective of this project is to design and implement the Entity Relation Diagram for the IMDb\ndata set for movies and T.V. shows. This design can be used to explore and view information\nabout movies and T.V. shows. Motivation behind doing this project came after working on movies\ndatabase for class assignments. If a person wants to find details about any movie or show they have\nto look at different places to get complete information. Data for this project is fetched from IMDb\nwebsite and is created as one stop destination for users to access the following information:\n1. User can access list of movies and shows which are coming soon.\n2. Region wise and language wise movies and shows can be found.\n3. Type of display for the movies/shows can also be checked. For example, if the display is\noriginal, it is just the trailer, display is 3-D version etc.\n4. List of movies/shows labeled based on their run time minutes - video, short, movie, tv series,\ntv mini series.\n5. User can gets the details of tv shows about the total seasons and total episodes.\n6. User can search for the list of movies and tv shows of their favorite actors/ actress, director,\nproducer, writer and other roles.\n7. User can look at the average rating given to their movie/show of choice.\n\n## ERD Design\nSQL table consists of all the data in the IMDb database. For creating tables, Entity Relationship\nDiagram is designed (ERD) to see the relationship between entity and its attributes. Records of\ndatabase are stored in relational database and data dictionary is one of its crucial component. Data\ndictionary is created to describe the contents, format and structure of the database. It consists of\nthe name of the column, description, data type, size, identity. constraints of the table like keys,\nunique, index, nulls etc. Relationship between the columns of the database can be used to access\nthe information and manipulate the database.\u003cbr\u003e\n\n![](ERD_Diagram.jpg)\n\n## Scripting\nThe data set was in tab separated format so to read it pandas library was used. Original data set\nwas huge and each table consisted of 7 million rows. Number of rows in TITLE BASICS and\nNAME BASICS were limited to 15000 and data after 2019 was inserted into the tables, rest all the\ntables were built on that. TITLE EPISODES consists of total number of episodes and total number\nof seasons. One table had roles and other had names with no roles. We put relationship for both\nwith and without roles in TITLE PRINCIPALS table. We exported all tables to csv for syncing\npurposes.\n\n```\nimport cx_Oracle\ncx_Oracle.init_oracle_client(lib_dir=\"instantclient_19_8\")\ndsn_tns = cx_Oracle.makedsn('reade.forest.usf.edu', '1521','cdb9')\nconn = cx_Oracle.connect(user='DB372', password='\u003cpassword\u003e', dsn=dsn_tns)\ncursor = conn.cursor()\n```\n\n```\nimport pandas as pd\nimport sys\n\nsql='insert into TITLE_BASICS values(:1,:2,:3,:4,:5,:6,:7)'\nn=0\nm=0\ndft = pd.read_csv('title.episode.tsv', sep='\\t')\n\nfor df in pd.read_csv('title.basics.tsv', sep='\\t', chunksize=5000):\n\tdf = df.replace('\\\\N', '')\n\tprint(n)\n\tfor index, row in df.iterrows():\n\t\tif row['startYear'] == '' or int(row['startYear']) \u003c 2015 or\n\t\tlen(dft.loc[dft['tconst']==row['tconst']]) \u003e 0 : continue\n\t\tli = [row['tconst'], row['originalTitle'], row['isAdult'],\n row['startYear'], row['endYear'] ,\n row['runtimeMinutes'], row['titleType']]\n\t\ttry:\n\t\t\tcursor.execute(sql,li)\n\t\t\tm += 1\n\t\t\tprint(\"inserted: \", m)\n\t\t\tif m \u003e= 10000:\n\t\t\t\tconn.commit()\n\t\t\t\tconn.close()\n\t\t\t\tsys.exit(1)\n\t\texcept Exception as e:\n\t\t\t\tprint(row)\n\t\t\t\tprint(str(e))\n\tn+=5000\n\tconn.commit()\n\nconn.close()\n```\n\n## Data Exploration and Query Writing\nIn this section, the database would be investigated to understand the structure. For this we write some queries and check the output to understand if the initial setup was successful.\n\n**Table 1 : title basics**\u003cbr\u003e\nAccording to the ERD shown in Fig 1, the table basic is the center which the others table connect.\nFirstly, this table did not have primary key as a default shown in Fig 2 and the Fig 3 shows the\nprimary key, tconst, after being altered.\n\n![](Media/Image1.jpg)\n\n```\nALTER TABLE title_basics\nADD CONSTRAINT title_basics_pk\nPRIMARY KEY (tconst);\n```\n\n![](Media/Image2.jpg)\n\n**Table 2 : title episode**\u003cbr\u003e\nSame pattern is applied to this table. However, there are duplicated data exist in this table shown\nin Fig 4 with prevent from creating primary key. After delete it, Fig 5 show the primary key, which\nis ep id. Surprisingly, parent tconst is the subset of tconst in title basic.\n\n```\nSELECT *\nFROM title_episode\nWHERE ep_id = 'epid000000000';\n```\n\n![](Media/Image3.jpg)\n\n```\nDELETE FROM title_episode\nWHERE rowid not in (\nSELECT MAX(rowid)\nFROM title_episode\nGROUP BY ep_id);\n```\n\n![](Media/Image4.jpg)\n\n**Table 3 : show attributes**\u003cbr\u003e\nIn contrast to table: title episode, there is no problem exist. Fig 6 show the primary key of this\ntable.\n\n```\nALTER TABLE show_attributes\nADD CONSTRAINT show_attributes_pk\n3RIMARY KEY (att_id);\n```\n\n![](Media/Image5.jpg)\n\n**Interesting Queries**\u003cbr\u003e\nThis section shows a code snippet from the full report about writing some intereting queries.\n\n1. Display movie name, people who are responsible with their roles, year of the movie,\nand age restriction.\n\n```\nSELECT\n\ttb.title,\n\tnb.first_name,\n\tnb.last_name,\n\ttp.role,\n\ttb.start_year,\n\ttb.is_adult\nFROM\n\ttitle_basics tb\n\tINNER JOIN title_principals tp\n\tUSING (tconst)\n\tINNER JOIN name_basics nb\n\tUSING (nconst)\nORDER BY tb.start_year DESC, nb.first_name;\n```\n\n![](Media/Image6.jpg)\n\n## Performance Tuning\nThis section explores various performance tuning methods which can increase query speed and efficiency.\nThe main purpose of database tuning is to increase query speed. Database tuning is a more\nbroader term which includes optimization, database management system applications and database\nenvironment configuration which includes OS optimizer, CPU single and multi-threading, memory\netc. In this project, we will be specifically looking at database performance tuning techniques like\nindexing, partitioning, parallel execution etc.\n\n**Indexing**\u003cbr\u003e\nLet’s start-off with indexing. Indexes are widely used to quickly search through the data without\nhaving to search every single row in a table when a database is accessed. Indexes can be created\nusing a single or multiple columns of a database table, which provides both rapid random lookups\nand efficient access of ordered records.\n\n**Function-Based Indexing**\u003cbr\u003e\nIndexing is performing a full scan, this can be further optimized by using Functionbased\nIndexing. In Function-Based Indexing, we use something called a range scan, instead of full\nscan, by doing this the query will be faster where the optimizer does complex querying.\n\n**Parallel Execution**\u003cbr\u003e\nIn this section, the CUSTOMER table is used to show how parallelism works. First of all we begin\nby creating a table and executing a simple query. Then we proceed to enable parallelism on the\nsame query and execute again.\n\n**Transitions in More Complex Parallel Queries**\u003cbr\u003e\nIn this section, we use the MOVIES tale to show various transitions like parallel-to-parallel and\nparallel-to-serial using complex queries. In this example we see the best transition which is\nparallel-to-parallel, showing we had no bottlenecks for this processing.\n\n**Partitioned Tables**\u003cbr\u003e\nIn this section, we show how partitioning works. We will write some interesting queries to test out\nhow the range partition works and see the partitioned result.\n\n## Data Visualization with R\nAccording to the difficulty of syncing the database to Rstudio, we decide to export the query result\nshown below in excel file. Then we decide to do the linear regression test by Rstudio with the table\ndata.\n\n```\nSELECT\n\ttb.title,\n\ttr.avarage_raing,\n\ttr.num_votes\nFROM\n\ttitle_basics tb\n\tINNER JOIN title_ratings tr\n\tUSING (tconst)\nWHERE tr.num_votes \u003e 100\nORDER BY tr.avarage_raing DESC;\n```\n\n![](Media/Image7.jpg)\n\nThe data show IMDB score, number of voter, restriction age of movies and number of location\nwhich the movie goes aboard. After that, we perform the regression model test by comparing\nhypothesis test of 6 different models. Denote Y = IMDB ranking, X1 = number of voter .\n\n![](Media/Image8.jpg)\n\n\u003e Y = B0 + B1X1\n\n![](Media/Image9.jpg)\n\nAccording to the P-value and R-squared, they are both surprisingly ugly value. We cannot assume\nor identify any relationship on them. Therefore, we decide to do nonlinear model which are ln(x)\nmodel and inverse x model .\n\n\u003e ln(x) model\n\n![](Media/Image10.jpg)\n\n\u003e Inverse X model\n\n![](Media/Image11.jpg)\n\n**Analysis Results**\u003cbr\u003e\nAccording to the results, both of the functions still contain ugly values. It means the relationship of\nthose X1 and Y is very hard to determine. The relationship could be very more complex or it is not\nrelated. However, we could imply that the rating of the movies might not explain in the numeric\nfunction because the human emotion is too bias.\n\n## Conclusion\nOne of the interesting projects that I did which helped me understand the entire database design and management pipeline starting with the ERD, scripting and querying to more advanced aspects of the design like performance tuning and we also got a chance to do some data analytics with R where we build ML models and draw new insights.\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsuryateja0153%2FDatabase-Design-And-Performance-Tuning","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsuryateja0153%2FDatabase-Design-And-Performance-Tuning","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsuryateja0153%2FDatabase-Design-And-Performance-Tuning/lists"}