{"id":13732790,"url":"https://github.com/ekagra-ranjan/GS-Quantify-17","last_synced_at":"2025-05-08T08:32:14.136Z","repository":{"id":105082295,"uuid":"142410184","full_name":"ekagra-ranjan/GS-Quantify-17","owner":"ekagra-ranjan","description":"GS-Quantify' 17, Goldman Sachs Data Science Competition","archived":false,"fork":false,"pushed_at":"2020-05-21T18:19:09.000Z","size":912,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-30T18:02:17.579Z","etag":null,"topics":["boosting-algorithm","cross-validation","data-science","data-science-challenges","data-science-competition","ensemble","feature-engineering","feature-extraction","feature-selection","goldmann-sachs","gradient-boosting-machine","gs-quantify","gsquantify","linear-regression","sklearn","xgb-classifier","xgboost"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ekagra-ranjan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-07-26T08:12:21.000Z","updated_at":"2023-01-06T04:46:37.000Z","dependencies_parsed_at":null,"dependency_job_id":"536fe1d4-038d-4742-bd06-00b2a25475fa","html_url":"https://github.com/ekagra-ranjan/GS-Quantify-17","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ekagra-ranjan%2FGS-Quantify-17","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ekagra-ranjan%2FGS-Quantify-17/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ekagra-ranjan%2FGS-Quantify-17/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ekagra-ranjan%2FGS-Quantify-17/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ekagra-ranjan","download_url":"https://codeload.github.com/ekagra-ranjan/GS-Quantify-17/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252798838,"owners_count":21805884,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["boosting-algorithm","cross-validation","data-science","data-science-challenges","data-science-competition","ensemble","feature-engineering","feature-extraction","feature-selection","goldmann-sachs","gradient-boosting-machine","gs-quantify","gsquantify","linear-regression","sklearn","xgb-classifier","xgboost"],"created_at":"2024-08-03T03:00:33.887Z","updated_at":"2025-05-08T08:32:13.718Z","avatar_url":"https://github.com/ekagra-ranjan.png","language":"Jupyter Notebook","funding_links":[],"categories":["Goldman Sachs"],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003e\nGS-Quantify-17\n\u003c/h1\u003e\n\n\u003ch2 align=\"center\"\u003e\n(Goldman Sachs Flagship Data Science Competition)\n\u003c/h2\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/ekagra-ranjan/GS-Quantify-17/raw/master/Method-Presentation-yoKnockers.pptx\"\u003e\u003cimg src=\"http://img.shields.io/badge/Slides-ppt-orange.svg\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/ekagra-ranjan/GS-Quantify-17/\"\u003e\u003cimg src=\"http://img.shields.io/badge/IITG Rank (ML)-3-blue.svg\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/ekagra-ranjan/GS-Quantify-17/\"\u003e\u003cimg src=\"http://img.shields.io/badge/National Rank (ML)-32-blue.svg\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/ekagra-ranjan/GS-Quantify-17/raw/master/Method-Presentation-yoKnockers.pptx\"\u003e\u003cimg src=\"http://img.shields.io/badge/Team Name-Yo Knockers-purple.svg\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\n\n\u003cbr\u003e\n\u003cbr\u003e\n\n## ML Problem Statement - Predicting Garbage Collector Invocation\n\n## Data Visualisation\n\nHere `gc` stands for Garbage Collector.\n\n\u003cp  align=\"center\"\u003e\n\u003cimg src=\"./plots/1.png\"\u003e\n  \u003cbr\u003e\n  \u003cb\u003e initial-Used-Memory (y-axis) vs gc-Initial-Memory (x-axis)\u003c/b\u003e\n\u003c/p\u003e\n\n\nThe plot shows us that there is a linear relationship between the 2 variabes.\n\n\n\n\u003cp  align=\"center\"\u003e\n\u003cimg src=\"./plots/2.png\"\u003e\n  \u003cbr\u003e\n  \u003cb\u003e Final-Used-Memory vs gc-Final-Memory\u003c/b\u003e\n\u003c/p\u003e\n\n\nThe plot shows us that there is a linear relationship between the 2 variabes.\n\n\n\n\n\u003cp  align=\"center\"\u003e\n\u003cimg src=\"./plots/3.png\"\u003e\n  \u003cbr\u003e\n  \u003cb\u003e initial-Used-Memory + initial-Free-Memory  vs gc-Total-Memory\u003c/b\u003e\n\u003c/p\u003e\n\n\nThe plot shows us that there is a linear relationship between the 2 variabes.\n\n\n\u003cp  align=\"center\"\u003e\n\u003cimg src=\"./plots/4.png\"\u003e\n  \u003cbr\u003e\n  \u003cb\u003e initial-Used-Memory + initial-Free-Memory  vs final-Used-Memory + final-Free-Memory \u003c/b\u003e\n\u003c/p\u003e\n\n\nThe plot shows us that there is a linear relationship between the 2 variabes. We observe 3 outliers in this plot which we remove before proceeding.\n\n\n\n\u003cp  align=\"center\"\u003e\n\u003cimg src=\"./plots/5.png\"\u003e\n  \u003cbr\u003e\n  \u003cb\u003e initial-Used-Memory + initial-Free-Memory  vs final-Used-Memory + final-Free-Memory \u003c/b\u003e\n\u003c/p\u003e\n\n\nThe plot shows us that there is a linear relationship between the 2 variabes.\n\n\n## Approximations Used\nFollowing approximations were made:\n* gcInitialMemory = initialUsedMemory\n* GcFinalMemory = finalUsedMemory\n* GcTotalMemory = finalUsedMemory+finalFreeMemory = initialUsedMemory + initialFreeMemory\n\nWe were required to print the memory free after every query is served but he heading of that column was given as initialFreeMemory which we take it as finalFreeMemory\n\n## Models Used\n\n* **Linear Regression**\nFollowing the plots and approximations we predicted:\n  * gcInitialMemory using linear regression with initialUsedMemory\n \n  * finalUsedMemory using linear regression with resources+initialUsedMemory\n \n  * gcTotalMemory using linear regression with initialUsedMemory+initialFreeMemory\n \n  * FinalFreeMemory using linear regression with initialFreeMemory+initialUsedMemory-finalUsedMemory\n \n \n\n\n* **XGBOOST**\n\n  Xgboost was used to determine the **gcRun**. We gave parameters to xgboost as: **resources**, **initialMemoryUsed**,         **initialMemoryFree**, **cpuTimeTaken**.\n\n  We chose this model as the output was not in linearly related to the parameters. We confirmed this creating a cross validation set and checking the accuracy of a linear model such as logistic regression, linear SVM( both hard-margin and soft-margin). The result came to be very poor. We also tried SVM with ‘rbf’ kernel, which wasn’t much an improvement from the linear models. \n\n  So we applied Xgboost was the best among the other models due to the nonlinear relationship between taget and parameters. Xgboost being an ensemble method has the added advantage of not being overfitted easily while preserving the accuracy.  \n\n## Strategy for deciding the results\nTo predict **gcRun**:\n\nWe used the xgboost to predict ‘gcRun’. We supplied **resources** feature to the xgboost algorithm by saved value of resources that we obtained from the training set. Eg: token_53 had ‘resources’ as 0.047545312750000325 which was obtained from training set.\n\nTo predict **initialFreeMemory**:\n\n* We computed initialFreeMemory as previous query’s finalFreeMemory\n* We computed initialUsedMemory as previous query’s finalUsedMemory\n* We computed gcInitialMemory as initialFreeMemory of same query\n* We computed gcTotalMemory as initialFreeMemory+initialUsedMemory of same query\n* We computed finalUsedMemory as resources+initialUsedMemory of same query\n* We computed finalFreeMemory as initialFreeMemory+initialUsedMemory-finalUsedMemory of same query\n\nThis finalFreeMemory then becomes the output for that query as the initialFreeMemory\n\n\n\u003cbr\u003e\n\u003cbr\u003e\n\n## Github repos of similar Data Science Competitions:\n\n* [Analyze-This-18](https://github.com/ekagra-ranjan/Analyze-This-18)\n* [Analyze-This-17](https://github.com/ekagra-ranjan/Analyze-This-17)\n* [Inter-IIT-Techmeet-17](https://github.com/ekagra-ranjan/Optimal-Bidding/)\n* [awesome-undergrad-hackathons](https://github.com/ekagra-ranjan/awesome-undergrad-hackathons)\n\n\u003cp align=\"center\"\u003e\n\tPlease star the repo if you found the materials in the repo useful :)\n\u003c/p\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fekagra-ranjan%2FGS-Quantify-17","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fekagra-ranjan%2FGS-Quantify-17","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fekagra-ranjan%2FGS-Quantify-17/lists"}