{"id":18535343,"url":"https://github.com/sanjinkurelic/casebasedreasoning","last_synced_at":"2025-06-20T15:13:25.087Z","repository":{"id":200595673,"uuid":"138355988","full_name":"SanjinKurelic/CaseBasedReasoning","owner":"SanjinKurelic","description":"Find missing values in data set using Euclid distance, normalization and calculating information value, weight of evidence","archived":false,"fork":false,"pushed_at":"2018-10-15T18:58:02.000Z","size":47,"stargazers_count":25,"open_issues_count":0,"forks_count":7,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-03-24T08:47:38.122Z","etag":null,"topics":["case-based-reasoning","csv","data-science","influence","information-value","machine-learning","numpy","pandas","python3","weight-of-evidence"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SanjinKurelic.png","metadata":{"files":{"readme":"ReadMe.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2018-06-22T23:18:49.000Z","updated_at":"2024-09-20T11:49:53.000Z","dependencies_parsed_at":null,"dependency_job_id":"cf911a7f-7cd7-4711-9f46-b8ea11089c53","html_url":"https://github.com/SanjinKurelic/CaseBasedReasoning","commit_stats":null,"previous_names":["sanjinkurelic/casebasedreasoning"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SanjinKurelic%2FCaseBasedReasoning","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SanjinKurelic%2FCaseBasedReasoning/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SanjinKurelic%2FCaseBasedReasoning/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SanjinKurelic%2FCaseBasedReasoning/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SanjinKurelic","download_url":"https://codeload.github.com/SanjinKurelic/CaseBasedReasoning/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248058086,"owners_count":21040693,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["case-based-reasoning","csv","data-science","influence","information-value","machine-learning","numpy","pandas","python3","weight-of-evidence"],"created_at":"2024-11-06T19:22:33.299Z","updated_at":"2025-04-09T15:32:07.107Z","avatar_url":"https://github.com/SanjinKurelic.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Missing Values\n\nCase-based reasoning (CBR) is the process of solving new problems based on the solutions of similar past problems \u003csup\u003e[1]\u003c/sup\u003e. In data science CBR is a technique which allows us to find missing values from a given set of data, and each variable has its own set of characteristics. If we have a variable that does not contain one or more characteristic, we can find a similar variable using Euclidean distance and predict the missing characteristics.\n\nIn this repository there are 2 Python scripts, one for finding missing values and the other for calculating which characteristics (columns) influence/define the missing value the most. There is also already a generated file containing data (including missing values) and statistics collected from users of a telecommunication company. Using that data we can train a model and check the correctness of the results.\n\n## Getting started\n\n### Requirements\n\nTo run both scripts you should have **Python version 3.x** and the following modules installed:\n\n- pandas\n- numpy\n\nAll modules are available trough PIP:\n\n```bash\npip install pandas\n```\n\nThis command also installs numpy module.\n\n### Running\n\nThis git repository is consisted of the following files:\n\n- *FindMissingValues.py*\n- *CalculateIV.py*\n- *telecom.csv*\n- *telecomStats.txt*\n\nFile *telecom.csv* contains all data from the telecommunication company in *csv* format with columns:\n\n- **customerId** : int\n- **customerAge** : int\n- **customerPlansChanged** : int\n- **smsCountPerMonth** : int\n- **callMinutePerMonth** : int\n- **dataMBPerMonth** : int\n- **netflixStream** : boolean\n- **pickboxStream** : boolean\n- **youtubeStream** : boolean\n- **hboGoStream** : boolean\n- **viberFree** : boolean\n- **whatsappFree** : boolean\n\nThere is a total of 4 000 rows and 12 columns. Out of the 4 000 rows, 6 of them have missing values: 3 rows with **customerAge** missing and 3 rows with **customerPlansChanged** missing.\n\nIf we run the script *FindMissingValues.py* with the following command:\n\n```bash\npython3 FindMissingValues.py\n```\n\nwe get following results:\n\ncustomerId | predictedValue | realValue\n:---: | :---: | :---:\n3998 | 30 | 33\n3999 | \u003cspan style=\"color:red\"\u003e**70**\u003c/span\u003e | 48\n4000 | 25 | 28\n3995 | 0 | 0\n3996 | 3 | 3\n3997 | 4 | 5\n\nAs we can see our predicting model is giving pretty good values, except for people older than 40. Those people usually do not use any kind of data/sms/call plans, and for that reason they are not our target population for presenting new tariffs. It should also be mentioned that data used in prediction is quite random so edge cases are not deeply covered (ex: people older than 40 years old).\n\nIf we run *CalculateIV.py* we will get information about which columns influence missing value (in this case customer age is the column that contains missing values):\n\n```bash\n(14, 18):\n#Positive influence:\ncustomerPlansChanged = (2.0, 3.0)\n#Opposite influence:\ncustomerPlansChanged = (3.0, 4.0)\nnetflixStream = True\npickboxStream = True\nyoutubeStream = True\nhboGoStream = True\n\n(18, 28):\n#Positive influence:\nyoutubeStream = True\n#There is no opposite influence\n...\n```\n\nFrom this partial output you can see which variables influence the missing value, so if a user changes his/her tariff 2 or 3 times (**positive influence**), and does not have Netflix, Pickbox, Youtube or HBO GO stream (**negative influence**), than he/she is probably between 14 and 18 years old.\n\n## Notice\n\nThis is a simple algorithm for finding missing values and it is not tested on real world data/applications. Do not use it in production before you double check if everything is working as assumed.\n\n## License\n\nThis project is licensed under the MIT License - see the *LICENSE* file for details.\n\n## References\n\n\u003csup\u003e[1]\u003c/sup\u003e https://en.wikipedia.org/wiki/Case-based_reasoning","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsanjinkurelic%2Fcasebasedreasoning","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsanjinkurelic%2Fcasebasedreasoning","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsanjinkurelic%2Fcasebasedreasoning/lists"}