{"id":18689693,"url":"https://github.com/mrseanryan/data-type-predictor","last_synced_at":"2025-11-08T07:30:30.003Z","repository":{"id":96809641,"uuid":"581187059","full_name":"mrseanryan/data-type-predictor","owner":"mrseanryan","description":"Given the name of a property or attribute like 'BrandName' or 'AmountReceived', try to predict a data type like String, Boolean, Integer...","archived":false,"fork":false,"pushed_at":"2022-12-23T15:55:36.000Z","size":30,"stargazers_count":2,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-12-28T01:49:28.125Z","etag":null,"topics":["ai","data-classification","data-types","nlp","stemming"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mrseanryan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-12-22T14:00:12.000Z","updated_at":"2023-11-23T02:32:08.000Z","dependencies_parsed_at":null,"dependency_job_id":"e81b5191-005a-41bd-8bb8-730eac2c4b59","html_url":"https://github.com/mrseanryan/data-type-predictor","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrseanryan%2Fdata-type-predictor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrseanryan%2Fdata-type-predictor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrseanryan%2Fdata-type-predictor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrseanryan%2Fdata-type-predictor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mrseanryan","download_url":"https://codeload.github.com/mrseanryan/data-type-predictor/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239550286,"owners_count":19657541,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","data-classification","data-types","nlp","stemming"],"created_at":"2024-11-07T10:44:47.096Z","updated_at":"2025-11-08T07:30:29.958Z","avatar_url":"https://github.com/mrseanryan.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# data-type-predictor README\n\nGiven the name of a property or attribute like 'BrandName' or 'AmountReceived', try to predict a data type like String, Boolean, Integer...\n\n# Dependencies\n\n```\npython3 -m pip install --upgrade parameterized==0.7.5 levenshtein==0.20.8 Flask==2.2.2\n```\n\n# Usage - Prediction\n\n```\npython3 ./src/predict-type-from-name.py \u003cname\u003e [--help --fuzzy]\n```\n\n## Example - passing a property name on the command line\n\n```\npython3 ./src/predict-type-from-name.py Actve --fuzzy\n```\n\nOutput:\n\n```\nActve=Boolean\n```\n\n## Example - REPL to try it out\n\n```\npython3 ./src/predict-type-from-name-repl.py\n```\n\nOutput:\n\n```\nEnter a property name like 'Color' or 'BrandName' or 'CreatedOn'\n(just press ENTER to exit) -\u003eExportedOn\nExportedOn=Date\n(just press ENTER to exit) -\u003eItemWidth\nItemWidth=Integer\n```\n\n## Example - running as a REST API\n\n```\n./go.api.sh\n```\n\nOpen a URL with a property-name at the end:\n\n```\nhttp://127.0.0.1:5000/predict_type/Branded\n```\n\nOutput:\n```\nproperty_name: Branded -\u003e predicted type=Boolean\n```\n\n\n# Usage - Evaluation\n\n```\npython3 ./src/evaluate.py \u003cpath or glob to JSON data file(s)\u003e [--help --fuzzy]\n```\n\n## Example\n\n```\npython3 ./src/evaluate.py ./data/names-and-types.small.1.json\n```\n\nOutput:\n\n```\n# Accuracy:\n\n45% correctly predicted\n5% incorrectly predicted\n50% not predicted\nData set size: 66 words\n```\n\n# Usage - Training\n\nA small element of Machine Learning is used to optimize the parameters used to predict, for a given data set.\n\nThe Accuracy measure is used (TP/(TP+FP)). The Cost function is defined simply to maximise the accuracy.\n\n## Example\n\n```\npython3 ./src/train.py ./data/ip-xxx-big.json\nTraining...\n[done]\nOptimal config:\nis_fuzzy=False, max_distance=0, min_length=2, cost=29, accuracy=71\n```\n\nUnfortunately, Machine Learning indicates that the optimal configuration can be acheived WITHOUT fuzzy matching!\nHowever, for UX reasons, fuzzy matching still seems useful, given the accuracy against data is the same.\n\n# Approach\n\n1. The property name (the word) is stemmed into smaller tokens, assuming camelCase or PascalCase\n2. Heuristics are run to try and recognise the first or last token. Example: `is` or `can` indicates `Boolean`. If match is found, exit.\n3. [If fuzzy matching is enabled] Levenshtein distance is then allowed on the longer tokens, to try to get a fuzzy match.\n\n# Evaluation (Validation)\n\n## Data set: 66 words\n\n| Approach                                                    | Accuracy | Correctly predicted | Incorrectly predicated | Not predicted | Data set                                                                                | Comment                                                                                           |\n| ----------------------------------------------------------- | -------- | ------------------- | ---------------------- | ------------- | --------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------- |\n| Heuristics, no fuzzy match                                  | -        | 45%                 | 5%                     | 50%           | 66 words                                                                                | 'Safe' predications                                                                               |\n| Heuristics, with fuzzy match (min length 3, max distance 5) | -        | 47%                 | 48%                    | 5%            | 66 words                                                                                | 'Unsafe' fuzzy predications: small gain in true positives with cost of much more false positives. |\n| Heuristics, with fuzzy match (min length 2, max distance 2) | -        | 50%                 | 14%                    | 36%           | 66 words                                                                                | 'Safer' fuzzy predications.                                                                       |\n| Heuristics, with fuzzy match (min length 5, max distance 2) | 91%      | 47%                 | 5%                     | 48%           | 66 words                                                                                | 'Safer' fuzzy predications.                                                                       |\n| *ML Optimized* Heuristics, with NO fuzzy match (min length 2)              | 91%      | 45%                 | 5%                     | 50%           | _Machine Learning optimized the 5600 item data set_ -\u003e Fuzzy is OFF. |\n\n## Data set: 5640 words\n\n| Approach                                                    | Accuracy | Correctly predicted | Incorrectly predicated | Not predicted | Data set                                                                       | Comment              |\n| ----------------------------------------------------------- | -------- | ------------------- | ---------------------- | ------------- | ------------------------------------------------------------------------------ | -------------------- |\n| Heuristics, no fuzzy match                                  | -        | 16%                 | 7%                     | 77%           | 5640 words                                                                     | 'Safe' predications. |\n| Heuristics, with fuzzy match (min length 2, max distance 2) | -        | 24%                 | 30%                    | 46%           | 5640 words                                                                     | Fuzzy predications.  |\n| Heuristics, with fuzzy match (min length 2, max distance 2) | -        | 24%                 | 30%                    | 46%           | 5640 words                                                                     | Fuzzy predications.  |\n| Heuristics, with fuzzy match (min length 5, max distance 2) | 68%      | 17%                 | 8%                     | 75%           | 5640 words                                                                     | Fuzzy predications.  |\n| *ML Optimized* Heuristics, with forced fuzzy match (min length 6, max distance 1)              | 71%      | 16%                 | 7%                     | 77%     | 5640       | _Machine Learning optimized THIS data set_ Fuzzy is forced ON, learned optimal token length. |\n| *ML Optimized* Heuristics, with NO fuzzy match               | 71%      | 16%                 | 7%                     | 77%     | 5640       | _Machine Learning optimized THIS data set_ -\u003e Fuzzy is OFF. |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmrseanryan%2Fdata-type-predictor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmrseanryan%2Fdata-type-predictor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmrseanryan%2Fdata-type-predictor/lists"}