Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mrseanryan/data-type-predictor
Given the name of a property or attribute like 'BrandName' or 'AmountReceived', try to predict a data type like String, Boolean, Integer...
https://github.com/mrseanryan/data-type-predictor
ai data-classification data-types nlp stemming
Last synced: 8 days ago
JSON representation
Given the name of a property or attribute like 'BrandName' or 'AmountReceived', try to predict a data type like String, Boolean, Integer...
- Host: GitHub
- URL: https://github.com/mrseanryan/data-type-predictor
- Owner: mrseanryan
- License: mit
- Created: 2022-12-22T14:00:12.000Z (about 2 years ago)
- Default Branch: master
- Last Pushed: 2022-12-23T15:55:36.000Z (about 2 years ago)
- Last Synced: 2024-11-07T10:52:41.310Z (about 2 months ago)
- Topics: ai, data-classification, data-types, nlp, stemming
- Language: Python
- Homepage:
- Size: 29.3 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# data-type-predictor README
Given the name of a property or attribute like 'BrandName' or 'AmountReceived', try to predict a data type like String, Boolean, Integer...
# Dependencies
```
python3 -m pip install --upgrade parameterized==0.7.5 levenshtein==0.20.8 Flask==2.2.2
```# Usage - Prediction
```
python3 ./src/predict-type-from-name.py [--help --fuzzy]
```## Example - passing a property name on the command line
```
python3 ./src/predict-type-from-name.py Actve --fuzzy
```Output:
```
Actve=Boolean
```## Example - REPL to try it out
```
python3 ./src/predict-type-from-name-repl.py
```Output:
```
Enter a property name like 'Color' or 'BrandName' or 'CreatedOn'
(just press ENTER to exit) ->ExportedOn
ExportedOn=Date
(just press ENTER to exit) ->ItemWidth
ItemWidth=Integer
```## Example - running as a REST API
```
./go.api.sh
```Open a URL with a property-name at the end:
```
http://127.0.0.1:5000/predict_type/Branded
```Output:
```
property_name: Branded -> predicted type=Boolean
```# Usage - Evaluation
```
python3 ./src/evaluate.py [--help --fuzzy]
```## Example
```
python3 ./src/evaluate.py ./data/names-and-types.small.1.json
```Output:
```
# Accuracy:45% correctly predicted
5% incorrectly predicted
50% not predicted
Data set size: 66 words
```# Usage - Training
A small element of Machine Learning is used to optimize the parameters used to predict, for a given data set.
The Accuracy measure is used (TP/(TP+FP)). The Cost function is defined simply to maximise the accuracy.
## Example
```
python3 ./src/train.py ./data/ip-xxx-big.json
Training...
[done]
Optimal config:
is_fuzzy=False, max_distance=0, min_length=2, cost=29, accuracy=71
```Unfortunately, Machine Learning indicates that the optimal configuration can be acheived WITHOUT fuzzy matching!
However, for UX reasons, fuzzy matching still seems useful, given the accuracy against data is the same.# Approach
1. The property name (the word) is stemmed into smaller tokens, assuming camelCase or PascalCase
2. Heuristics are run to try and recognise the first or last token. Example: `is` or `can` indicates `Boolean`. If match is found, exit.
3. [If fuzzy matching is enabled] Levenshtein distance is then allowed on the longer tokens, to try to get a fuzzy match.# Evaluation (Validation)
## Data set: 66 words
| Approach | Accuracy | Correctly predicted | Incorrectly predicated | Not predicted | Data set | Comment |
| ----------------------------------------------------------- | -------- | ------------------- | ---------------------- | ------------- | --------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------- |
| Heuristics, no fuzzy match | - | 45% | 5% | 50% | 66 words | 'Safe' predications |
| Heuristics, with fuzzy match (min length 3, max distance 5) | - | 47% | 48% | 5% | 66 words | 'Unsafe' fuzzy predications: small gain in true positives with cost of much more false positives. |
| Heuristics, with fuzzy match (min length 2, max distance 2) | - | 50% | 14% | 36% | 66 words | 'Safer' fuzzy predications. |
| Heuristics, with fuzzy match (min length 5, max distance 2) | 91% | 47% | 5% | 48% | 66 words | 'Safer' fuzzy predications. |
| *ML Optimized* Heuristics, with NO fuzzy match (min length 2) | 91% | 45% | 5% | 50% | _Machine Learning optimized the 5600 item data set_ -> Fuzzy is OFF. |## Data set: 5640 words
| Approach | Accuracy | Correctly predicted | Incorrectly predicated | Not predicted | Data set | Comment |
| ----------------------------------------------------------- | -------- | ------------------- | ---------------------- | ------------- | ------------------------------------------------------------------------------ | -------------------- |
| Heuristics, no fuzzy match | - | 16% | 7% | 77% | 5640 words | 'Safe' predications. |
| Heuristics, with fuzzy match (min length 2, max distance 2) | - | 24% | 30% | 46% | 5640 words | Fuzzy predications. |
| Heuristics, with fuzzy match (min length 2, max distance 2) | - | 24% | 30% | 46% | 5640 words | Fuzzy predications. |
| Heuristics, with fuzzy match (min length 5, max distance 2) | 68% | 17% | 8% | 75% | 5640 words | Fuzzy predications. |
| *ML Optimized* Heuristics, with forced fuzzy match (min length 6, max distance 1) | 71% | 16% | 7% | 77% | 5640 | _Machine Learning optimized THIS data set_ Fuzzy is forced ON, learned optimal token length. |
| *ML Optimized* Heuristics, with NO fuzzy match | 71% | 16% | 7% | 77% | 5640 | _Machine Learning optimized THIS data set_ -> Fuzzy is OFF. |