Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/grindelfp/datasets-analysis
The Machine Learning and Data Analysis course task dedicated to training skills of data normalizing and preprocessing.
https://github.com/grindelfp/datasets-analysis
data-analysis datasets ipynb mlda
Last synced: 8 days ago
JSON representation
The Machine Learning and Data Analysis course task dedicated to training skills of data normalizing and preprocessing.
- Host: GitHub
- URL: https://github.com/grindelfp/datasets-analysis
- Owner: GrindelfP
- License: mit
- Created: 2024-02-12T12:49:49.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2024-02-23T19:45:17.000Z (11 months ago)
- Last Synced: 2025-01-12T17:09:00.910Z (8 days ago)
- Topics: data-analysis, datasets, ipynb, mlda
- Language: Jupyter Notebook
- Homepage:
- Size: 3.14 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.adoc
- License: LICENSE
Awesome Lists containing this project
README
= Datasets analysis practice =
== Task ==
1. Pick 3 datasets from kaggle.com
2. Describe the type of each feature from each dataset
3. Analyze the data from one of the datasets with the most unique types of features
4. Formalize and normalize the data
5. Write conclusions about the work doneThe work shoud be completed as a Jupyter notebook and passed as PDF file.
== Datasets ==
1. Chess Game Dataset (Lichess) - https://www.kaggle.com/datasnaek/chess
2. Google Play Store Apps - https://www.kaggle.com/lava18/google-play-store-apps
3. The books of Skyrim - https://www.kaggle.com/datasets/aadamg/skyrim-books-from-uesp== Formalization ways ==
The formalization of the data can be done in the multiple ways, dependently on the type of each feature of a dataset.=== Binary features ===
Define which one of the two present values is 0 and which is 1. Then, replace the values with 0 and 1.=== Nominal features ===
Normalize the data by creating a column for each unique value of the feature. Then, replace the values with 0 and 1.
Example:
Color of eye: blue, green, brown, grey.[options="header"]
.Table 1. Representation of the nominal feature as columns of binary values
|================
| blue | green | brown | grey
| 1 | 0 | 0 | 0
| 0 | 1 | 0 | 0
| 0 | 0 | 1 | 0
| 0 | 0 | 0 | 1
| ... | ... | ... | ...
|=================== Ranged features ===
Divide the values by the greatest value of the feature, which will make the values to be in the range from 0 to 1.=== Features of quantity ===
Here comes the following algorithm:1. Round the values to the nearest integer
2. Sort the values in ascending order
3. Divide the values into two groups: 0%..50% of values and 51%..100%
4. Assign 0 to the first group and 1 to the second group