Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ahammadmejbah/different-types-of-data-splitting-methods
In order to prevent overfitting and guarantee that our model can generalize to new data, data splitting is essential in machine learning.
https://github.com/ahammadmejbah/different-types-of-data-splitting-methods
data data-analysis data-engineering data-mining data-science data-visualization
Last synced: about 2 months ago
JSON representation
In order to prevent overfitting and guarantee that our model can generalize to new data, data splitting is essential in machine learning.
- Host: GitHub
- URL: https://github.com/ahammadmejbah/different-types-of-data-splitting-methods
- Owner: ahammadmejbah
- License: gpl-3.0
- Created: 2023-10-18T10:19:53.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2023-10-19T11:26:16.000Z (about 1 year ago)
- Last Synced: 2024-08-28T20:40:50.106Z (4 months ago)
- Topics: data, data-analysis, data-engineering, data-mining, data-science, data-visualization
- Homepage: https://blogs.ahammedmejbah.com/
- Size: 95.7 KB
- Stars: 4
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Different types of data splitting methods
In order to prevent overfitting and guarantee that our model can generalize to new data, data splitting is essential in machine learning. Let's examine a few typical data splitting techniques:
![](https://www.googleapis.com/download/storage/v1/b/kaggle-forum-message-attachments/o/inbox%2F5658374%2F1104d9898cd56b7c9c7d70141c23a3fd%2F1_train-test-split_0.jpg?generation=1697623539235388&alt=media)
Image Credit: Train test split procedure. | Image: Michael Galarnyk | Built In
1. **Train/Test Split**
This is the simplest method. We split our data into a training set and a testing set.```python
from sklearn.model_selection import train_test_splitX, y = [...] # Your data and labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
```2. **K-Fold Cross Validation**
This method involves splitting the data into 'k' subsets. The model is trained on k-1 of these folds and tested on the remaining one. This process is repeated k times, each time with a different fold as the test set.```python
from sklearn.model_selection import KFoldkf = KFold(n_splits=5, shuffle=True, random_state=42)
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Train and test your model here
```3. **Stratified K-Fold Cross Validation**
Like K-Fold, but it ensures that each fold maintains the same distribution of classes as the entire dataset.```python
from sklearn.model_selection import StratifiedKFoldskf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_index, test_index in skf.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Train and test your model here
```4. **Time Series Split**
Useful for time series data. In each split, the test set consists of the next 'n' points in the data. This avoids "looking into the future" during training.```python
from sklearn.model_selection import TimeSeriesSplittscv = TimeSeriesSplit(n_splits=5)
for train_index, test_index in tscv.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Train and test your model here
```
5. **Leave One Out Cross Validation (LOOCV)**
This involves training on all data points except one and testing on that single left out point. This is repeated for all data points. It's computationally intensive but can be useful for small datasets.```python
from sklearn.model_selection import LeaveOneOutloo = LeaveOneOut()
for train_index, test_index in loo.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Train and test your model here
```
6. **Stratified Sampling**Stratified sampling ensures that the training and test sets have approximately the same percentage of samples of each target class as the complete set.
```python
from sklearn.model_selection import train_test_split# Assume X is your feature matrix and y is your labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
```7. **Group K-Fold Cross-Validation**
Group K-Fold cross-validation is a variation of k-fold cross-validation that ensures the same group is not represented in both the training and test sets.
```python
from sklearn.model_selection import GroupKFoldgroups = [...] # This needs to be a list of group identifiers corresponding to each observation in X
gkf = GroupKFold(n_splits=5)
for train_index, test_index in gkf.split(X, y, groups):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Training and testing model
```**In the above examples, `X` and `y` are your feature matrix and label vector respectively. Also, you need to have sklearn installed in your environment.**
Always be sure that no information from the test set leaks into the training set with your data. This is crucial in instances like time series forecasting, where utilizing past data to predict the future may provide erroneous findings.
## Credit BytesOfIntelligence