Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ahammadmejbah/different-types-of-data-splitting-methods

In order to prevent overfitting and guarantee that our model can generalize to new data, data splitting is essential in machine learning.
https://github.com/ahammadmejbah/different-types-of-data-splitting-methods

data data-analysis data-engineering data-mining data-science data-visualization

Last synced: 21 days ago
JSON representation

In order to prevent overfitting and guarantee that our model can generalize to new data, data splitting is essential in machine learning.

Host: GitHub
URL: https://github.com/ahammadmejbah/different-types-of-data-splitting-methods
Owner: ahammadmejbah
License: gpl-3.0
Created: 2023-10-18T10:19:53.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2023-10-19T11:26:16.000Z (over 1 year ago)
Last Synced: 2024-11-11T08:37:34.707Z (3 months ago)
Topics: data, data-analysis, data-engineering, data-mining, data-science, data-visualization
Homepage: https://blogs.ahammedmejbah.com/
Size: 95.7 KB
Stars: 4
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        


       
Different types of data splitting methods

     


In order to prevent overfitting and guarantee that our model can generalize to new data, data splitting is essential in machine learning. Let's examine a few typical data splitting techniques:

![](https://www.googleapis.com/download/storage/v1/b/kaggle-forum-message-attachments/o/inbox%2F5658374%2F1104d9898cd56b7c9c7d70141c23a3fd%2F1_train-test-split_0.jpg?generation=1697623539235388&alt=media)

Image Credit: Train test split procedure. | Image: Michael Galarnyk | Built In

1. **Train/Test Split**

    This is the simplest method. We split our data into a training set and a testing set.

    ```python

    from sklearn.model_selection import train_test_split

    X, y = [...]  # Your data and labels

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

    ```

2. **K-Fold Cross Validation**

    This method involves splitting the data into 'k' subsets. The model is trained on k-1 of these folds and tested on the remaining one. This process is repeated k times, each time with a different fold as the test set.

    ```python

    from sklearn.model_selection import KFold

    kf = KFold(n_splits=5, shuffle=True, random_state=42)

    for train_index, test_index in kf.split(X):

        X_train, X_test = X[train_index], X[test_index]

        y_train, y_test = y[train_index], y[test_index]

        # Train and test your model here

    ```

3. **Stratified K-Fold Cross Validation**

    Like K-Fold, but it ensures that each fold maintains the same distribution of classes as the entire dataset.

    ```python

    from sklearn.model_selection import StratifiedKFold

    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

    for train_index, test_index in skf.split(X, y):

        X_train, X_test = X[train_index], X[test_index]

        y_train, y_test = y[train_index], y[test_index]

        # Train and test your model here

    ```

4. **Time Series Split**

    Useful for time series data. In each split, the test set consists of the next 'n' points in the data. This avoids "looking into the future" during training.

    ```python

    from sklearn.model_selection import TimeSeriesSplit

    tscv = TimeSeriesSplit(n_splits=5)

    for train_index, test_index in tscv.split(X):

        X_train, X_test = X[train_index], X[test_index]

        y_train, y_test = y[train_index], y[test_index]

        # Train and test your model here

    ```

5. **Leave One Out Cross Validation (LOOCV)**

    This involves training on all data points except one and testing on that single left out point. This is repeated for all data points. It's computationally intensive but can be useful for small datasets.

    ```python

    from sklearn.model_selection import LeaveOneOut

    loo = LeaveOneOut()

    for train_index, test_index in loo.split(X):

        X_train, X_test = X[train_index], X[test_index]

        y_train, y_test = y[train_index], y[test_index]

        # Train and test your model here

    ```

 6. **Stratified Sampling**

 Stratified sampling ensures that the training and test sets have approximately the same percentage of samples of each target class as the complete set.

 ```python

 from sklearn.model_selection import train_test_split

 # Assume X is your feature matrix and y is your labels

 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

 ```

 7. **Group K-Fold Cross-Validation**

 Group K-Fold cross-validation is a variation of k-fold cross-validation that ensures the same group is not represented in both the training and test sets.

 ```python

 from sklearn.model_selection import GroupKFold

 groups = [...]  # This needs to be a list of group identifiers corresponding to each observation in X

 gkf = GroupKFold(n_splits=5)

 for train_index, test_index in gkf.split(X, y, groups):

    X_train, X_test = X[train_index], X[test_index]

    y_train, y_test = y[train_index], y[test_index]

    # Training and testing model

 ```

**In the above examples, `X` and `y` are your feature matrix and label vector respectively. Also, you need to have sklearn installed in your environment.**

Always be sure that no information from the test set leaks into the training set with your data. This is crucial in instances like time series forecasting, where utilizing past data to predict the future may provide erroneous findings. 

## Credit BytesOfIntelligence