{"id":13642620,"url":"https://github.com/jfilter/split-folders","last_synced_at":"2026-01-28T14:09:38.652Z","repository":{"id":33095589,"uuid":"151568644","full_name":"jfilter/split-folders","owner":"jfilter","description":"🗂 Split folders with files (i.e. images) into training, validation and test (dataset) folders","archived":false,"fork":false,"pushed_at":"2023-03-08T05:05:34.000Z","size":91,"stargazers_count":421,"open_issues_count":15,"forks_count":69,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-04-01T14:51:48.944Z","etag":null,"topics":["dataset","deep-learning","machine-learning","oversampling","python","python-package","splitting","test","training","validation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jfilter.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2018-10-04T12:33:35.000Z","updated_at":"2025-03-17T13:27:28.000Z","dependencies_parsed_at":"2022-06-27T05:12:57.922Z","dependency_job_id":null,"html_url":"https://github.com/jfilter/split-folders","commit_stats":{"total_commits":62,"total_committers":8,"mean_commits":7.75,"dds":"0.11290322580645162","last_synced_commit":"c566dbd56a1097e1ddba2de5dfb93bd67eade54f"},"previous_names":[],"tags_count":13,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jfilter%2Fsplit-folders","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jfilter%2Fsplit-folders/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jfilter%2Fsplit-folders/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jfilter%2Fsplit-folders/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jfilter","download_url":"https://codeload.github.com/jfilter/split-folders/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247888559,"owners_count":21013001,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset","deep-learning","machine-learning","oversampling","python","python-package","splitting","test","training","validation"],"created_at":"2024-08-02T01:01:34.009Z","updated_at":"2026-01-28T14:09:38.643Z","avatar_url":"https://github.com/jfilter.png","language":"Python","readme":"# `split-folders` [![Build Status](https://img.shields.io/github/actions/workflow/status/jfilter/split-folders/test.yml)](https://github.com/jfilter/split-folders/actions/workflows/test.yml) [![PyPI](https://img.shields.io/pypi/v/split-folders.svg)](https://pypi.org/project/split-folders/) [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/split-folders.svg)](https://pypi.org/project/split-folders/) [![PyPI - Downloads](https://img.shields.io/pypi/dm/split-folders)](https://pypistats.org/packages/split-folders)\n\nSplit folders with files (e.g. images) into **train**, **validation** and **test** (dataset) folders.\n\nThe input folder should have the following format (with class subdirectories):\n\n```\ninput/\n    class1/\n        img1.jpg\n        img2.jpg\n        ...\n    class2/\n        imgWhatever.jpg\n        ...\n    ...\n```\n\nOr a **flat directory** without class subdirectories:\n\n```\ninput/\n    file1.jpg\n    file2.jpg\n    ...\n```\n\nIn order to give you this:\n\n```\noutput/\n    train/\n        class1/\n            img1.jpg\n            ...\n        class2/\n            imga.jpg\n            ...\n    val/\n        class1/\n            img2.jpg\n            ...\n        class2/\n            imgb.jpg\n            ...\n    test/\n        class1/\n            img3.jpg\n            ...\n        class2/\n            imgc.jpg\n            ...\n```\n\nThis should get you started to do some serious deep learning on your data. [Read here](https://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set) why it's a good idea to split your data intro three different sets.\n\n-   Split files into a training set and a validation set (and optionally a test set).\n-   Works on any file types.\n-   Supports both class-based directory structures and flat directories.\n-   The files get shuffled (can be disabled for time series data).\n-   A [seed](https://docs.python.org/3/library/random.html#random.seed) makes splits reproducible.\n-   Allows randomized [oversampling](https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis) for imbalanced datasets.\n-   Optionally group files by prefix or by stem.\n-   Optionally split files by file format(s).\n-   Split parallel directories (e.g. `images/` + `annotations/`) in lockstep.\n-   Custom grouping via callable.\n-   (Should) work on all operating systems.\n\n## Install\n\nThis package is Python only and there are no external dependencies.\n\n```bash\npip install split-folders\n```\n\nOptionally, you may install [tqdm](https://github.com/tqdm/tqdm) to get a progress bar when moving files.\n\n```bash\npip install split-folders[full]\n```\n\n## Usage\n\nYou can use `split-folders` as Python module or as a Command Line Interface (CLI).\n\nIf your datasets is balanced (each class has the same number of samples), choose `ratio` otherwise `fixed`.\nNB: oversampling is turned off by default.\nOversampling is only applied to the _train_ folder since having duplicates in _val_ or _test_ would be considered cheating.\n\n### Module\n\n```python\nimport splitfolders\n\n# Split with a ratio.\n# To only split into training and validation set, set a tuple to `ratio`, i.e, `(.8, .2)`.\nsplitfolders.ratio(\"input_folder\", output=\"output\",\n    seed=1337, ratio=(.8, .1, .1), group_prefix=None, group=None,\n    formats=None, move=False, shuffle=True) # default values\n\n# Split val/test with a fixed number of items, e.g. `(100, 100)`, for each set.\n# To only split into training and validation set, use a single number to `fixed`, i.e., `10`.\n# Set 3 values, e.g. `(300, 100, 100)`, to limit the number of training values.\nsplitfolders.fixed(\"input_folder\", output=\"output\",\n    seed=1337, fixed=(100, 100), oversample=False, group_prefix=None, group=None,\n    formats=None, move=False, shuffle=True) # default values\n\n# Use `fixed=\"auto\"` with oversampling to auto-compute the val size from the smallest class.\n# Allocates ~20% of the smallest class to validation, rest to training.\nsplitfolders.fixed(\"input_folder\", output=\"output\",\n    seed=1337, fixed=\"auto\", oversample=True)\n\n# Split into k folds for cross-validation.\n# Each fold directory contains train/ and val/ subdirectories.\n# Uses symlinks by default to avoid k× disk usage.\nsplitfolders.kfold(\"input_folder\", output=\"output\",\n    seed=1337, k=5, group_prefix=None, group=None,\n    formats=None, move=\"symlink\", shuffle=True) # default values\n\n# Split without shuffling (e.g. for time series data).\nsplitfolders.ratio(\"input_folder\", output=\"output\",\n    ratio=(.8, .1, .1), shuffle=False)\n```\n\n### Flat directories\n\nIf your input folder contains files directly (no class subdirectories), `splitfolders` auto-detects this and splits files into `train/`, `val/`, `test/` without creating class subfolders:\n\n```python\n# input_folder/ contains file1.jpg, file2.jpg, ... (no subdirs)\nsplitfolders.ratio(\"input_folder\", output=\"output\", ratio=(.8, .1, .1))\n```\n\nOutput:\n```\noutput/\n    train/\n        file1.jpg\n        ...\n    val/\n        file5.jpg\n        ...\n```\n\n\u003e **Note:** Oversampling is not available with flat directories (there are no classes to balance).\n\n### Grouping files\n\nWhen your data has multiple files per sample (e.g. an image and its annotation), you need to keep them together during the split. There are several ways to do this.\n\n#### Group by prefix (`group_prefix`)\n\nThe legacy approach. Set `group_prefix` to the number of files per group (e.g. `2` for image + annotation pairs). Files are grouped by their filename stem (the part before the extension). All stems must have exactly `group_prefix` files.\n\n```\ninput/cats/\n    img1.jpg   img1.txt\n    img2.jpg   img2.txt\n```\n\n```python\nsplitfolders.ratio(\"input\", output=\"output\", group_prefix=2)\n```\n\n#### Group by stem (`group=\"stem\"`)\n\nA simpler alternative to `group_prefix`. Automatically groups files that share the same stem and discovers the group size. No need to specify how many files per group — it just requires every stem to have the same count.\n\n```python\nsplitfolders.ratio(\"input\", output=\"output\", group=\"stem\")\n```\n\nIf every stem has only one file (e.g. a folder of just `.jpg` files), `group=\"stem\"` behaves identically to no grouping.\n\n#### Group by sibling directories (`group=\"sibling\"`)\n\nFor datasets where file types live in **parallel directories** rather than alongside each other:\n\n```\ndata/\n    images/\n        im_1.jpg\n        im_2.jpg\n    annotations/\n        im_1.xml\n        im_2.xml\n```\n\nUse `group=\"sibling\"` to split all directories in lockstep, matching files across directories by stem:\n\n```python\nsplitfolders.ratio(\"data\", output=\"output\", group=\"sibling\")\n```\n\nThis produces:\n\n```\noutput/\n    train/\n        images/im_1.jpg\n        annotations/im_1.xml\n    val/\n        images/im_2.jpg\n        annotations/im_2.xml\n```\n\nRequirements:\n- The input must have at least 2 subdirectories.\n- Every stem must exist in every subdirectory.\n- Cannot be combined with `oversample=True`.\n\n#### Custom grouping (`group=callable`)\n\nFor advanced use cases, pass any callable that takes a list of `Path` objects and returns a list of tuples:\n\n```python\ndef my_grouping(files):\n    # Custom logic to group files\n    # Return: list of tuples of Path objects\n    ...\n\nsplitfolders.ratio(\"input\", output=\"output\", group=my_grouping)\n```\n\nThis also covers **manifest-based splitting** (#41). For example, if you have a CSV that defines train/test assignments:\n\n```python\ndef group_from_manifest(files):\n    manifest = load_my_csv(\"split_manifest.csv\")\n    # return list of tuples grouped according to manifest\n    ...\n\nsplitfolders.ratio(\"input\", output=\"output\", group=group_from_manifest)\n```\n\n\u003e **Note:** `group_prefix` and `group` are mutually exclusive — setting both raises a `ValueError`.\n\n### File formats\n\nThere might be some instances when you have multiple file formats in these folders. Provide one or multiple extension(s) to `formats` for splitting only certain files (e.g. `formats=['.jpeg', '.png']`).\n\n### Move options\n\nSet\n- `move=True` or `move='move'` if you want to move the files instead of copying.\n- `move=False` or `move='copy'` if you want to copy the files. (default behavior)\n- `move='symlink'` if you want to symlink (i.e. create shortcuts `ln -s`) instead of copying.\n\n### CLI\n\n```\nUsage:\n    splitfolders [--output] [--ratio] [--fixed] [--kfold] [--seed] [--oversample] [--group_prefix] [--group] [--formats] [--move] [--no-shuffle] folder_with_images\nOptions:\n    --output        path to the output folder. defaults to `output`. Get created if non-existent.\n    --ratio         the ratio to split. e.g. for train/val/test `.8 .1 .1 --` or for train/val `.8 .2 --`.\n    --fixed         set the absolute number of items per validation/test set. The remaining items constitute\n                    the training set. e.g. for train/val/test `100 100` or for train/val `100`.\n                    Set 3 values, e.g. `300 100 100`, to limit the number of training values.\n                    Use `auto` to auto-compute from the smallest class (requires --oversample).\n    --kfold         split into k folds for cross-validation. e.g. `5` for 5-fold CV. Uses symlinks by default.\n    --seed          set seed value for shuffling the items. defaults to 1337.\n    --oversample    enable oversampling of imbalanced datasets, works only with --fixed.\n    --group_prefix  split files into equally-sized groups based on their prefix\n    --group         grouping strategy: 'stem' or 'sibling' (mutually exclusive with --group_prefix)\n    --formats       split the files based on specified extension(s)\n    --move          move the files instead of copying\n    --symlink       symlink(create shortcut) the files instead of copying\n    --no-shuffle    do not shuffle files before splitting (useful for time series data)\nExample:\n    splitfolders --ratio .8 .1 .1 -- folder_with_images\n    splitfolders --kfold 5 folder_with_images\n    splitfolders --group stem --ratio .8 .1 .1 -- folder_with_images\n    splitfolders --group sibling --ratio .8 .1 .1 -- data_with_parallel_dirs\n```\n\nBecause of some [Python quirks](https://github.com/jfilter/split-folders/issues/19) you have to prepend ` --` after using `--ratio`.\n\nInstead of the command `splitfolders` you can also use `split_folders` or `split-folders`.\n\n## Development\n\nInstall and use [poetry](https://python-poetry.org/).\n\n## Contributing\n\nIf you have a **question**, found a **bug** or want to propose a new **feature**, have a look at the [issues page](https://github.com/jfilter/split-folders/issues).\n\n**Pull requests** are especially welcomed when they fix bugs or improve the code quality.\n\n## License\n\nMIT\n","funding_links":[],"categories":["Python"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjfilter%2Fsplit-folders","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjfilter%2Fsplit-folders","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjfilter%2Fsplit-folders/lists"}