{"id":15066211,"url":"https://github.com/serezd/ffcv_pytorch_lightning","last_synced_at":"2025-04-10T13:42:43.174Z","repository":{"id":65458120,"uuid":"574991336","full_name":"SerezD/ffcv_pytorch_lightning","owner":"SerezD","description":"[FFCV-PL] manage fast data loading with ffcv and pytorch lightning","archived":false,"fork":false,"pushed_at":"2023-07-17T10:31:08.000Z","size":66,"stargazers_count":15,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-04T05:03:25.764Z","etag":null,"topics":["dataloader","ffcv","pytorch","pytorch-lightning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SerezD.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-12-06T14:15:40.000Z","updated_at":"2025-01-18T12:53:19.000Z","dependencies_parsed_at":"2025-02-17T10:41:40.757Z","dependency_job_id":null,"html_url":"https://github.com/SerezD/ffcv_pytorch_lightning","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SerezD%2Fffcv_pytorch_lightning","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SerezD%2Fffcv_pytorch_lightning/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SerezD%2Fffcv_pytorch_lightning/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SerezD%2Fffcv_pytorch_lightning/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SerezD","download_url":"https://codeload.github.com/SerezD/ffcv_pytorch_lightning/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248226247,"owners_count":21068168,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataloader","ffcv","pytorch","pytorch-lightning"],"created_at":"2024-09-25T01:03:46.933Z","updated_at":"2025-04-10T13:42:43.154Z","avatar_url":"https://github.com/SerezD.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# FFCV Dataloader with Pytorch Lightning\n\nFFCV is a fast dataloader for neural networks training: https://github.com/libffcv/ffcv  \n\nIn this repository, all the steps to install and configure it with pytorch-lightning are presented.  \nThe idea is to provide very generic methods and utils, while letting the user decide and configure anything.\n\n## Installation\n\nTested with: \n```\nUbuntu 22.04.2 LTS\npython 3.11\nffcv==1.0.2\npytorch==2.0.1\npytorch-lightning==2.0.4\n```\n\n### Dependencies\n\nYou can install dependencies (FFCV, Pytorch) with the provided `environment.yml` file:  \n```\nconda env create --file environment.yml\nconda activate ffcv-pl\n```\nThis should correctly create a conda environment named `ffcv-pl`.  \n\n**Note:** Modify the pytorch-cuda version to the one compatible with your system.\n\n**Note:** Solving environment can take quite a long time. \nI suggest to use [libmamba solver](https://www.anaconda.com/blog/a-faster-conda-for-a-growing-community) \nto speed up the process.\n\n**If the above does not work**, then another option is manual installation: \n\n1. create conda environment\n    ```\n    conda create --name ffcv-pl\n    conda activate ffcv-pl\n    ```\n\n2. install pytorch according to [official website](https://pytorch.org/get-started/locally/) \n\n    ```\n    # in my environment the command is the following \n    conda install pytorch torchvision torchaudio pytorch-cuda=[your-version] -c pytorch -c nvidia\n    ```\n\n3. install ffcv dependencies and pytorch-lightning\n    ```\n    # can take some time for solving, but should not create conflicts\n    conda install cupy pkg-config libjpeg-turbo\"\u003e=2.1.4\" opencv numba pytorch-lightning\"\u003e=2.0.0\" -c pytorch -c conda-forge\n    ```\n\n4. install ffcv\n    ```\n    pip install ffcv\n    ```\n\nFor further help, check out FFCV installation guidelines: [ffcv official page](https://github.com/libffcv/ffcv)\n\n### Package\n\nOnce dependencies are installed, it is safe to install the package: \n```\npip install ffcv_pl\n```\n\n## Dataset Creation\n\nYou need to save your dataset in ffcv format (`.beton`).  \nOfficial FFCV [docs](https://docs.ffcv.io/writing_datasets.html).\n\nThis package provides you the `create_beton_wrapper` method, which allows to easily create\na `.beton` dataset from a `torch` dataset.  \n\nExample from the `dataset_creation.py` script:\n\n```\nfrom ffcv.fields import RGBImageField\n\nfrom ffcv_pl.generate_dataset import create_beton_wrapper\nfrom torch.utils.data.dataset import Dataset\nimport numpy as np\nfrom PIL import Image\n\n\nclass ToyImageLabelDataset(Dataset):\n\n    def __init__(self, n_samples: int):\n        self.samples = [Image.fromarray((np.random.rand(32, 32, 3) * 255).astype('uint8')).convert('RGB')\n                        for _ in range(n_samples)]\n\n    def __len__(self):\n        return len(self.samples)\n\n    def __getitem__(self, idx):\n        return (self.samples[idx], int(idx))\n\n\ndef main():\n\n    # 1. Instantiate the torch dataset that you want to create\n    # Important: the __get_item__ dataset must return tuples! (This depends on FFCV library)\n    image_label_dataset = ToyImageLabelDataset(n_samples=256)\n    \n    # 2. Optional: create Field objects.\n    # here overwrites only RGBImageField, leave default IntField.\n    fields = (RGBImageField(write_mode='jpg', max_resolution=32), None)\n    \n    # 3. call the method, and it will automatically create the .beton dataset for you.\n    create_beton_wrapper(image_label_dataset, \"./data/image_label.beton\", fields)\n\n\nif __name__ == '__main__':\n\n    main()\n\n```\n\n## Dataloader and Datamodule\n\nMerge the PL Datamodule with the FFCV Loader object.  \nOfficial FFCV Loader [docs](https://docs.ffcv.io/making_dataloaders.html).   \nOfficial Pytorch-Lightning DataModule [docs](https://lightning.ai/docs/pytorch/stable/data/datamodule.html).\n\nIn `main.py` a complete example on how to use the `FFCVDataModule` method and train a \nLightning Model is given.\n\nThe main steps to follow are:\n1. create `FFCVPipelineManager` object, which needs the path to a previously created `.beton` file, \n   a list of operations to perform on each item returned by your dataset and an ordering option for Loading.\n2. create the `FFCVDataModule` object, which is a Lightning Module with FFCV Loader.\n3. Pass the data module to Pytorch Lightning trainer, and run!\n\n**Suggestion** : read FFCV [performance guide](https://docs.ffcv.io/performance_guide.html) to better\n   understand which options fit your needs.\n\nComplete Example from the `main.py` script:\n\n```\nimport pytorch_lightning as pl\nimport torch\nfrom ffcv.fields.basics import IntDecoder\nfrom ffcv.fields.rgb_image import RandomResizedCropRGBImageDecoder, CenterCropRGBImageDecoder\nfrom ffcv.loader import OrderOption\nfrom ffcv.transforms import ToTensor, ToTorchImage\nfrom pytorch_lightning.strategies.ddp import DDPStrategy\n\nfrom torch import nn\nfrom torch.optim import Adam\nfrom torchvision.transforms import RandomHorizontalFlip\n\nfrom ffcv_pl.data_loading import FFCVDataModule\nfrom ffcv_pl.ffcv_utils.augmentations import DivideImage255\n\nfrom ffcv_pl.ffcv_utils.utils import FFCVPipelineManager\n\n\n# define the LightningModule\nclass LitAutoEncoder(pl.LightningModule):\n\n    def __init__(self):\n        super().__init__()\n        self.encoder = nn.Sequential(nn.Linear(32 * 32 * 3, 64), nn.ReLU(), nn.Linear(64, 3))\n        self.decoder = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 32 * 32 * 3))\n\n    def training_step(self, batch, batch_idx):\n\n        x = batch[0]\n\n        b, c, h, w = x.shape\n        x = x.reshape(b, -1)\n        z = self.encoder(x)\n        x_hat = self.decoder(z)\n        loss = nn.functional.mse_loss(x_hat, x)\n\n        # Logging to TensorBoard by default\n        self.log(\"train_loss\", loss)\n        return loss\n\n    def validation_step(self, batch, batch_idx):\n        pass\n\n    def configure_optimizers(self):\n        optimizer = Adam(self.parameters(), lr=1e-3)\n        return optimizer\n\n\ndef main():\n\n    seed = 1234\n\n    pl.seed_everything(seed, workers=True)\n\n    batch_size = 16\n    gpus = 2\n    nodes = 1\n    workers = 8\n\n    # image label dataset\n    train_manager = FFCVPipelineManager(\"./data/image_label.beton\",  # previously defined using dataset_creation.py\n                                        pipeline_transforms=[\n\n                                            # image pipeline\n                                            [RandomResizedCropRGBImageDecoder((32, 32)),\n                                             ToTensor(),\n                                             ToTorchImage(),\n                                             DivideImage255(dtype=torch.float32),\n                                             RandomHorizontalFlip(p=0.5)],\n\n                                            # label (int) pipeline\n                                            [IntDecoder(),\n                                             ToTensor()\n                                             ]\n                                        ],\n                                        ordering=OrderOption.RANDOM)  # random ordering for training\n\n    val_manager = FFCVPipelineManager(\"./data/image_label.beton\",\n                                      pipeline_transforms=[\n\n                                          # image pipeline (different from train)\n                                          [CenterCropRGBImageDecoder((32, 32), ratio=1.),\n                                           ToTensor(),\n                                           ToTorchImage(),\n                                           DivideImage255(dtype=torch.float32)],\n\n                                          # label (int) pipeline\n                                          None  # if None, uses default\n                                      ],\n                                      ordering=OrderOption.SEQUENTIAL)  # sequential ordering for validation\n\n    # datamodule creation\n    # ignore test and predict steps, since managers are not defined.\n    data_module = FFCVDataModule(batch_size, workers, train_manager=train_manager, val_manager=val_manager,\n                                 is_dist=True, seed=seed)\n\n    # define model\n    model = LitAutoEncoder()\n\n    # trainer\n    trainer = pl.Trainer(strategy=DDPStrategy(find_unused_parameters=False), deterministic=True,\n                         accelerator='gpu', devices=gpus, num_nodes=nodes, max_epochs=5, logger=False)\n\n    # start training!\n    trainer.fit(model, data_module)\n\n\nif __name__ == '__main__':\n\n    main()\n\n```\n\n## Code Citations\n\n1. Pytorch-Lightning:\n    ```\n   @software{Falcon_PyTorch_Lightning_2019,\n    author = {Falcon, William and {The PyTorch Lightning team}},\n    doi = {10.5281/zenodo.3828935},\n    license = {Apache-2.0},\n    month = mar,\n    title = {{PyTorch Lightning}},\n    url = {https://github.com/Lightning-AI/lightning},\n    version = {1.4},\n    year = {2019}\n    }\n   ```\n\n2. FFCV: \n    ```\n    @misc{leclerc2022ffcv,\n        author = {Guillaume Leclerc and Andrew Ilyas and Logan Engstrom and Sung Min Park and Hadi Salman and Aleksander Madry},\n        title = {{FFCV}: Accelerating Training by Removing Data Bottlenecks},\n        year = {2022},\n        howpublished = {\\url{https://github.com/libffcv/ffcv/}},\n        note = {commit 2544abdcc9ce77db12fecfcf9135496c648a7cd5}\n    }\n    ```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fserezd%2Fffcv_pytorch_lightning","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fserezd%2Fffcv_pytorch_lightning","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fserezd%2Fffcv_pytorch_lightning/lists"}