{"id":20655402,"url":"https://github.com/bhimrazy/secure-data-processing-with-litdata","last_synced_at":"2025-03-09T22:12:06.740Z","repository":{"id":249719889,"uuid":"831345994","full_name":"bhimrazy/secure-data-processing-with-litdata","owner":"bhimrazy","description":"Unlock Secure Data Processing at Scale with LitData's Advanced Encryption Features","archived":false,"fork":false,"pushed_at":"2024-09-14T13:29:12.000Z","size":49,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-17T11:13:33.086Z","etag":null,"topics":["data-processing","data-security","encryption"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bhimrazy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-20T09:32:42.000Z","updated_at":"2024-08-09T11:01:31.000Z","dependencies_parsed_at":"2024-11-16T18:11:01.167Z","dependency_job_id":"74ef905f-a6e3-4906-8d34-cd0d7ae435ba","html_url":"https://github.com/bhimrazy/secure-data-processing-with-litdata","commit_stats":null,"previous_names":["bhimrazy/secure-data-processing-with-litdata"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bhimrazy%2Fsecure-data-processing-with-litdata","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bhimrazy%2Fsecure-data-processing-with-litdata/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bhimrazy%2Fsecure-data-processing-with-litdata/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bhimrazy%2Fsecure-data-processing-with-litdata/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bhimrazy","download_url":"https://codeload.github.com/bhimrazy/secure-data-processing-with-litdata/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242756870,"owners_count":20180206,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-processing","data-security","encryption"],"created_at":"2024-11-16T18:10:51.369Z","updated_at":"2025-03-09T22:12:06.704Z","avatar_url":"https://github.com/bhimrazy.png","language":"Python","readme":"# Unlock Secure Data Processing at Scale with LitData's Advanced Encryption Features\n\nIn today's data-driven world, processing large datasets securely and efficiently is more crucial than ever. Enter [LitData](https://github.com/Lightning-AI/litdata), a powerful Python library that's revolutionizing how we handle and optimize big data. Let's dive into how LitData's advanced encryption features can help you process data at scale while maintaining top-notch security.\n\n\u003cdiv align=\"center\"\u003e\n    \u003ca target=\"_blank\" href=\"https://lightning.ai/bhimrajyadav/studios/unlock-secure-data-processing-at-scale-with-litdata-s-advanced-encryption-features\"\u003e\n      \u003cimg src=\"https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/app-2/studio-badge.svg\" alt=\"Open In Studio\"/\u003e\n    \u003c/a\u003e\u003cbr/\u003e\u003cbr/\u003e\n    \u003cimg src=\"https://github.com/user-attachments/assets/6bb2474d-2c0c-42af-8c6e-43c495034341\" alt=\"Cover Image\" width=\"640\" height=\"360\" style=\"object-fit: cover;\"\u003e\n\u003c/div\u003e\n\n## 🔒 Why Encryption Matters in Data Processing\n\nBefore we jump into the technical details, let's consider why encryption is so important:\n\n1. **Data Privacy**: Protect sensitive information from unauthorized access.\n2. **Compliance**: Meet regulatory requirements like GDPR or HIPAA.\n3. **Intellectual Property**: Safeguard valuable datasets and algorithms.\n4. **Trust**: Build confidence with clients and stakeholders.\n\n## 🚀 Introducing LitData's Advanced Encryption Capabilities\n\nLitData takes data security to the next level by offering powerful and flexible encryption options designed for secure, efficient data processing at scale:\n\n1. **Flexible Encryption Levels ✅**\n\n   - Sample-level: Encrypt each data point individually\n   - Chunk-level: Encrypt groups of samples for balanced security and performance\n\n2. **Always-Encrypted Data ✅**\n\n   - Data remains encrypted at rest and in transit\n   - On-the-fly decryption only when data is actively used\n\n3. **Key Advantages ✅**\n\n   - Process parts of your dataset without decrypting everything\n   - Minimize attack surface with data encrypted most of the time\n   - Optimize performance with customizable encryption granularity\n   - Seamless integration with existing data pipelines\n   - Cloud-ready security for protected data wherever it resides\n\n4. **Efficient Resource Use ✅**\n   - Decrypt only what's needed, when it's needed\n   - Decrypted data doesn't persist in memory, enhancing security\n\nWith these features, LitData empowers you to work with sensitive data at scale, maintaining high security standards without compromising on performance or ease of use.\n\n## 💡 How to Use LitData's Encryption Features\n\nIn this guide, we'll walk through a practical example of using LitData's encryption features with the `dermamnist` dataset from the [`medmnist`](https://medmnist.com/) library.\n\n### Prerequisites\n\nFirst, install the required packages and download the dataset:\n\n```bash\n# clone the repo\ngit clone https://github.com/bhimrazy/secure-data-processing-with-litdata \u0026\u0026 cd secure-data-processing-with-litdata\n\npip install -r requirements.txt\npython -m medmnist save --flag=dermamnist --folder=data/ --postfix=png --download=True --size=28\n```\n\n### Optimizing and Encrypting Data\n\nCreate a Python script named `optimize_medmnist.py`:\n\n```python\n# optimize_medmnist.py\nimport logging\nfrom typing import Dict, List\n\nimport pandas as pd\nfrom litdata import optimize\nfrom litdata.utilities.encryption import FernetEncryption\nfrom PIL import Image\n\n# Set up logging\nlogging.basicConfig(\n    level=logging.INFO, format=\"%(asctime)s - %(levelname)s - %(message)s\"\n)\n\n\ndef load_data(file_path: str) -\u003e pd.DataFrame:\n    \"\"\"Load and preprocess the CSV file.\"\"\"\n    logging.info(f\"Loading data from {file_path}\")\n    df = pd.read_csv(file_path, header=None)\n    df.columns = [\"split\", \"image\", \"class\"]\n    return df\n\n\ndef split_data(df: pd.DataFrame) -\u003e Dict[str, List[Dict]]:\n    \"\"\"Split the data into training, validation, and test sets.\"\"\"\n    logging.info(\"Splitting data into training, validation, and test sets\")\n    return {\n        \"train\": df[df[\"split\"] == \"TRAIN\"].to_dict(orient=\"records\"),\n        \"validation\": df[df[\"split\"] == \"VALIDATION\"].to_dict(orient=\"records\"),\n        \"test\": df[df[\"split\"] == \"TEST\"].to_dict(orient=\"records\"),\n    }\n\n\ndef read_data(record: Dict) -\u003e Dict:\n    \"\"\"Read and preprocess an image record.\"\"\"\n    try:\n        image = Image.open(f\"data/dermamnist/{record['image']}\")\n        return {\"image\": image, \"class\": record[\"class\"], \"split\": record[\"split\"]}\n    except Exception as e:\n        logging.error(f\"Error reading image {record['image']}: {e}\")\n        return {}\n\n\ndef optimize_data(records: List[Dict], output_dir: str, encryption: FernetEncryption):\n    \"\"\"Optimize data using the provided records and encryption.\"\"\"\n    logging.info(f\"Optimizing data and saving to {output_dir}\")\n    optimize(\n        fn=read_data,\n        inputs=records,\n        output_dir=output_dir,\n        chunk_bytes=\"60MB\",\n        num_workers=2,\n        encryption=encryption,\n    )\n\n\ndef prepare():\n    \"\"\"Main preparation function.\"\"\"\n    # Load the data\n    df = load_data(\"data/dermamnist.csv\")\n    data_splits = split_data(df)\n\n    # Set up encryption\n    fernet = FernetEncryption(password=\"your_super_secret_password\", level=\"chunk\")\n\n    # Optimize data\n    optimize_data(data_splits[\"train\"], \"data/dermamnist_optimized/train\", fernet)\n    optimize_data(\n        data_splits[\"validation\"], \"data/dermamnist_optimized/validation\", fernet\n    )\n    optimize_data(data_splits[\"test\"], \"data/dermamnist_optimized/test\", fernet)\n\n    # Save the encryption key\n    logging.info(\"Saving encryption key to fernet.pem\")\n    fernet.save(\"fernet.pem\")\n\n\nif __name__ == \"__main__\":\n    prepare()\n```\n\nRun the script:\n\n```bash\npython optimize_medmnist.py\n```\n\n## 🔍 Decrypting and Using the Data\n\nWhen it's time to use your encrypted data, LitData makes it just as easy:\nTo test it, create a script named `test_dataset.py`:\n\n```python\n# test_dataset.py\nimport torch\nimport numpy as np\nfrom litdata import StreamingDataset, StreamingDataLoader\nfrom litdata.utilities.encryption import FernetEncryption\n\nfernet = FernetEncryption.load(\"fernet.pem\", password=\"your_super_secret_password\")\n\n\ndef collate_fn(batch):\n    \"\"\"Collate function for the streaming data loader.\"\"\"\n    images = np.array([np.array(item[\"image\"]) for item in batch])\n    classes = [item[\"class\"] for item in batch]\n\n    images_tensor = torch.tensor(images, dtype=torch.float32)\n    classes_tensor = torch.tensor(classes, dtype=torch.long)\n\n    return {\"image\": images_tensor, \"class\": classes_tensor}\n\n\ndef get_dataloader(\n    dataset: StreamingDataset, batch_size: int, shuffle: bool\n) -\u003e StreamingDataLoader:\n    \"\"\"Helper function to create a dataloader.\"\"\"\n    return StreamingDataLoader(\n        dataset, batch_size=batch_size, shuffle=shuffle, collate_fn=collate_fn\n    )\n\n\ndef main():\n    # Create streaming datasets\n    train_dataset = StreamingDataset(\n        input_dir=\"data/dermamnist_optimized/train\", encryption=fernet\n    )\n    valid_dataset = StreamingDataset(\n        input_dir=\"data/dermamnist_optimized/validation\", encryption=fernet\n    )\n    test_dataset = StreamingDataset(\n        input_dir=\"data/dermamnist_optimized/test\", encryption=fernet\n    )\n\n    # Create dataloaders\n    train_loader = get_dataloader(train_dataset, batch_size=32, shuffle=True)\n    valid_loader = get_dataloader(valid_dataset, batch_size=32, shuffle=False)\n    test_loader = get_dataloader(test_dataset, batch_size=32, shuffle=False)\n\n    # Log dataset sizes\n    print(\"Number of training images:\", len(train_dataset))\n    print(\"Number of validation images:\", len(valid_dataset))\n    print(\"Number of test images:\", len(test_dataset))\n\n    # Log example batches\n    train_batch = next(iter(train_loader))\n    print(\"Train Batch classes:\", train_batch[\"class\"])\n\n    valid_batch = next(iter(valid_loader))\n    print(\"Validation Batch classes:\", valid_batch[\"class\"])\n\n    test_batch = next(iter(test_loader))\n    print(\"Test Batch classes:\", test_batch[\"class\"])\n\n\nif __name__ == \"__main__\":\n    main()\n```\n\nRun the script:\n\n```bash\npython test_dataset.py\n\n# Number of training images: 7007\n# Number of validation images: 1003\n# Number of test images: 2005\n# ...\n```\n\nAnd that's it! You've successfully encrypted, optimized, and used your data with LitData's advanced encryption features.\n\nYou can further test by training a model using the PyTorch Lightning library:\n\n```bash\npython train.py\n```\n\n## 🌟 Benefits of LitData's Approach\n\n1. **Scalability**: Process enormous datasets without compromising on security.\n2. **Cloud-Ready**: Works seamlessly with cloud storage solutions like AWS S3.\n3. **Customizable**: Implement your own encryption methods by subclassing `Encryption`.\n4. **Efficient**: Optimized for fast data loading and processing.\n\n## 🚧 Best Practices for Secure Data Processing\n\n1. **Key Management**: Store encryption keys securely, separate from the data.\n2. **Regular Rotation**: Change encryption keys periodically for enhanced security.\n3. **Access Control**: Limit who can decrypt and access the sensitive data.\n4. **Audit Trails**: Keep logs of data access and processing for compliance.\n\n## 🔮 The Future of Secure Data Processing\n\nAs datasets grow larger and privacy concerns intensify, tools like LitData will become increasingly crucial. By combining efficient data processing with robust security features, LitData is at the forefront of this evolution.\n\n## 🎓 Conclusion\n\nLitData's advanced encryption features offer a powerful solution for organizations looking to process large datasets securely and efficiently. By leveraging sample-level encryption and cloud-ready streaming, you can unlock new possibilities in data science and machine learning while maintaining the highest standards of data protection.\n\nReady to supercharge your data processing with top-tier security? Give LitData a try today!\n\n---\n\nRemember, security is an ongoing process. Stay updated with the latest best practices and keep your data safe!\n\nFor more information on LitData and its features, check out the official documentation and [GitHub](https://github.com/Lightning-AI/litdata) repository.\n\n## Author\n\n- [Bhimraj Yadav](https://github.com/bhimrazy)\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbhimrazy%2Fsecure-data-processing-with-litdata","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbhimrazy%2Fsecure-data-processing-with-litdata","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbhimrazy%2Fsecure-data-processing-with-litdata/lists"}