https://github.com/bhimrazy/secure-data-processing-with-litdata

Unlock Secure Data Processing at Scale with LitData's Advanced Encryption Features
https://github.com/bhimrazy/secure-data-processing-with-litdata
data-processing data-security encryption
Last synced: 3 months ago
JSON representation
Unlock Secure Data Processing at Scale with LitData's Advanced Encryption Features
Host: GitHub
URL: https://github.com/bhimrazy/secure-data-processing-with-litdata
Owner: bhimrazy
License: mit
Created: 2024-07-20T09:32:42.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-09-14T13:29:12.000Z (almost 2 years ago)
Last Synced: 2025-01-17T11:13:33.086Z (over 1 year ago)
Topics: data-processing, data-security, encryption
Language: Python
Homepage:
Size: 47.9 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # Unlock Secure Data Processing at Scale with LitData's Advanced Encryption Features

In today's data-driven world, processing large datasets securely and efficiently is more crucial than ever. Enter [LitData](https://github.com/Lightning-AI/litdata), a powerful Python library that's revolutionizing how we handle and optimize big data. Let's dive into how LitData's advanced encryption features can help you process data at scale while maintaining top-notch security.



    

      

    



    



## 🔒 Why Encryption Matters in Data Processing

Before we jump into the technical details, let's consider why encryption is so important:

1. **Data Privacy**: Protect sensitive information from unauthorized access.

2. **Compliance**: Meet regulatory requirements like GDPR or HIPAA.

3. **Intellectual Property**: Safeguard valuable datasets and algorithms.

4. **Trust**: Build confidence with clients and stakeholders.

## 🚀 Introducing LitData's Advanced Encryption Capabilities

LitData takes data security to the next level by offering powerful and flexible encryption options designed for secure, efficient data processing at scale:

1. **Flexible Encryption Levels ✅**

   - Sample-level: Encrypt each data point individually

   - Chunk-level: Encrypt groups of samples for balanced security and performance

2. **Always-Encrypted Data ✅**

   - Data remains encrypted at rest and in transit

   - On-the-fly decryption only when data is actively used

3. **Key Advantages ✅**

   - Process parts of your dataset without decrypting everything

   - Minimize attack surface with data encrypted most of the time

   - Optimize performance with customizable encryption granularity

   - Seamless integration with existing data pipelines

   - Cloud-ready security for protected data wherever it resides

4. **Efficient Resource Use ✅**

   - Decrypt only what's needed, when it's needed

   - Decrypted data doesn't persist in memory, enhancing security

With these features, LitData empowers you to work with sensitive data at scale, maintaining high security standards without compromising on performance or ease of use.

## 💡 How to Use LitData's Encryption Features

In this guide, we'll walk through a practical example of using LitData's encryption features with the `dermamnist` dataset from the [`medmnist`](https://medmnist.com/) library.

### Prerequisites

First, install the required packages and download the dataset:

```bash

# clone the repo

git clone https://github.com/bhimrazy/secure-data-processing-with-litdata && cd secure-data-processing-with-litdata

pip install -r requirements.txt

python -m medmnist save --flag=dermamnist --folder=data/ --postfix=png --download=True --size=28

```

### Optimizing and Encrypting Data

Create a Python script named `optimize_medmnist.py`:

```python

# optimize_medmnist.py

import logging

from typing import Dict, List

import pandas as pd

from litdata import optimize

from litdata.utilities.encryption import FernetEncryption

from PIL import Image

# Set up logging

logging.basicConfig(

    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"

)

def load_data(file_path: str) -> pd.DataFrame:

    """Load and preprocess the CSV file."""

    logging.info(f"Loading data from {file_path}")

    df = pd.read_csv(file_path, header=None)

    df.columns = ["split", "image", "class"]

    return df

def split_data(df: pd.DataFrame) -> Dict[str, List[Dict]]:

    """Split the data into training, validation, and test sets."""

    logging.info("Splitting data into training, validation, and test sets")

    return {

        "train": df[df["split"] == "TRAIN"].to_dict(orient="records"),

        "validation": df[df["split"] == "VALIDATION"].to_dict(orient="records"),

        "test": df[df["split"] == "TEST"].to_dict(orient="records"),

    }

def read_data(record: Dict) -> Dict:

    """Read and preprocess an image record."""

    try:

        image = Image.open(f"data/dermamnist/{record['image']}")

        return {"image": image, "class": record["class"], "split": record["split"]}

    except Exception as e:

        logging.error(f"Error reading image {record['image']}: {e}")

        return {}

def optimize_data(records: List[Dict], output_dir: str, encryption: FernetEncryption):

    """Optimize data using the provided records and encryption."""

    logging.info(f"Optimizing data and saving to {output_dir}")

    optimize(

        fn=read_data,

        inputs=records,

        output_dir=output_dir,

        chunk_bytes="60MB",

        num_workers=2,

        encryption=encryption,

    )

def prepare():

    """Main preparation function."""

    # Load the data

    df = load_data("data/dermamnist.csv")

    data_splits = split_data(df)

    # Set up encryption

    fernet = FernetEncryption(password="your_super_secret_password", level="chunk")

    # Optimize data

    optimize_data(data_splits["train"], "data/dermamnist_optimized/train", fernet)

    optimize_data(

        data_splits["validation"], "data/dermamnist_optimized/validation", fernet

    )

    optimize_data(data_splits["test"], "data/dermamnist_optimized/test", fernet)

    # Save the encryption key

    logging.info("Saving encryption key to fernet.pem")

    fernet.save("fernet.pem")

if __name__ == "__main__":

    prepare()

```

Run the script:

```bash

python optimize_medmnist.py

```

## 🔍 Decrypting and Using the Data

When it's time to use your encrypted data, LitData makes it just as easy:

To test it, create a script named `test_dataset.py`:

```python

# test_dataset.py

import torch

import numpy as np

from litdata import StreamingDataset, StreamingDataLoader

from litdata.utilities.encryption import FernetEncryption

fernet = FernetEncryption.load("fernet.pem", password="your_super_secret_password")

def collate_fn(batch):

    """Collate function for the streaming data loader."""

    images = np.array([np.array(item["image"]) for item in batch])

    classes = [item["class"] for item in batch]

    images_tensor = torch.tensor(images, dtype=torch.float32)

    classes_tensor = torch.tensor(classes, dtype=torch.long)

    return {"image": images_tensor, "class": classes_tensor}

def get_dataloader(

    dataset: StreamingDataset, batch_size: int, shuffle: bool

) -> StreamingDataLoader:

    """Helper function to create a dataloader."""

    return StreamingDataLoader(

        dataset, batch_size=batch_size, shuffle=shuffle, collate_fn=collate_fn

    )

def main():

    # Create streaming datasets

    train_dataset = StreamingDataset(

        input_dir="data/dermamnist_optimized/train", encryption=fernet

    )

    valid_dataset = StreamingDataset(

        input_dir="data/dermamnist_optimized/validation", encryption=fernet

    )

    test_dataset = StreamingDataset(

        input_dir="data/dermamnist_optimized/test", encryption=fernet

    )

    # Create dataloaders

    train_loader = get_dataloader(train_dataset, batch_size=32, shuffle=True)

    valid_loader = get_dataloader(valid_dataset, batch_size=32, shuffle=False)

    test_loader = get_dataloader(test_dataset, batch_size=32, shuffle=False)

    # Log dataset sizes

    print("Number of training images:", len(train_dataset))

    print("Number of validation images:", len(valid_dataset))

    print("Number of test images:", len(test_dataset))

    # Log example batches

    train_batch = next(iter(train_loader))

    print("Train Batch classes:", train_batch["class"])

    valid_batch = next(iter(valid_loader))

    print("Validation Batch classes:", valid_batch["class"])

    test_batch = next(iter(test_loader))

    print("Test Batch classes:", test_batch["class"])

if __name__ == "__main__":

    main()

```

Run the script:

```bash

python test_dataset.py

# Number of training images: 7007

# Number of validation images: 1003

# Number of test images: 2005

# ...

```

And that's it! You've successfully encrypted, optimized, and used your data with LitData's advanced encryption features.

You can further test by training a model using the PyTorch Lightning library:

```bash

python train.py

```

## 🌟 Benefits of LitData's Approach

1. **Scalability**: Process enormous datasets without compromising on security.

2. **Cloud-Ready**: Works seamlessly with cloud storage solutions like AWS S3.

3. **Customizable**: Implement your own encryption methods by subclassing `Encryption`.

4. **Efficient**: Optimized for fast data loading and processing.

## 🚧 Best Practices for Secure Data Processing

1. **Key Management**: Store encryption keys securely, separate from the data.

2. **Regular Rotation**: Change encryption keys periodically for enhanced security.

3. **Access Control**: Limit who can decrypt and access the sensitive data.

4. **Audit Trails**: Keep logs of data access and processing for compliance.

## 🔮 The Future of Secure Data Processing

As datasets grow larger and privacy concerns intensify, tools like LitData will become increasingly crucial. By combining efficient data processing with robust security features, LitData is at the forefront of this evolution.

## 🎓 Conclusion

LitData's advanced encryption features offer a powerful solution for organizations looking to process large datasets securely and efficiently. By leveraging sample-level encryption and cloud-ready streaming, you can unlock new possibilities in data science and machine learning while maintaining the highest standards of data protection.

Ready to supercharge your data processing with top-tier security? Give LitData a try today!

---

Remember, security is an ongoing process. Stay updated with the latest best practices and keep your data safe!

For more information on LitData and its features, check out the official documentation and [GitHub](https://github.com/Lightning-AI/litdata) repository.

## Author

- [Bhimraj Yadav](https://github.com/bhimrazy)

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bhimrazy/secure-data-processing-with-litdata

Awesome Lists containing this project

README