Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mithril-security/bastionlab
A simple framework for privacy-friendly data science collaboration
https://github.com/mithril-security/bastionlab
deep-learning eda multi-party privacy pytorch secure-enclave
Last synced: 6 days ago
JSON representation
A simple framework for privacy-friendly data science collaboration
- Host: GitHub
- URL: https://github.com/mithril-security/bastionlab
- Owner: mithril-security
- License: apache-2.0
- Created: 2022-07-11T09:31:54.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2023-09-29T14:08:07.000Z (over 1 year ago)
- Last Synced: 2025-01-08T07:08:22.043Z (13 days ago)
- Topics: deep-learning, eda, multi-party, privacy, pytorch, secure-enclave
- Language: Rust
- Homepage:
- Size: 18.3 MB
- Stars: 170
- Watchers: 5
- Forks: 12
- Open Issues: 14
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- Awesome-SGX-Open-Source - https://github.com/mithril-security/bastionlab
- awesome-production-machine-learning - BastionLab - security/bastionlab.svg?style=social) - BastionLab is a framework for confidential data science collaboration. It uses Confidential Computing, Access control data science, and Differential Privacy to enable data scientists to remotely perform data exploration, statistics, and training on confidential data while ensuring maximal privacy for data owners. (Privacy and Security)
README
Mithril Security β BastionLab
Website |
Documentation |
Discord |
Blog |
LinkedIn |
# π Welcome to BastionLab!
Where data owners and data scientists can securely collaborate without exposing data - opening the way to projects that were too risky to consider.
## βοΈ What is BastionLab?
**BastionLab is a simple privacy framework for data science collaboration, covering data exploration and AI training.**
It acts like an **access control solution**, for data owners to protect the privacy of their datasets, and **stands as a guard**, to enforce that only privacy-friendly operations are allowed on the data and anonymized outputs are shown to the data scientist.
- Data owners can let external or internal data scientists **explore and extract values from their datasets, according to a strict privacy policy** they'll define in BastionLab.
- Data scientists can **remotely run queries on data frames and train their models without seeing the original data or intermediary results**.**BastionLab is an open-source project.**
Our solution is coded in Rust π¦, uses Polars π», a pandas-like library for data exploration, and Torch π₯, a popular library for AI training.
We also have an option to set-up confidential computing π, a hardware-based technology that ensures no one but the processor of the machine can see the data or the model.## π Quick tour
You can go try out our [Quick tour](https://bastionlab.readthedocs.io/en/latest/docs/quick-tour/quick-tour/) in the documentation to discover BastionLab with a hands-on example using the famous Titanic dataset.
But hereβs a taste of what using BastionLab could look like π
### Data exploration
#### Data owner's side
```py
# Load your dataset using polars.
>>> import polars as pl
>>> df = pl.read_csv("titanic.csv")# Define a custom policy for your data.
# In this example, requests that aggregate at least 10 rows are safe.
# Other requests will be reviewed by the data owner.
>>> from bastionlab.polars.policy import Policy, Aggregation, Review
>>> policy = Policy(safe_zone=Aggregation(min_agg_size=10), unsafe_handling=Review())# Upload your dataset to the server.
# Optionally anonymize sensitive columns.
# The server returns a remote object that can be used to query the dataset.
>>> from bastionlab import Connection
>>> with Connection("bastionlab.example.com") as client:
... rdf = client.polars.send_df(df, policy=policy, sanitized_columns=["Name"])
... rdf
...
FetchableLazyFrame(identifier=3a2d15c5-9f9d-4ced-9234-d9465050edb1)
```#### Data scientist's side
```py
# List the datasets made available by the data owner, select one and get a remote object.
>>> from bastionlab import Connection
>>> connection = Connection("localhost")
>>> all_remote_dfs = connection.client.polars.list_dfs()
>>> remote_df = all_remote_dfs[0]# Run unsafe queries such as displaying the five first rows.
# According to the policy, unsafe queries require the data owner's approval.
>>> remote_df.head(5).collect().fetch()
Warning: non privacy-preserving queries necessitate data owner's approval.
Reason: Only 1 subrules matched but at least 2 are required.
Failed sub rules are:
Rule #1: Cannot fetch a DataFrame that does not aggregate at least 10 rows of the initial dataframe uploaded by the data owner.A notification has been sent to the data owner. The request will be pending until the data owner accepts or denies it or until timeout seconds elapse.
The query has been accepted by the data owner.
shape: (5, 12)
βββββββββββββββ¬βββββββββββ¬βββββββββ¬βββββββ¬ββββββ¬βββββββββββββββββββ¬ββββββββββ¬ββββββββ¬βββββββββββ
β PassengerId β Survived β Pclass β Name β ... β Ticket β Fare β Cabin β Embarked β
β --- β --- β --- β --- β β --- β --- β --- β --- β
β i64 β i64 β i64 β str β β str β f64 β str β str β
βββββββββββββββͺβββββββββββͺβββββββββͺβββββββͺββββββͺβββββββββββββββββββͺββββββββββͺββββββββͺβββββββββββ‘
β 1 β 0 β 3 β null β ... β A/5 21171 β 7.25 β null β S β
βββββββββββββββΌβββββββββββΌβββββββββΌβββββββΌββββββΌβββββββββββββββββββΌββββββββββΌββββββββΌβββββββββββ€
β 2 β 1 β 1 β null β ... β PC 17599 β 71.2833 β C85 β C β
βββββββββββββββΌβββββββββββΌβββββββββΌβββββββΌββββββΌβββββββββββββββββββΌββββββββββΌββββββββΌβββββββββββ€
β 3 β 1 β 3 β null β ... β STON/O2. 3101282 β 7.925 β null β S β
βββββββββββββββΌβββββββββββΌβββββββββΌβββββββΌββββββΌβββββββββββββββββββΌββββββββββΌββββββββΌβββββββββββ€
β 4 β 1 β 1 β null β ... β 113803 β 53.1 β C123 β S β
βββββββββββββββΌβββββββββββΌβββββββββΌβββββββΌββββββΌβββββββββββββββββββΌββββββββββΌββββββββΌβββββββββββ€
β 5 β 0 β 3 β null β ... β 373450 β 8.05 β null β S β
βββββββββββββββ΄βββββββββββ΄βββββββββ΄βββββββ΄ββββββ΄βββββββββββββββββββ΄ββββββββββ΄ββββββββ΄βββββββββββ# Run safe queries and get the result right away.
>>> (
... remote_df
... .select([pl.col("Pclass"), pl.col("Survived")])
... .groupby(pl.col("Pclass"))
... .agg(pl.col("Survived").mean())
... .sort("Survived", reverse=True)
... .collect()
... .fetch()
... )
shape: (3, 2)
ββββββββββ¬βββββββββββ
β Pclass β Survived β
β --- β --- β
β i64 β f64 β
ββββββββββͺβββββββββββ‘
β 1 β 0.62963 β
ββββββββββΌβββββββββββ€
β 2 β 0.472826 β
ββββββββββΌβββββββββββ€
β 3 β 0.242363 β
ββββββββββ΄βββββββββββ
```### AI training
### Data owner's side
```py
>>> from torchvision.datasets import CIFAR100
>>> from torchvision.transforms import ToTensor, Normalize, Compose
>>> from bastionlab.client import Connection# Define a transformation pipeline for the CIFAR dataset.
# The last step is there for shape compatibility reasons.
>>> transform = Compose([
... ToTensor(),
... Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
... lambda x: [x.squeeze(0)],
... ])# Define train and test datasets
>>> train_dataset = CIFAR100("data", train=True, transform=transform, download=True)
Files already downloaded and verified
>>> test_dataset = CIFAR100("data", train=False, transform=transform, download=True)
Files already downloaded and verified# Send them to the server by instantiating a RemoteDataset.
>>> with Connection("localhost") as client:
... client.torch.RemoteDataset(train_dataset, test_dataset, name="CIFAR100")
...
Sending CIFAR100: 100%|ββββββββββββββββββββ| 615M/615M [00:04<00:00, 150MB/s]
Sending CIFAR100 (test): 100%|ββββββββββββββββββββ| 123M/123M [00:00<00:00, 150MB/s]```
### Data scientist's side
```py
>>> from torchvision.models import efficientnet_b0
>>> from bastionlab.client import Connection# Define the model
>>> model = efficientnet_b0()# List the datasets made available by the data owner, select one and get a remote object.
>>> connection = Connection("localhost")
>>> remote_datasets = connection.client.torch.list_remote_datasets()
>>> remote_dataset = remote_datasets[0]# Send the model to the server by instantiating a RemoteLearner
# The RemoteLearner objects references the RemoteDataset.
>>> remote_learner = connection.client.torch.RemoteLearner(
... model,
... remote_dataset,
... max_batch_size=64,
... loss="cross_entropy",
... model_name="EfficientNet-B0",
... device="cpu",
... )
Sending EfficientNet-B0: 100%|ββββββββββββββββββββ| 21.7M/21.7M [00:00<00:00, 531MB/s]# Train the remote model for given amount of epochs
>>> remote_learner.fit(nb_epochs=1)
Epoch 1/1 - train: 100%|ββββββββββββββββββββ| 781/781 [04:06<00:00, 3.17batch/s, cross_entropy=4.1798 (+/- 0.0000)]# Test the remote model
>>> remote_learner.test(metric="accuracy")
Epoch 1/1 - test: 100%|ββββββββββββββββββββ| 156/156 [00:14<00:00, 10.62batch/s, accuracy=0.1123 (+/- 0.0000)]
```## ποΈ Key features
- **Access control**: data owners can define an interactive privacy policy that will filter the data scientist queries. They do not have to open unrestricted access to their datasets anymore.
- **Limited expressivity**: BastionLab limits the type of operations that can be executed by the data scientists to avoid arbitrary code execution.
- **Transparent remote access**: the data scientists never access the dataset directly. They only manipulate a local object that contains metadata to interact with a remotely hosted dataset. Calls can always be seen by data owners.## π Getting help
- Go to our [Discord](https://discord.com/invite/TxEHagpWd4) #support channel
- Report bugs by [opening an issue on our BastionLab Github](https://github.com/mithril-security/bastionlab/issues)
- [Book a meeting](https://calendly.com/contact-mithril-security/15mins?month=2022-11) with us## π¨ Disclaimer
BastionLab is still in development. **Do not use it yet in a production workload.** We will audit our solution in the future to attest that it enforces the security standards of the market.
## π License
BastionLab is licensed under the Apache License, Version 2.0.
*Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.*
*[See the License](http://www.apache.org/licenses/LICENSE-2.0) for the specific language governing permissions and limitations under the License.*