Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sleepymalc/miss
Official implementation of the paper "Most Influential Subset Selection: Challenges, Promises, and Beyond" (NeurIPS2024)
https://github.com/sleepymalc/miss
data-attribution machine-unlearning subset-selection
Last synced: 18 days ago
JSON representation
Official implementation of the paper "Most Influential Subset Selection: Challenges, Promises, and Beyond" (NeurIPS2024)
- Host: GitHub
- URL: https://github.com/sleepymalc/miss
- Owner: sleepymalc
- Created: 2023-09-07T22:30:48.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-10-17T03:34:55.000Z (28 days ago)
- Last Synced: 2024-10-19T05:56:41.314Z (26 days ago)
- Topics: data-attribution, machine-unlearning, subset-selection
- Language: Jupyter Notebook
- Homepage:
- Size: 78.8 MB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# MISS
This is the official implementation of [Most Influential Subset Selection: Challenges, Promises, and Beyond](https://arxiv.org/abs/2409.18153).
## Setup Guide
In order to use this framework, you need to have a working installation of Python 3.8 or newer. The only uncommon package we're using is [pyDVL](https://pydvl.org/devel/). Please refer to the official guide from their website and correctly install it.
## Quick Start
Make sure you have followed the [Setup Guide](#setup-guide) before running the code.
### Linear Regression
The [linear_regression](linear_regression) directory consists of the key MISS algorithm (`LAGS.py`) and the Python notebooks for both the real-world experiment and synthetic data experiment. To obtain the result, simply run the notebooks.
### Logistic Regression
The [logistic_regression](logistic_regression) directory consists of the key MISS algorithm (`IF.py`) and the Python notebook for both the real-world experiment and synthetic data experiment. To obtain the result, simply run the notebooks.
### Multi-Layer Perceptron
The [MLP](MLP) directory mainly consists of the key MISS algorithm (`IF.py`), and a wrapper of the entire experiment (`MISS.py`) to obtain the result, with a python notebook for the evaluation (`evaluation_MNIST.ipynb`). We divide the workflow in several steps since this experiment is a bit time-consuming. We now detail the whole workflow.
>Before running the script, you will need to manually create the following directories: `./MLP/checkpoint`, `./MLP/checkpoint/adaptive_tmp`, `./MLP/results/Eval`, and `./MLP/results/IF`.
1. Train a number of models specified by `--ensemble`, and save them to `./MLP/checkpoint`.
```bash
python model_train.py --seed 0 --train_size 5000 --test_size 500 --ensemble 5
```Note that the training set and the test set are constructed deterministically: in the above example, it'll take the first 5000 training samples and 500 test samples.
>The test dataset here is only used to show the accuracy of the model; we do not use it for selecting the model (e.g., cross-validation). In other words, it won't affect the next step in any way.
2. Solve the MISS and save the result to `./MLP/results/IF`. For the naive greedy:```bash
python MISS.py --seed 0 --train_size 5000 --test_range 0:49 --test_start_idx 0 --ensemble 5 --k 50
```For the (stepped) adaptive greedy:
```bash
python MISS.py --seed 0 --train_size 5000 --test_range 0:49 --test_start_idx 0 --ensemble 5 --k 50 --adaptive --warm_start --step 5
```Several notes on the flag:
- `seed`: The seed used for the previous (step 1) experiment.
>Note that step is deterministic (the training involved in this step is always controlled by some fixed seeds to avoid confusion).
- `adaptive`: If specified, then the adaptive greedy will be used.
- `warm_start` and `step`: These two flags only take effect when `adaptive` is specified.
- `test_range`: Construct the test dataset with an index between the specified range in the format of `start:end` (inclusive).
>This allows batched processing due to insufficient memory: initialization takes around 40 GB CUDA memory already, and after processing each test point the memory allocation increased by a non-negligible amount, which suffices to cause a CUDA out of memory error.
3. Run `evaluation_MNIST.ipynb` to evaluate the performance and generate plots. The evaluation result will be saved to `./MLP/results/Eval` if `load_eval` is set to `False` (you will need to do this at the first time).
>The evaluation script will aggregate all batches in the second step together.#### Examples
A sample script for the first two steps:
```bash
# Step 1
python3 model_train.py --seed 0 --train_size 5000 --test_size 500 --ensemble 5# Step 2
## Greedy
python3 MISS.py --seed 0 --train_size 5000 --test_range 0:49 --ensemble 5 --k 50## Adaptive Greedy
python3 MISS.py --seed 0 --train_size 5000 --test_range 0:24 --ensemble 5 --k 50 --adaptive --warm_start --step 5
python3 MISS.py --seed 0 --train_size 5000 --test_range 25:49 --ensemble 5 --k 50 --adaptive --warm_start --step 5
```## Citation
If you find this repository valuable, please give it a star! Got any questions or feedback? Feel free to open an issue. Using this in your work? Please reference us using the provided citation:
```bibtex
@inproceedings{hu2024most,
author = {Yuzheng Hu and Pingbang Hu and Han Zhao and Jiaqi W. Ma},
title = {Most Influential Subset Selection: Challenges, Promises, and Beyond},
booktitle = {Advances in Neural Information Processing Systems},
volume = {37},
year = {2024}
}
```