https://github.com/priorlabs/nanotabpfn

nanoTabPFN: A Playground for Tabular Foundation Models
https://github.com/priorlabs/nanotabpfn

Last synced: 10 months ago
JSON representation

nanoTabPFN: A Playground for Tabular Foundation Models

Host: GitHub
URL: https://github.com/priorlabs/nanotabpfn
Owner: PriorLabs
License: apache-2.0
Created: 2025-04-07T08:02:20.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-07-23T23:02:29.000Z (11 months ago)
Last Synced: 2025-08-11T20:46:11.319Z (11 months ago)
Language: Python
Homepage:
Size: 10.7 KB
Stars: 18
Watchers: 1
Forks: 3
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # nanoTabPFN

The purpose of this repository is to provide a fully open source playground for tabular foundation models.

It contains a much smaller and simpler implementation of the TabPFNv2 architecture as well as a training loop and code for loading data that was pre-generated by a prior. We are planning to rapidly extend the repository with more features (e.g. regression, missing values, categorical features), prior interfaces and architectures.

It is supposed to be a good starting point for students and researchers that are interested in learning about how TabPFN works under the hood.

Clone the repository, afterwards install dependencies via:

```

pip install -e .

```

We offer the same interface as TabPFN:

```python

from sklearn.datasets import load_breast_cancer

from sklearn.metrics import accuracy_score, roc_auc_score

from sklearn.model_selection import train_test_split

from nanotabpfn import NanoTabPFNClassifier

# Load data

X, y = load_breast_cancer(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

# Initialize a classifier

clf = NanoTabPFNClassifier()

clf.fit(X_train, y_train)

# Predict probabilities

prediction_probabilities = clf.predict_proba(X_test)

print("ROC AUC:", roc_auc_score(y_test, prediction_probabilities[:, 1]))

# Predict labels

predictions = clf.predict(X_test)

print("Accuracy", accuracy_score(y_test, predictions))

```

### Our Code

`nanotabpfn/model.py` contains the implementation of the architecture in less than 250 lines of code. `nanotabpfn/train.py` implements a simple training loop in under 100 lines and `nanotabpfn/priors.py` implements a dataloader that allows you to load a dump pre-generated from a prior.

We will release multiple dumps of different scales soon. We also offer an interface where you can provide your own get\_batch function.

### Pretrain your own small nanoTabPFN

First we download 100k pre-generated datasets with 50 datapoints, 3 features and up to 3  classes each from [here](https://ml.informatik.uni-freiburg.de/research-artifacts/pfefferle/nanoTabPFN/50x3_3_100k_classification.h5).

Then you can run:

```

python pretrain_classification.py -epochs 80 -steps 25 -batchsize 50 -priordump 50x3_3_100k_classification.h5

```

This should take less than 5 min on a modern NVIDIA GPU (around 10 minutes on Macbook M4 Pro GPU and around 40 min on M4 Pro CPU).

#### Step by Step Explanation

First we import our Architecture, Prior interface and training loop, etc.

```python

from nanotabpfn.model import NanoTabPFNModel

from nanotabpfn.priors import PriorDumpDataLoader

from nanotabpfn.train import train

from nanotabpfn.utils import get_default_device

from nanotabpfn.interface import NanoTabPFNClassifier

from torch.nn import CrossEntropyLoss

```

then we instantiate our model and loss criterion:

```python

model = NanoTabPFNModel(

    num_attention_heads=6,

    embedding_size=192,

    mlp_hidden_size=768,

    num_layers=6,

    num_outputs=10,

)

criterion = CrossEntropyLoss()

```

then we instantiate our prior:

```python

device = get_default_device()

prior = PriorDumpDataLoader(filename='50x3_3_100k_classification.h5', num_steps=25, batch_size=50, device=device)

```

and finally train our model:

```python

def epoch_callback(epoch, epoch_time, mean_loss, model):

    classifier = NanoTabPFNClassifier(model, device)

    # you can add your own eval code here that runs after every epoch

    print(f'epoch {epoch:5d} | time {epoch_time:5.2f}s | mean loss {mean_loss:5.2f}', flush=True)

trained_model, loss = train(

    model=model,

    prior=prior,

    criterion=criterion,

    epochs=80,

    device=device,

    epoch_callback=epoch_callback

)

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/priorlabs/nanotabpfn

Awesome Lists containing this project

README