An open API service indexing awesome lists of open source software.

https://github.com/kapshaul/onlinelearning

Repository of Online Learning algorithms, including Bandits, UCB, and more.
https://github.com/kapshaul/onlinelearning

adaptive-ad adversarial-learning bandit bandit-learning linear-regression machine-learning online-learning upper-confidence-bound

Last synced: 7 months ago
JSON representation

Repository of Online Learning algorithms, including Bandits, UCB, and more.

Awesome Lists containing this project

README

          

# Comparison of Bandit Algorithms

## Overview
This repository includes implementations and performance reports of several bandit algorithms. The study covers:

[1. Explore-then-Commit](#1-explore-then-commit)

[2. Upper Confidence Bound (UCB)](#2-upper-confidence-bound-ucb)

[3. Thompson Sampling](#3-thompson-sampling)

[4. Linear UCB (LinUCB)](#4-linear-ucb-linucb)

[5. Linear Thompson Sampling (LinTS)](#5-linear-thompson-sampling-lints)

[6. Generalized Linear Model Bandit (GLM)](#6-generalized-linear-model-glm-bandit-non-linear-bandit)

## Implementation
To simulate a specific algorithm, edit the `Simulation.py` script by enabling the desired algorithm and disabling the others.

For example, to run the UCB algorithm with $\alpha = 0.5$, update the code as follows:
```python
## Initiate Bandit Algorithms ##
algorithms = {}

#algorithms['EpsilonGreedyLinearBandit'] = EpsilonGreedyLinearBandit(dimension=context_dimension, lambda_=0.1, epsilon=None)
#algorithms['EpsilonGreedyMultiArmedBandit'] = EpsilonGreedyMultiArmedBandit(num_arm=n_articles, epsilon=0.1)
#algorithms['ExplorethenCommit'] = ExplorethenCommit(num_arm=n_articles, m=30)
algorithms['UCBBandit'] = UCBBandit(num_arm=n_articles, alpha=0.5)
#algorithms['ThompsonSamplingGaussianMAB'] = ThompsonSamplingGaussianMAB(num_arm=n_articles)
#algorithms['LinearUCBBandit'] = LinearUCBBandit(dimension=context_dimension, lambda_=0.1, alpha=0.5) #delta=0.05, alpha=2.358
#algorithms['LinearThompsonSamplingMAB'] = LinearThompsonSamplingMAB(dimension=context_dimension, lambda_=0.1)
```

After selecting your algorithm, run the `Simulation.py` script.

## 1. Explore-then-Commit
### Result



| Hyperparameter (m) | Cumulative Regret |
|:------------------:|:-----------------:|
| 10 | 1001.40 |
| 20 | 214.90 |
| 30 | 334.02 |



(a) m = 10                                        
(b) m = 20                                        
(c) m = 30

**Figure 1**: Explore then Commit accumulated regret

## 2. Upper Confidence Bound (UCB)
### Reward Estimation + Confidence Bound
$$
\text{UCB} = \hat u_{t-1,i} + \sqrt{\frac{2 \ln t}{S_{t-1,i}}}
$$

### Result



| Hyperparameter (α) | Cumulative Regret |
|:------------------:|:-----------------:|
| 0.1 | 256.50 |
| 0.5 | 977.03 |
| 1.0 | 1906.65 |



(a) $\alpha$ = 0.1                                        
(b) $\alpha$ = 0.5                                        
(c) $\alpha$ = 1

**Figure 2**: UCB Bandit accumulated regret

## 3. Thompson Sampling
### Posterior Distribution

$$
N \sim \left( \hat u_{t-1,i}, \frac{1}{S_{t-1,i} + 1} \right)
$$

### Result

| Cumulative Regret |
|:------------------:|
| 100 |

**Figure 3**: Thompson Sampling accumulated regret

## 4. Linear UCB (LinUCB)
### Parameter Estimation

$$
\hat \theta_{t+1} = A^{-1}_ {t+1} b_{t+1}
$$

### Reward Estimation + Confidence Bound

$$
\text{UCB} = x^T \hat \theta_t + \alpha \sqrt{x^T A^{-1} x}
$$

### Result

| Hyperparameter (α) | Cumulative Regret |
|:------------------:|:-----------------:|
| 0.5 | 24.43 |
| 1.5 | 177.89 |
| 2.5 | 487.73 |



(a) $\alpha$ = 0.5                                        
(b) $\alpha$ = 1.5                                        
(c) $\alpha$ = 2.5

**Figure 4**: Linear UCB accumulated regret




(a) $\alpha$ = 0.5                                        
(b) $\alpha$ = 1.5                                        
(c) $\alpha$ = 2.5

**Figure 5**: Linear UCB estimation error

## 5. Linear Thompson Sampling (LinTS)
### Posterior Distribution
$$
N \sim (\hat{\theta}_t, A^{-1}_t)
$$

### Result

| Cumulative Regret |
|:------------------:|
| 1098.24 |


**Figure 6**: Linear Thompson Sampling accumulated regret and estimation error

## 6. Generalized Linear Model (GLM) Bandit: Non-linear Bandit
### Modified Non-LinearReward Function For Testing

$$
R = (x^T \theta)^2 + \epsilon, \text{ where } \epsilon \sim N(\mu, \sigma^2)
$$

### GLM Parameter Estimation (MLE)

$$
\hat \theta_{t+1} = \max_{\theta} P(r|\theta) = A^{-1}_ {t+1} b_{t+1}
$$

### GLM UCB

$$
UCB_{GLM} = f(x^T \hat \theta_t) + \alpha \sqrt{x^T A^{-1} x} = (x^T \hat \theta_t)^2 + \alpha \sqrt{x^T A^{-1} x}
$$

### Result

| Hyperparameter (α) | Cumulative Regret |
|:------------------:|:-----------------:|
| 0.1 | 62.16 |
| 0.5 | 727.63 |
| 1.5 | 5948.48 |



(a) $\alpha$ = 0.1                                        
(b) $\alpha$ = 0.5                                        
(c) $\alpha$ = 1.5

**Figure 7**: GLM-UCB accumulated regret




(a) $\alpha$ = 0.1                                        
(b) $\alpha$ = 0.5                                        
(c) $\alpha$ = 1.5

**Figure 8**: GLM-UCB estimation error