https://github.com/mnitin-reddy/anomaly-detection-in-server-networks

This project implements an anomaly detection algorithm to identify failing servers in a network. The model is trained on a dataset with throughput and latency features, estimating parameters for anomaly detection. The approach is extended to a high-dimensional dataset, showcasing its effectiveness in detecting outliers and anomalies
https://github.com/mnitin-reddy/anomaly-detection-in-server-networks

anamoly data-science machine-learning matplotlib numpy pyhton seaborn

Last synced: about 1 month ago
JSON representation

Host: GitHub
URL: https://github.com/mnitin-reddy/anomaly-detection-in-server-networks
Owner: MNitin-Reddy
Created: 2024-10-29T05:03:06.000Z (6 months ago)
Default Branch: main
Last Pushed: 2024-12-02T07:09:51.000Z (5 months ago)
Last Synced: 2025-02-08T11:44:10.252Z (3 months ago)
Topics: anamoly, data-science, machine-learning, matplotlib, numpy, pyhton, seaborn
Language: Jupyter Notebook
Homepage:
Size: 587 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # Anomaly Detection in Server Networks

We will implement the anomaly detection algorithm and apply it to detect failing servers on a network.

## Table of Contents

1. [Packages](#packages)

2. [Anomaly Detection](#anomaly-detection)

   - [Problem Statement](#problem-statement)

   - [Dataset](#dataset)

   - [Gaussian Distribution](#gaussian-distribution)

   -  [High Dimensional Dataset](#high-dimensional-dataset)

## 1. Packages

Ensure you have the required libraries installed to run the code.

```bash

pip install numpy matplotlib

```

## 2. Anomaly Detection

### 2.1 Problem Statement

The task involves implementing an anomaly detection algorithm to detect anomalous behaviour in server computers. The dataset contains two features:

- Throughput (mb/s)

- Latency (ms)

The dataset consists of 307 examples representing the servers' behaviour. The majority of examples are "normal," while some may represent anomalies. The goal is to use a Gaussian distribution model to detect these anomalies.

### 2.2 Dataset

We begin by loading the dataset. The dataset is split into the following:

```python

def load_data():

    X = np.load("data/X_part1.npy")

    X_val = np.load("data/X_val_part1.npy")

    y_val = np.load("data/y_val_part1.npy")

    return X, X_val, y_val

X_train, X_val, y_val = load_data()

```

- **X_train**: Used to fit a Gaussian distribution.

- **X_val, y_val**: Used as a cross-validation set to determine a threshold for anomaly detection.

### 2.3 Gaussian Distribution

To perform anomaly detection, the algorithm needs to fit a Gaussian distribution to the data's features and then calculate the probability of each example under this distribution. Two parameters characterize the Gaussian distribution: mean ($\mu$) and variance ($\sigma^2$).

For each feature in the dataset, the algorithm estimates the mean and variance, which are then used to evaluate the probability density of each data point.

```python

def estimate_gaussian(X): 

    m, n = X.shape 

    mu = np.mean(X, axis=0)

    var = np.var(X, axis=0)

    return mu, var

```

### Visualize the Data

Before analyzing further, it's important to visualize the data using a scatter plot that shows the relationship between Throughput and Latency.

```python

plt.scatter(X_train[:, 0], X_train[:, 1], c='blue', marker='x')

plt.xlabel('Throughput (mb/s)')

plt.ylabel('Latency (ms)')

plt.title('Server Behavior Visualization')

plt.show()

```

### Multivariate Gaussian

The algorithm uses the multivariate Gaussian distribution to evaluate the probability density function. This distribution calculates the probability for each data point based on the estimated mean and variance.

```python

def multivariate_gaussian(X, mu, var):

    k = len(mu)

    if var.ndim == 1:

        var = np.diag(var)

    X = X - mu

    p = (2 * np.pi) ** (-k / 2) * np.linalg.det(var) ** (-0.5) * \

        np.exp(-0.5 * np.sum(np.matmul(X, np.linalg.pinv(var)) * X, axis=1))

    return p

```

### Selecting the Threshold ($\epsilon$)

The threshold for detecting anomalies is chosen based on cross-validation. The goal is to find the best threshold using the $F_1$ score, which balances precision and recall. The algorithm evaluates various thresholds and selects the one that maximizes the $F_1$ score.

```python

def select_threshold(y_val, p_val): 

    best_epsilon = 0

    best_F1 = 0

    F1 = 0

    step_size = (max(p_val) - min(p_val)) / 1000

    

    for epsilon in np.arange(min(p_val), max(p_val), step_size):

        predictions = (p_val < epsilon).astype(int)

        tp = np.sum((predictions == 1) & (y_val == 1))  # True positives

        fp = np.sum((predictions == 1) & (y_val == 0))  # False positives

        fn = np.sum((predictions == 0) & (y_val == 1))  # False negatives

        precision = tp / (tp + fp) if (tp + fp) > 0 else 0

        recall = tp / (tp + fn) if (tp + fn) > 0 else 0

        

        if precision + recall > 0:

            F1 = (2 * precision * recall) / (precision + recall)

        if F1 > best_F1:

            best_F1 = F1

            best_epsilon = epsilon

    return best_epsilon, best_F1

```

## 2.4 High Dimensional Dataset

Next, we apply the anomaly detection algorithm to a more realistic dataset with 11 features, capturing more properties of the servers. This step involves:

- Estimating Gaussian parameters for the new dataset.

- Evaluating probabilities for both the training and validation sets.

- Using the cross-validation set to find the best threshold.

```python

X_train_high, X_val_high, y_val_high = load_data_high()

mu_high, var_high = estimate_gaussian(X_train_high)

p_high = multivariate_gaussian(X_train_high, mu_high, var_high)

p_val_high = multivariate_gaussian(X_val_high, mu_high, var_high)

epsilon_high, F1_high = select_threshold(y_val_high, p_val_high)

print('Best epsilon found using cross-validation: %e' % epsilon_high)

print('Best F1 on Cross Validation Set: %f' % F1_high)

print('# Anomalies found: %d' % sum(p_high < epsilon_high))

```

![Anomaly Detection Results](Anamolies.png)

## Conclusion

In this project, we successfully implemented an anomaly detection algorithm to identify failing servers in a network. By using a Gaussian distribution, we estimated the parameters for each feature and applied these to detect anomalies in the dataset.

The approach began with a 2D dataset, where we visualized the anomalies, and extended to a higher-dimensional dataset with 11 features, reflecting a more realistic scenario. After training the model, we selected an optimal threshold using cross-validation and calculated the F1 score for the best results.

### Key takeaways:

- The anomaly detection algorithm was effective in identifying outliers, with a good balance between precision and recall, as demonstrated by the F1 score of 0.615 on the cross-validation set.

- The model detected 117 anomalies in the high-dimensional dataset, which indicates its potential for real-world applications, such as detecting server failures or other abnormal behaviours in various industries.

- The project showcased important machine learning concepts like Gaussian distribution estimation, multivariate anomaly detection, and model evaluation using the F1 score.

This project illustrates the power of anomaly detection in monitoring systems and networks, offering valuable insights into the health and performance of computing infrastructures.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mnitin-reddy/anomaly-detection-in-server-networks

Awesome Lists containing this project

README