https://github.com/mmsaki/clustering-crypto

Using k-Means algorithm and a Principal Component Analysis (PCA) to cluster cryptocurrencies.
https://github.com/mmsaki/clustering-crypto
elbow-curves pca-analysis
Last synced: 4 months ago
JSON representation
Using k-Means algorithm and a Principal Component Analysis (PCA) to cluster cryptocurrencies.
Host: GitHub
URL: https://github.com/mmsaki/clustering-crypto
Owner: mmsaki
Created: 2022-06-15T23:55:02.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2022-06-25T17:33:14.000Z (over 3 years ago)
Last Synced: 2025-04-06T23:47:02.597Z (9 months ago)
Topics: elbow-curves, pca-analysis
Language: Jupyter Notebook
Homepage:
Size: 21 MB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # The Power of the Cloud and Unsupervised Learning

## Table of Contents

 





Crypto Clustering Overview





Data Preprocessing





Reducing Data Dimentions Using PCA





Clustering Cryptocurrencies Using K-Means





Visualizing Results





Optional Challenge





**File:** [Clustering Crypto](./ClusteringCrypto/crypto_clustering.ipynb)

**File:** [Optional Challenge](./ClusteringCrypto/crypto_clustering_sm.ipynb)

## Crypto Clustering Overview

- In this assignment I run the k-Means algorithm and a Principal Component Analysis (PCA) to cluster cryptocurrencies. 

- Assuming I am a Senior Manager at the Advisory Services team on a [Big Four firm](https://en.wikipedia.org/wiki/Big_Four_accounting_firms).

- One of my most important clients, a prominent investment bank, is interested in offering a new cryptocurrencies investment portfolio for its customers, however, they are lost in the immense universe of cryptocurrencies. 

- They ask me to help them make sense of it all by generating a report of what cryptocurrencies are available on the trading market and how they can be grouped using classification.

- I will put my new unsupervivsed learning and Amazon SageMaker skills into action by clustering cryptocurrencies and creating plots to present my results.

- I am asked to accomplish the following main tasks:

    - **[Data Preprocessing](#data-processing):** Prepare data for dimension reduction with PCA and clustering using K-Means.

    - **[Reducing Data Dimensions Using PCA](#reducing-data-dimentions-using-pca):** Reduce data dimension using the `PCA` algorithm from `sklearn`.

    - **[Clustering Cryptocurrencies Using K-Means](#clustering-cryptocurrencies-using-k-means):** Predict clusters using the cryptocurrencies data using the `KMeans` algorithm from `sklearn`.

    - **[Visualizing Results](#visualizing-results):** Create some plots and data tables to present my results.

    - **[Optional Challenge](#optional-challenge):** Deploy my notebook to Amazon SageMaker.

## Data Processing

- [x] Using the following `requests` library, retreive the necessary data from the following API endpoint from _CryptoCompare_ - `https://min-api.cryptocompare.com/data/all/coinlist`. **HINT:** I will need to use the 'Data' key from the json response, then transpose the DataFrame. Name my DataFrame `crypto_df`.

    ```python

    # Use the following endpoint to fetch json data

    url = "https://min-api.cryptocompare.com/data/all/coinlist"

    response = requests.get(url).json()

    # Create a DataFrame 

    crypto_df = pd.DataFrame(response["Data"]).T

    ```

    - With the data loaded into a Pandas DataFrame, continue with the following data preprocessing tasks.

    - [x] Keep only the necessary columns: `'CoinName','Algorithm','IsTrading','ProofType','TotalCoinsMined','CirculatingSupply'`

    ```python

    # Keep only necessary columns

    crypto_df = crypto_df[['CoinName','Algorithm','IsTrading','ProofType','TotalCoinsMined','CirculatingSupply']]

    ```

    - [x] Keep only the cryptocurrencies that are trading.

    ```python

    # Keep only cryptocurrencies that are trading

    crypto_df = crypto_df[crypto_df["IsTrading"] == True]

    ```

    - [x] Keep only the cryptocurrencies with a working algorithm.

    ```python

    crypto_df = crypto_df[crypto_df["Algorithm"] != "N/A"]

    ```

    - [x] Remove the `IsTrading` column.

    ```python

    crypto_df = crypto_df.drop(columns = ["IsTrading"])

    ```

    - [x] Remove all cryptocurrencies with at least one null value.

    ```python

    crypto_df = crypto_df.dropna()

    ```

    - [x] Remove all cryptocurrencies that have no coins mined.

    ```python

    crypto_df = crypto_df[crypto_df["TotalCoinsMined"] > 0]

    ```

    - [x] Drop all rows where there are 'N/A' text values.

    ```python

    crypto_df = crypto_df[crypto_df.iloc[:] != "N/A"].dropna()

    ```

    - [x] Store the names of all cryptocurrencies in a DataFrame named `coins_name`, use the `crypto_df.index` as the index for this new DataFrame.

    ```python

    coins_name = crypto_df.index

    ```

    - [x] Remove the `CoinName` column.

    ```python

    crypto_df = crypto_df.drop("CoinName", axis=1)

    ```

    - [x] Create dummy variables for all the text features, and store the resulting data in a DataFrame named `X`.

    ```python

    X = pd.get_dummies(data = crypto_df, columns = ["Algorithm", "ProofType"])

    ```

    - [x] Use the [`StandardScaler` from `sklearn`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) to standardize all the data of the `X` DataFrame. Remember, this is important prior to using PCA and K-Means algorithms.

    ```python

    X = StandardScaler().fit_transform(X)

    ```

    ## Reducing Data Dimentions Using PCA

    - [x] Use the [`PCA` algorithm from `sklearn`](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) to reduce the dimensions of the `X` DataFrame down to three principal components.

    ```python

    pca = PCA(n_components=3)

    crypto_pca = pca.fit_transform(X)

    ```

    - [x] Once I have reduced the data dimensions, create a DataFrame named `pcs_df` using as columns names `"PC 1", "PC 2"` and `"PC 3"`; use the `crypto_df.index` as the index for this new DataFrame.

    ```python

    pcs_df = pd.DataFrame(

        crypto_pca,

        columns = ["PC 1", "PC 2", "PC 3"],

        index = coins_name

    )

    pcs_df.head(10)

    ```

## Clustering Cryptocurrencies Using k-means

- [x] Create an Elbow Curve to find the best value for `k` using the `pcs_df` DataFrame.

```python

inertia = []

k = list(range(1, 11))

# Calculate the inertia for the range of k values

for i in k:

    k_model = KMeans(n_clusters=i, random_state=1)

    k_model.fit(pcs_df)

    inertia.append(k_model.inertia_)

# Create the Elbow Curve using hvPlot

elbow_data = {"k": k, "inertia": inertia}

df_elbow = pd.DataFrame(elbow_data)

# Create Elbow plot

df_elbow.hvplot.line(

    x="k", 

    y="inertia", 

    title="Elbow Curve", 

    xticks=k

)

```

!["Elbow Plot"](./plots/elbow_curve.png)

- [x] Once I define the best value for `k`, run the `Kmeans` algorithm to predict the `k` clusters for the cryptocurrencies data. Use the `pcs_df` to run the `KMeans` algorithm.

```python

# Initialize the K-Means model

model = KMeans(n_clusters = 10, random_state=0)

# Fit the model

model.fit(pcs_df)

# Predict clusters

k_10 = model.predict(pcs_df)

```

- [x] Create a new DataFrame named `clustered_df`, that includes the following columns `"Algorithm", "ProofType", "TotalCoinsMined", "TotalCoinSupply", "PC 1", "PC 2", "PC 3", "CoinName", "Class"`. I should maintain the index of the `crypto_df` DataFrames as is shown bellow.

```python

clustered_df = pd.concat([crypto_df, pcs_df], axis=1)

clustered_df["Class"] = k_10

clustered_df["CoinName"] = coins_name

clustered_df.head(20)

```

## Visualizing Results

- In this section, I will create some data visualization to present the final results. 

- [x] Create a scatter plot using `hvplot.scatter`, to present the clustered data about cryptocurrencies having `x="TotalCoinsMined"` and `y="TotalCoinSupply"` to contrast the number of available coins versus the total number of mined coins. Use the `hover_cols=["CoinName"]` parameter to include the cryptocurrency name on each data point.

```python

# Plot Scatter plot

clustered_df.hvplot.scatter(

    x= "TotalCoinsMined", 

    y= "CirculatingSupply",

    hover_cols=["CoinName"]

)

```

!["Hvplot Cluster"](./plots/cluster_plot.png)

- [x] Use `hvplot.table` to create a data table with all the current tradable cryptocurrencies. The table should have the following columns: `"CoinName", "Algorithm", "ProofType", "CirculatingSupply", "TotalCoinsMined", "Class"`

```python

clustered_df.hvplot.table(columns=["CoinName", "Algorithm", "ProofType", "CirculatingSupply", "TotalCoinsMined", "Class"], sortable=True, selectable=True)

```

![table](./plots/table_of_tradable_coins.png)

## Optional Challenge

- For the challenge section, I have to upload my Jupyter notebook to Amazon SageMaker and deploy it.

- The `hvplot` library is not included in the built-in anaconda environments, so for this challenge section, I should use the `altair` library instead.

- [x] Upload my Jupyter notebook and rename it as `crypto_clustering_sm.ipynb`

- [x] Select the `conda_python3` environment.

- [x] Install the `altair` library by running the following code before the initial imports.

   ```python

   !pip install -U altair

   ```

- [x] Use the `altair` scatter plot to create the Elbow Curve.

```python

inertia = []

k = list(range(1, 11))

# Calculate the inertia for the range of k values

for i in k:

    k_model = KMeans(n_clusters=i, random_state=1)

    k_model.fit(pcs_df)

    inertia.append(k_model.inertia_)

# Create the Elbow Curve using altair

elbow_data = {"k": k, "inertia": inertia}

df_elbow = pd.DataFrame(elbow_data)

# Create Elbow plot

alt.Chart(df_elbow).mark_line().encode(

    x="k", 

    y="inertia"

)

```

![Elbow Curve Visualization](./plots/sagemaker_elbow_curve_visualization.png)

- [x] Use the `altair` scatter plot to visualize the clusters. Since this is a 2D-Scatter, use `x="PC 1"` and `y="PC 2"` for the axes, and add the following columns as tool tips: `"CoinName", "Algorithm", "TotalCoinsMined", "TotalCoinSupply"`.

```python

# Plot the scatter with x="PC 1" and y="PC 2"

# Plot the clusters

alt.Chart(clustered_df).mark_circle(size=60).encode(

    x="PC 1",

    y="PC 2",

    color='Class',

    tooltip=['CoinName', 'Algorithm', 'TotalCoinsMined', 'CirculatingSupply']

).interactive()

```

![Altair Cluster plot](./plots/cluster_altair_visualization.png)

_ _ _ _
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mmsaki/clustering-crypto

Awesome Lists containing this project

README