{"id":23305867,"url":"https://github.com/mmsaki/clustering-crypto","last_synced_at":"2025-09-02T02:32:57.299Z","repository":{"id":37495104,"uuid":"503955302","full_name":"mmsaki/clustering-crypto","owner":"mmsaki","description":"Using k-Means algorithm and a Principal Component Analysis (PCA) to cluster cryptocurrencies.","archived":false,"fork":false,"pushed_at":"2022-06-25T17:33:14.000Z","size":22072,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-06T23:47:02.597Z","etag":null,"topics":["elbow-curves","pca-analysis"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mmsaki.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-06-15T23:55:02.000Z","updated_at":"2022-07-03T10:06:43.000Z","dependencies_parsed_at":"2022-09-07T14:31:57.211Z","dependency_job_id":null,"html_url":"https://github.com/mmsaki/clustering-crypto","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mmsaki/clustering-crypto","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mmsaki%2Fclustering-crypto","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mmsaki%2Fclustering-crypto/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mmsaki%2Fclustering-crypto/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mmsaki%2Fclustering-crypto/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mmsaki","download_url":"https://codeload.github.com/mmsaki/clustering-crypto/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mmsaki%2Fclustering-crypto/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273220134,"owners_count":25066319,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-02T02:00:09.530Z","response_time":77,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["elbow-curves","pca-analysis"],"created_at":"2024-12-20T12:14:27.625Z","updated_at":"2025-09-02T02:32:57.257Z","avatar_url":"https://github.com/mmsaki.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# The Power of the Cloud and Unsupervised Learning\n\n\n## Table of Contents\n\u003cdetails\u003e \n\u003col\u003e\n\u003cli\u003e\nCrypto Clustering Overview\n\u003c/li\u003e\n\u003cli\u003e\nData Preprocessing\n\u003c/li\u003e\n\u003cli\u003e\nReducing Data Dimentions Using PCA\n\u003c/li\u003e\n\u003cli\u003e\nClustering Cryptocurrencies Using K-Means\n\u003c/li\u003e\n\u003cli\u003e\nVisualizing Results\n\u003c/li\u003e\n\u003cli\u003e\nOptional Challenge\n\u003c/li\u003e\n\u003c/ol\u003e\n\u003c/details\u003e\n\n**File:** [Clustering Crypto](./ClusteringCrypto/crypto_clustering.ipynb)\n**File:** [Optional Challenge](./ClusteringCrypto/crypto_clustering_sm.ipynb)\n\n## Crypto Clustering Overview\n\n- In this assignment I run the k-Means algorithm and a Principal Component Analysis (PCA) to cluster cryptocurrencies. \n\n- Assuming I am a Senior Manager at the Advisory Services team on a [Big Four firm](https://en.wikipedia.org/wiki/Big_Four_accounting_firms).\n- One of my most important clients, a prominent investment bank, is interested in offering a new cryptocurrencies investment portfolio for its customers, however, they are lost in the immense universe of cryptocurrencies. \n- They ask me to help them make sense of it all by generating a report of what cryptocurrencies are available on the trading market and how they can be grouped using classification.\n- I will put my new unsupervivsed learning and Amazon SageMaker skills into action by clustering cryptocurrencies and creating plots to present my results.\n\n- I am asked to accomplish the following main tasks:\n\n    - **[Data Preprocessing](#data-processing):** Prepare data for dimension reduction with PCA and clustering using K-Means.\n\n    - **[Reducing Data Dimensions Using PCA](#reducing-data-dimentions-using-pca):** Reduce data dimension using the `PCA` algorithm from `sklearn`.\n\n    - **[Clustering Cryptocurrencies Using K-Means](#clustering-cryptocurrencies-using-k-means):** Predict clusters using the cryptocurrencies data using the `KMeans` algorithm from `sklearn`.\n\n    - **[Visualizing Results](#visualizing-results):** Create some plots and data tables to present my results.\n\n    - **[Optional Challenge](#optional-challenge):** Deploy my notebook to Amazon SageMaker.\n\n## Data Processing\n\n- [x] Using the following `requests` library, retreive the necessary data from the following API endpoint from _CryptoCompare_ - `https://min-api.cryptocompare.com/data/all/coinlist`. **HINT:** I will need to use the 'Data' key from the json response, then transpose the DataFrame. Name my DataFrame `crypto_df`.\n\n    ```python\n    # Use the following endpoint to fetch json data\n    url = \"https://min-api.cryptocompare.com/data/all/coinlist\"\n    response = requests.get(url).json()\n\n    # Create a DataFrame \n    crypto_df = pd.DataFrame(response[\"Data\"]).T\n    ```\n\n    - With the data loaded into a Pandas DataFrame, continue with the following data preprocessing tasks.\n\n    - [x] Keep only the necessary columns: `'CoinName','Algorithm','IsTrading','ProofType','TotalCoinsMined','CirculatingSupply'`\n\n    ```python\n    # Keep only necessary columns\n    crypto_df = crypto_df[['CoinName','Algorithm','IsTrading','ProofType','TotalCoinsMined','CirculatingSupply']]\n    ```\n\n    - [x] Keep only the cryptocurrencies that are trading.\n\n    ```python\n    # Keep only cryptocurrencies that are trading\n    crypto_df = crypto_df[crypto_df[\"IsTrading\"] == True]\n    ```\n\n    - [x] Keep only the cryptocurrencies with a working algorithm.\n\n    ```python\n    crypto_df = crypto_df[crypto_df[\"Algorithm\"] != \"N/A\"]\n    ```\n\n    - [x] Remove the `IsTrading` column.\n\n    ```python\n    crypto_df = crypto_df.drop(columns = [\"IsTrading\"])\n    ```\n\n    - [x] Remove all cryptocurrencies with at least one null value.\n\n    ```python\n    crypto_df = crypto_df.dropna()\n    ```\n\n    - [x] Remove all cryptocurrencies that have no coins mined.\n\n    ```python\n    crypto_df = crypto_df[crypto_df[\"TotalCoinsMined\"] \u003e 0]\n    ```\n\n    - [x] Drop all rows where there are 'N/A' text values.\n\n    ```python\n    crypto_df = crypto_df[crypto_df.iloc[:] != \"N/A\"].dropna()\n    ```\n\n    - [x] Store the names of all cryptocurrencies in a DataFrame named `coins_name`, use the `crypto_df.index` as the index for this new DataFrame.\n\n    ```python\n    coins_name = crypto_df.index\n    ```\n\n    - [x] Remove the `CoinName` column.\n\n    ```python\n    crypto_df = crypto_df.drop(\"CoinName\", axis=1)\n    ```\n\n    - [x] Create dummy variables for all the text features, and store the resulting data in a DataFrame named `X`.\n\n    ```python\n    X = pd.get_dummies(data = crypto_df, columns = [\"Algorithm\", \"ProofType\"])\n    ```\n\n    - [x] Use the [`StandardScaler` from `sklearn`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) to standardize all the data of the `X` DataFrame. Remember, this is important prior to using PCA and K-Means algorithms.\n\n    ```python\n    X = StandardScaler().fit_transform(X)\n    ```\n\n    ## Reducing Data Dimentions Using PCA\n\n    - [x] Use the [`PCA` algorithm from `sklearn`](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) to reduce the dimensions of the `X` DataFrame down to three principal components.\n\n    ```python\n    pca = PCA(n_components=3)\n    crypto_pca = pca.fit_transform(X)\n    ```\n\n    - [x] Once I have reduced the data dimensions, create a DataFrame named `pcs_df` using as columns names `\"PC 1\", \"PC 2\"` and `\"PC 3\"`; use the `crypto_df.index` as the index for this new DataFrame.\n\n    ```python\n    pcs_df = pd.DataFrame(\n        crypto_pca,\n        columns = [\"PC 1\", \"PC 2\", \"PC 3\"],\n        index = coins_name\n    )\n    pcs_df.head(10)\n    ```\n\n## Clustering Cryptocurrencies Using k-means\n\n- [x] Create an Elbow Curve to find the best value for `k` using the `pcs_df` DataFrame.\n\n```python\ninertia = []\nk = list(range(1, 11))\n\n# Calculate the inertia for the range of k values\nfor i in k:\n    k_model = KMeans(n_clusters=i, random_state=1)\n    k_model.fit(pcs_df)\n    inertia.append(k_model.inertia_)\n\n# Create the Elbow Curve using hvPlot\nelbow_data = {\"k\": k, \"inertia\": inertia}\ndf_elbow = pd.DataFrame(elbow_data)\n\n# Create Elbow plot\ndf_elbow.hvplot.line(\n    x=\"k\", \n    y=\"inertia\", \n    title=\"Elbow Curve\", \n    xticks=k\n)\n```\n\n![\"Elbow Plot\"](./plots/elbow_curve.png)\n\n- [x] Once I define the best value for `k`, run the `Kmeans` algorithm to predict the `k` clusters for the cryptocurrencies data. Use the `pcs_df` to run the `KMeans` algorithm.\n\n```python\n# Initialize the K-Means model\nmodel = KMeans(n_clusters = 10, random_state=0)\n\n# Fit the model\nmodel.fit(pcs_df)\n\n# Predict clusters\nk_10 = model.predict(pcs_df)\n```\n\n- [x] Create a new DataFrame named `clustered_df`, that includes the following columns `\"Algorithm\", \"ProofType\", \"TotalCoinsMined\", \"TotalCoinSupply\", \"PC 1\", \"PC 2\", \"PC 3\", \"CoinName\", \"Class\"`. I should maintain the index of the `crypto_df` DataFrames as is shown bellow.\n\n```python\nclustered_df = pd.concat([crypto_df, pcs_df], axis=1)\nclustered_df[\"Class\"] = k_10\nclustered_df[\"CoinName\"] = coins_name\nclustered_df.head(20)\n```\n\n## Visualizing Results\n\n- In this section, I will create some data visualization to present the final results. \n- [x] Create a scatter plot using `hvplot.scatter`, to present the clustered data about cryptocurrencies having `x=\"TotalCoinsMined\"` and `y=\"TotalCoinSupply\"` to contrast the number of available coins versus the total number of mined coins. Use the `hover_cols=[\"CoinName\"]` parameter to include the cryptocurrency name on each data point.\n\n```python\n# Plot Scatter plot\nclustered_df.hvplot.scatter(\n    x= \"TotalCoinsMined\", \n    y= \"CirculatingSupply\",\n    hover_cols=[\"CoinName\"]\n)\n```\n\n![\"Hvplot Cluster\"](./plots/cluster_plot.png)\n\n- [x] Use `hvplot.table` to create a data table with all the current tradable cryptocurrencies. The table should have the following columns: `\"CoinName\", \"Algorithm\", \"ProofType\", \"CirculatingSupply\", \"TotalCoinsMined\", \"Class\"`\n\n```python\nclustered_df.hvplot.table(columns=[\"CoinName\", \"Algorithm\", \"ProofType\", \"CirculatingSupply\", \"TotalCoinsMined\", \"Class\"], sortable=True, selectable=True)\n```\n\n![table](./plots/table_of_tradable_coins.png)\n\n## Optional Challenge\n\n- For the challenge section, I have to upload my Jupyter notebook to Amazon SageMaker and deploy it.\n\n- The `hvplot` library is not included in the built-in anaconda environments, so for this challenge section, I should use the `altair` library instead.\n\n- [x] Upload my Jupyter notebook and rename it as `crypto_clustering_sm.ipynb`\n\n- [x] Select the `conda_python3` environment.\n- [x] Install the `altair` library by running the following code before the initial imports.\n   ```python\n   !pip install -U altair\n   ```\n- [x] Use the `altair` scatter plot to create the Elbow Curve.\n\n```python\ninertia = []\nk = list(range(1, 11))\n\n# Calculate the inertia for the range of k values\nfor i in k:\n    k_model = KMeans(n_clusters=i, random_state=1)\n    k_model.fit(pcs_df)\n    inertia.append(k_model.inertia_)\n\n# Create the Elbow Curve using altair\nelbow_data = {\"k\": k, \"inertia\": inertia}\ndf_elbow = pd.DataFrame(elbow_data)\n\n# Create Elbow plot\nalt.Chart(df_elbow).mark_line().encode(\n    x=\"k\", \n    y=\"inertia\"\n)\n```\n\n![Elbow Curve Visualization](./plots/sagemaker_elbow_curve_visualization.png)\n\n- [x] Use the `altair` scatter plot to visualize the clusters. Since this is a 2D-Scatter, use `x=\"PC 1\"` and `y=\"PC 2\"` for the axes, and add the following columns as tool tips: `\"CoinName\", \"Algorithm\", \"TotalCoinsMined\", \"TotalCoinSupply\"`.\n\n```python\n# Plot the scatter with x=\"PC 1\" and y=\"PC 2\"\n# Plot the clusters\nalt.Chart(clustered_df).mark_circle(size=60).encode(\n    x=\"PC 1\",\n    y=\"PC 2\",\n    color='Class',\n    tooltip=['CoinName', 'Algorithm', 'TotalCoinsMined', 'CirculatingSupply']\n).interactive()\n```\n\n![Altair Cluster plot](./plots/cluster_altair_visualization.png)\n\n_ _ _ _","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmmsaki%2Fclustering-crypto","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmmsaki%2Fclustering-crypto","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmmsaki%2Fclustering-crypto/lists"}