{"id":18925550,"url":"https://github.com/nightmachinery/soal_playground","last_synced_at":"2025-09-02T12:43:59.740Z","repository":{"id":40428935,"uuid":"446938239","full_name":"NightMachinery/soal_playground","owner":"NightMachinery","description":null,"archived":false,"fork":false,"pushed_at":"2023-12-15T11:40:18.000Z","size":25627,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-05-25T00:18:09.321Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NightMachinery.png","metadata":{"files":{"readme":"readme.org","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-01-11T18:33:09.000Z","updated_at":"2022-01-11T18:52:48.000Z","dependencies_parsed_at":"2024-11-08T11:12:24.353Z","dependency_job_id":"c7dcfdfb-ad5f-4714-a18e-73554569cbfd","html_url":"https://github.com/NightMachinery/soal_playground","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/NightMachinery/soal_playground","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NightMachinery%2Fsoal_playground","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NightMachinery%2Fsoal_playground/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NightMachinery%2Fsoal_playground/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NightMachinery%2Fsoal_playground/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NightMachinery","download_url":"https://codeload.github.com/NightMachinery/soal_playground/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NightMachinery%2Fsoal_playground/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273287471,"owners_count":25078569,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-02T02:00:09.530Z","response_time":77,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-08T11:12:18.821Z","updated_at":"2025-09-02T12:43:59.717Z","avatar_url":"https://github.com/NightMachinery.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"#+TITLE: soal_playground\n\nThis file is authored in org-mode markup, and it is better viewed [[https://github.com/NightMachinary/soal_playground/raw/master/readme.org][raw]] than the default Github rendering view.\n\n* project todos\n** periphal\n*** Investigate why cuML is consuming so much memory.\n**** [[id:f8dc1a3d-afa6-4f5c-98c2-7b0a836f30ab][memleak/gen:rapidsai/cudf#10107 {BUG} Creating a DataFrame from a numpy array consumes too much RAM]]\n\n*** Rebenchmark =python run_one.py kmeans_mb2e10_sklearn_iter10e4_dask fcps_dietary_survey_IBS= on Colab; its score is normal on my laptop, but it is too low on Colab.\n\n*** Create a =conda= constructor.\n- @alt Compress the whole =conda= directory and persist it.\n\n**** [[https://colab.research.google.com/drive/1HjikV9AS7X4eklbPtauTG_N6XNGIwOHG#scrollTo=xor-KoTA1dYX]]\n\n**** [[https://github.com/conda/constructor/issues/488][conda/constructor#488 Weird conflict errors]]\n\n**** [[https://github.com/conda-incubator/condacolab/issues/22][conda-incubator/condacolab#22 Weird conflict errors]]\n\n*** DONE =hdbscan= has a numpy incompatibility problem in the GPU mode.\n:PROPERTIES:\n:visibility: folded\n:END:\n- Update: I think adding =hdbscan= and =numpy= as explicit deps to =conda= solved this.\n\n#+begin_example python\nTraceback (most recent call last):\n  File \"run_one.py\", line 8, in \u003cmodule\u003e\n    from soalpy.runners import *\n  File \"/usr/local/lib/python3.8/site-packages/soalpy/runners.py\", line 9, in \u003cmodule\u003e\n    from hdbscan import HDBSCAN\n  File \"/usr/local/lib/python3.8/site-packages/hdbscan/__init__.py\", line 1, in \u003cmodule\u003e\n    from .hdbscan_ import HDBSCAN, hdbscan\n  File \"/usr/local/lib/python3.8/site-packages/hdbscan/hdbscan_.py\", line 21, in \u003cmodule\u003e\n    from ._hdbscan_linkage import (single_linkage,\n  File \"hdbscan/_hdbscan_linkage.pyx\", line 1, in init hdbscan._hdbscan_linkage\nValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject\n#+end_example\n\n*** DONE =hdbscan= has problems with the FCPS data.\n:PROPERTIES:\n:visibility: folded\n:END:\n- Update: I think updating =xarray= solved this.\n\n- works fine on my laptop though?!\n\n#+begin_example python\ncmd: python 'run_one.py' 'hdbscan_sklearn_best' 'fcps_leukemia'\nERROR: command failed with 1\n#### stats:\nCommand exited with non-zero status 1\n1379144,4.45\n#### out:\n#### err:\n##\nRAPIDS not installed\nINFO: metric switched to precomputed.\nTraceback (most recent call last):\n  File \"run_one.py\", line 223, in \u003cmodule\u003e\n    res = algo(dataset)\n  File \"/usr/local/lib/python3.7/dist-packages/soalpy/hdbscan_runners.py\", line 8, in hdbscan_sklearn_best\n    return run(dataset, mode=\"HDBSCAN\", algorithm='best', **kwargs,)\n  File \"/usr/local/lib/python3.7/dist-packages/soalpy/runners.py\", line 159, in run\n    preds = clf.fit_predict(input_data)\n  File \"/usr/local/lib/python3.7/dist-packages/hdbscan/hdbscan_.py\", line 1227, in fit_predict\n    self.fit(X)\n  File \"/usr/local/lib/python3.7/dist-packages/hdbscan/hdbscan_.py\", line 1173, in fit\n    preds = clf.fit_predict(input_data)\n  File \"/usr/local/lib/python3.7/dist-packages/hdbscan/hdbscan_.py\", line 1227, in fit_predict\n    self.fit(X)\n  File \"/usr/local/lib/python3.7/dist-packages/hdbscan/hdbscan_.py\", line 1173, in fit\n    check_precomputed_distance_matrix(X)\n  File \"/usr/local/lib/python3.7/dist-packages/hdbscan/hdbscan_.py\", line 393, in check_precomputed_distance_matrix\n    tmp[np.isinf(tmp)] = 1\n  File \"/usr/local/lib/python3.7/dist-packages/xarray/core/dataarray.py\", line 715, in __setitem__\n    obj = self[key]\n  File \"/usr/local/lib/python3.7/dist-packages/xarray/core/dataarray.py\", line 706, in __getitem__\n    return self.isel(indexers=self._item_key_to_dict(key))\n  File \"/usr/local/lib/python3.7/dist-packages/xarray/core/dataarray.py\", line 1140, in isel\n    indexers, drop=drop, missing_dims=missing_dims\n  File \"/usr/local/lib/python3.7/dist-packages/xarray/core/dataset.py\", line 2275, in _isel_fancy\n    name, var, self.xindexes[name], var_indexers\n  File \"/usr/local/lib/python3.7/dist-packages/xarray/core/indexes.py\", line 295, in isel_variable_and_index\n    new_variable = variable.isel(indexers)\n  File \"/usr/local/lib/python3.7/dist-packages/xarray/core/variable.py\", line 1135, in isel\n    return self[key]\n  File \"/usr/local/lib/python3.7/dist-packages/xarray/core/variable.py\", line 779, in __getitem__\n    dims, indexer, new_order = self._broadcast_indexes(key)\n  File \"/usr/local/lib/python3.7/dist-packages/xarray/core/variable.py\", line 622, in _broadcast_indexes\n    self._validate_indexers(key)\n  File \"/usr/local/lib/python3.7/dist-packages/xarray/core/variable.py\", line 670, in _validate_indexers\n    \"not supported. \".format(k.ndim)\nIndexError: 2-dimensional boolean indexing is not supported.\n#+end_example\n\n*** IGNORE @upstreamBug? =hdbscan_cuml= has problems with =fcps_leukemia=\n#+begin_example\n##### Algorithm: hdbscan_cuml\ncmd: python 'run_one.py' 'hdbscan_cuml' 'fcps_leukemia'\nERROR: command failed with 1\n#### stats:\nCommand exited with non-zero status 1\n2115364,6.80\n#### out:\n#### err:\n##\nINFO: metric switched to precomputed.\nTraceback (most recent call last):\n  File \"run_one.py\", line 223, in \u003cmodule\u003e\n    res = algo(dataset)\n  File \"/root/miniconda3/lib/python3.8/site-packages/soalpy/hdbscan_runners.py\", line 5, in hdbscan_cuml\n    return run(dataset, mode=\"cuHDBSCAN\")\n  File \"/root/miniconda3/lib/python3.8/site-packages/soalpy/runners.py\", line 159, in run\n    preds = clf.fit_predict(input_data)\n  File \"/root/miniconda3/lib/python3.8/site-packages/cuml/internals/api_decorators.py\", line 586, in inner_get\n    ret_val = func(*args, **kwargs)\n  File \"cuml/cluster/hdbscan.pyx\", line 671, in cuml.cluster.hdbscan.HDBSCAN.fit_predict\n  File \"/root/miniconda3/lib/python3.8/site-packages/cuml/internals/api_decorators.py\", line 409, in inner_with_setters\n    return func(*args, **kwargs)\n  File \"cuml/cluster/hdbscan.pyx\", line 638, in cuml.cluster.hdbscan.HDBSCAN.fit\n  File \"cuml/common/base.pyx\", line 270, in cuml.common.base.Base.__getattr__\nAttributeError\n####\nERROR: exit_code=1. deleted: /content/drive/MyDrive/soalpy/benchmarks/fcps_leukemia/hdbscan/hdbscan_cuml\n#+end_example\n\n*** DONE Save the generated datasets in =run_one.py= to avoid the upstream memory issues.\n\n*** DONE @upstreamBug Jupyter memory leak\n**** [[https://colab.research.google.com/drive/1UpqpMbb6fpCZFDXNZ-Q5i72aAqn8R2cI?usp=sharing][reproduction steps]]\n\n**** [[https://github.com/ipython/ipython/issues/3452#thread-subscription-status][ipython/ipython#3452 Memory leak even when cache_size = 0 and history_length = 0 or history_length = 1]]\n\n*** @toread\n**** Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.\n***** chapter 21 (clustering)\n\n*** preprocessing\n**** [[file:./dimension reduction.org]]\n\n**** normalization\n#+begin_example python\nfrom sklearn import pipeline\nfrom sklearn.preprocessing import MinMaxScaler, Normalizer\nfrom sklearn.model_selection import train_test_split\n\nfrom sklearn.datasets import load_breast_cancer\nX, y = load_breast_cancer(return_X_y=True)\n\ndata_train, data_test, targets_train, targets_test = train_test_split(X, y, random_state=17)\n\nmm = pipeline.make_pipeline(MinMaxScaler(), Normalizer())\ndata_train = mm.fit_transform(data_train)\n#+end_example\n\n*** @? sparsity support\n\n** phase I\n*** [[./data/datasets.org][Find good datasets.]]\n\n*** benchmark a clustering algorithm (e.g., k-means) on:\n**** scalability\n***** feature size (10k needed)\n#+begin_quote\n\nکلا داده تا حد چند 100 گیگ و زیر یک ترا مرز است\nولی این میتواند ضرب بعد در تعداد هم فرض شود\n\n#+end_quote\n\n#+begin_src bsh.dash :results verbatim :exports both :wrap results\nec $((10**(4+6)*8)) | numfmt-bytes\n#: float64 is 8 bytes\n#+end_src\n\n#+RESULTS:\n#+begin_results\n75GiB\n#+end_results\n\n**** time\n\n**** memory\n\n**** parallelism on CPUs\n\n**** GPU/TPU support\n\n**** How much can it saturate the computing device?\n\n**** correctness\n***** internal clustering metrics?\n\n***** completeness score\n\n***** homogeneity score\n\n**** flexibility of the implementation\n***** hyperparameters\n\n*** Find other clustering algorithms and repeat.\n**** DBSCAN\n***** HDBSCAN (expected to be the best algorithm for the job)\n****** [[https://github.com/scikit-learn-contrib/hdbscan/issues/521][scikit-learn-contrib/hdbscan#521 Does HDBSCAN support out-of-core (incremental) training?]]\n\n**** spectral clustering\n\n**** gaussian mixture model (GMM)\n***** Since we already have k-means, are GMMs useful?\n\n**** @? latent lirichlet allocation (LDA)\n\n**** @? power iteration clustering (PIC)\n\n*** export CSV, HDF5\n**** time of exporting and loading and size\n**** try =gz=\n\n**** results\n***** =parquet=\n****** no compression\n#+begin_example\ntotal 55G\n-rw-r--r-- 1 root root 555M Feb 20 12:42 part.98.parquet\n-rw-r--r-- 1 root root 555M Feb 20 12:50 part.99.parquet\n-rw-r--r-- 1 root root 555M Feb 20 12:43 part.9.parquet\n...\n55G\t/d.parquet.none\n\nic| dur_write: 918.9540417194366\nic| dur_read: 125.95909476280212\nic| dur_conv: 0.6665265560150146\nic| dur_avg: 395.527277469635\nic| avg: -0.020827701\n#+end_example\n\n\n****** =compression=gzip=\n#+begin_example\n-rw-r--r-- 1 root root 520M Feb 20 13:43 part.0.parquet\n-rw-r--r-- 1 root root 520M Feb 20 14:03 part.10.parquet\n-rw-r--r-- 1 root root 520M Feb 20 13:46 part.11.parquet\n-rw-r--r-- 1 root root 520M Feb 20 14:04 part.12.parquet\n...\nparquet compression=gzip\ntotal 51G\n\nic| dur_write: 2132.026951789856\nic| dur_read: 113.30188322067261\nic| dur_conv: 0.6828622817993164\nic| dur_avg: 389.4808497428894\nic| avg: -0.020827701\n#+end_example\n\n\n****** =compression=snappy=\n#+begin_example\n-rw-r--r-- 1 root root 555M Feb 20 13:07 part.98.parquet\n-rw-r--r-- 1 root root 555M Feb 20 13:06 part.99.parquet\n-rw-r--r-- 1 root root 555M Feb 20 13:12 part.9.parquet\n...\n55G\t/d.parquet.snappy\n\nic| dur_write: 975.4363565444946\nic| dur_read: 125.97352576255798\nic| dur_conv: 0.6695859432220459\nic| dur_avg: 402.8658866882324\nic| avg: -0.020827701\n#+end_example\n\n****** =compression=brotli=\n#+begin_example\n50G\t/.d.parquet.brotli\nic| dur_write: 3271.8567810058594\nic| dur_read: 115.65357375144958\nic| dur_conv: 0.6975142955780029\nic| dur_avg: 399.2433009147644\n#+end_example\n\n***** CSV\n****** gzip\n#+begin_example\n\u003e du -h d-00.csv\n439M    d-00.csv\n\n\u003e du -h =(zcat d-00.csv)\n965M    /tmp/zsh2ilH9S\n#+end_example\n\n#+begin_example\n43G\t/d_csv\nic| dur_write: ~ 3 hours\nic| dur_read: 45.56965947151184\nic| dur_conv: 0.7060840129852295\nic| dur_avg: 9023.461018323898\nic| avg: 50.03045087413395\n#+end_example\n\n****** no compression\n#+begin_example\n-rw-r--r-- 1 root root 965M Feb 20 17:15 d-97.csv\n-rw-r--r-- 1 root root 965M Feb 20 15:19 d-98.csv\n-rw-r--r-- 1 root root 965M Feb 20 15:51 d-99.csv\n...\n95G\t/d_csv\n\nic| dur_write: 11639.161382436752\nic| dur_read: 732.3326630592346\nic| dur_conv: 0.6718065738677979\nic| dur_avg: 11983.071362257004\nic| avg: 49.9741248861226\n#+end_example\n\n***** zarr\n#+begin_example\n-rw-r--r-- 1 root root 356M Feb 21 09:36 95.0\n-rw-r--r-- 1 root root 356M Feb 21 09:36 96.0\n-rw-r--r-- 1 root root 356M Feb 21 09:36 97.0\n-rw-r--r-- 1 root root 356M Feb 21 09:36 98.0\n-rw-r--r-- 1 root root 356M Feb 21 09:36 99.0\n...\n35G\t/d_zarr\n\nic| dur_write: 298.37498664855957\nic| dur_read: 0.018457412719726562\nic| dur_conv: 0\nic| dur_avg: 220.6890745162964\nic| avg: 0.0023640413\n#+end_example\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnightmachinery%2Fsoal_playground","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnightmachinery%2Fsoal_playground","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnightmachinery%2Fsoal_playground/lists"}