https://github.com/nightmachinery/soal_playground
https://github.com/nightmachinery/soal_playground
Last synced: 10 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/nightmachinery/soal_playground
- Owner: NightMachinery
- License: mit
- Created: 2022-01-11T18:33:09.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2023-12-15T11:40:18.000Z (over 2 years ago)
- Last Synced: 2025-05-25T00:18:09.321Z (about 1 year ago)
- Language: Jupyter Notebook
- Size: 24.4 MB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: readme.org
- License: LICENSE
Awesome Lists containing this project
README
#+TITLE: soal_playground
This file is authored in org-mode markup, and it is better viewed [[https://github.com/NightMachinary/soal_playground/raw/master/readme.org][raw]] than the default Github rendering view.
* project todos
** periphal
*** Investigate why cuML is consuming so much memory.
**** [[id:f8dc1a3d-afa6-4f5c-98c2-7b0a836f30ab][memleak/gen:rapidsai/cudf#10107 {BUG} Creating a DataFrame from a numpy array consumes too much RAM]]
*** Rebenchmark =python run_one.py kmeans_mb2e10_sklearn_iter10e4_dask fcps_dietary_survey_IBS= on Colab; its score is normal on my laptop, but it is too low on Colab.
*** Create a =conda= constructor.
- @alt Compress the whole =conda= directory and persist it.
**** [[https://colab.research.google.com/drive/1HjikV9AS7X4eklbPtauTG_N6XNGIwOHG#scrollTo=xor-KoTA1dYX]]
**** [[https://github.com/conda/constructor/issues/488][conda/constructor#488 Weird conflict errors]]
**** [[https://github.com/conda-incubator/condacolab/issues/22][conda-incubator/condacolab#22 Weird conflict errors]]
*** DONE =hdbscan= has a numpy incompatibility problem in the GPU mode.
:PROPERTIES:
:visibility: folded
:END:
- Update: I think adding =hdbscan= and =numpy= as explicit deps to =conda= solved this.
#+begin_example python
Traceback (most recent call last):
File "run_one.py", line 8, in
from soalpy.runners import *
File "/usr/local/lib/python3.8/site-packages/soalpy/runners.py", line 9, in
from hdbscan import HDBSCAN
File "/usr/local/lib/python3.8/site-packages/hdbscan/__init__.py", line 1, in
from .hdbscan_ import HDBSCAN, hdbscan
File "/usr/local/lib/python3.8/site-packages/hdbscan/hdbscan_.py", line 21, in
from ._hdbscan_linkage import (single_linkage,
File "hdbscan/_hdbscan_linkage.pyx", line 1, in init hdbscan._hdbscan_linkage
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
#+end_example
*** DONE =hdbscan= has problems with the FCPS data.
:PROPERTIES:
:visibility: folded
:END:
- Update: I think updating =xarray= solved this.
- works fine on my laptop though?!
#+begin_example python
cmd: python 'run_one.py' 'hdbscan_sklearn_best' 'fcps_leukemia'
ERROR: command failed with 1
#### stats:
Command exited with non-zero status 1
1379144,4.45
#### out:
#### err:
##
RAPIDS not installed
INFO: metric switched to precomputed.
Traceback (most recent call last):
File "run_one.py", line 223, in
res = algo(dataset)
File "/usr/local/lib/python3.7/dist-packages/soalpy/hdbscan_runners.py", line 8, in hdbscan_sklearn_best
return run(dataset, mode="HDBSCAN", algorithm='best', **kwargs,)
File "/usr/local/lib/python3.7/dist-packages/soalpy/runners.py", line 159, in run
preds = clf.fit_predict(input_data)
File "/usr/local/lib/python3.7/dist-packages/hdbscan/hdbscan_.py", line 1227, in fit_predict
self.fit(X)
File "/usr/local/lib/python3.7/dist-packages/hdbscan/hdbscan_.py", line 1173, in fit
preds = clf.fit_predict(input_data)
File "/usr/local/lib/python3.7/dist-packages/hdbscan/hdbscan_.py", line 1227, in fit_predict
self.fit(X)
File "/usr/local/lib/python3.7/dist-packages/hdbscan/hdbscan_.py", line 1173, in fit
check_precomputed_distance_matrix(X)
File "/usr/local/lib/python3.7/dist-packages/hdbscan/hdbscan_.py", line 393, in check_precomputed_distance_matrix
tmp[np.isinf(tmp)] = 1
File "/usr/local/lib/python3.7/dist-packages/xarray/core/dataarray.py", line 715, in __setitem__
obj = self[key]
File "/usr/local/lib/python3.7/dist-packages/xarray/core/dataarray.py", line 706, in __getitem__
return self.isel(indexers=self._item_key_to_dict(key))
File "/usr/local/lib/python3.7/dist-packages/xarray/core/dataarray.py", line 1140, in isel
indexers, drop=drop, missing_dims=missing_dims
File "/usr/local/lib/python3.7/dist-packages/xarray/core/dataset.py", line 2275, in _isel_fancy
name, var, self.xindexes[name], var_indexers
File "/usr/local/lib/python3.7/dist-packages/xarray/core/indexes.py", line 295, in isel_variable_and_index
new_variable = variable.isel(indexers)
File "/usr/local/lib/python3.7/dist-packages/xarray/core/variable.py", line 1135, in isel
return self[key]
File "/usr/local/lib/python3.7/dist-packages/xarray/core/variable.py", line 779, in __getitem__
dims, indexer, new_order = self._broadcast_indexes(key)
File "/usr/local/lib/python3.7/dist-packages/xarray/core/variable.py", line 622, in _broadcast_indexes
self._validate_indexers(key)
File "/usr/local/lib/python3.7/dist-packages/xarray/core/variable.py", line 670, in _validate_indexers
"not supported. ".format(k.ndim)
IndexError: 2-dimensional boolean indexing is not supported.
#+end_example
*** IGNORE @upstreamBug? =hdbscan_cuml= has problems with =fcps_leukemia=
#+begin_example
##### Algorithm: hdbscan_cuml
cmd: python 'run_one.py' 'hdbscan_cuml' 'fcps_leukemia'
ERROR: command failed with 1
#### stats:
Command exited with non-zero status 1
2115364,6.80
#### out:
#### err:
##
INFO: metric switched to precomputed.
Traceback (most recent call last):
File "run_one.py", line 223, in
res = algo(dataset)
File "/root/miniconda3/lib/python3.8/site-packages/soalpy/hdbscan_runners.py", line 5, in hdbscan_cuml
return run(dataset, mode="cuHDBSCAN")
File "/root/miniconda3/lib/python3.8/site-packages/soalpy/runners.py", line 159, in run
preds = clf.fit_predict(input_data)
File "/root/miniconda3/lib/python3.8/site-packages/cuml/internals/api_decorators.py", line 586, in inner_get
ret_val = func(*args, **kwargs)
File "cuml/cluster/hdbscan.pyx", line 671, in cuml.cluster.hdbscan.HDBSCAN.fit_predict
File "/root/miniconda3/lib/python3.8/site-packages/cuml/internals/api_decorators.py", line 409, in inner_with_setters
return func(*args, **kwargs)
File "cuml/cluster/hdbscan.pyx", line 638, in cuml.cluster.hdbscan.HDBSCAN.fit
File "cuml/common/base.pyx", line 270, in cuml.common.base.Base.__getattr__
AttributeError
####
ERROR: exit_code=1. deleted: /content/drive/MyDrive/soalpy/benchmarks/fcps_leukemia/hdbscan/hdbscan_cuml
#+end_example
*** DONE Save the generated datasets in =run_one.py= to avoid the upstream memory issues.
*** DONE @upstreamBug Jupyter memory leak
**** [[https://colab.research.google.com/drive/1UpqpMbb6fpCZFDXNZ-Q5i72aAqn8R2cI?usp=sharing][reproduction steps]]
**** [[https://github.com/ipython/ipython/issues/3452#thread-subscription-status][ipython/ipython#3452 Memory leak even when cache_size = 0 and history_length = 0 or history_length = 1]]
*** @toread
**** Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.
***** chapter 21 (clustering)
*** preprocessing
**** [[file:./dimension reduction.org]]
**** normalization
#+begin_example python
from sklearn import pipeline
from sklearn.preprocessing import MinMaxScaler, Normalizer
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
data_train, data_test, targets_train, targets_test = train_test_split(X, y, random_state=17)
mm = pipeline.make_pipeline(MinMaxScaler(), Normalizer())
data_train = mm.fit_transform(data_train)
#+end_example
*** @? sparsity support
** phase I
*** [[./data/datasets.org][Find good datasets.]]
*** benchmark a clustering algorithm (e.g., k-means) on:
**** scalability
***** feature size (10k needed)
#+begin_quote
کلا داده تا حد چند 100 گیگ و زیر یک ترا مرز است
ولی این میتواند ضرب بعد در تعداد هم فرض شود
#+end_quote
#+begin_src bsh.dash :results verbatim :exports both :wrap results
ec $((10**(4+6)*8)) | numfmt-bytes
#: float64 is 8 bytes
#+end_src
#+RESULTS:
#+begin_results
75GiB
#+end_results
**** time
**** memory
**** parallelism on CPUs
**** GPU/TPU support
**** How much can it saturate the computing device?
**** correctness
***** internal clustering metrics?
***** completeness score
***** homogeneity score
**** flexibility of the implementation
***** hyperparameters
*** Find other clustering algorithms and repeat.
**** DBSCAN
***** HDBSCAN (expected to be the best algorithm for the job)
****** [[https://github.com/scikit-learn-contrib/hdbscan/issues/521][scikit-learn-contrib/hdbscan#521 Does HDBSCAN support out-of-core (incremental) training?]]
**** spectral clustering
**** gaussian mixture model (GMM)
***** Since we already have k-means, are GMMs useful?
**** @? latent lirichlet allocation (LDA)
**** @? power iteration clustering (PIC)
*** export CSV, HDF5
**** time of exporting and loading and size
**** try =gz=
**** results
***** =parquet=
****** no compression
#+begin_example
total 55G
-rw-r--r-- 1 root root 555M Feb 20 12:42 part.98.parquet
-rw-r--r-- 1 root root 555M Feb 20 12:50 part.99.parquet
-rw-r--r-- 1 root root 555M Feb 20 12:43 part.9.parquet
...
55G /d.parquet.none
ic| dur_write: 918.9540417194366
ic| dur_read: 125.95909476280212
ic| dur_conv: 0.6665265560150146
ic| dur_avg: 395.527277469635
ic| avg: -0.020827701
#+end_example
****** =compression=gzip=
#+begin_example
-rw-r--r-- 1 root root 520M Feb 20 13:43 part.0.parquet
-rw-r--r-- 1 root root 520M Feb 20 14:03 part.10.parquet
-rw-r--r-- 1 root root 520M Feb 20 13:46 part.11.parquet
-rw-r--r-- 1 root root 520M Feb 20 14:04 part.12.parquet
...
parquet compression=gzip
total 51G
ic| dur_write: 2132.026951789856
ic| dur_read: 113.30188322067261
ic| dur_conv: 0.6828622817993164
ic| dur_avg: 389.4808497428894
ic| avg: -0.020827701
#+end_example
****** =compression=snappy=
#+begin_example
-rw-r--r-- 1 root root 555M Feb 20 13:07 part.98.parquet
-rw-r--r-- 1 root root 555M Feb 20 13:06 part.99.parquet
-rw-r--r-- 1 root root 555M Feb 20 13:12 part.9.parquet
...
55G /d.parquet.snappy
ic| dur_write: 975.4363565444946
ic| dur_read: 125.97352576255798
ic| dur_conv: 0.6695859432220459
ic| dur_avg: 402.8658866882324
ic| avg: -0.020827701
#+end_example
****** =compression=brotli=
#+begin_example
50G /.d.parquet.brotli
ic| dur_write: 3271.8567810058594
ic| dur_read: 115.65357375144958
ic| dur_conv: 0.6975142955780029
ic| dur_avg: 399.2433009147644
#+end_example
***** CSV
****** gzip
#+begin_example
> du -h d-00.csv
439M d-00.csv
> du -h =(zcat d-00.csv)
965M /tmp/zsh2ilH9S
#+end_example
#+begin_example
43G /d_csv
ic| dur_write: ~ 3 hours
ic| dur_read: 45.56965947151184
ic| dur_conv: 0.7060840129852295
ic| dur_avg: 9023.461018323898
ic| avg: 50.03045087413395
#+end_example
****** no compression
#+begin_example
-rw-r--r-- 1 root root 965M Feb 20 17:15 d-97.csv
-rw-r--r-- 1 root root 965M Feb 20 15:19 d-98.csv
-rw-r--r-- 1 root root 965M Feb 20 15:51 d-99.csv
...
95G /d_csv
ic| dur_write: 11639.161382436752
ic| dur_read: 732.3326630592346
ic| dur_conv: 0.6718065738677979
ic| dur_avg: 11983.071362257004
ic| avg: 49.9741248861226
#+end_example
***** zarr
#+begin_example
-rw-r--r-- 1 root root 356M Feb 21 09:36 95.0
-rw-r--r-- 1 root root 356M Feb 21 09:36 96.0
-rw-r--r-- 1 root root 356M Feb 21 09:36 97.0
-rw-r--r-- 1 root root 356M Feb 21 09:36 98.0
-rw-r--r-- 1 root root 356M Feb 21 09:36 99.0
...
35G /d_zarr
ic| dur_write: 298.37498664855957
ic| dur_read: 0.018457412719726562
ic| dur_conv: 0
ic| dur_avg: 220.6890745162964
ic| avg: 0.0023640413
#+end_example