https://github.com/nightmachinery/soal_playground

Last synced: 10 months ago
JSON representation
Host: GitHub
URL: https://github.com/nightmachinery/soal_playground
Owner: NightMachinery
License: mit
Created: 2022-01-11T18:33:09.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2023-12-15T11:40:18.000Z (over 2 years ago)
Last Synced: 2025-05-25T00:18:09.321Z (about 1 year ago)
Language: Jupyter Notebook
Size: 24.4 MB
Stars: 0
Watchers: 3
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: readme.org
- License: LICENSE
Awesome Lists containing this project

README

          #+TITLE: soal_playground

This file is authored in org-mode markup, and it is better viewed [[https://github.com/NightMachinary/soal_playground/raw/master/readme.org][raw]] than the default Github rendering view.

* project todos

** periphal

*** Investigate why cuML is consuming so much memory.

**** [[id:f8dc1a3d-afa6-4f5c-98c2-7b0a836f30ab][memleak/gen:rapidsai/cudf#10107 {BUG} Creating a DataFrame from a numpy array consumes too much RAM]]

*** Rebenchmark =python run_one.py kmeans_mb2e10_sklearn_iter10e4_dask fcps_dietary_survey_IBS= on Colab; its score is normal on my laptop, but it is too low on Colab.

*** Create a =conda= constructor.

- @alt Compress the whole =conda= directory and persist it.

**** [[https://colab.research.google.com/drive/1HjikV9AS7X4eklbPtauTG_N6XNGIwOHG#scrollTo=xor-KoTA1dYX]]

**** [[https://github.com/conda/constructor/issues/488][conda/constructor#488 Weird conflict errors]]

**** [[https://github.com/conda-incubator/condacolab/issues/22][conda-incubator/condacolab#22 Weird conflict errors]]

*** DONE =hdbscan= has a numpy incompatibility problem in the GPU mode.

:PROPERTIES:

:visibility: folded

:END:

- Update: I think adding =hdbscan= and =numpy= as explicit deps to =conda= solved this.

#+begin_example python

Traceback (most recent call last):

  File "run_one.py", line 8, in 

    from soalpy.runners import *

  File "/usr/local/lib/python3.8/site-packages/soalpy/runners.py", line 9, in 

    from hdbscan import HDBSCAN

  File "/usr/local/lib/python3.8/site-packages/hdbscan/__init__.py", line 1, in 

    from .hdbscan_ import HDBSCAN, hdbscan

  File "/usr/local/lib/python3.8/site-packages/hdbscan/hdbscan_.py", line 21, in 

    from ._hdbscan_linkage import (single_linkage,

  File "hdbscan/_hdbscan_linkage.pyx", line 1, in init hdbscan._hdbscan_linkage

ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

#+end_example

*** DONE =hdbscan= has problems with the FCPS data.

:PROPERTIES:

:visibility: folded

:END:

- Update: I think updating =xarray= solved this.

- works fine on my laptop though?!

#+begin_example python

cmd: python 'run_one.py' 'hdbscan_sklearn_best' 'fcps_leukemia'

ERROR: command failed with 1

#### stats:

Command exited with non-zero status 1

1379144,4.45

#### out:

#### err:

##

RAPIDS not installed

INFO: metric switched to precomputed.

Traceback (most recent call last):

  File "run_one.py", line 223, in 

    res = algo(dataset)

  File "/usr/local/lib/python3.7/dist-packages/soalpy/hdbscan_runners.py", line 8, in hdbscan_sklearn_best

    return run(dataset, mode="HDBSCAN", algorithm='best', **kwargs,)

  File "/usr/local/lib/python3.7/dist-packages/soalpy/runners.py", line 159, in run

    preds = clf.fit_predict(input_data)

  File "/usr/local/lib/python3.7/dist-packages/hdbscan/hdbscan_.py", line 1227, in fit_predict

    self.fit(X)

  File "/usr/local/lib/python3.7/dist-packages/hdbscan/hdbscan_.py", line 1173, in fit

    preds = clf.fit_predict(input_data)

  File "/usr/local/lib/python3.7/dist-packages/hdbscan/hdbscan_.py", line 1227, in fit_predict

    self.fit(X)

  File "/usr/local/lib/python3.7/dist-packages/hdbscan/hdbscan_.py", line 1173, in fit

    check_precomputed_distance_matrix(X)

  File "/usr/local/lib/python3.7/dist-packages/hdbscan/hdbscan_.py", line 393, in check_precomputed_distance_matrix

    tmp[np.isinf(tmp)] = 1

  File "/usr/local/lib/python3.7/dist-packages/xarray/core/dataarray.py", line 715, in __setitem__

    obj = self[key]

  File "/usr/local/lib/python3.7/dist-packages/xarray/core/dataarray.py", line 706, in __getitem__

    return self.isel(indexers=self._item_key_to_dict(key))

  File "/usr/local/lib/python3.7/dist-packages/xarray/core/dataarray.py", line 1140, in isel

    indexers, drop=drop, missing_dims=missing_dims

  File "/usr/local/lib/python3.7/dist-packages/xarray/core/dataset.py", line 2275, in _isel_fancy

    name, var, self.xindexes[name], var_indexers

  File "/usr/local/lib/python3.7/dist-packages/xarray/core/indexes.py", line 295, in isel_variable_and_index

    new_variable = variable.isel(indexers)

  File "/usr/local/lib/python3.7/dist-packages/xarray/core/variable.py", line 1135, in isel

    return self[key]

  File "/usr/local/lib/python3.7/dist-packages/xarray/core/variable.py", line 779, in __getitem__

    dims, indexer, new_order = self._broadcast_indexes(key)

  File "/usr/local/lib/python3.7/dist-packages/xarray/core/variable.py", line 622, in _broadcast_indexes

    self._validate_indexers(key)

  File "/usr/local/lib/python3.7/dist-packages/xarray/core/variable.py", line 670, in _validate_indexers

    "not supported. ".format(k.ndim)

IndexError: 2-dimensional boolean indexing is not supported.

#+end_example

*** IGNORE @upstreamBug? =hdbscan_cuml= has problems with =fcps_leukemia=

#+begin_example

##### Algorithm: hdbscan_cuml

cmd: python 'run_one.py' 'hdbscan_cuml' 'fcps_leukemia'

ERROR: command failed with 1

#### stats:

Command exited with non-zero status 1

2115364,6.80

#### out:

#### err:

##

INFO: metric switched to precomputed.

Traceback (most recent call last):

  File "run_one.py", line 223, in 

    res = algo(dataset)

  File "/root/miniconda3/lib/python3.8/site-packages/soalpy/hdbscan_runners.py", line 5, in hdbscan_cuml

    return run(dataset, mode="cuHDBSCAN")

  File "/root/miniconda3/lib/python3.8/site-packages/soalpy/runners.py", line 159, in run

    preds = clf.fit_predict(input_data)

  File "/root/miniconda3/lib/python3.8/site-packages/cuml/internals/api_decorators.py", line 586, in inner_get

    ret_val = func(*args, **kwargs)

  File "cuml/cluster/hdbscan.pyx", line 671, in cuml.cluster.hdbscan.HDBSCAN.fit_predict

  File "/root/miniconda3/lib/python3.8/site-packages/cuml/internals/api_decorators.py", line 409, in inner_with_setters

    return func(*args, **kwargs)

  File "cuml/cluster/hdbscan.pyx", line 638, in cuml.cluster.hdbscan.HDBSCAN.fit

  File "cuml/common/base.pyx", line 270, in cuml.common.base.Base.__getattr__

AttributeError

####

ERROR: exit_code=1. deleted: /content/drive/MyDrive/soalpy/benchmarks/fcps_leukemia/hdbscan/hdbscan_cuml

#+end_example

*** DONE Save the generated datasets in =run_one.py= to avoid the upstream memory issues.

*** DONE @upstreamBug Jupyter memory leak

**** [[https://colab.research.google.com/drive/1UpqpMbb6fpCZFDXNZ-Q5i72aAqn8R2cI?usp=sharing][reproduction steps]]

**** [[https://github.com/ipython/ipython/issues/3452#thread-subscription-status][ipython/ipython#3452 Memory leak even when cache_size = 0 and history_length = 0 or history_length = 1]]

*** @toread

**** Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.

***** chapter 21 (clustering)

*** preprocessing

**** [[file:./dimension reduction.org]]

**** normalization

#+begin_example python

from sklearn import pipeline

from sklearn.preprocessing import MinMaxScaler, Normalizer

from sklearn.model_selection import train_test_split

from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)

data_train, data_test, targets_train, targets_test = train_test_split(X, y, random_state=17)

mm = pipeline.make_pipeline(MinMaxScaler(), Normalizer())

data_train = mm.fit_transform(data_train)

#+end_example

*** @? sparsity support

** phase I

*** [[./data/datasets.org][Find good datasets.]]

*** benchmark a clustering algorithm (e.g., k-means) on:

**** scalability

***** feature size (10k needed)

#+begin_quote

کلا داده تا حد چند 100 گیگ و زیر یک ترا مرز است

ولی این میتواند ضرب بعد در تعداد هم فرض شود

#+end_quote

#+begin_src bsh.dash :results verbatim :exports both :wrap results

ec $((10**(4+6)*8)) | numfmt-bytes

#: float64 is 8 bytes

#+end_src

#+RESULTS:

#+begin_results

75GiB

#+end_results

**** time

**** memory

**** parallelism on CPUs

**** GPU/TPU support

**** How much can it saturate the computing device?

**** correctness

***** internal clustering metrics?

***** completeness score

***** homogeneity score

**** flexibility of the implementation

***** hyperparameters

*** Find other clustering algorithms and repeat.

**** DBSCAN

***** HDBSCAN (expected to be the best algorithm for the job)

****** [[https://github.com/scikit-learn-contrib/hdbscan/issues/521][scikit-learn-contrib/hdbscan#521 Does HDBSCAN support out-of-core (incremental) training?]]

**** spectral clustering

**** gaussian mixture model (GMM)

***** Since we already have k-means, are GMMs useful?

**** @? latent lirichlet allocation (LDA)

**** @? power iteration clustering (PIC)

*** export CSV, HDF5

**** time of exporting and loading and size

**** try =gz=

**** results

***** =parquet=

****** no compression

#+begin_example

total 55G

-rw-r--r-- 1 root root 555M Feb 20 12:42 part.98.parquet

-rw-r--r-- 1 root root 555M Feb 20 12:50 part.99.parquet

-rw-r--r-- 1 root root 555M Feb 20 12:43 part.9.parquet

...

55G	/d.parquet.none

ic| dur_write: 918.9540417194366

ic| dur_read: 125.95909476280212

ic| dur_conv: 0.6665265560150146

ic| dur_avg: 395.527277469635

ic| avg: -0.020827701

#+end_example

****** =compression=gzip=

#+begin_example

-rw-r--r-- 1 root root 520M Feb 20 13:43 part.0.parquet

-rw-r--r-- 1 root root 520M Feb 20 14:03 part.10.parquet

-rw-r--r-- 1 root root 520M Feb 20 13:46 part.11.parquet

-rw-r--r-- 1 root root 520M Feb 20 14:04 part.12.parquet

...

parquet compression=gzip

total 51G

ic| dur_write: 2132.026951789856

ic| dur_read: 113.30188322067261

ic| dur_conv: 0.6828622817993164

ic| dur_avg: 389.4808497428894

ic| avg: -0.020827701

#+end_example

****** =compression=snappy=

#+begin_example

-rw-r--r-- 1 root root 555M Feb 20 13:07 part.98.parquet

-rw-r--r-- 1 root root 555M Feb 20 13:06 part.99.parquet

-rw-r--r-- 1 root root 555M Feb 20 13:12 part.9.parquet

...

55G	/d.parquet.snappy

ic| dur_write: 975.4363565444946

ic| dur_read: 125.97352576255798

ic| dur_conv: 0.6695859432220459

ic| dur_avg: 402.8658866882324

ic| avg: -0.020827701

#+end_example

****** =compression=brotli=

#+begin_example

50G	/.d.parquet.brotli

ic| dur_write: 3271.8567810058594

ic| dur_read: 115.65357375144958

ic| dur_conv: 0.6975142955780029

ic| dur_avg: 399.2433009147644

#+end_example

***** CSV

****** gzip

#+begin_example

> du -h d-00.csv

439M    d-00.csv

> du -h =(zcat d-00.csv)

965M    /tmp/zsh2ilH9S

#+end_example

#+begin_example

43G	/d_csv

ic| dur_write: ~ 3 hours

ic| dur_read: 45.56965947151184

ic| dur_conv: 0.7060840129852295

ic| dur_avg: 9023.461018323898

ic| avg: 50.03045087413395

#+end_example

****** no compression

#+begin_example

-rw-r--r-- 1 root root 965M Feb 20 17:15 d-97.csv

-rw-r--r-- 1 root root 965M Feb 20 15:19 d-98.csv

-rw-r--r-- 1 root root 965M Feb 20 15:51 d-99.csv

...

95G	/d_csv

ic| dur_write: 11639.161382436752

ic| dur_read: 732.3326630592346

ic| dur_conv: 0.6718065738677979

ic| dur_avg: 11983.071362257004

ic| avg: 49.9741248861226

#+end_example

***** zarr

#+begin_example

-rw-r--r-- 1 root root 356M Feb 21 09:36 95.0

-rw-r--r-- 1 root root 356M Feb 21 09:36 96.0

-rw-r--r-- 1 root root 356M Feb 21 09:36 97.0

-rw-r--r-- 1 root root 356M Feb 21 09:36 98.0

-rw-r--r-- 1 root root 356M Feb 21 09:36 99.0

...

35G	/d_zarr

ic| dur_write: 298.37498664855957

ic| dur_read: 0.018457412719726562

ic| dur_conv: 0

ic| dur_avg: 220.6890745162964

ic| avg: 0.0023640413

#+end_example
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/nightmachinery/soal_playground

Awesome Lists containing this project

README