{"id":20735143,"url":"https://github.com/djsutherland/py-sdm","last_synced_at":"2025-04-23T23:41:47.805Z","repository":{"id":5978346,"uuid":"7200365","full_name":"djsutherland/py-sdm","owner":"djsutherland","description":"Python implementation of nonparametric nearest-neighbor-based estimators for divergences between distributions.","archived":false,"fork":false,"pushed_at":"2017-03-13T23:20:19.000Z","size":8447,"stargazers_count":48,"open_issues_count":31,"forks_count":8,"subscribers_count":7,"default_branch":"master","last_synced_at":"2024-04-16T04:01:44.184Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://cs.cmu.edu/~dsutherl/sdm/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/djsutherland.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-12-17T06:23:00.000Z","updated_at":"2023-07-25T07:05:06.000Z","dependencies_parsed_at":"2022-08-06T06:15:39.430Z","dependency_job_id":null,"html_url":"https://github.com/djsutherland/py-sdm","commit_stats":null,"previous_names":[],"tags_count":14,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/djsutherland%2Fpy-sdm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/djsutherland%2Fpy-sdm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/djsutherland%2Fpy-sdm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/djsutherland%2Fpy-sdm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/djsutherland","download_url":"https://codeload.github.com/djsutherland/py-sdm/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225004730,"owners_count":17405659,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-17T05:34:34.098Z","updated_at":"2024-11-17T05:34:34.956Z","avatar_url":"https://github.com/djsutherland.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"This is a Python implementation of nonparametric divergence estimators.\n\nFor an introduction to the method and why you'd want to use it,\nsee http://cs.cmu.edu/~dsutherl/sdm/.\n\nCode homepage: https://github.com/dougalsutherland/py-sdm/.\n\nCode by Dougal J. Sutherland \u003cdsutherl@cs.cmu.edu\u003e\nbased partially on code by Liang Xiong \u003clxiong@cs.cmu.edu\u003e.\n\n**NOTE:** this package will soon be superceded by the more-generic and\nbetter-integrated-with-scikit-learn package\n`skl-groups \u003chttps://github.com/dougalsutherland/skl-groups/\u003e`_.\n\n\nInstallation\n------------\n\nThis code is written for Python 2.7, with 3.2+ compatability in mind (but not\ntested). It is known not to work for 2.6, though adding support would not be\noverly difficult; let me know if you want that.\n\nIt is also only tested on Unix-like operating systems (in particular, on OS X,\nCentOS, and Ubuntu). All of the code except for the actual SVM wrappers\n*should* work on Windows, but it's untested. The SVM wrappers *should* work\nif you use n_proc=1; if you try to use multiprocessing there it will complain\nand crash.\n\nIf you want to run with more than about a thousand objects, make sure that your\nnumpy and scipy are linked to a fast BLAS/LAPACK implementation like MKL, ACML,\nor OpenBLAS.\n\nThe easiest way to accomplish that is to use a pre-packaged distribution. I use\n`Anaconda \u003chttps://store.continuum.io/cshop/anaconda/\u003e`_. If you're affiliated\nwith an academic institution, you can get the MKL Optimizations add-on for free\nthat links numpy to Intel's fast MKL library. Anaconda (or EPD) also let you\navoid having to compile scipy (which takes a long time) and install non-python\nlibraries like hdf5. If so, ``conda install accelerate`` install them all.\n\nIt's also easiest to install py-sdm through binaries with the conda package\nmanager (part of Anaconda). There are currently only builds on 64-bit OSX and\n64-bit Linux, with Python 2.7. To do so::\n\n    conda install -c http://conda.binstar.org/dougal py-sdm\n\nIf you don't want to use binaries, there are various complications. See\n``INSTALL.rst`` for details.\n\n\nQuick Start Guide for Images\n----------------------------\n\nThis shows you the basics of how to do classification or regression on images.\n\n\nData Format\n===========\n\nIf you're doing classification, it's easiest if your images are in a single\ndirectory containing one directory per class, and images for that class in the\ndirectory: ``root-dir/class-name/image-name.jpg``\n\nIf you're doing regression, it's easiest to have your images all in a single\ndirectory, and a CSV file $target_name.csv with labels of the form::\n\n    image1.jpg,2.4\n\nfor each image in the directory (no header).\n\n\nExtracting Features\n===================\n\nThis step extracts SIFT features for a collection of images.\n\nThe basic command is something like::\n\n    extract_image_features --root-dir path-to-root-dir --color hsv feats_raw.h5\n\nfor classification, or::\n\n    extract_image_features --dirs path-to-root-dir --color hsv feats_raw.h5\n\nfor regression.\n\nThis by default spawns one process per core to extract features (each of which\nuses only one thread); this can be controlled with the ``--n-proc`` argument.\n\nYou're likely to want to use the ``--resize`` option if your images are large\nand/or of widely varying sizes. We typically resize them to be about 100px wide\nor so.\n\nSee ``--help`` for more options.\n\n\nPost-Processing Features\n========================\n\nThis step handles \"blanks,\" does dimensionality reduction via PCA, adds\nspatial information, and standardizes features.\n\nThe basic command is::\n\n    proc_image_features --pca-varfrac 0.7 feats_raw.h5 feats_pca.h5\n\nThis by default does a dense PCA; if you have a lot of images and/or the images\nare large, it'll take a lot of memory.\nYou can reduce memory requirements a lot by replacing the ``--pca-varfrac 0.7``\nwith something like ``--pca-k 50 --pca-random``, which will do a randomized SVD\nto reduce dimensionality to 50; you have to specify a specific dimension rather\nthan a percent of variance, though.\n\nIf you have a numpy linked to MKL or other fancy blas libraries, it will\nprobably try to eat all your cores during the PCA; the ``OMP_NUM_THREADS``\nenvironment variable can limit that.\n\nAgain, other options available via ``--help``.\n\n\nClassifying/Regressing\n======================\n\nOnce you have this, to calculate divergences and run the SVMs in one step you\ncan use a command like::\n\n    sdm cv --div-func renyi:.9 -K 5 --cv-folds 10 \\\n        feats_pca.h5 --div-cache-file feats_pca.divs.h5 \\\n        --output-file feats_pca.cv.npz\n\nfor cross-validation. This will cache the calculated divergences in\n``feats_pca.divs.h5``, and print out accuracy information as well as saving\npredictions and some other info in ``feats_pca.cv.npz``.\nThis can take a long time, especially when doing divergences.\n\nFor regression, the command would look like::\n\n    sdm cv --nu-svr --div-func renyi:.9 -K 5 --cv-folds 10 \\\n        --labels-name target_name\n        feats_pca.h5 --div-cache-file feats_pca.divs.h5\n        --output-file feats_pca.cv.npz\n\nThis uses ``--n-proc`` to specify the number of SVMs to run in parallel during\nparameter tuning. During the projection phase (which happens in serial), an\nMKL-linked numpy is likely to spawn many threads;\n``OMP_NUM_THREADS`` will again control this.\n\nMany more options are available via ``sdm cv --help``.\n\n``sdm`` also supports predicting using a training / test set through\n``sdm predict`` rather than ``sdm cv``, but there isn't currently code to\nproduce the input files it assumes. If this would be useful for you, let me\nknow and I'll write it....\n\n\nPrecomputing Divergences\n========================\n\nIf you'd like to try several divergence functions (e.g. different values of\nalpha or K), it's much more efficient to compute them all at once than to\nlet ``sdm`` do them all separately.\n\n(This will hopefully no longer be true once ``sdm`` crossvalidates among\ndivergence functions and Ks:\n`issue #12 \u003chttps://github.com/dougalsutherland/py-sdm/issues/12\u003e`_.)\n\nThe ``estimate_divs`` command does this, using a command along the lines of::\n\n    estimate_divs --div-funcs kl renyi:.8 renyi:.9 renyi:.99 -K 1 3 5 10 --\n        feats_pca.h5 feats_pca.divs.h5\n\n(where the ``--`` indicates that the ``-K`` arguments are done and it's time for\npositional args.)\n\n\n\nQuick Start Guide For General Features\n--------------------------------------\n\nIf you don't want to use the image feature extraction code above, you have two\nmain options for using SDMs.\n\n\nMaking Compatible Files\n=======================\n\nOne option is to make an hdf5 file compatible with the output of\n``extract_image_features`` and ``proc_image_features``, e.g. with ``h5py``.\nThe structure that you want to make is::\n\n    /cat1          # the name of a category\n      /bag1        # the name of each data sample\n        /features  # a row-instance feature matrix\n        /label-1   # a scalar dataset with the value of label-1\n        /label-2   # scalar dataset with a second label type\n      /bag2\n        ...\n    /cat2\n      ...\n\nSome notes:\n\n* All of the names except ``features`` can be replaced with whatever you like.\n* If you have a single \"natural\" classification label, it can be convenient to\n  use that for the category, but you can put them all in the same category if\n  you like.\n* The features matrices can have any number of rows but must have the same\n  numbers of columns.\n* Different bags need not have the same labels available, unless you want to use\n  them for training / cross-validating in ``sdm``. Each bag can have any number\n  of labels.\n\nAlternatively, you can use the \"per-bag\" format, where you make a ``.npz``\nfile (with ``np.savez``) at ``root-path/cat-name/bag-name.npz`` with a\n``features`` matrix and any labels (as above).\n\nDepending on the nature of your features, you may want to run PCA on them,\nstandardize the dimensions, or perform other normalizations. You can do PCA and\nstandardization with ``proc_image_features``, as long as you make sure to pass\n``--blank-handler none --no-add-x --no-add-y`` so it doesn't try to do image-\nspecific stuff.\n\nYou can then use ``sdm`` as above.\n\n\nUsing the API\n=============\n\nYou can also use the API directly. The following shows basic usage in the\nsituation where test data is not available at training time::\n\n    import sdm\n\n    # train_features is a list of row-instance data matrices\n    # train_labels is a numpy vector of integer categories\n\n    # PCA and standardize the features\n    train_feats = sdm.Features(train_features)\n    pca = train_feats.pca(varfrac=0.7, ret_pca=True, inplace=True)\n    scaler = train_feats.standardize(ret_scaler=True, inplace=True)\n\n    clf = sdm.SDC()\n    clf.fit(train_feats, train_labels)\n    # ^ gets divergences and does parameter tuning. See the docstrings for\n    # more information about options, divergence caches, etc. Caching\n    # divergences is highly recommended.\n\n    # get test_features: another list of row-instance data matrices\n    # and then process them consistently with the training samples\n    test_feats = sdm.Features(test_features, default_category='test')\n    test_feats.pca(pca=pca, inplace=True)\n    test_feats.normalize(scaler=scaler, inplace=True)\n\n    # get test predictions\n    preds = clf.predict(test_feats)\n\n    accuracy = np.mean(preds == test_labels)\n\nTo do regression, use ``clf = sdm.NuSDR()`` and a real-valued train_labels;\nthe rest of the usage is the same.\n\nIf you're running on a nontrivial amount of data, it may be nice to pass\n``status_fn=True`` and ``progressbar=True`` to the constructor to get status\ninformation out along the way (like in the CLI).\n\nIf test data is available at training time, it's preferable to use\n``.transduct()`` instead. There's also a ``.crossvalidate()`` method.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdjsutherland%2Fpy-sdm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdjsutherland%2Fpy-sdm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdjsutherland%2Fpy-sdm/lists"}