{"id":13415326,"url":"https://github.com/kLabUM/rrcf","last_synced_at":"2025-03-14T22:33:19.037Z","repository":{"id":47123185,"uuid":"153873481","full_name":"kLabUM/rrcf","owner":"kLabUM","description":"🌲 Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams","archived":false,"fork":false,"pushed_at":"2024-02-24T12:21:01.000Z","size":4662,"stargazers_count":488,"open_issues_count":29,"forks_count":112,"subscribers_count":20,"default_branch":"master","last_synced_at":"2024-07-31T21:53:39.779Z","etag":null,"topics":["anomaly-detection","detect-outliers","machine-learning","outliers","python","random-forest","robust-random-cut-forest","streaming-data","tree"],"latest_commit_sha":null,"homepage":"https://klabum.github.io/rrcf/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kLabUM.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-10-20T05:39:05.000Z","updated_at":"2024-07-26T12:23:04.000Z","dependencies_parsed_at":"2023-02-17T09:01:48.275Z","dependency_job_id":"ca3b3de8-cd80-4341-9221-62dee5634d95","html_url":"https://github.com/kLabUM/rrcf","commit_stats":{"total_commits":227,"total_committers":7,"mean_commits":32.42857142857143,"dds":0.1541850220264317,"last_synced_commit":"1795a1b4dd39a3ffb196be7e251b177b3d1ab489"},"previous_names":[],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kLabUM%2Frrcf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kLabUM%2Frrcf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kLabUM%2Frrcf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kLabUM%2Frrcf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kLabUM","download_url":"https://codeload.github.com/kLabUM/rrcf/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243658057,"owners_count":20326459,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["anomaly-detection","detect-outliers","machine-learning","outliers","python","random-forest","robust-random-cut-forest","streaming-data","tree"],"created_at":"2024-07-30T21:00:47.177Z","updated_at":"2025-03-14T22:33:19.030Z","avatar_url":"https://github.com/kLabUM.png","language":"Python","readme":"# rrcf 🌲🌲🌲\n[![Build Status](https://travis-ci.org/kLabUM/rrcf.svg?branch=master)](https://travis-ci.org/kLabUM/rrcf) [![Coverage Status](https://coveralls.io/repos/github/kLabUM/rrcf/badge.svg?branch=master)](https://coveralls.io/github/kLabUM/rrcf?branch=master) [![Python 3.6](https://img.shields.io/badge/python-3.6-blue.svg)](https://www.python.org/downloads/release/python-360/) ![GitHub](https://img.shields.io/github/license/kLabUM/rrcf.svg) [![status](http://joss.theoj.org/papers/f8c83c0b01a984d0dbf934939b53c96d/status.svg)](http://joss.theoj.org/papers/f8c83c0b01a984d0dbf934939b53c96d)\n\nImplementation of the *Robust Random Cut Forest Algorithm* for anomaly detection by [Guha et al. (2016)](http://proceedings.mlr.press/v48/guha16.pdf).\n\n\u003e S. Guha, N. Mishra, G. Roy, \u0026 O. Schrijvers, *Robust random cut forest based anomaly\n\u003e detection on streams*, in Proceedings of the 33rd International conference on machine\n\u003e learning, New York, NY, 2016 (pp. 2712-2721).\n\n## About\n\nThe *Robust Random Cut Forest* (RRCF) algorithm is an ensemble method for detecting outliers in streaming data. RRCF offers a number of features that many competing anomaly detection algorithms lack. Specifically, RRCF:\n\n- Is designed to handle streaming data.\n- Performs well on high-dimensional data.\n- Reduces the influence of irrelevant dimensions.\n- Gracefully handles duplicates and near-duplicates that could otherwise mask the presence of outliers.\n- Features an anomaly-scoring algorithm with a clear underlying statistical meaning.\n\nThis repository provides an open-source implementation of the RRCF algorithm and its core data structures for the purposes of facilitating experimentation and enabling future extensions of the RRCF algorithm.\n\n## Documentation\n\nRead the docs [here 📖](https://klabum.github.io/rrcf/).\n\n## Installation\n\nUse `pip` to install `rrcf` via pypi:\n\n```shell\n$ pip install rrcf\n```\n\nCurrently, only Python 3 is supported.\n\n### Dependencies\n\nThe following dependencies are *required* to install and use `rrcf`:\n\n- [numpy](http://www.numpy.org/) (\u003e= 1.15)\n\nThe following *optional* dependencies are required to run the examples shown in the documentation:\n\n- [pandas](https://pandas.pydata.org/) (\u003e= 0.23)\n- [scipy](https://www.scipy.org/) (\u003e= 1.2)\n- [scikit-learn](https://scikit-learn.org/stable/) (\u003e= 0.20)\n- [matplotlib](https://matplotlib.org/) (\u003e= 3.0)\n\nListed version numbers have been tested and are known to work (this does not necessarily preclude older versions).\n\n## Robust random cut trees\n\nA robust random cut tree (RRCT) is a binary search tree that can be used to detect outliers in a point set. A RRCT can be instantiated from a point set. Points can also be added and removed from an RRCT.\n\n### Creating the tree\n\n```python\nimport numpy as np\nimport rrcf\n\n# A (robust) random cut tree can be instantiated from a point set (n x d)\nX = np.random.randn(100, 2)\ntree = rrcf.RCTree(X)\n\n# A random cut tree can also be instantiated with no points\ntree = rrcf.RCTree()\n```\n\n### Inserting points\n\n```python\ntree = rrcf.RCTree()\n\nfor i in range(6):\n    x = np.random.randn(2)\n    tree.insert_point(x, index=i)\n```\n\n```\n─+\n ├───+\n │   ├───+\n │   │   ├──(0)\n │   │   └───+\n │   │       ├──(5)\n │   │       └──(4)\n │   └───+\n │       ├──(2)\n │       └──(3)\n └──(1)\n```\n\n### Deleting points\n\n```\ntree.forget_point(2)\n```\n\n```\n─+\n ├───+\n │   ├───+\n │   │   ├──(0)\n │   │   └───+\n │   │       ├──(5)\n │   │       └──(4)\n │   └──(3)\n └──(1)\n```\n\n## Anomaly score\n\nThe likelihood that a point is an outlier is measured by its collusive displacement (CoDisp): if including a new point significantly changes the model complexity (i.e. bit depth), then that point is more likely to be an outlier.\n\n```python\n# Seed tree with zero-mean, normally distributed data\nX = np.random.randn(100,2)\ntree = rrcf.RCTree(X)\n\n# Generate an inlier and outlier point\ninlier = np.array([0, 0])\noutlier = np.array([4, 4])\n\n# Insert into tree\ntree.insert_point(inlier, index='inlier')\ntree.insert_point(outlier, index='outlier')\n```\n\n```python\ntree.codisp('inlier')\n\u003e\u003e\u003e 1.75\n```\n\n```python\ntree.codisp('outlier')\n\u003e\u003e\u003e 39.0\n```\n\n## Batch anomaly detection\n\nThis example shows how a robust random cut forest can be used to detect outliers in a batch setting. Outliers correspond to large CoDisp.\n\n```python\nimport numpy as np\nimport pandas as pd\nimport rrcf\n\n# Set parameters\nnp.random.seed(0)\nn = 2010\nd = 3\nnum_trees = 100\ntree_size = 256\n\n# Generate data\nX = np.zeros((n, d))\nX[:1000,0] = 5\nX[1000:2000,0] = -5\nX += 0.01*np.random.randn(*X.shape)\n\n# Construct forest\nforest = []\nwhile len(forest) \u003c num_trees:\n    # Select random subsets of points uniformly from point set\n    ixs = np.random.choice(n, size=(n // tree_size, tree_size),\n                           replace=False)\n    # Add sampled trees to forest\n    trees = [rrcf.RCTree(X[ix], index_labels=ix) for ix in ixs]\n    forest.extend(trees)\n\n# Compute average CoDisp\navg_codisp = pd.Series(0.0, index=np.arange(n))\nindex = np.zeros(n)\nfor tree in forest:\n    codisp = pd.Series({leaf : tree.codisp(leaf) for leaf in tree.leaves})\n    avg_codisp[codisp.index] += codisp\n    np.add.at(index, codisp.index.values, 1)\navg_codisp /= index\n```\n\n![Image](https://github.com/kLabUM/rrcf/blob/master/resources/batch.png)\n\n## Streaming anomaly detection\n\nThis example shows how the algorithm can be used to detect anomalies in streaming time series data.\n\n```python\nimport numpy as np\nimport rrcf\n\n# Generate data\nn = 730\nA = 50\ncenter = 100\nphi = 30\nT = 2*np.pi/100\nt = np.arange(n)\nsin = A*np.sin(T*t-phi*T) + center\nsin[235:255] = 80\n\n# Set tree parameters\nnum_trees = 40\nshingle_size = 4\ntree_size = 256\n\n# Create a forest of empty trees\nforest = []\nfor _ in range(num_trees):\n    tree = rrcf.RCTree()\n    forest.append(tree)\n    \n# Use the \"shingle\" generator to create rolling window\npoints = rrcf.shingle(sin, size=shingle_size)\n\n# Create a dict to store anomaly score of each point\navg_codisp = {}\n\n# For each shingle...\nfor index, point in enumerate(points):\n    # For each tree in the forest...\n    for tree in forest:\n        # If tree is above permitted size, drop the oldest point (FIFO)\n        if len(tree.leaves) \u003e tree_size:\n            tree.forget_point(index - tree_size)\n        # Insert the new point into the tree\n        tree.insert_point(point, index=index)\n        # Compute codisp on the new point and take the average among all trees\n        if not index in avg_codisp:\n            avg_codisp[index] = 0\n        avg_codisp[index] += tree.codisp(index) / num_trees\n```\n\n![Image](https://github.com/kLabUM/rrcf/blob/master/resources/sine.png)\n\n## Obtain feature importance\n\nThis example shows how to estimate the feature importance using the dimension of cut obtained during the calculation of the CoDisp.\n\n\n```python\nimport numpy as np\nimport pandas as pd\nimport rrcf\n\n# Set parameters\nnp.random.seed(0)\nn = 2010\nd = 3\nnum_trees = 100\ntree_size = 256\n\n# Generate data\nX = np.zeros((n, d))\nX[:1000,0] = 5\nX[1000:2000,0] = -5\nX += 0.01*np.random.randn(*X.shape)\n\n# Construct forest\nforest = []\nwhile len(forest) \u003c num_trees:\n    # Select random subsets of points uniformly from point set\n    ixs = np.random.choice(n, size=(n // tree_size, tree_size),\n                           replace=False)\n    # Add sampled trees to forest\n    trees = [rrcf.RCTree(X[ix], index_labels=ix) for ix in ixs]\n    forest.extend(trees)\n\n\n# Compute average CoDisp with the cut dimension for each point\ndim_codisp = np.zeros([n,d],dtype=float)\nindex = np.zeros(n)\nfor tree in forest:\n    for leaf in tree.leaves:\n        codisp,cutdim = tree.codisp_with_cut_dimension(leaf)\n        \n        dim_codisp[leaf,cutdim] += codisp \n\n        index[leaf] += 1\n\navg_codisp = dim_codisp.sum(axis=1)/index\n\n#codisp anomaly threshold and calculate the mean over each feature\nfeature_importance_anomaly = np.mean(dim_codisp[avg_codisp\u003e50,:],axis=0)\n#create a dataframe with the feature importance\ndf_feature_importance = pd.DataFrame(feature_importance_anomaly,columns=['feature_importance'])\ndf_feature_importance\n```\n![Image](https://raw.githubusercontent.com/kLabUM/rrcf/master/feature_importance.png)\n\n\n\n## Contributing\n\nWe welcome contributions to the `rrcf` repo. To contribute, submit a [pull request](https://help.github.com/en/articles/about-pull-requests) to the `dev` branch.\n\n#### Types of contributions\n\nSome suggested types of contributions include:\n\n- Bug fixes\n- Documentation improvements\n- Performance enhancements\n- Extensions to the algorithm\n\nCheck the issue tracker for any specific issues that need help. If you encounter a problem using `rrcf`, or have an idea for an extension, feel free to raise an issue.\n\n#### Guidelines for contributors\n\nPlease consider the following guidelines when contributing to the codebase:\n\n- Ensure that any new methods, functions or classes include docstrings. Docstrings should include a description of the code, as well as descriptions of the inputs (arguments) and outputs (returns). Providing an example use case is recommended (see existing methods for examples).\n- Write unit tests for any new code and ensure that all tests are passing with no warnings. Please ensure that overall code coverage does not drop below 80%.\n\n#### Running unit tests\n\nTo run unit tests, first ensure that `pytest` and `pytest-cov` are installed:\n\n```\n$ pip install pytest pytest-cov\n```\n\nTo run the tests, navigate to the root directory of the repo and run:\n\n```\n$ pytest --cov=rrcf/\n```\n\n## Citing\n\nIf you have used this codebase in a publication and wish to cite it, please use the [`Journal of Open Source Software article`](https://joss.theoj.org/papers/10.21105/joss.01336).\n\n\u003e M. Bartos, A. Mullapudi, \u0026 S. Troutman, *rrcf: Implementation of the Robust\n\u003e Random Cut Forest algorithm for anomaly detection on streams*,\n\u003e in: Journal of Open Source Software, The Open Journal, Volume 4, Number 35.\n\u003e 2019\n\n```bibtex\n@article{bartos_2019_rrcf,\n  title={{rrcf: Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams}},\n  authors={Matthew Bartos and Abhiram Mullapudi and Sara Troutman},\n  journal={{The Journal of Open Source Software}},\n  volume={4},\n  number={35},\n  pages={1336},\n  year={2019}\n}\n```\n","funding_links":[],"categories":["Python","📦 Packages","异常检测","异常检测包","Anomaly Detection Software"],"sub_categories":["Python"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FkLabUM%2Frrcf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FkLabUM%2Frrcf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FkLabUM%2Frrcf/lists"}