{"id":13532035,"url":"https://github.com/lyst/rpforest","last_synced_at":"2025-04-06T22:11:00.333Z","repository":{"id":57462817,"uuid":"39015546","full_name":"lyst/rpforest","owner":"lyst","description":"It is a forest of random projection trees","archived":false,"fork":false,"pushed_at":"2020-02-08T00:44:48.000Z","size":6772,"stargazers_count":224,"open_issues_count":0,"forks_count":43,"subscribers_count":17,"default_branch":"master","last_synced_at":"2025-03-30T21:13:40.623Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lyst.png","metadata":{"files":{"readme":"README.md","changelog":"changelog.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-07-13T14:02:32.000Z","updated_at":"2025-03-22T08:07:32.000Z","dependencies_parsed_at":"2022-09-05T17:22:08.741Z","dependency_job_id":null,"html_url":"https://github.com/lyst/rpforest","commit_stats":null,"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lyst%2Frpforest","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lyst%2Frpforest/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lyst%2Frpforest/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lyst%2Frpforest/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lyst","download_url":"https://codeload.github.com/lyst/rpforest/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247557767,"owners_count":20958047,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T07:01:07.717Z","updated_at":"2025-04-06T22:11:00.302Z","avatar_url":"https://github.com/lyst.png","language":"Python","readme":"# rpforest\n\n![rpforest](https://raw.githubusercontent.com/lyst/rpforest/master/rpforest.jpg)\n\n[![CircleCI](https://circleci.com/gh/lyst/rpforest/tree/master.svg?style=svg\u0026circle-token=6ab982f5b17307152e1f3b42b00b8ecc074a764d)](https://circleci.com/gh/lyst/rpforest/tree/master)\n\nrpforest is a Python library for approximate nearest neighbours search: finding points in a high-dimensional space that are close to a given query point in a fast but approximate manner.\n\nrpforest differs from alternative ANN packages such as [annoy](https://github.com/spotify/annoy) by not requiring the storage of all the vectors indexed in the model. Used in this way, rpforest serves to produce a list of candidate ANNs for use by a further service where point vectors are stored (for example, a relational database).\n\n## How it works\n\nIt works by building a forest of N binary random projection trees.\n\nIn each tree, the set of training points is recursively partitioned into smaller and smaller subsets until a leaf node of at most M points is reached. Each parition is based on the cosine of the angle the points make with a randomly drawn hyperplane: points whose angle is smaller than the median angle fall in the left partition, and the remaining points fall in the right partition.\n\nThe resulting tree has predictable leaf size (no larger than M) and is approximately balanced because of median splits, leading to consistent tree traversal times.\n\nQuerying the model is accomplished by traversing each tree to the query point's leaf node to retrieve ANN candidates from that tree, then merging them and sorting by distance to the query point.\n\n## Installation\n\n1. Install numpy first.\n2. Install rpforest using pip: `pip install rpforest`\n\n## Usage\n\n### Fitting\n\nModel fitting is straightforward:\n\n```python\nfrom rpforest import RPForest\n\nmodel = RPForest(leaf_size=50, no_trees=10)\nmodel.fit(X)\n```\n\nThe speed-precision tradeoff is governed by the `leaf_size` and `no_trees` parameters. Increasing `leaf_size` leads the model to produce shallower trees with larger leaf nodes; increasing `no_trees` fits more trees.\n\n### In-memory queries\n\nWhere the entire set of points can be kept in memory, rpforest supports in-memory ANN queries. After fitting, ANNs can be obtained by calling:\n\n```python\nnns = model.query(x_query, 10)\n```\n\nReturn nearest neighbours for vector x by first retrieving candidate NNs from x's leaf nodes, then merging them and sorting by cosine similarity with x. At most no_trees \\* leaf_size NNs will can be returned.\n\n### Candidate queries\n\nrpforest can support indexing and candidate ANN queries on datasets larger than would fit in available memory. This is accomplished by first fitting the model on a subset of the data, then indexing a larger set of data into the fitted model:\n\n```python\nfrom rpforest import RPForest\n\nmodel = RPForest(leaf_size=50, no_trees=10)\nmodel.fit(X_train)\n\nmodel.clear()  # Deletes X_train vectors\n\nfor point_id, x in get_x_vectors():\n     model.index(point_id, x)\n\nnns = model.get_candidates(x_query, 10)\n```\n\n### Model persistence\n\nModel persistence is achieved simply by pickling and unpickling.\n\n```python\nmodel = pickle.loads(pickle.dumps(model))\n```\n\n### Performance\n\n[Erik Bernhardsson](https://twitter.com/fulhack), the author of annoy, maintains an ANN [performance shootout](https://github.com/erikbern/ann-benchmarks) repository, comparing a number of Python ANN packages.\n\nOn the GloVe cosine distance benchmark, rpforest is not as fast as highly optimised C and C++ packages like FLANN and annoy. However, it far outerpforms scikit-learn's [LSHForest](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LSHForest.html) and [panns](https://github.com/ryanrhymes/panns).\n\n![Performance](https://raw.githubusercontent.com/lyst/rpforest/master/glove.png)\n\n## Development\n\nPull requests are welcome. To install for development:\n\n1. Clone the rpforest repository: `git clone git@github.com:lyst/rpforest.git`\n2. Install it for development using pip: `cd rpforest \u0026\u0026 pip install -e .`\n3. You can run tests by running `python setupy.py test`.\n\nWhen making changes to the `.pyx` extension files, you'll need to run `python setup.py cythonize` in order to produce the extension `.cpp` files before running `pip install -e .`.\n","funding_links":[],"categories":["Machine Learning","Awesome Vector Search Engine"],"sub_categories":["Random Forests","Library"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flyst%2Frpforest","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flyst%2Frpforest","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flyst%2Frpforest/lists"}