{"id":15064015,"url":"https://github.com/garethjns/incrementaltrees","last_synced_at":"2025-08-19T02:33:15.250Z","repository":{"id":47702555,"uuid":"163000023","full_name":"garethjns/IncrementalTrees","owner":"garethjns","description":"Adds partial fit method to sklearn's forest estimators to allow incremental training without being limited to a linear model. Works with Dask-ml's Incremental.","archived":false,"fork":false,"pushed_at":"2024-06-18T00:19:06.000Z","size":447,"stargazers_count":35,"open_issues_count":5,"forks_count":3,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-12-18T17:12:39.149Z","etag":null,"topics":["dask-ml","incremental-learning","random-forest","sklearn"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/garethjns.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-12-24T14:49:29.000Z","updated_at":"2024-03-18T09:20:06.000Z","dependencies_parsed_at":"2024-09-25T00:10:20.917Z","dependency_job_id":"9fb9b210-63d4-4080-9d82-96c255f8b823","html_url":"https://github.com/garethjns/IncrementalTrees","commit_stats":null,"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/garethjns%2FIncrementalTrees","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/garethjns%2FIncrementalTrees/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/garethjns%2FIncrementalTrees/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/garethjns%2FIncrementalTrees/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/garethjns","download_url":"https://codeload.github.com/garethjns/IncrementalTrees/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230310245,"owners_count":18206470,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dask-ml","incremental-learning","random-forest","sklearn"],"created_at":"2024-09-25T00:10:15.134Z","updated_at":"2024-12-18T17:12:43.362Z","avatar_url":"https://github.com/garethjns.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Incremental trees\n![The overcomplicated tests are...](https://github.com/garethjns/IncrementalTrees/workflows/The%20overcomplicated%20tests%20are.../badge.svg)\n\nAdds partial fit method to sklearn's forest estimators (currently RandomForestClassifier/Regressor and ExtraTreesClassifier/Regressor) to allow [incremental training](https://scikit-learn.org/0.15/modules/scaling_strategies.html) without being limited to a linear model. Works with or without [Dask-ml's Incremental](http://ml.dask.org/incremental.html).\n\nThese methods don't try and implement partial fitting for decision trees, rather they remove requirement that individual decision trees within forests are trained with the same data (or equally sized bootstraps). This reduces memory burden, training time, and variance. This is at the cost of generally increasing the number of weak learners will probably be required. \n\nThe resulting forests are not \"true\" online learners, as batch size affects performance. However, they should have similar (possibly better) performance as their standard versions after seeing an equivalent number of training rows.\n\n## Installing package\n\nQuick start:\n\n1) Clone repo and build pip installable package.\n   ````bash\n    pip install incremental_trees\n   ````\n\n\n## Usage Examples\nCurrently implemented:\n - Streaming versions of RandomForestClassifier (StreamingRFC) and ExtraTreesClassifer (StreamingEXTC). They work should work for binary and multi-class classification, but not multi-output yet.\n - Streaming versions of RandomForestRegressor (StreamingRFR) and ExtraTreesRegressor (StreamingEXTR). \n\nSee:\n- Below for example of using different mechanisms to feed .partial_fit() and different parameter set ups.  \n- [notes/PerformanceComparisons.ipynb](https://github.com/garethjns/IncrementalTrees/blob/master/notes/PerformanceComparisons.ipynb) and  [notes/PerformanceComparisonsDask.ipynb](https://github.com/garethjns/IncrementalTrees/blob/master/notes/PerformanceComparisonsDask.ipynb) for more examples and performance comparisons against RandomForest. Also there are some (unfinished) performance comparisons in tests/.\n\n\n### Data feeding mechanisms\n\n#### Fitting with .fit()\nFeeds .partial_fit() with randomly samples rows.\n\n\n````python\nimport numpy as np\nfrom sklearn.datasets import make_blobs\nfrom incremental_trees.models.classification.streaming_rfc import StreamingRFC\n\n# Generate some data in memory\nx, y = make_blobs(n_samples=int(2e5), random_state=0, n_features=40, centers=2, cluster_std=100)\n\nsrfc = StreamingRFC(n_estimators_per_chunk=3,\n                    max_n_estimators=np.inf,\n                    spf_n_fits=30,  # Number of calls to .partial_fit()\n                    spf_sample_prop=0.3)  # Number of rows to sample each on .partial_fit()\n\nsrfc.fit(x, y, sample_weight=np.ones_like(y))  # Optional, gets sampled along with the data\n\n# Should be n_estimators_per_chunk * spf_n_fits\nprint(len(srfc.estimators_))\nprint(srfc.score(x, y))\n````\n\n#### Fitting with .fit() and Dask\nCall .fit() directly, let dask handle sending data to .partial_fit()\n\n````python\nimport numpy as np\nimport dask_ml.datasets\nfrom dask_ml.wrappers import Incremental\nfrom dask.distributed import Client, LocalCluster\nfrom dask import delayed\nfrom incremental_trees.models.classification.streaming_rfc import StreamingRFC\n\n# Generate some data out-of-core\nx, y = dask_ml.datasets.make_blobs(n_samples=2e5, chunks=1e4, random_state=0,\n                                   n_features=40, centers=2, cluster_std=100)\n\n# Create throwaway cluster and client to run on                                  \nwith LocalCluster(processes=False, n_workers=2, \n                  threads_per_worker=2) as cluster, Client(cluster) as client:\n\n    # Wrap model with Dask Incremental\n    srfc = Incremental(StreamingRFC(dask_feeding=True,  # Turn dask on\n                                    n_estimators_per_chunk=10,\n                                    max_n_estimators=np.inf,\n                                    n_jobs=4))\n    \n    # Call fit directly, specifying the expected classes\n    srfc.fit(x, y,\n             classes=delayed(np.unique)(y).compute())\n             \n    print(len(srfc.estimators_))\n    print(srfc.score(x, y))\n````\n\n#### Feeding .partial_fit() manually \n.partial_fit can be called directly and fed data manually.\n\nFor example, this can be used to feed .partial_fit() sequentially (although below example selects random rows, which is similar to non-dask example above).\n\n````python\nimport numpy as np\nfrom sklearn.datasets import make_blobs\nfrom incremental_trees.models.classification.streaming_rfc import StreamingRFC\n\nsrfc = StreamingRFC(n_estimators_per_chunk=20,\n                    max_n_estimators=np.inf,\n                    n_jobs=4)\n\n# Generate some data in memory\nx, y = make_blobs(n_samples=int(2e5), random_state=0, n_features=40,\n                  centers=2, cluster_std=100)\n\n# Feed .partial_fit() with random samples of the data\nn_chunks = 30\nchunk_size = int(2e3)\nfor i in range(n_chunks):\n   sample_idx = np.random.randint(0, x.shape[0], chunk_size)\n   # Call .partial_fit(), specifying expected classes, also supports other .fit args such as sample_weight\n   srfc.partial_fit(x[sample_idx, :], y[sample_idx],\n                    classes=np.unique(y))\n\n# Should be n_chunks * n_estimators_per_chunk             \nprint(len(srfc.estimators_))\nprint(srfc.score(x, y))\n````\n\n### Possible model set ups\nThere are a couple of different model setups worth considering. No idea which works best. \n\n#### \"Incremental forest\"\nFor the number of chunks/fits, sample rows from X, then fit a number of single trees (with different column subsets), eg.\n````python\nsrfc = StreamingRFC(n_estimators_per_chunk=10, max_features='sqrt')    \n````\n#### \"Incremental decision trees\"\nSingle (or few) decision trees per data subset, with all features. \n````python\nsrfc = StreamingRFC(n_estimators_per_chunk=1, max_features=x.shape[1])\n````\n\n# Version history\n## v0.6.0\n - Update to work with scikit-learn==1.2, dask==2022.12, dask-glm==0.2.0, dask-ml==2022.5.27. Support python 3.8 and 3.9.\n## v0.5.1\n - Add support for passing fit args/kwargs via `.fit` (specifically, `sample_weight`)\n## v0.5.0\n - Add support for passing fit args/kwargs via `.partial fit` (specifically, `sample_weight`)\n## v0.4.0\n - Refactor and tidy, try with new versions of Dask/sklearn\n## v0.3.1-3\n  - Update Dask versions\n## v0.3.0\n  - Updated unit tests\n  - Added performance benchmark tests for classifiers, not finished.\n  - Added regressor versions of RandomForest (StreamingRFR) and ExtaTrees (StreamingEXTR, also renamed StreamingEXT to StreamingEXTC).\n  - .fit() overload to handle feeding .partial_fit() with random row samples, without using Dask. Adds compatibility with sklearn SearchCV objects.\n## v0.2.0\n  - Add ExtraTreesClassifier (StreamingEXT)\n## v0.1.0\n  - .partial_fit() for RandomForestClassifier (StreamingRFC)\n  - .predict_proba() for RandomforestClassifier\n  \n  \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgarethjns%2Fincrementaltrees","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgarethjns%2Fincrementaltrees","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgarethjns%2Fincrementaltrees/lists"}