{"id":13423679,"url":"https://github.com/DmitryUlyanov/Multicore-TSNE","last_synced_at":"2025-03-15T17:32:06.321Z","repository":{"id":40685952,"uuid":"71323151","full_name":"DmitryUlyanov/Multicore-TSNE","owner":"DmitryUlyanov","description":"Parallel t-SNE implementation with Python and Torch wrappers.","archived":false,"fork":false,"pushed_at":"2024-02-06T10:59:55.000Z","size":455,"stargazers_count":1896,"open_issues_count":40,"forks_count":229,"subscribers_count":42,"default_branch":"master","last_synced_at":"2025-03-08T15:46:52.700Z","etag":null,"topics":["barnes-hut-tsne","multicore","py-bh-tsne","tsne"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DmitryUlyanov.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-10-19T05:46:52.000Z","updated_at":"2025-02-28T18:06:55.000Z","dependencies_parsed_at":"2022-09-10T20:22:22.823Z","dependency_job_id":"e2e3acee-f3f7-4198-89de-9e8d79ab705f","html_url":"https://github.com/DmitryUlyanov/Multicore-TSNE","commit_stats":{"total_commits":93,"total_committers":17,"mean_commits":5.470588235294118,"dds":0.5913978494623655,"last_synced_commit":"89b8ce5b1911b024eeaf6b7f1083da3c30ad5e7c"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DmitryUlyanov%2FMulticore-TSNE","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DmitryUlyanov%2FMulticore-TSNE/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DmitryUlyanov%2FMulticore-TSNE/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DmitryUlyanov%2FMulticore-TSNE/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DmitryUlyanov","download_url":"https://codeload.github.com/DmitryUlyanov/Multicore-TSNE/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243763221,"owners_count":20344184,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["barnes-hut-tsne","multicore","py-bh-tsne","tsne"],"created_at":"2024-07-31T00:00:40.525Z","updated_at":"2025-03-15T17:32:05.875Z","avatar_url":"https://github.com/DmitryUlyanov.png","language":"C++","funding_links":[],"categories":["C++","其他_机器学习与深度学习","数据可视化"],"sub_categories":[],"readme":"# Multicore t-SNE [![Build Status](https://travis-ci.org/DmitryUlyanov/Multicore-TSNE.svg?branch=master)](https://travis-ci.org/DmitryUlyanov/Multicore-TSNE)\n\nThis is a multicore modification of [Barnes-Hut t-SNE](https://github.com/lvdmaaten/bhtsne) by L. Van der Maaten with Python CFFI-based wrappers. This code also works **faster than sklearn.TSNE** on 1 core (as of version 0.18).\n\n\u003ccenter\u003e\u003cimg src=\"mnist-tsne.png\" width=\"512\"\u003e\u003c/center\u003e\n\n# What to expect\n\nBarnes-Hut t-SNE is done in two steps.\n\n- First step: an efficient data structure for nearest neighbours search is built and used to compute probabilities. This can be done in parallel for each point in the dataset, this is why we can expect a good speed-up by using more cores.\n\n- Second step: the embedding is optimized using gradient descent. This part is essentially consecutive so we can only optimize within iteration. In fact some parts can be parallelized effectively, but not all of them a parallelized for now. That is why the second step speed-up will not be as significant as first step sepeed-up but there is still room for improvement.\n\nSo when can you benefit from parallelization? It is almost true, that the second step computation time is constant of `D` and depends mostly on `N`. The first part's time depends on `D` a lot, so for small `D` `time(Step 1) \u003c\u003c time(Step 2)`, for large `D` `time(Step 1) \u003e\u003e time(Step 2)`. As we are only good at parallelizing step 1 we will benefit most when `D` is large enough (MNIST's `D = 784` is large, `D = 10` even for `N=1000000` is not so much). I wrote multicore modification originally for [Springleaf competition](https://www.kaggle.com/c/springleaf-marketing-response), where my data table was about `300000 x 3000` and only several days left till the end of the competition so any speed-up was handy.\n\n# Benchmark\n\n### 1 core\n\nInterestingly, this code beats other implementations. We compare to `sklearn` (Barnes-Hut of course), L. Van der Maaten's [bhtsne](https://github.com/lvdmaaten/bhtsne), [py_bh_tsne repo](https://github.com/danielfrg/tsne) (cython wrapper for bhtsne with QuadTree). `perplexity = 30, theta=0.5` for every run. In fact [py_bh_tsne repo](https://github.com/danielfrg/tsne) works at the same speed as this code when using more optimization flags for the compiler.\n\nThis is a benchmark for `70000x784` MNIST data:\n\n| Method                       | Step 1 (sec)   | Step 2 (sec)  |\n| ---------------------------- |:---------------:| --------------:|\n| MulticoreTSNE(n_jobs=1)      | **912**         | **350**        |\n| bhtsne                       | 4257            | 1233           |\n| py_bh_tsne                   | 1232            | 367            |\n| sklearn(0.18)                | ~5400           | ~20920         |\n\nI did my best to find what is wrong with sklearn numbers, but it is the best benchmark I could do (you can find the test script in `MulticoreTSNE/examples` folder).\n\n### Multicore\n\nThis table shows a relative to 1 core speed-up when using `n` cores.\n\n| n_jobs        | Step 1    | Step 2   |\n| ------------- |:---------:| --------:|\n| 1             | 1x        | 1x       |\n| 2             | 1.54x     | 1.05x    |\n| 4             | 2.6x      | 1.2x     |\n| 8             | 5.6x      | 1.65x    |\n\n# How to use\n\n### Install\n\n#### Directly from pypi\n`pip install MulticoreTSNE`\n\n#### From source\n\nMake sure `cmake` is installed on your system, and you will also need a sensible C++ compiler, such as `gcc` or `llvm-clang`. On macOS, you can get both via [homebrew](https://brew.sh/).\n\nTo install the package, please do:\n```bash\ngit clone https://github.com/DmitryUlyanov/Multicore-TSNE.git\ncd Multicore-TSNE/\npip install .\n```\n\nTested with python \u003e= 3.6 (conda).\n\n### Run\n\nYou can use it as a near drop-in replacement for [sklearn.manifold.TSNE](http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html).\n\n```python\nfrom MulticoreTSNE import MulticoreTSNE as TSNE\n\ntsne = TSNE(n_jobs=4)\nY = tsne.fit_transform(X)\n```\n\nPlease refer to [sklearn TSNE manual](http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) for parameters explanation.\n\nThis implementation `n_components=2`, which is the most common case (use [Barnes-Hut t-SNE](https://github.com/lvdmaaten/bhtsne) or sklearn otherwise). Also note that some parameters are there just for the sake of compatibility with sklearn and are otherwise ignored. See `MulticoreTSNE` class docstring for more info.\n\n#### MNIST example\n```python\nfrom sklearn.datasets import fetch_openml\nfrom MulticoreTSNE import MulticoreTSNE as TSNE\nfrom matplotlib import pyplot as plt\n\nX, _ = fetch_openml(\n  \"mnist_784\", version=1, return_X_y=True, as_frame=False, parser=\"pandas\"\n)\nembeddings = TSNE(n_jobs=4).fit_transform(X)\nvis_x = embeddings[:, 0]\nvis_y = embeddings[:, 1]\nplt.scatter(vis_x, vis_y, c=digits.target, cmap=plt.cm.get_cmap(\"jet\", 10), marker='.')\nplt.colorbar(ticks=range(10))\nplt.clim(-0.5, 9.5)\nplt.show()\n```\n\n### Test\n\nYou can test it on MNIST dataset with the following command:\n\n```bash\npython MulticoreTSNE/examples/test.py --n_jobs \u003cn_jobs\u003e\n```\n\n#### Note on jupyter use\nTo make the computation log visible in jupyter please install `wurlitzer` (`pip install wurlitzer`) and execute this line in any cell beforehand:\n```\n%load_ext wurlitzer\n```\nMemory leakages are possible if you interrupt the process. Should be OK if you let it run until the end.\n\n# License\n\nInherited from [original repo's license](https://github.com/lvdmaaten/bhtsne).\n\n# Future work\n\n- Allow other types than double\n- Improve step 2 performance (possible)\n\n# Citation\n\nPlease cite this repository if it was useful for your research:\n\n```\n@misc{Ulyanov2016,\n  author = {Ulyanov, Dmitry},\n  title = {Multicore-TSNE},\n  year = {2016},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https://github.com/DmitryUlyanov/Multicore-TSNE}},\n}\n```\n\nOf course, do not forget to cite [L. Van der Maaten's paper](http://lvdmaaten.github.io/publications/papers/JMLR_2014.pdf)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDmitryUlyanov%2FMulticore-TSNE","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FDmitryUlyanov%2FMulticore-TSNE","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDmitryUlyanov%2FMulticore-TSNE/lists"}