{"id":18295651,"url":"https://github.com/koheiw/proxyc","last_synced_at":"2025-04-05T12:31:40.958Z","repository":{"id":42382170,"uuid":"162182932","full_name":"koheiw/proxyC","owner":"koheiw","description":"R package for large-scale similarity/distance computation","archived":false,"fork":false,"pushed_at":"2024-04-25T11:39:11.000Z","size":1019,"stargazers_count":28,"open_issues_count":4,"forks_count":6,"subscribers_count":6,"default_branch":"master","last_synced_at":"2024-04-28T03:20:53.917Z","etag":null,"topics":["data-science","distance-measures","r","similarity-measures"],"latest_commit_sha":null,"homepage":null,"language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/koheiw.png","metadata":{"files":{"readme":"README.Rmd","changelog":"NEWS.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-12-17T19:58:04.000Z","updated_at":"2024-04-24T14:20:21.000Z","dependencies_parsed_at":"2022-08-12T10:00:16.270Z","dependency_job_id":"690e9c04-1a15-429f-b5f4-6e961d7e8823","html_url":"https://github.com/koheiw/proxyC","commit_stats":{"total_commits":189,"total_committers":3,"mean_commits":63.0,"dds":"0.17989417989417988","last_synced_commit":"9b139cea220c17aa188503a49ec96593468a64f8"},"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/koheiw%2FproxyC","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/koheiw%2FproxyC/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/koheiw%2FproxyC/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/koheiw%2FproxyC/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/koheiw","download_url":"https://codeload.github.com/koheiw/proxyC/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247338917,"owners_count":20923000,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","distance-measures","r","similarity-measures"],"created_at":"2024-11-05T14:36:54.610Z","updated_at":"2025-04-05T12:31:35.950Z","avatar_url":"https://github.com/koheiw.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"---\noutput: github_document\n---\n\n```{r, echo = FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"##\",\n  fig.path = \"man/figures/\",\n  fig.width = 9,\n  fig.height = 3,\n  warning = FALSE,\n  dpi = 150\n)\n```\n\n# proxyC: R package for large-scale similarity/distance computation\n\n\u003c!-- badges: start --\u003e\n\n[![CRAN Version](https://www.r-pkg.org/badges/version/proxyC)](https://CRAN.R-project.org/package=proxyC)\n[![Downloads](https://cranlogs.r-pkg.org/badges/proxyC)](https://CRAN.R-project.org/package=proxyC)\n[![Total Downloads](https://cranlogs.r-pkg.org/badges/grand-total/proxyC?color=orange)](https://CRAN.R-project.org/package=proxyC)\n[![R build status](https://github.com/koheiw/proxyC/workflows/R-CMD-check/badge.svg)](https://github.com/koheiw/proxyC/actions)\n[![codecov](https://codecov.io/gh/koheiw/proxyC/branch/master/graph/badge.svg)](https://app.codecov.io/gh/koheiw/proxyC)\n\u003c!-- badges: end --\u003e\n\n**proxyC** computes proximity between rows or columns of large matrices efficiently in C++. It is optimized for large sparse matrices using the Armadillo and Intel TBB libraries. Among several built-in similarity/distance measures, computation of correlation, cosine similarity and Euclidean distance is particularly fast.\n\nThis code was originally written for [**quanteda**](https://github.com/quanteda/quanteda) to compute similarity/distance between documents or features in large corpora, but separated as a stand-alone package to make it available for broader data scientific purposes.\n\n## Install \n\nSince **proxyC** v0.4.0, it requires the Intel oneAPI Threading Building Blocks for parallel computing. Windows and Mac users can download a binary package from CRAN, but Linux users must install the library by executing the commands below:\n\n```{bash, eval=FALSE}\n# Fedora, CentOS, RHEL\nsudo yum install tbb-devel\n\n# Debian and Ubuntu\nsudo apt install libtbb-dev\n```\n\n```{r eval=FALSE}\ninstall.packages(\"proxyC\")\n```\n\n## Performance\n\n```{r}\nrequire(Matrix)\nrequire(microbenchmark)\nrequire(ggplot2)\nrequire(magrittr)\n\n# Set number of threads\noptions(\"proxyC.threads\" = 8)\n\n# Make a matrix with 99% zeros\nsm1k \u003c- rsparsematrix(1000, 1000, 0.01) # 1,000 columns\nsm10k \u003c- rsparsematrix(1000, 10000, 0.01) # 10,000 columns\n\n# Convert to dense format\ndm1k \u003c- as.matrix(sm1k) \ndm10k \u003c- as.matrix(sm10k)\n```\n\n## Cosine similarity between columns\n\nWith sparse matrices, **proxyC** is roughly 10 to 100 times faster than **proxy**. \n\n```{r, cahce=TRUE}\nbm1 \u003c- microbenchmark(\n    \"proxy 1k\" = proxy::simil(dm1k, method = \"cosine\"),\n    \"proxyC 1k\" = proxyC::simil(sm1k, margin = 2, method = \"cosine\"),\n    \"proxy 10k\" = proxy::simil(dm10k, method = \"cosine\"),\n    \"proxyC 10k\" = proxyC::simil(sm10k, margin = 2, method = \"cosine\"),\n    times = 10\n)\nautoplot(bm1)\n```\n\n##  Cosine similarity greater than 0.9\n\nIf `min_simil` is used, **proxyC** becomes even faster because small similarity scores are floored to zero.\n\n```{r, cahce=TRUE}\nbm2 \u003c- microbenchmark(\n    \"proxyC all\" = proxyC::simil(sm1k, margin = 2, method = \"cosine\"),\n    \"proxyC min_simil\" = proxyC::simil(sm1k, margin = 2, method = \"cosine\", min_simil = 0.9),\n    times = 10\n)\nautoplot(bm2)\n```\n\nFlooring by `min_simil` makes the resulting object much smaller.\n\n```{r, cahce=TRUE}\nproxyC::simil(sm10k, margin = 2, method = \"cosine\") %\u003e% \n  object.size() %\u003e% \n  print(units = \"MB\")\nproxyC::simil(sm10k, margin = 2, method = \"cosine\", min_simil = 0.9) %\u003e% \n  object.size() %\u003e% \n  print(units = \"MB\")\n```\n\n## Top-10 correlation\n\nIf `rank` is used, **proxyC** only returns top-n values. \n\n```{r, cahce=TRUE}\nbm3 \u003c- microbenchmark(\n    \"proxyC rank\" = proxyC::simil(sm1k, margin = 2, method = \"correlation\", rank = 10),\n    \"proxyC all\" = proxyC::simil(sm1k, margin = 2, method = \"correlation\"),\n    times = 10\n)\nautoplot(bm3)\n```\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkoheiw%2Fproxyc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkoheiw%2Fproxyc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkoheiw%2Fproxyc/lists"}