{"id":28451933,"url":"https://github.com/doxakis/cosinesimilaritycomparison","last_synced_at":"2025-07-11T19:37:08.928Z","repository":{"id":91102983,"uuid":"145944048","full_name":"doxakis/CosineSimilarityComparison","owner":"doxakis","description":"CPU vs GPU vs Advanced Vector Extensions (AVX, SSE, etc.) with varying number of threads","archived":false,"fork":false,"pushed_at":"2018-08-26T02:11:38.000Z","size":25,"stargazers_count":23,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-07-02T03:37:37.208Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/doxakis.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2018-08-24T04:54:26.000Z","updated_at":"2025-06-09T08:06:29.000Z","dependencies_parsed_at":"2023-07-02T10:46:14.319Z","dependency_job_id":null,"html_url":"https://github.com/doxakis/CosineSimilarityComparison","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/doxakis/CosineSimilarityComparison","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/doxakis%2FCosineSimilarityComparison","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/doxakis%2FCosineSimilarityComparison/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/doxakis%2FCosineSimilarityComparison/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/doxakis%2FCosineSimilarityComparison/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/doxakis","download_url":"https://codeload.github.com/doxakis/CosineSimilarityComparison/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/doxakis%2FCosineSimilarityComparison/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264887915,"owners_count":23678773,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-06T17:09:36.767Z","updated_at":"2025-07-11T19:37:08.919Z","avatar_url":"https://github.com/doxakis.png","language":"C#","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Cosine Similarity\n\nThe cosine similarity is a measure of similary between two vectors. Typically, it can be used as a text matching algorithm. The vector is filled by the term frequency vectors of word or sequence of X characters in text documents. To simplify the experiment, the dataset is filled with random values. In the experiment, it compute the distance between each vectors.\n\nThe same algorithm is written using different methods.\n\nMethods:\n\n- CPU (pure c#)\n- GPU (NVidia graphic card and CUDA)\n- Vectorized on CPU (using Advanced Vector Extensions. e.g. SSE, AVX, etc.)\n\nThe experiment can help better understand the advantage of using one method over another. It also provide an example of code which we can refer to when we need to.\n\nVarying parameters:\n\n- Method (CPU, Vectorized on CPU, GPU)\n- Dataset size (number of element and number of dimension)\n- Number of threads\n- Array types (int/double)\n\n# Methodology\n\nThe project build in release x64 and the option optimize code is checked.\n\nFirst, it makes sure to do the JIT compilation on GPU on startup. (about 1 sec.) It run comparison for multiple matrix sizes.\n\nIt validates the result to the CPU version (1 thread) to make sure the result is the same.\n\n# Results\n\nComputer specs:\n\n- SSD\n- 16 GB DDR4\n- NVIDIA GeForce GTX 1060 6GB (1280 cuda cores)\n- Ryzen 7 (8 cores, 16 threads)\n\n```\nInteger versions:\n\nGpu (JIT compilation): 979 ms\n\nDataset: 200x100000\nSimple 1 thread:        3386 ms\nSimple 2 threads:       1735 ms\nSimple 4 threads:       864 ms\nSimple 8 threads:       476 ms\nVectorizedV1 1 thread:  1420 ms\nVectorizedV1 2 threads: 715 ms\nVectorizedV1 4 threads: 376 ms\nVectorizedV1 8 threads: 244 ms\nVectorizedV2 1 thread:  982 ms\nVectorizedV2 2 threads: 508 ms\nVectorizedV2 4 threads: 283 ms\nVectorizedV2 8 threads: 242 ms\nGpu:                    725 ms\n\nDataset: 2000x5000\nSimple 1 thread:        17625 ms\nSimple 2 threads:       8784 ms\nSimple 4 threads:       4355 ms\nSimple 8 threads:       2467 ms\nVectorizedV1 1 thread:  7578 ms\nVectorizedV1 2 threads: 3894 ms\nVectorizedV1 4 threads: 2084 ms\nVectorizedV1 8 threads: 1257 ms\nVectorizedV2 1 thread:  6148 ms\nVectorizedV2 2 threads: 3326 ms\nVectorizedV2 4 threads: 1731 ms\nVectorizedV2 8 threads: 1322 ms\nGpu:                    1361 ms\n\nDataset: 5000x25\nSimple 1 thread:        930 ms\nSimple 2 threads:       589 ms\nSimple 4 threads:       446 ms\nSimple 8 threads:       276 ms\nVectorizedV1 1 thread:  626 ms\nVectorizedV1 2 threads: 488 ms\nVectorizedV1 4 threads: 389 ms\nVectorizedV1 8 threads: 203 ms\nVectorizedV2 1 thread:  5962 ms\nVectorizedV2 2 threads: 3490 ms\nVectorizedV2 4 threads: 1961 ms\nVectorizedV2 8 threads: 1250 ms\nGpu:                    725 ms\n\nDataset: 1x1\nSimple 1 thread:        0 ms\nSimple 2 threads:       0 ms\nSimple 4 threads:       0 ms\nSimple 8 threads:       0 ms\nVectorizedV1 1 thread:  0 ms\nVectorizedV1 2 threads: 0 ms\nVectorizedV1 4 threads: 0 ms\nVectorizedV1 8 threads: 0 ms\nVectorizedV2 1 thread:  0 ms\nVectorizedV2 2 threads: 0 ms\nVectorizedV2 4 threads: 0 ms\nVectorizedV2 8 threads: 0 ms\nGpu:                    303 ms\n\nDouble versions:\n\nGpu (JIT compilation): 318 ms\n\nDataset: 2000x5000\nSimple 1 thread:        9804 ms\nSimple 2 threads:       5532 ms\nSimple 4 threads:       3121 ms\nSimple 8 threads:       2641 ms\nVectorizedV1 1 thread:  11518 ms\nVectorizedV1 2 threads: 6006 ms\nVectorizedV1 4 threads: 3201 ms\nVectorizedV1 8 threads: 2532 ms\nVectorizedV2 1 thread:  7338 ms\nVectorizedV2 2 threads: 4262 ms\nVectorizedV2 4 threads: 2867 ms\nVectorizedV2 8 threads: 2545 ms\nGpu:                    1614 ms\n\nDataset: 5000x25\nSimple 1 thread:        732 ms\nSimple 2 threads:       529 ms\nSimple 4 threads:       358 ms\nSimple 8 threads:       312 ms\nVectorizedV1 1 thread:  697 ms\nVectorizedV1 2 threads: 652 ms\nVectorizedV1 4 threads: 321 ms\nVectorizedV1 8 threads: 347 ms\nVectorizedV2 1 thread:  5546 ms\nVectorizedV2 2 threads: 3396 ms\nVectorizedV2 4 threads: 1807 ms\nVectorizedV2 8 threads: 1302 ms\nGpu:                    739 ms\n\nDataset: 10000x1000\nGpu:                    Exception: unspecified launch failure\n```\n\nIf we consider the dataset 200x100000:\n\n```\nGpuCosineSimilarityIntegerVersionCacheKernel:\n    (init):             256 ms\n    (ComputeDistances): 378 ms\n    (ComputeDistances): 356 ms\n    (ComputeDistances): 344 ms\n    (ComputeDistances): 345 ms\n    (ComputeDistances): 342 ms\n    (ComputeDistances): 341 ms\n    (ComputeDistances): 344 ms\n    (ComputeDistances): 345 ms\n    (ComputeDistances): 344 ms\n    (ComputeDistances): 342 ms\n    (ComputeDistances): 345 ms\n    (ComputeDistances): 340 ms\n    (ComputeDistances): 342 ms\n    (ComputeDistances): 346 ms\n    (ComputeDistances): 342 ms\n    (ComputeDistances): 342 ms\n    (ComputeDistances): 345 ms\n    (ComputeDistances): 343 ms\n    (ComputeDistances): 366 ms\n    (ComputeDistances): 345 ms\n    (ComputeDistances): 356 ms\n    (ComputeDistances): 420 ms\n    (ComputeDistances): 346 ms\n    (ComputeDistances): 343 ms\n    (ComputeDistances): 342 ms\n    (ComputeDistances): 344 ms\n    (ComputeDistances): 344 ms\n    (ComputeDistances): 349 ms\n    (ComputeDistances): 346 ms\n    (ComputeDistances): 343 ms\n    (ComputeDistances): 354 ms\n    (ComputeDistances): 344 ms\n    (ComputeDistances): 342 ms\n    (ComputeDistances): 343 ms\n    (ComputeDistances): 348 ms\n    (ComputeDistances): 343 ms\n    (ComputeDistances): 355 ms\n    (ComputeDistances): 343 ms\n    (ComputeDistances): 351 ms\n    (ComputeDistances): 352 ms\n    (ComputeDistances): 342 ms\n    (ComputeDistances): 346 ms\n    (ComputeDistances): 346 ms\n    (ComputeDistances): 344 ms\n    (ComputeDistances): 346 ms\n    (ComputeDistances): 344 ms\n    (ComputeDistances): 343 ms\n    (ComputeDistances): 343 ms\n    (ComputeDistances): 345 ms\n    (ComputeDistances): 344 ms\n    (ComputeDistances): 346 ms\n    (ComputeDistances): 346 ms\n    (ComputeDistances): 344 ms\n    (ComputeDistances): 343 ms\n    (ComputeDistances): 351 ms\n    (ComputeDistances): 347 ms\n    (ComputeDistances): 344 ms\n    (ComputeDistances): 354 ms\n    (ComputeDistances): 345 ms\n    (ComputeDistances): 343 ms\n    min: 340 ms\n    max: 420 ms\n    (dispose):          34 ms\n```\n\nIf we consider the dataset: 200x100000:\n\n```\nSimple 1 thread:        3338 ms\nSimple 2 threads:       1694 ms\nSimple 4 threads:       845 ms\nSimple 8 threads:       446 ms\nSimpleV2 1 thread:      1940 ms\nSimpleV2 2 threads:     982 ms\nSimpleV2 4 threads:     549 ms\nSimpleV2 8 threads:     273 ms\n```\n\nIf we turn on/off optimize code:\n\n```\nDataset: 200x100000\n\nx64 (optimize code unchecked)\nCompute with integer, result with double: 14237 ms\nCompute with integer, result with float: 14393 ms\nCompute with float, result with float: 14196 ms\n\nx64 (optimize code checked)\nCompute with integer, result with double: 3331 ms\nCompute with integer, result with float: 3332 ms\nCompute with float, result with float: 1874 ms\n```\n\nWith the GPU, float vs integer:\n\n```\nDataset: 200x100000\nGpuCosineSimilarityFloatVersionCacheKernel:\n    (init):             273 ms\n    (ComputeDistances): 166 ms\n    (ComputeDistances): 212 ms\n    (ComputeDistances): 181 ms\n    (ComputeDistances): 157 ms\n    (ComputeDistances): 157 ms\n    (ComputeDistances): 161 ms\n    (ComputeDistances): 167 ms\n    (ComputeDistances): 157 ms\n    (ComputeDistances): 157 ms\n    (ComputeDistances): 157 ms\n    (ComputeDistances): 162 ms\n    (ComputeDistances): 160 ms\n    (ComputeDistances): 157 ms\n    (ComputeDistances): 157 ms\n    (ComputeDistances): 160 ms\n    (ComputeDistances): 165 ms\n    (ComputeDistances): 159 ms\n    (ComputeDistances): 155 ms\n    (ComputeDistances): 157 ms\n    (ComputeDistances): 160 ms\n    min: 155 ms\n    max: 212 ms\n    (dispose):          49 ms\n\nDataset: 200x100000\nGpuCosineSimilarityIntegerVersionCacheKernel:\n    (init):             274 ms\n    (ComputeDistances): 390 ms\n    (ComputeDistances): 563 ms\n    (ComputeDistances): 383 ms\n    (ComputeDistances): 338 ms\n    (ComputeDistances): 346 ms\n    (ComputeDistances): 342 ms\n    (ComputeDistances): 338 ms\n    (ComputeDistances): 340 ms\n    (ComputeDistances): 341 ms\n    (ComputeDistances): 348 ms\n    (ComputeDistances): 345 ms\n    (ComputeDistances): 340 ms\n    (ComputeDistances): 417 ms\n    (ComputeDistances): 344 ms\n    (ComputeDistances): 351 ms\n    (ComputeDistances): 341 ms\n    (ComputeDistances): 342 ms\n    (ComputeDistances): 359 ms\n    (ComputeDistances): 340 ms\n    (ComputeDistances): 348 ms\n    min: 338 ms\n    max: 563 ms\n    (dispose):          43 ms\n```\n\n# Conclusion\n\nWith the simple method, adding more thread reduce the duration.\n\nThere is a minimal cost to communicate with the GPU device (about 300 ms in the experimentation and only occur on the first GPU call). You need to have a great amount of data to use the GPU. Otherwise, it's slower than the single thread version. The communication cost with GPU is negligible when using large arrays. If the array is too large, we got an exception. (Maybe it's time to do batch processing and do multiple GPU call.)\n\nThe Advanced Vector Extensions of modern CPU can be used per thread. Adding more threads reduce the computation time. Compared to the simple method, it uses about half (or less) the time to do the same job in the integer version. If the dataset is a double array, the performance is the same or worst.\n\nObviously, using double is way slower than integer. If possible, always prefer integer. If you want to keep some digits, you could multiple the number by 10 or 100 and convert it to integer. If you really want to keep double, maybe you should consider using the GPU.\n\nIf we compare the vectorized version (integer array, v1 and v2), the dot product is faster than doing an addition/multiplication on an accumulator vector and taking the sum of the accumulator when having small dimension in the array. (It's slower than the simple method on 1 thread.) But, if you consider an array with a lot of dimension, it's faster using an accumulator vector than using the dot product operation.\n\nWith the GPU, the kernel function can be cached for multiple use. If we consider the dataset 200x100000:\n- Initialization take 256 ms\n- Dispose take 34 ms\n- Computing the distance vary between 340 ms and 420 ms. (about a variation of 80 ms)\n- Compared to GpuCosineSimilarityIntegerVersion (about 725 ms). It's faster if we do multiple call.\n\nPrecalculating the magnitude for each vector greatly reduce the amount of operations to do. (class: SimpleV2CosineSimilarityIntegerVersion)\n\nMake sure the optimize code is checked. (more than x4 speedup) Using float array results in better performance than double or integer array. If you use dotnet core, when you deploy using `dotnet publish -c Release`, the code is optimized.\n\nOn the GPU, a float array gives better performance than an integer array.\n\n# Copyright and license\n\nCode released under the MIT license.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdoxakis%2Fcosinesimilaritycomparison","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdoxakis%2Fcosinesimilaritycomparison","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdoxakis%2Fcosinesimilaritycomparison/lists"}