{"id":19603078,"url":"https://github.com/devexcale/multicoreminhash","last_synced_at":"2025-07-21T06:32:50.738Z","repository":{"id":248358193,"uuid":"768769328","full_name":"devExcale/MulticoreMinHash","owner":"devExcale","description":"C implementations of the MinHash algorithm using MPI and OpenMP","archived":false,"fork":false,"pushed_at":"2024-07-14T08:15:36.000Z","size":91,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-06-13T06:43:04.366Z","etag":null,"topics":["acsai","c","mpi","multithreading","openmp","sapienza","sapienza-university"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/devExcale.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-07T17:42:26.000Z","updated_at":"2024-07-14T08:18:13.000Z","dependencies_parsed_at":"2024-07-14T10:32:06.697Z","dependency_job_id":"5e455d72-4152-4cb4-a273-1913edd5862b","html_url":"https://github.com/devExcale/MulticoreMinHash","commit_stats":null,"previous_names":["devexcale/multicoreminhash"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/devExcale/MulticoreMinHash","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devExcale%2FMulticoreMinHash","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devExcale%2FMulticoreMinHash/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devExcale%2FMulticoreMinHash/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devExcale%2FMulticoreMinHash/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/devExcale","download_url":"https://codeload.github.com/devExcale/MulticoreMinHash/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devExcale%2FMulticoreMinHash/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266253614,"owners_count":23900053,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["acsai","c","mpi","multithreading","openmp","sapienza","sapienza-university"],"created_at":"2024-11-11T09:27:54.079Z","updated_at":"2025-07-21T06:32:45.730Z","avatar_url":"https://github.com/devExcale.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Multicore MinHash\n\nThis project contains two C implementations of the MinHash algorithm that use parallelism to improve performances.\nThe two implementations use MPI one and OpenMP the other to parallelize computation.\n\nThe algorithm's output is a csv file that contains the indices of the documents that matched\nand the similarity score between them.\n\n## Code structure\n\nThe code can be found inside the `src` folder:\nthe `MPI` and `OMP` folders contain the implementations of the algorithm using MPI and OpenMP respectively.\nMoreover, python scripts with various purposes can be found in the `src` folder.\n\nThe whole project can be built using the `Makefile` in the root folder.\nDuring compilation, the `whichmp` variable must be used to specify which implementation to compile.\nThe possible values are `MPI` and `OMP`.\n\n## Running options\n\n- `docs`: the number of documents to use when running the program\n- `offset`: the number of documents to skip when running the program\n- `shingle`: the number of words to use for each shingle\n- `signature`: the number of hash functions to use for each signature\n- `bandrows`: the number of rows to use for each band\n- `seed`: the seed to use for the hash functions\n- `threshold`: the similarity threshold to use when filtering the results\n\n## Makefile rules\n\nThe `Makefile` contains the following rules:\n\n- `all`: compiles the project using the specified implementation\n- `clean`: removes all the compiled files\n- `run`: runs the compiled program\n- `debug`: runs the compiled program with the `gdb` debugger\n- `time`: runs the program with the `time` command\n- `report`: runs the program multiple times with increasing number of processes\n  and saves the execution times in a csv file\n- `report-check`: checks that the csv outputs of the multiple runs by `report` are consistent\n- `extract-medpub`: extracts the MedPub dataset from kaggle's csv file\n\n## Make options\n\nThe makefile contains ready-to-use configurations that can be used to test the algorithm out-of-the-box.\n\n- `whichmp`: the implementation to use when compiling or running the program\n- `processes`: the number of processes to use when running once,\n  or the maximum number of processes to use when running multiple times\n- `dataset`: the dataset to use when running the program (see [below](#datasets) for more information)\n- `repeat`: the number of times to run the program with the same number of processes when using the `report` rule\n\n\u003e **Example:** the command `make report whichmp=OMP processes=12 repeat=3 dataset=medical` will run the OMP implementation on\nthe `medical` dataset from 1 to 12 processes, 3 times for each number of processes, for a total of 36 executions.\n\n## Datasets\n\nThe datasets we used to test the performance of the algorithms are downloadable from the Kaggle platform.\n\n| Name                                         | Execution code |  # Docs | Link                                                                                          |\n|----------------------------------------------|:--------------:|--------:|-----------------------------------------------------------------------------------------------|\n| 2k clean medical articles (MedicalNewsToday) |   `medical`    |   1'989 | [link](https://www.kaggle.com/datasets/trikialaaa/2k-clean-medical-articles-medicalnewstoday) |\n| 🌍 Environment News Dataset 📰               | `environment`  |  29'090 | [link](https://www.kaggle.com/datasets/beridzeg45/guardian-environment-related-news)          |\n| PubMed Article Summarization Dataset         |    `medpub`    | 106'330 | [link](https://www.kaggle.com/datasets/thedevastator/pubmed-article-summarization-dataset)    |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdevexcale%2Fmulticoreminhash","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdevexcale%2Fmulticoreminhash","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdevexcale%2Fmulticoreminhash/lists"}