https://github.com/devexcale/multicoreminhash

C implementations of the MinHash algorithm using MPI and OpenMP
https://github.com/devexcale/multicoreminhash

acsai c mpi multithreading openmp sapienza sapienza-university

Last synced: about 1 month ago
JSON representation

C implementations of the MinHash algorithm using MPI and OpenMP

Host: GitHub
URL: https://github.com/devexcale/multicoreminhash
Owner: devExcale
Created: 2024-03-07T17:42:26.000Z (over 1 year ago)
Default Branch: master
Last Pushed: 2024-07-14T08:15:36.000Z (about 1 year ago)
Last Synced: 2025-02-26T15:46:42.900Z (5 months ago)
Topics: acsai, c, mpi, multithreading, openmp, sapienza, sapienza-university
Language: C
Homepage:
Size: 88.9 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Multicore MinHash

This project contains two C implementations of the MinHash algorithm that use parallelism to improve performances.
The two implementations use MPI one and OpenMP the other to parallelize computation.

The algorithm's output is a csv file that contains the indices of the documents that matched
and the similarity score between them.

## Code structure

The code can be found inside the `src` folder:
the `MPI` and `OMP` folders contain the implementations of the algorithm using MPI and OpenMP respectively.
Moreover, python scripts with various purposes can be found in the `src` folder.

The whole project can be built using the `Makefile` in the root folder.
During compilation, the `whichmp` variable must be used to specify which implementation to compile.
The possible values are `MPI` and `OMP`.

## Running options

- `docs`: the number of documents to use when running the program
- `offset`: the number of documents to skip when running the program
- `shingle`: the number of words to use for each shingle
- `signature`: the number of hash functions to use for each signature
- `bandrows`: the number of rows to use for each band
- `seed`: the seed to use for the hash functions
- `threshold`: the similarity threshold to use when filtering the results

## Makefile rules

The `Makefile` contains the following rules:

- `all`: compiles the project using the specified implementation
- `clean`: removes all the compiled files
- `run`: runs the compiled program
- `debug`: runs the compiled program with the `gdb` debugger
- `time`: runs the program with the `time` command
- `report`: runs the program multiple times with increasing number of processes
and saves the execution times in a csv file
- `report-check`: checks that the csv outputs of the multiple runs by `report` are consistent
- `extract-medpub`: extracts the MedPub dataset from kaggle's csv file

## Make options

The makefile contains ready-to-use configurations that can be used to test the algorithm out-of-the-box.

- `whichmp`: the implementation to use when compiling or running the program
- `processes`: the number of processes to use when running once,
or the maximum number of processes to use when running multiple times
- `dataset`: the dataset to use when running the program (see [below](#datasets) for more information)
- `repeat`: the number of times to run the program with the same number of processes when using the `report` rule

> **Example:** the command `make report whichmp=OMP processes=12 repeat=3 dataset=medical` will run the OMP implementation on
the `medical` dataset from 1 to 12 processes, 3 times for each number of processes, for a total of 36 executions.

## Datasets

The datasets we used to test the performance of the algorithms are downloadable from the Kaggle platform.

| Name | Execution code | # Docs | Link |
|----------------------------------------------|:--------------:|--------:|-----------------------------------------------------------------------------------------------|
| 2k clean medical articles (MedicalNewsToday) | `medical` | 1'989 | [link](https://www.kaggle.com/datasets/trikialaaa/2k-clean-medical-articles-medicalnewstoday) |
| 🌍 Environment News Dataset 📰 | `environment` | 29'090 | [link](https://www.kaggle.com/datasets/beridzeg45/guardian-environment-related-news) |
| PubMed Article Summarization Dataset | `medpub` | 106'330 | [link](https://www.kaggle.com/datasets/thedevastator/pubmed-article-summarization-dataset) |

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/devexcale/multicoreminhash

Awesome Lists containing this project

README