https://github.com/vida-nyu/samplingmethodsforinnerproductsketching
https://github.com/vida-nyu/samplingmethodsforinnerproductsketching
Last synced: 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/vida-nyu/samplingmethodsforinnerproductsketching
- Owner: VIDA-NYU
- Created: 2023-08-31T08:41:03.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-02-23T17:01:25.000Z (3 months ago)
- Last Synced: 2025-02-23T18:19:12.154Z (3 months ago)
- Language: Python
- Homepage: https://www.vldb.org/pvldb/vol17/p2185-musco.pdf
- Size: 17.7 MB
- Stars: 3
- Watchers: 7
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Sampling Methods for Inner Product Sketching
This is the code for the paper [Sampling Methods for Inner Product Sketching](https://www.vldb.org/pvldb/vol17/p2185-musco.pdf) published at VLDB 2024.
We suggest users read the paper to better understand the experiments before using the code.The extended version of the paper (with appendices) is available at: https://arxiv.org/abs/2309.16157
For citations, please use:
> Majid Daliri, Juliana Freire, Christopher Musco, AΓ©cio Santos,
and Haoxiang Zhang. Sampling Methods for Inner Product Sketching.
PVLDB, 17(9): 2185 - 2197, 2024. doi:10.14778/3665844.3665850## Contents
This README file is divided into the following sections:
* [1. Requirements](#-1-requirements)
* [2. Setup before reproducing the plots](#-2-setup-before-reproducing-the-plots)
* [3. Reproducing the experimental results](#-3-reproducing-the-plots-by-running-experiments)## π 1. Requirements
The paper experiments were run using `Python 3.9.9` with the following required packages. They are also listed in the `requirements.txt` file.
- matplotlib==3.7.2
- numba==0.57.1
- numpy==1.24.4
- pandas==2.0.3
- scipy==1.11.1
- statsmodels==0.14.0
- sklearn==1.4.1The instructions assume a Unix-like operating system (Linux or MacOS). You may need to adjust the steps for machines running Windows.
## π 2. Setup before reproducing the plots
### π₯ 2.1 Create a virtual environment (optional, but recommended)
To isolate dependencies and avoid library conflicts with your local environment, you may want to use a Python virtual environment manager. To do so, you should run the following commands to create and activate the virtual environment:
```bash
python -m venv ./venv
source ./venv/bin/activate
```### π₯ 2.2 Make sure you have the required packages installed
You can install the dependencies using `pip`:
```
pip install -r requirements.txt
```### π₯ 2.3 Set correct environment variables PROJECT_PATH and SCRIPT_PATH by running:
```bash
source .bashrc
```To verify that this worked, you can run `echo $PROJECT_PATH` and confirm that the output points to the directory where the repositoy was downloaded.
## π 3. Reproducing the experimental results
### π₯ 3.1 Make sure you have done the [Setup](#-2-setup-before-reproducing-the-plots).
### π₯ 3.2 Use the command line to run the script with the appropriate mode.
### π₯ 3.3 Following are instructions to reproduce the experiments needed for each figure in the paper. Each subsection below describes the following points:
- explanation of the experiment
- command to run the experiment
- expected time to run the experiment based on the machine used to run the experiments:
- `MacBook Pro (15-inch, 2019)`
- `2.3 GHz 8-Core Intel Core i9` with `16GB` RAM#### βοΈ Figure 3: Inner product estimation for synthetic *real* data.
- Command: `python super_script.py -mode=ip`
- Expected time:
- 3.5 hours per plot
- 14 hours for all 4 plots in Figure 3#### βοΈ Figure 4: Inner product estimation for synthetic *binary* data. This can be applied to problems like join size estimation for tables with unique keys and set intersection estimation.
- Command: `python super_script.py -mode=join_size`
- Expected time:
- 1.8 hours per plot
- 7.2 hours for all 4 plots in Figure 4#### βοΈ Figure 5: Comparison of End-Biased Sampling (TS-1norm) and its Priority Sampling counterpart (PS-1norm) against our TS-weighted and PS-weighted methods
- Command: `python super_script.py -mode=1normVS2norm`
- Expected time:
- 16min per plot
- 64min for all 4 plots in Figure 5#### βοΈ Figure 6: Join-Correlation estimation for synthetic data.
- Command: `python super_script.py -mode=corr`
- Expected time:
- 7 hours per plot
- 28 hours for all 4 plots in Figure 6#### βοΈ Figure 7: Sketch construction time. Based on the equipment used to run the experiments, you may not be able to reproduce the exact time. However, you can still see a similar trend in the time taken by each method.
- Command: `python super_script.py -mode=time`
- Expected time:
- 3.5 hours for the plot### Note that for following real data experiments, depending on the seed and samples, the results may vary slightly. However, the trend will be similar.
#### βοΈ Figure 8 and Table 2: Inner product, correlation, and join size estimations for the World Bank data,
- Command: `python super_script.py -mode=wbf`
- Expected time:
- 6 hours for the figure and CSVs#### βοΈ Figure 9: Text similarity estimation using the 20 Newsgroups dataset
- Command: `python super_script.py -mode=20news`
- Expected time:
- 2 hours#### βοΈ Figure 10: Join size estimation for the Twitter and TPC-H datasets.
- Skewed TPC-H dataset
- Command: `python super_script.py -mode=tpch`
- Expected time:
- 2 hours
- Twitter dataset
- Command: `python super_script.py -mode=twitter`
- Expected time:
- 8 hours### π₯ 3.4 Viewing the figures:
The figures are generated in PDF format under the directory `/fig`.