Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/divy9881/external-sorting-for-databases
External Sorting algorithm for Databases having constrained storage hierarchy
https://github.com/divy9881/external-sorting-for-databases
databases graceful-degradation sorting-algorithms tournament-algorithm
Last synced: 8 days ago
JSON representation
External Sorting algorithm for Databases having constrained storage hierarchy
- Host: GitHub
- URL: https://github.com/divy9881/external-sorting-for-databases
- Owner: divy9881
- License: gpl-3.0
- Created: 2023-09-22T00:58:02.000Z (about 1 year ago)
- Default Branch: master
- Last Pushed: 2023-12-10T03:46:26.000Z (11 months ago)
- Last Synced: 2024-01-28T00:56:37.798Z (10 months ago)
- Topics: databases, graceful-degradation, sorting-algorithms, tournament-algorithm
- Language: C++
- Homepage:
- Size: 438 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# External-Sorting-for-Databases
External Sorting algorithm for Databases having constrained storage hierarchy# Group Members
- Divy Patel (9085310937) [email protected]
- Sahil Naphade (9085746619) [email protected]
- Devaki Kulkarni (9086222321) [email protected]
- Manaswini Gogineni (9085432699) [email protected]# Individual Contributions
- __Divy__: Cache-size mini runs, Device-optimized page sizes, Spilling memory-to-SSD, Spilling from SSD to disk, Graceful degradation, Optimized merge patterns, Testing and Memory Leak Check
- __Sahil__: Tournament trees, Offset-value coding, Minimum count of row & column comparisons, Optimized merge patterns, Large-size records, Testing and Memory Leak Check
- __Devaki__: Tournament trees, Offset-value coding, Large-size records
- __Manaswini__: Verification# Techniques Implemented by our submission and the corresponding Source Files and Lines
- **Tournament trees**: `File Tree.cpp @ Line 196`
- **Offset-value coding**: `File DataRecord.cpp @ Line 122`
- **Minimum count of row & column comparisons**
- **Cache-size mini runs**: `File SortRecords.cpp @ Line 26`
- **Device-optimized page sizes**: `File SortRecords.cpp @ Line 81 and Line 136`
- **Spilling memory-to-SSD**: `File SortRecords.cpp @ Line 65`
- **Spilling from SSD to disk**: `File SortRecords.cpp @ Line 69 and Line 125`
- **Graceful degradation**: `File SortRecords.cpp @ Line 72, Line 74 and Line 151`
- **Into merging**
- **Beyond one merge step**
- **Optimized merge patterns**: `File SortRecords.cpp @ Line 150 and Line 151`
- **Verifying**: `File Iterator.cpp @ Line 69 and Line 84`
- **sets of rows & values**: `File Iterator.cpp @ Line 84`
- **sort order**: `File Iterator.cpp @ Line 69`- Replacement selection?
- Run size > memory size?
- Variable-size records?
- Compression?
- Prefix truncation?
- Quicksort# Reasons we chose to implement the specific subset of techniques
- `Tournament-tree priority queue` was used in order to achieve `high fan-in` for merging our sorted run inputs of records and less number of comparisons than a standard tree-of-winners
- `Offset-value coding` was used to achieve `minimum column value comparisons`
- `Cache-size mini runs` were used to be able to fit the sort inputs, for tournament-tree, in the cache. This enabled us to leverage the low-latency accesses when there are `cache hits`
- `Device-optimized page sizes` were used in order to being cognizant about the `access-profile(latency, bandwidth)` of various devices in the storage hierarchy. For `SSD`, we used `8KB(100 MB/s * 0.1 ms ~ 10KB)` and for `HDD`, we used `1MB(100 MB/s * 10 ms ~ 1MB)`
- We achieved graceful-degradation by spilling `cache-size runs from cache to memory`, `spilling memory-size runs from memory to SSD` and `spilling SSD-size runs from SSD to HDD`
- Also `HDD-page size(1MB)` sorted runs were written to `SSD` prior to actually merging runs on the `HDD`. This is to leverage low-latency accesses of flash drives(SSD)
- `Sort-order`, `set of rows` and their `values` were verified as part of sorting the input records. This is to verify the `correctness` and `integrity` of our sort algorihthm# Project's state
- The implementation of the `External-Sort` is complete with all of the techniques which were expected from us as part of the course project
- The sort was tested against `1KB` size records and with `12M` number of records(although it takes ~1hr to complete the sort, for this particular test-case)
- The sort algorithm was tested against `valgrind` to check for any memory leaks introduced while developing. The codebase does not have any memory leaks, from the latest leak-report on the most recent code version# How to run our programs
- To run our program, first compile the source code using following command, under `External-Sort` directory
```
$ cd External-Sort
$ make ExternalSort.exe
```
- After compiling the source code, to execute the External Sort with custom arguments, run following command inside `External-Sort` directory
```
# Where,
# "-c" gives the total number of records
# "-s" is the individual record size
# "-o" is the trace of your program run
$ ./ExternalSort.exe -c 120 -s 10 -o trace0.txt
```- The program creates three directories on the completion of the sort algorithm:
- `input`: This directory consist of the input table which has records generated by the random-generator in arbitrary order
- `output`: This directory consist of the output table which has records from input table but in a sorted order, sorted using our sort algorithm
- `trace`: This directory consists of trace files generated from the sort. The trace file consists of logs related SSD and HDD device accesses. And the logs related to sort state machine- In order to remove all the generated binaries, executables, and the utility directories mentioned above, run the following command
```
$ make clean
```# Initial Setup
```
$ docker run -it --privileged -v $pwd/External-Sort:/External-Sort ubuntu bash
$ apt-get update
$ apt-get install build-essential make g++ vim sudo valgrind -y
$ cd ./External-Sort
```# Generating ExternalSort.exe and Running the Executable
```
$ make ExternalSort.exe# Where,
# "-c" gives the total number of records
# "-s" is the individual record size
# "-o" is the trace of your program run
$ ./ExternalSort.exe -c 120 -s 1000 -o trace0.txt
```# Run Valgrind for a Leak Check
```
# Creates a log file `valgrind` inside the External-Sort directory
$ valgrind --track-origins=yes --log-file="/External-Sort/valgrind" --leak-check=yes ./ExternalSort.exe -c 120 -s 1000 -o trace0.txt
```