{"id":19304567,"url":"https://github.com/divy9881/external-sorting-for-databases","last_synced_at":"2026-05-18T06:02:09.073Z","repository":{"id":212605547,"uuid":"694918746","full_name":"divy9881/External-Sorting-for-Databases","owner":"divy9881","description":"External Sorting algorithm for Databases having constrained storage hierarchy","archived":false,"fork":false,"pushed_at":"2023-12-10T03:46:26.000Z","size":448,"stargazers_count":1,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-10-10T00:03:11.338Z","etag":null,"topics":["databases","graceful-degradation","sorting-algorithms","tournament-algorithm"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/divy9881.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-09-22T00:58:02.000Z","updated_at":"2024-02-14T05:14:12.000Z","dependencies_parsed_at":"2023-12-15T07:27:23.865Z","dependency_job_id":"dbae977a-3128-4f43-9bd6-77babc8236eb","html_url":"https://github.com/divy9881/External-Sorting-for-Databases","commit_stats":null,"previous_names":["divy9881/external-sorting-for-databases"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/divy9881/External-Sorting-for-Databases","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/divy9881%2FExternal-Sorting-for-Databases","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/divy9881%2FExternal-Sorting-for-Databases/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/divy9881%2FExternal-Sorting-for-Databases/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/divy9881%2FExternal-Sorting-for-Databases/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/divy9881","download_url":"https://codeload.github.com/divy9881/External-Sorting-for-Databases/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/divy9881%2FExternal-Sorting-for-Databases/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33167429,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-18T05:43:36.989Z","status":"ssl_error","status_checked_at":"2026-05-18T05:43:19.133Z","response_time":71,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["databases","graceful-degradation","sorting-algorithms","tournament-algorithm"],"created_at":"2024-11-09T23:30:13.618Z","updated_at":"2026-05-18T06:02:09.054Z","avatar_url":"https://github.com/divy9881.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# External-Sorting-for-Databases\nExternal Sorting algorithm for Databases having constrained storage hierarchy\n\n# Group Members\n- Divy Patel (9085310937) dspatel6@wisc.edu\n- Sahil Naphade (9085746619) smnaphade@wisc.edu\n- Devaki Kulkarni (9086222321) dgkulkarni2@wisc.edu\n- Manaswini Gogineni (9085432699) mgogineni@wisc.edu \n\n# Individual Contributions\n- __Divy__: Cache-size mini runs, Device-optimized page sizes, Spilling memory-to-SSD, Spilling from SSD to disk, Graceful degradation, Optimized merge patterns, Testing and Memory Leak Check\n- __Sahil__: Tournament trees, Offset-value coding, Minimum count of row \u0026 column comparisons, Optimized merge patterns, Large-size records, Testing and Memory Leak Check\n- __Devaki__: Tournament trees, Offset-value coding, Large-size records\n- __Manaswini__: Verification\n\n# Techniques Implemented by our submission and the corresponding Source Files and Lines\n\n- **Tournament trees**: `File Tree.cpp @ Line 196`\n- **Offset-value coding**: `File DataRecord.cpp @ Line 122`\n- **Minimum count of row \u0026 column comparisons**\n- **Cache-size mini runs**: `File SortRecords.cpp @ Line 26`\n- **Device-optimized page sizes**: `File SortRecords.cpp @ Line 81 and Line 136`\n- **Spilling memory-to-SSD**: `File SortRecords.cpp @ Line 65`\n- **Spilling from SSD to disk**: `File SortRecords.cpp @ Line 69 and Line 125`\n- **Graceful degradation**: `File SortRecords.cpp @ Line 72, Line 74 and Line 151`\n  - **Into merging** \n  - **Beyond one merge step**\n- **Optimized merge patterns**: `File SortRecords.cpp @ Line 150 and Line 151`\n- **Verifying**: `File Iterator.cpp @ Line 69 and Line 84`\n  - **sets of rows \u0026 values**: `File Iterator.cpp @ Line 84`\n  - **sort order**: `File Iterator.cpp @ Line 69`\n\n- Replacement selection?\n- Run size \u003e memory size?\n- Variable-size records?\n- Compression?\n- Prefix truncation?\n- Quicksort\n\n\n# Reasons we chose to implement the specific subset of techniques\n- `Tournament-tree priority queue` was used in order to achieve `high fan-in` for merging our sorted run inputs of records and less number of comparisons than a standard tree-of-winners\n- `Offset-value coding` was used to achieve `minimum column value comparisons`\n- `Cache-size mini runs` were used to be able to fit the sort inputs, for tournament-tree, in the cache. This enabled us to leverage the low-latency accesses when there are `cache hits`\n- `Device-optimized page sizes` were used in order to being cognizant about the `access-profile(latency, bandwidth)` of various devices in the storage hierarchy. For `SSD`, we used `8KB(100 MB/s * 0.1 ms ~ 10KB)` and for `HDD`, we used `1MB(100 MB/s * 10 ms ~ 1MB)`\n- We achieved graceful-degradation by spilling `cache-size runs from cache to memory`, `spilling memory-size runs from memory to SSD` and `spilling SSD-size runs from SSD to HDD`\n- Also `HDD-page size(1MB)` sorted runs were written to `SSD` prior to actually merging runs on the `HDD`. This is to leverage low-latency accesses of flash drives(SSD)\n- `Sort-order`, `set of rows` and their `values` were verified as part of sorting the input records. This is to verify the `correctness` and `integrity` of our sort algorihthm\n\n\n# Project's state\n- The implementation of the `External-Sort` is complete with all of the techniques which were expected from us as part of the course project\n- The sort was tested against `1KB` size records and with `12M` number of records(although it takes ~1hr to complete the sort, for this particular test-case)\n- The sort algorithm was tested against `valgrind` to check for any memory leaks introduced while developing. The codebase does not have any memory leaks, from the latest leak-report on the most recent code version\n\n# How to run our programs\n- To run our program, first compile the source code using following command, under `External-Sort` directory\n```\n$ cd External-Sort\n$ make ExternalSort.exe\n```\n- After compiling the source code, to execute the External Sort with custom arguments, run following command inside `External-Sort` directory\n```\n# Where,\n# \"-c\" gives the total number of records\n# \"-s\" is the individual record size\n# \"-o\" is the trace of your program run\n$ ./ExternalSort.exe -c 120 -s 10 -o trace0.txt\n```\n\n- The program creates three directories on the completion of the sort algorithm:\n  - `input`: This directory consist of the input table which has records generated by the random-generator in arbitrary order\n  - `output`: This directory consist of the output table which has records from input table but in a sorted order, sorted using our sort algorithm\n  - `trace`: This directory consists of trace files generated from the sort. The trace file consists of logs related SSD and HDD device accesses. And the logs related to sort state machine\n\n- In order to remove all the generated binaries, executables, and the utility directories mentioned above, run the following command\n```\n$ make clean\n```\n\n# Initial Setup\n```\n$ docker run -it --privileged -v $pwd/External-Sort:/External-Sort ubuntu bash\n$ apt-get update\n$ apt-get install build-essential make g++ vim sudo valgrind -y\n$ cd ./External-Sort\n```\n\n# Generating ExternalSort.exe and Running the Executable\n```\n$ make ExternalSort.exe\n\n# Where,\n# \"-c\" gives the total number of records\n# \"-s\" is the individual record size\n# \"-o\" is the trace of your program run\n$ ./ExternalSort.exe -c 120 -s 1000 -o trace0.txt\n```\n\n# Run Valgrind for a Leak Check\n```\n# Creates a log file `valgrind` inside the External-Sort directory\n$ valgrind --track-origins=yes --log-file=\"/External-Sort/valgrind\" --leak-check=yes ./ExternalSort.exe -c 120 -s 1000 -o trace0.txt\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdivy9881%2Fexternal-sorting-for-databases","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdivy9881%2Fexternal-sorting-for-databases","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdivy9881%2Fexternal-sorting-for-databases/lists"}