{"id":20153551,"url":"https://github.com/seqan/chopper","last_synced_at":"2025-04-09T21:33:11.218Z","repository":{"id":37856304,"uuid":"373461381","full_name":"seqan/chopper","owner":"seqan","description":"A tool for partitioning a set of sequences into similar batches.","archived":false,"fork":false,"pushed_at":"2024-10-23T19:38:28.000Z","size":3050,"stargazers_count":9,"open_issues_count":8,"forks_count":9,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-10-24T04:08:50.477Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://docs.seqan.de/chopper","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/seqan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-06-03T10:01:35.000Z","updated_at":"2024-10-23T19:38:32.000Z","dependencies_parsed_at":"2023-10-16T19:30:56.373Z","dependency_job_id":"ad564bb3-acd8-4601-8d29-96b4315808b6","html_url":"https://github.com/seqan/chopper","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seqan%2Fchopper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seqan%2Fchopper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seqan%2Fchopper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seqan%2Fchopper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/seqan","download_url":"https://codeload.github.com/seqan/chopper/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248114961,"owners_count":21050145,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-13T23:19:41.582Z","updated_at":"2025-04-09T21:33:11.182Z","avatar_url":"https://github.com/seqan.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Chopper - partition your sequences [![build status][1]][2] [![codecov][3]][4]\n\n[1]: https://github.com/seqan/chopper/actions/workflows/ci_linux.yml/badge.svg?branch=main\n[2]: https://github.com/seqan/chopper/actions?query=branch%3Amain\n[3]: https://codecov.io/gh/seqan/chopper/branch/main/graph/badge.svg?token=SJVMYRUKW2\n[4]: https://codecov.io/gh/seqan/chopper\n\n## System requirements\n\n* GCC Version \u003e= 11\n* CMake Version \u003e= 3.18\n\n## General setup\n\nSet up the repository:\n\n```\ngit clone --recurse-submodules https://github.com/seqan/chopper\n```\n\nSet up the build directory\n```\nmkdir chopper_build\ncd chopper_build\ncmake ../chopper\n```\n\nBuild chopper\n```\nmake\n```\n\nOptional: Build the test to check if everything works\n```\nmake test\n```\nIn case anything fails, please open a GitHub issue mentioning your OS and compiler version.\n\n\n## Chopper\n\nChopper uses a hierarchical DP algorithm to layout user bins into a given number of technical bins,\noptimizing the space consumption of a Hierarchical Interleaved Bloom Filter (HIBF).\n\nChopper needs an **input file** with filenames. The file could look like this:\n(it is always good to give absolute instead of relative paths)\n\n```\n/path/to/file1.fa\n/path/to/file2.fa\n/path/to/file3.fa.gz\n...\n```\n\nYou can then **run chopper** with the following command:\n\n```\n./chopper --input data.tsv --kmer 21 --output chopper.layout\n```\n\nThere are **more options** to tweak the layout (with sensible defaults). You get detailed information if you run:\n```\n./chopper --help\n```\n\nThe resulting layout file can be used to build an HIBF index with\n[raptor](https://github.com/seqan/raptor).\n\n## Understanding the layout file\n\nThere is no need to actually understand the internals of the layout file, as you can just let\n[raptor](https://github.com/seqan/raptor) build the HIBF index automatically from the layout.\nIf you are interested, or you have a specific use case, here is some information about the layout.\nYou will also find a [visualisation of the layout file](#visualisation-of-the-layout-file) at the end of this section.\n\n**A layout file has 3 parts: (1) The config, (2) the header, (3) the layout content.**\n\nAt first, **the config** of chopper that created this particular file is stored.\nThe config part is identified by two hashes `##` at the beginning of each line. It starts with `##CONFIG` and ends with\n`##ENDCONFIG`. It could look like this:\n\n```\n##CONFIG:\n##{\n##    \"config\": {\n##        \"version\": 2,\n##        \"k\": 19,\n\n...\n\n##ENDCONFIG\n```\n\nThe config is followed by the actual **header** of the layout file, which stores important information for building the\nHIBF index.\nThe header is identified by one hash `#` at the beginning of each line. It starts with `#HIGH_LEVEL_IBF max_bin_id:[X]`\nand ends with `#FILES\tBIN_INDICES\t  NUMBER_OF_BINS`, the column names of the layout content.\n\nIt could look like this:\n\n```\n#HIGH_LEVEL_IBF max_bin_id:14\n#MERGED_BIN_0 max_bin_id:0\n#MERGED_BIN_5;3 max_bin_id:2\n\n...\n\n#FILES\tBIN_INDICES\tNUMBER_OF_BINS\n```\n\nEach line corresponds to one IBF in the hierarchy, identifying the maximum technical bin, maximal in its k-mer content.\nThis information is needed to compute the size of each IBF when building the HIBF.\n\nDetails:\n\n1. `HIGH_LEVEL_IBF max_bin_id:[X]`: Reports the id (`[X]`) of that technical bin in the top/high-level IBF that has the\n                                    highest k-mer content.\n2. `MERGED_BIN_[Y] max_bin_id:[X]`: Reports the id (`[X]`) of that technical bin in the IBF identified by `[Y]`.\n\n\nFollowing the header, the **layout content** describes the actual layout.\nEach line reports the structure for a particular user bin. In that sense, the number of content lines is exactly the\nsame as that of the inputs.\n\nIt could look like this:\n\n```\n/path/to/file1.fa\t12\t3\n/path/to/file2.fa\t0;2\t1;1\n/path/to/file3.fa.gz\t5;4\t1;7\n\n...\n```\n\nColumns of the layout content:\n\n1. `FILES`: The file path(s) for the user bin.\n2. `BIN_INDICES`: The technical bin indices on each level that the user bin is stored in. In the example:\n                  * `file1.fa` is stored in technical bin `12` of the top-level IBF.\n                  * `file2.fa` is stored in technical bin `0` of the top-level IBF which is a merged bin, so it's\n                    also stored in a lower level IBF. In this lower level IBF, it is stored in technical bin `2`.\n                  * `file3.fa.gz` is stored in technical bin `5` of the top-level IBF which is a merged bin, so it's\n                    also stored in a lower level IBF. In this lower level IBF, it is stored in technical bin `4`.\n3. `NUMBER_OF_BINS`: The number of technical bins the user bin is stored in on each level. For this example:\n                  * `file1.fa` split into `3` technical bins (ids:`12,13,14`) on the top-level IBF.\n                  * `file2.fa` is stored in a merged bin (`1`) and in a single bin (`1`) on the lower level.\n                  * `file3.fa.gz` is stored in a merged bin (`1`) and is split into `7` bins (ids:`4,5,6,7,8,9,10`)\n                     on the lower level.\n\n### Visualisation of the layout file\n\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003e\u003ci\u003eClick here to show a visualisation of the layout file\u003c/i\u003e\u003c/b\u003e\u003c/summary\u003e\n\n\u003cimg src=\"doc/layout_file.svg\" alt=\"Visualisation of the layout file\" width=\"100%\"\u003e\n\n\u003c/details\u003e\n\n## Multiple files per user bin\n\nCurrently, chopper always has a 1-to-1 relation between files and user bins when laying out.\nIf you want to assign multiple files to a user bin, you can use [raptor](https://github.com/seqan/raptor).\n\n`raptor prepare` also handles files like this:\n\n```\n/path/to/file1-a.fa;/path/to/file1-b.fa;/path/to/file1-c.fa\n/path/to/file2.fa\n/path/to/file3.fa.gz\n\n...\n```\n\n`copper` or alternatively `raptor layout` (which calls chopper) then computes the layout based on the precomputed files.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fseqan%2Fchopper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fseqan%2Fchopper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fseqan%2Fchopper/lists"}