{"id":39761161,"url":"https://github.com/agormp/greedysub","last_synced_at":"2026-01-18T11:38:07.112Z","repository":{"id":65326308,"uuid":"573046730","full_name":"agormp/greedysub","owner":"agormp","description":"Reduce redundancy in dataset using greedy algorithms: select subset of data such that no items are closely related","archived":false,"fork":false,"pushed_at":"2025-01-22T10:05:38.000Z","size":10694,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-30T22:51:37.060Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/agormp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-12-01T15:29:04.000Z","updated_at":"2025-01-22T10:05:42.000Z","dependencies_parsed_at":"2025-04-11T16:49:38.157Z","dependency_job_id":"78d1b289-97ad-4f39-98cb-fb0ddf608a21","html_url":"https://github.com/agormp/greedysub","commit_stats":{"total_commits":127,"total_committers":1,"mean_commits":127.0,"dds":0.0,"last_synced_commit":"9242196db2201019f161bf9563e83ab80f92a966"},"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/agormp/greedysub","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agormp%2Fgreedysub","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agormp%2Fgreedysub/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agormp%2Fgreedysub/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agormp%2Fgreedysub/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/agormp","download_url":"https://codeload.github.com/agormp/greedysub/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agormp%2Fgreedysub/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28535169,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-18T10:13:46.436Z","status":"ssl_error","status_checked_at":"2026-01-18T10:13:11.045Z","response_time":98,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-18T11:38:07.033Z","updated_at":"2026-01-18T11:38:07.095Z","avatar_url":"https://github.com/agormp.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# greedysub\n\n![](https://img.shields.io/badge/version-1.2.5-blue)\n[![PyPI downloads](https://static.pepy.tech/personalized-badge/greedysub?period=total\u0026units=international_system\u0026left_color=grey\u0026right_color=blue\u0026left_text=PyPI%20downloads)](https://pepy.tech/project/greedysub)\n[![DOI](https://zenodo.org/badge/573046730.svg)](https://zenodo.org/doi/10.5281/zenodo.8383075)\n![Tests](https://github.com/agormp/greedysub/actions/workflows/tests.yml/badge.svg)\n![Codecov](https://codecov.io/gh/agormp/greedysub/branch/main/graph/badge.svg)\n\nThe `greedysub` command-line program selects a subset of input data such that no retained items are closely related (\"neighbors\"). \n\n![](https://github.com/agormp/greedysub/raw/main/maxindset.png?raw=true)\n\n## Overview\n\nOne use case for `greedysub` is to select a non-redundant subset of DNA- or protein-sequences, i.e., a subset where all pairwise sequence identities are below a given threshold. However, the program can be used to find representative subsets for any other type of items also, or, more generally, to find a [\"maximal independent set\"](#theory) on a graph. \n\nThe program requires a list of pairwise similarities (or distances) as input, along with a cutoff specifying when two items are considered to be neighbors.\n\nReducing sequence redundancy is helpful, e.g., when using cross-validation for estimating the predictive performance of machine learning methods, such as neural networks, in order to avoid spuriously high performance estimates: if similar items (sequences) are present in both training and test sets, then the method will appear to be good at generalisation, when it may just have been overtrained to recognize items (sequences) similar to those in the training set. \n\nThe program implements two different [greedy](https://en.wikipedia.org/wiki/Greedy_algorithm) heuristics for solving the problem: \"greedy-max\" and \"greedy-min\". On average the \"min\" algorithm will be best (giving the largest subset). See section \"Theory\" for details on the algorithms, and for comments on the non-optimality of the heuristics for this problem.\n\n\n## Availability\n\nThe `greedysub` source code is available on GitHub: https://github.com/agormp/greedysub. The executable can be installed from PyPI: https://pypi.org/project/greedysub/\n\n## Installation\n\n```\npython3 -m pip install greedysub\n```\n\nUpgrading to latest version:\n\n```\npython3 -m pip install --upgrade greedysub\n```\n\n## Citation\n\nTo cite greedysub: use the link in the right sidebar under About --\u003e Cite this repository.\n\n## Primary Dependencies\n\n* [pandas](https://pandas.pydata.org) (automatically installed when using pip to install greedysub)\n\n## Usage\n\n```\nusage: greedysub    [-h] [--algo ALGORITHM] [--val VALUETYPE] [-c CUTOFF] [-k KEEPFILE]\n                    INFILE OUTFILE\n\nSelects subset of items, based on list of pairwise similarities (or distances), such that\nno retained items are close neighbors\n\npositional arguments:\n  INFILE            input file containing similarity or distance for each pair of items:\n                    name1 name2 value\n  OUTFILE           output file contatining neighborless subset of items (one name per\n                    line)\n\noptions:\n  -h, --help        show this help message and exit\n  --algo ALGORITHM  algorithm: min, max [default: min]\n  --val VALUETYPE   specify whether values in INFILE are distances (--val dist) or\n                    similarities (--val sim)\n  -c CUTOFF         cutoff value for deciding which pairs are neighbors\n  -k KEEPFILE       (optional) file with names of items that must be kept (one name per\n                    line)\n```\n\n### Input file\n\nThe program requires an INFILE, which should be a textfile where each line contains the names of two sequences (items) and their pairwise similarity (option `--val sim`) or distance (option `--val dist`):\n\n```\nyfg1  yfg2  0.98\nyfg1  klp2  0.67\nyfg1  mcf9  0.87\n...\n```\n\n**Note:** The input file must contain one line for *each possible pair of items*.\n\n### Output file\n\nThe results are written to the OUTFILE, which will contain a list of names (one name per line) of sequences (items) that should be retained: \n\n```\nyfg1\nklp2\n...\n```\n\n**Note:** It is guaranteed that no two items in the resulting subset are neighbors.\nThe program aims to find the maximally sized set of non-adjacent items (but see section Theory for why this is hard and not guaranteed).\n\n\n### Keepfile\n\nUsing the option `-k \u003cPATH TO KEEPFILE\u003e` the user can specify a list of names for items that must be retained in the subset no matter what (even if some of them are neighbors). This KEEPFILE should be a text file listing one name to be retained per line\n\n```\nabc1\ndef3\n...\n```\n\n### Usage examples\n\n#### Select items such that pairwise *similarity* is less than 0.75, using \"greedy-min\" algorithm\n\n```\ngreedysub --algo min --val sim -c 0.75 simfile.txt resultfile.txt\n```\n\n#### Select items such that pairwise *distance* is at least 10, using \"greedy-min\" algorithm\n\n```\ngreedysub --algo min --val dist -c 10 distfile.txt resultfile.txt\n```\n\n#### Select items with pairwise *distance* at least 3, while keeping items in keeplist.txt, using \"greedy-max\"\n\n```\ngreedysub --algo max --val dist -c 3 -k keeplist.txt simfile.txt resultfile.txt\n```\n\n### Summary info written to stdout\n\nBasic information about the original and reduced data sets will be printed to stdout. \n\n#### Example output\n\n```\n\n\tNames in reduced set written to tests/outfile.txt\n\n\tNumber in original set:      1,414\n\tNumber in reduced set:         509\n\n\tNode degree original set:\n\t    min:       1\n\t    max:       9\n\t    ave:       3.03\n\n\tNode distances original set:\n\t    ave:       5.12\n\t    cutoff:    0.95\n\n```\n\nHere, the `node degree` of an item is the number of neighbors it has (i.e., the number of other items that are closer to the item than the cutoff value).\n\n\n## Theory\n\n### Equivalence to \"maximum independent set problem\" and other problems\n\nFinding the largest subset of non-neighboring sequences (items) from a list of pairwise similarities (or distances) is equivalent to the following problems:\n\n* [\"Maximum independent set problem\"](https://en.wikipedia.org/wiki/Independent_set_(graph_theory)) from graph-theory: find the largest set of nodes on a graph, such that none of the nodes are adjacent.\n* [\"Maximum clique problem\"](https://en.wikipedia.org/wiki/Clique_problem#Finding_maximum_cliques_in_arbitrary_graphs): if a set of nodes constitute a maximum independent set, then the same nodes form a maximum [clique](https://en.wikipedia.org/wiki/Clique_(graph_theory)) on the [complement graph](https://en.wikipedia.org/wiki/Complement_graph).\n* [\"Minimum vertex cover problem\"](https://en.wikipedia.org/wiki/Vertex_cover): a vertex cover is a set of nodes that includes at least one endpoint of all edges of the graph. A minimum vertex cover is the smallest possible such set. A minimum vertex cover is the complement of a maximum independent set.\n\n### Computational intractibility of problem\n\nThis problem is [strongly NP-hard](https://en.wikipedia.org/wiki/Strong_NP-completeness) and it is also\n[hard to approximate](https://projecteuclid.org/journals/acta-mathematica/volume-182/issue-1/Clique-is-hard-to-approximate-within-n1ε/10.1007/BF02392825.full). There are therefore no efficient, exact algorithms, although [there are exact algorithms with much better time complexity than the worst-case complexity of a naive, exhaustive search](https://arxiv.org/abs/1312.6260). \n\n### Implemented algorithms\n\n**Note:** of the two implemented, greedy algorithms, `greedy-min` has the best guaranteed performance. However, performance can be much better than the minimum guaranteed one, and occasionally `greedy-max` may find a larger set (this depends on the specific graph).\n\n#### Greedy-min algorithm\n\nGiven a graph $G$, and an empty set $S$:\n\n* While there are still edges in $G$:\n\t* Select a node $\\nu$ of *minimum* degree in $G$\n\t* Add $\\nu$ to $S$\n\t* Remove $\\nu$ and its neighbors from $G$\n* Output the set of nodes in $S$\n\n**Performance ratio:** On a graph with maximum node degree $\\Delta$, it [has been shown ](https://link.springer.com/article/10.1007/BF02523693) that the greedy-min algorithm yields solutions that are within a factor $3 / (\\Delta + 2)$ of the optimal solution. For instance, for $\\Delta=4$ the algorithm is guaranteed to be no worse than $3 / (4 + 2) = 0.5$ times the optimal solution (i.e., the found solution will be at least half the size of the optimal one).\n\n#### Greedy-max algorithm\n\nGiven a graph $G$:\n\n* While there are still edges in $G$:\n\t* Select a node $\\nu$ of *maximum* degree in $G$\n\t* Remove $\\nu$\n* Output set of nodes left in $G$\n\n**Performance ratio:** On a graph with maximum node degree $\\Delta$, it [has been shown ](https://www.sciencedirect.com/science/article/pii/S0166218X02002056?via%3Dihub) that the greedy-max algorithm yields solutions that are within a factor $1 / (\\Delta + 1)$ of the optimal solution. For instance, for $\\Delta=4$ the algorithm is guaranteed to be no worse than $1 / (4 + 1) = 0.2$ times the optimal solution (i.e., the found solution will be at least 20% the size of the optimal one).\n\n**Note:** the greedy-max algorithm is the same as algorithm 2 from the following paper, and has also been implemented in the [`hobohm` program](https://github.com/agormp/hobohm) (but the algorithm has been described in the context of graph theory prior to this work): Hobohm et al.: [\"Selection of representative protein data sets\", Protein Sci. 1992. 1(3):409-17](https://pubmed.ncbi.nlm.nih.gov/1304348/).\n\n### Computational performance:\n\nThe program has been optimized to run reasonably fast with limited memory usage, and to be able to handle large input files (also larger than available RAM). A known (current) limitation is that the neighbor graph (the dictionary keeping track of which nodes connect to which other nodes) has to be small enough to fit in memory.\n\nThe table below shows examples of run times (wall-clock time) on a 2021 M1 Macbook Pro (64 GB memory), for different sizes of input files.\n\n\nCutoffs were chosen such that inputs were reduced to approximately 500 names regardless of starting size (except for the smallest file where the cutoff was chosen such that the input was reduced to a third of its initial size).\n\n| Size of input file  | Size of input file: lines | No. names, original | No. names, reduced | Peak memory     |Wall-clock time |\n|      :-----:        |       :-----:             |        -----:       |     -------:       |   -------:      |  -------:      |\n|      1.6 MB         |       100 K (1E5)         |         447         |       151          |    43 MB        |   0.36 s       |\n|      18 MB          |       1 mill (1E6)        |        1,414        |       509          |    88 MB        |   0.52 s       |\n|      91 MB          |       5 mill (5E6)        |        3,162        |       500          |    97 MB        |   1.23 s       |\n|      181 MB         |       10 mill (1E7)       |        4,472        |       501          |    102 MB       |   2.23 s       |\n|      2.0 GB         |       100 mill (1E8)      |        14,142       |       505          |    310 MB       |   21.3 s       |\n|      20 GB          |       1 bill (1E9)        |        44,721       |       501          |    6.7 GB       |   4:12 m:s     |      \n          \n\n\u003c!---\n\nCutoffs were chosen such that inputs were reduced to approximately 500 names regardless of starting size (except for the smallest file where the cutoff was chosen such that the input was reduced to a third of its initial size).\n\n| Size of input file  | Size of input file: lines | No. names, original | No. names, reduced | Time         |\n|      :-----:        |       :-----:             |        -----:       |     -------:       |   -------:   |\n|      1.6 MB         |       100 K (1E5)         |         447         |   c 1      151     |    0.36 s    |\n|      18 MB          |       1 mill (1E6)        |        1,414        |   c 0.95   509     |    0.52 s    |\n|      91 MB          |       5 mill (5E6)        |        3,162        |   c 1.75   500     |    1.23 s    |\n|      181 MB         |       10 mill (1E7)       |        4472         |   c 2.16   501     |    2.23 s    |\n|      2.0 GB         |       100 mill (1E8)      |        14142        |   c 3.65   505     |    21.3 s    |\n|      20 GB          |       1 bill (1E9)        |        44721        |   c 5.38   501     |                 |               |\n          \n--\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fagormp%2Fgreedysub","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fagormp%2Fgreedysub","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fagormp%2Fgreedysub/lists"}