https://github.com/bbuchfink/deepclust-data
https://github.com/bbuchfink/deepclust-data
Last synced: 5 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/bbuchfink/deepclust-data
- Owner: bbuchfink
- License: gpl-3.0
- Created: 2023-12-13T14:04:42.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2025-06-18T11:04:57.000Z (12 months ago)
- Last Synced: 2025-06-18T11:27:24.753Z (12 months ago)
- Language: Shell
- Size: 17.8 MB
- Stars: 0
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Scripts to reproduce the results of Buchfink et al., "Sensitive clustering of protein sequences at tree-of-life scale using DIAMOND DeepClust".
# Data files
Data files that are required (see Data availability):
- `arch80_all.tsv`: File mapping NCBI accessions to Pfam domain architectures.
- `nr_accessions.tsv.zst`: List of all accessions in NR database that was used for the main benchmark.
- `clan2acc.tsv`: File mapping Pfam accessions to clans.
- `clust.dedup2.tsv.zst`: TSV file mapping cluster member sequences to representatives for the big ~19bn clustering run.
- `centroids.dedup.faa`: FASTA file of representatives for the big ~19bn clustering run.