{"id":49871168,"url":"https://github.com/anandsahuofficial/tbxt-hit-id","last_synced_at":"2026-05-15T07:35:12.803Z","repository":{"id":356977323,"uuid":"1232540661","full_name":"anandsahuofficial/tbxt-hit-id","owner":"anandsahuofficial","description":"Multi-signal computational pipeline for TBXT G177D (Brachyury), chordoma's master regulator. TBXT Hackathon   2026.","archived":false,"fork":false,"pushed_at":"2026-05-10T18:39:38.000Z","size":9389,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-10T19:32:25.789Z","etag":null,"topics":["autodock-vina","boltz-2","chordoma","computational-chemistry","drug-discovery","gnina","hit-identification","mmgbsa","muni-bio","onepot","openmm","pillar-vc","qsar","rdkit","rowan","tbxt","tbxt-hackathon-2026","virtual-screening"],"latest_commit_sha":null,"homepage":"https://github.com/anandsahuofficial/tbxt-hit-id","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/anandsahuofficial.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS.md","dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-08T03:10:51.000Z","updated_at":"2026-05-10T18:39:42.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/anandsahuofficial/tbxt-hit-id","commit_stats":null,"previous_names":["anandsahuofficial/tbxt-hit-id"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/anandsahuofficial/tbxt-hit-id","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anandsahuofficial%2Ftbxt-hit-id","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anandsahuofficial%2Ftbxt-hit-id/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anandsahuofficial%2Ftbxt-hit-id/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anandsahuofficial%2Ftbxt-hit-id/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/anandsahuofficial","download_url":"https://codeload.github.com/anandsahuofficial/tbxt-hit-id/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anandsahuofficial%2Ftbxt-hit-id/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33057988,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-13T13:14:54.681Z","status":"online","status_checked_at":"2026-05-15T02:00:06.351Z","response_time":103,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["autodock-vina","boltz-2","chordoma","computational-chemistry","drug-discovery","gnina","hit-identification","mmgbsa","muni-bio","onepot","openmm","pillar-vc","qsar","rdkit","rowan","tbxt","tbxt-hackathon-2026","virtual-screening"],"created_at":"2026-05-15T07:35:11.986Z","updated_at":"2026-05-15T07:35:12.796Z","avatar_url":"https://github.com/anandsahuofficial.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/social_preview.png\" alt=\"tbxt-hit-id - multi-signal computational pipeline for TBXT G177D (Brachyury), chordoma's master regulator\" width=\"100%\"\u003e\n\u003c/p\u003e\n\n# tbxt-hit-id\n\n**Multi-signal computational pipeline for identifying small-molecule\ninhibitors of TBXT (Brachyury) - chordoma's master transcription\nfactor and lineage-defining oncogenic driver.**\n\nA reproducible end-to-end pipeline that integrates docking (Vina,\nGNINA), generative co-folding (Boltz-2), free-energy refinement\n(MMGBSA), target-specific QSAR (RF + XGBoost on 653 measured Naar SPR\nKd), and T-box paralog selectivity into a strict 7-criterion filter\nchain - yielding 137 organizer-compliant hit candidates and a top-4\nranked submission.\n\n\u003e Built for the **TBXT Hit Identification Hackathon 2026**\n\u003e ([tbxtchallenge.org](https://tbxtchallenge.org)), hosted by\n\u003e [muni.bio](https://muni.bio) in Boston at\n\u003e [Pillar VC](https://www.pillar.vc/), with platform support from\n\u003e [Rowan](https://labs.rowansci.com) and\n\u003e [onepot.ai](https://www.onepot.ai).\n\n\u003e **Note on history.** This repository is the cleaned, post-hackathon\n\u003e version of the pipeline. The original development repository -\n\u003e including the full commit history from the pre-event build phase and\n\u003e the hackathon weekend - is preserved at\n\u003e [`anandsahuofficial/tbxt-hit-id-dev`](https://github.com/anandsahuofficial/tbxt-hit-id-dev).\n\n---\n\n## Citation\n\nIf you use this pipeline or adapt it for your own targets, please cite:\n\n\u003e Sahu, A. and contributors (2026). *tbxt-hit-id: Multi-signal\n\u003e computational pipeline for TBXT (Brachyury) hit identification.*\n\u003e https://github.com/anandsahuofficial/tbxt-hit-id\n\nA machine-readable [`CITATION.cff`](CITATION.cff) is included for\nautomated citation tools (GitHub renders this as a \"Cite this\nrepository\" button in the right sidebar).\n\n---\n\n## Why this matters\n\nChordoma is a rare, locally aggressive cancer of the notochord remnant\n(skull base, mobile spine, sacrum). It has no FDA-approved targeted\ntherapy and a 5-year survival around 70%. The transcription factor\n**TBXT (Brachyury)** is its master regulator and is required for\ntumor cell survival - silencing TBXT collapses chordoma cell viability.\nThis makes TBXT the most-validated drug target in chordoma, but it is\nalso a long-considered \"undruggable\" transcription factor with a\nshallow DNA-binding domain pocket and high T-box paralog homology.\n\nThe **G177D variant** (`rs2305089`, allele frequency ~0.42) is enriched\nin \u0026gt; 90% of Western chordoma cases and creates a unique pocket\n(site F: Y88 / D177 / L42) that engages the variant residue directly -\na structural feature absent from the other 16 T-box paralogs and\ntherefore intrinsically selective.\n\nThis repo is the open computational pipeline that selected hit\ncandidates against that pocket from a 570-compound novelty-filtered\npool.\n\n---\n\n## Results\n\n| | |\n|---|---|\n| Pool screened | **570** novelty-filtered compounds |\n| Pass all 7 strict criteria | **137** organizer-compliant candidates |\n| Submitted to experimental program | **24** compounds (4 to judges + 20 first batch) |\n| Best predicted Boltz-2 Kd | **3.2 µM** (dual-engine 1.02× agreement) |\n| Cost to source all 4 picks | **$875** via onepot.ai 100% catalog match |\n| Site coverage on top 4 picks | **4 / 4** at site F (variant residue D177) |\n\nThe **4 picks** (`results/top4.csv`):\n\n| # | ID | Boltz Kd (run A / B) | gnina Vina/pKd | Cost | Risks |\n|---:|---|---:|---:|---:|:---:|\n| 1 | FM002150_analog_0083 | 3.2 / 3.26 µM | -5.01 / 3.94 | $125 | low/low |\n| 2 | FM001452_analog_0104 | 3.7 / 4.97 µM | -5.77 / 4.03 | $250 | med/med |\n| 3 | FM001452_analog_0201 | 8.16 / 8.76 µM | -6.07 / 4.69 | $375 | high/med |\n| 4 | FM001452_analog_0171 | 8.32 / 8.17 µM | -6.19 / 4.44 | $250 | med/med |\n\nThe full 137-candidate pool with every per-criterion pass/fail flag\nis in `results/all_candidates_tiered.csv`.\n\n---\n\n## Approach - 6 orthogonal signals + 7-criterion strict gate\n\n![Pipeline architecture](docs/architecture.png)\n\n### Six orthogonal scoring signals\n\n| Signal | What it catches |\n|---|---|\n| **Vina ensemble** (6 receptor confs) | Geometric fit, receptor flexibility |\n| **GNINA CNN** pose + pKd | Vina-trap detection, ML affinity |\n| **TBXT-specific QSAR** (RF + XGBoost on 653 measured Naar SPR Kd) | Target-specific affinity |\n| **Boltz-2 generative co-folding** (two independent backends) | Independent affinity + binder/non-binder classifier |\n| **MMGBSA implicit-solvent refinement** (top 30) | Free-energy refinement |\n| **T-box paralog selectivity** (16 paralogs) | Off-target risk |\n\nEach signal has a known failure mode that another signal in the\nstack catches. No pick depends on a single score.\n\n### Seven-criterion strict filter (T-0 hard gate)\n\n```\nC1  onepot.ai 100% catalog match (similarity = 1.000)\nC2  strictly non-covalent\nC3  Chordoma rule: MW ≤ 600, LogP ≤ 6, HBD ≤ 6, HBA ≤ 12\nC4  lead-like ideal: 10–30 HA, HBD+HBA ≤ 11, \u003c 5 rings, ≤ 2 fused\nC5  PAINS-clean + no acid halides / aldehydes / diazo / imines /\n      polycyclic \u003e 2 fused / long alkyl\nC6  Tanimoto \u003c 0.85 to Naar / TEP / prior_art_canonical\nC7  ESOL log S \u003e -5 (DMSO @ 10 mM + aqueous @ 50 µM)\n```\n\n570 → 137 strict-pass → 4 picks. See [`docs/filter_chain.md`](docs/filter_chain.md).\n\n### Four-tier ranking\n\n| Tier | Definition | Count |\n|---|---|---:|\n| **T1 GOLD** | All criteria + Kd ≤ 5 µM + low/low risk | **0** (empty by design - honest finding, not overclaimed) |\n| **T2 SILVER** | All criteria + Kd ≤ 10 µM + soluble | 16 |\n| **T3 BRONZE** | All criteria + Kd ≤ 50 µM, borderline solubility | 89 |\n| **T4 RELAXED** | All criteria + Kd ≤ 100 µM | 32 |\n\nSee [`docs/tier_definitions.md`](docs/tier_definitions.md).\n\n---\n\n## Cross-validation\n\n- **Two independent Boltz-2 runs** on separate compute backends:\n  4 / 4 picks agree within 1.34×\n- **10-seed GNINA pose-stability** at site F across all 570 compounds:\n  identifies pose-stable picks (σ \u0026lt; 0.05)\n- **Rowan ADMET** (49 properties × 4 picks): all 4 ADMET-profiled\n- **Rowan pose-analysis MD** (explicit-solvent, 5 ns × 1 traj +\n  1 ns equil): protein-ligand RMSD trajectories captured per pick\n- **muni.bio `onepot` tool**: all 4 picks at similarity = 1.000 with\n  price + chemistry_risk + supplier_risk attached\n\nEvery pick is supported by multiple independent lines of evidence.\n\n---\n\n## Quickstart - reproduce the top 4 from a fresh clone\n\nThe recommended path is **container-first** - one CUDA-enabled image\nshipped from GHCR carries every binary and Python dep, including\nGNINA (which is otherwise glibc-fragile on HPC). A native conda\npath is also available - see [`setup/README.md`](setup/README.md).\n\n```bash\n# 1. Clone\ngit clone https://github.com/anandsahuofficial/tbxt-hit-id\ncd tbxt-hit-id\n\n# 2. Pull the all-batteries-included container (~6-12 GB; one-time)\nbash setup/pull_container.sh\n# → ./tbxt-hit-id.sif\n\n# 3. Fetch the receptor + bulk data assets\nbash setup/fetch_receptor.sh           # PDB 6F59:A from RCSB\nbash setup/fetch_data.sh               # candidate pool + Naar SPR Kd + receptor ensemble\nbash setup/fetch_data.sh --include-poses   # OPTIONAL: pre-computed scores (~600 MB, enables --demo)\n\n# 4. Confirm setup is ready (recommended before a multi-hour pipeline run)\napptainer exec --bind $PWD tbxt-hit-id.sif bash setup/smoke_test.sh\n\n# 5. Run the pipeline end-to-end (the container has every binary + Python dep baked in)\napptainer exec --nv --bind $PWD tbxt-hit-id.sif bash examples/reproduce_top4.sh\n# or:\napptainer exec --bind $PWD tbxt-hit-id.sif bash examples/reproduce_top4.sh --demo\n\n# 6. Inspect results\ncat results/top4.csv\n```\n\nTotal runtime - full mode: **~6 hours** on a single RTX 4090 (24 GB)\nfor the 570-compound pool; ~12 hours for the full HPC variant matrix.\nDemo mode: **\u003c 2 minutes**, no GPU. See\n[`setup/README.md`](setup/README.md) for HPC notes, the native conda\npath, and the container-internals reference.\n\n\u003e **Data availability.** The bulk data bundle (570-compound pool,\n\u003e Naar SPR Kd training set, 6-conformer crystallographic ensemble, and optional\n\u003e pre-computed scoring outputs) is hosted as a separate Hugging Face\n\u003e dataset and downloaded by `setup/fetch_data.sh`. If the dataset is\n\u003e not yet published, `fetch_data.sh` will print a clear error\n\u003e message; the curated **post-pipeline outputs** in\n\u003e [`results/`](results/) are committed directly to this repo and can\n\u003e be inspected without any data fetch.\n\n---\n\n## Reuse - adapt for your own target\n\nThis pipeline is **target-agnostic**. To screen against a different\nprotein:\n\n1. Replace the receptor and pocket centroid in `src/pipeline/define_pockets.py`\n2. Drop your candidate compound pool into `data/pool.csv` (SMILES + ID columns)\n3. (Optional) Re-train the QSAR model on your target's measured-affinity data\n4. Re-run `examples/reproduce_top4.sh`\n\nThe 6-signal architecture, 7-criterion filter chain, tier ranking,\nand cross-validation harness all generalize.\n\n---\n\n## What's in this repo\n\n```\ntbxt-hit-id/\n├── README.md                  ← you are here\n├── LICENSE  NOTICE          ← MIT + provenance\n├── AUTHORS.md  CITATION.cff\n├── environment.yml            ← one-command conda env\n│\n├── docs/                      ← deep-dive methodology\n│   ├── methodology.md         ← 6-signal pipeline\n│   ├── filter_chain.md        ← 7-criterion strict gate\n│   ├── tier_definitions.md\n│   └── architecture.png\n│\n├── slides/                    ← judges-facing deck\n│   ├── slides.md  slides.pdf\n│   ├── architecture.png       ← pipeline graphic (in deck)\n│   └── renders/               ← 2D + 3D pose PNGs (4 picks)\n│\n├── results/                   ← curated post-pipeline outputs (committed)\n│   ├── top4.csv  top5to24.csv  all_candidates_tiered.csv\n│   └── selected/              ← cross-val summary, MD attempt log, variants\n│\n├── src/                       ← pipeline source (33 Python modules)\n│   ├── pipeline/              ← vina, gnina, boltz, mmgbsa, qsar, paralog, consensus\n│   ├── filters/               ← 7-criterion strict gate (+ PAINS + onepot membership)\n│   ├── ranking/               ← 4-tier classifier\n│   ├── enumeration/           ← one-pot reaction enumeration\n│   └── viz/                   ← render helpers (2D + 3D pose)\n│\n├── setup/                     ← container + env + cold-start data fetchers\n│   ├── Containerfile          ← CUDA + GNINA + conda env + src baked in (CI builds → GHCR)\n│   ├── pull_container.sh      ← apptainer/singularity pull from ghcr.io\n│   ├── fetch_receptor.sh      ← PDB 6F59:A from RCSB\n│   ├── fetch_data.sh          ← candidate pool + Naar Kd + receptor ensemble (HF)\n│   ├── HPC.md                 ← Singularity GNINA recipe + Boltz cache + SLURM\n│   └── README.md              ← quick-start + troubleshooting\n│\n├── .github/workflows/\n│   └── container.yml          ← builds + pushes container to ghcr.io on every push to main\n│\n├── tools/\n│   └── render_slides.py       ← Markdown → HTML → Chromium PDF renderer\n│\n└── examples/\n    └── reproduce_top4.sh      ← one-command end-to-end (--full or --demo)\n```\n\n---\n\n## Citation\n\nIf you use this work, please cite:\n\n```bibtex\n@software{sahu2026tbxt,\n  author       = {Sahu, Anand and contributors},\n  title        = {tbxt-hit-id: Multi-signal computational pipeline\n                  for TBXT (Brachyury) hit identification},\n  year         = {2026},\n  url          = {https://github.com/anandsahuofficial/tbxt-hit-id},\n  note         = {TBXT Hit Identification Hackathon 2026, Boston}\n}\n```\n\nA machine-readable [`CITATION.cff`](CITATION.cff) is included.\n\n---\n\n## Honest expectations (calibration)\n\n- Public computational methods over-predict Kd by **6–25×** at the µM\n  regime. Realistic SPR for these 4 picks: **18–200 µM range**.\n- The pipeline is designed for **hit identification**, not lead\n  optimization - none of the picks are expected to be sub-µM at SPR\n  without follow-up SAR.\n- The **T1 GOLD tier is empty by design** - no compound in the pool\n  simultaneously hits Kd ≤ 5 µM AND low/low chemistry/supplier risk.\n  This is surfaced honestly rather than gamed by relaxing the gate.\n\n---\n\n## Credits\n\n- **Team Lead:** [Anand Sahu](https://github.com/anandsahuofficial) -\n  pipeline architecture, methodology, multi-signal consensus\n  integration, final pick selection, live demo\n- **Team contributors:** see [`AUTHORS.md`](AUTHORS.md) for the full\n  list of simulation, data-generation, and analysis contributors\n\n### Platform \u0026 venue partners\n\n- [muni.bio](https://muni.bio) - `onepot` catalog membership tool, CLI,\n  and platform credits\n- [Rowan](https://labs.rowansci.com) - ADMET, docking, and\n  pose-analysis MD platform\n- [onepot.ai](https://www.onepot.ai) - virtual catalog and one-pot\n  synthesis library\n- [Pillar VC](https://www.pillar.vc/) - Boston event venue\n- [TBXT Hackathon](https://tbxtchallenge.org) - organizers, mentors,\n  and TEPs\n\n### Datasets\n\n- **Naar SPR Kd dataset** (653 measured TBXT-binding affinities) -\n  TBXT Hackathon 2026\n- **PDB 6F59** chain A - TBXT G177D + DNA construct (matches the\n  hackathon CF Labs SPR assay)\n- **muni.bio onepot.ai catalog** - ~1.4M one-pot-accessible virtual\n  compounds\n\n### AI implementation assistance\n\nSubstantial portions of code, scripts, and prose were drafted with\nAI assistance (Claude) under the team lead's direction. All\nscientific decisions, parameter choices, and final outputs were\nreviewed and accepted by the team lead, who is responsible for\nall outcomes of this work.\n\n---\n\n## License\n\nMIT - see [`LICENSE`](LICENSE). Use it, adapt it, share it.\nAttribution required; see [`NOTICE`](NOTICE) for the authorship +\nprovenance record and [`CITATION.cff`](CITATION.cff) for the\npreferred citation.\n\n---\n\n## Keywords\n\n`chordoma` · `TBXT` · `Brachyury` · `T-box transcription factor` ·\n`G177D` · `rs2305089` · `drug discovery` · `virtual screening` ·\n`hit identification` · `multi-signal consensus` · `Vina` · `GNINA` ·\n`Boltz-2` · `co-folding` · `MMGBSA` · `QSAR` · `ADMET` ·\n`MD pose analysis` · `paralog selectivity` · `onepot.ai` ·\n`muni.bio` · `Rowan` · `Pillar VC` · `TBXT Hackathon 2026`\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fanandsahuofficial%2Ftbxt-hit-id","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fanandsahuofficial%2Ftbxt-hit-id","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fanandsahuofficial%2Ftbxt-hit-id/lists"}