{"id":46255241,"url":"https://github.com/elkins/synth-pdb","last_synced_at":"2026-06-03T01:01:37.850Z","repository":{"id":333215726,"uuid":"1136571644","full_name":"elkins/synth-pdb","owner":"elkins","description":"Generate realistic PDB files with mixed secondary structures for testing, education and bioinformatics tool development","archived":false,"fork":false,"pushed_at":"2026-05-27T01:22:45.000Z","size":271971,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-05-27T03:13:55.735Z","etag":null,"topics":["amino-acid-sequence","bioinformatics","biophysics","computational-structural-biology","molecular-modeling","nmr-spectroscopy","nmr-tools","peptide","peptide-sequences","protein","protein-data-bank","protein-structure","ramachandran","science-education","scientific-computing","secondary-structure","simulation","structural-bioinformatics","structural-biology"],"latest_commit_sha":null,"homepage":"https://elkins.github.io/synth-pdb/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/elkins.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":".zenodo.json","notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-17T23:37:38.000Z","updated_at":"2026-05-27T01:39:52.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/elkins/synth-pdb","commit_stats":null,"previous_names":["elkins/stupid-pdb","elkins/synth-pdb"],"tags_count":19,"template":false,"template_full_name":null,"purl":"pkg:github/elkins/synth-pdb","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elkins%2Fsynth-pdb","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elkins%2Fsynth-pdb/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elkins%2Fsynth-pdb/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elkins%2Fsynth-pdb/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/elkins","download_url":"https://codeload.github.com/elkins/synth-pdb/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elkins%2Fsynth-pdb/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33843611,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-02T02:00:07.132Z","response_time":109,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["amino-acid-sequence","bioinformatics","biophysics","computational-structural-biology","molecular-modeling","nmr-spectroscopy","nmr-tools","peptide","peptide-sequences","protein","protein-data-bank","protein-structure","ramachandran","science-education","scientific-computing","secondary-structure","simulation","structural-bioinformatics","structural-biology"],"created_at":"2026-03-03T23:12:27.138Z","updated_at":"2026-06-03T01:01:37.837Z","avatar_url":"https://github.com/elkins.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# synth-pdb\n\nA command-line tool to generate Protein Data Bank (PDB) files with full atomic representation for testing, benchmarking and educational purposes.\n\n[![PyPI version](https://img.shields.io/badge/pypi-v1.38.0-blue)](https://pypi.org/project/synth-pdb/)\n[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.18357242.svg)](https://doi.org/10.5281/zenodo.18357242)\n[![Tests](https://github.com/elkins/synth-pdb/actions/workflows/test.yml/badge.svg)](https://github.com/elkins/synth-pdb/actions/workflows/test.yml)\n[![codecov](https://codecov.io/gh/elkins/synth-pdb/branch/master/graph/badge.svg)](https://codecov.io/gh/elkins/synth-pdb)\n[![Documentation](https://img.shields.io/badge/docs-live-brightgreen)](https://elkins.github.io/synth-pdb/)\n[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)\n[![Checked with mypy](https://www.mypy-lang.org/static/mypy_badge.svg)](https://mypy-lang.org/)\n\n📚 **[Read the full documentation](https://elkins.github.io/synth-pdb/)** | [Getting Started](https://elkins.github.io/synth-pdb/getting-started/quickstart/) | [API Reference](https://elkins.github.io/synth-pdb/api/overview/) | [Tutorials](examples/interactive_tutorials/gfp_molecular_forge.ipynb)\n\n## 📚 Interactive Tutorials\n\n### Prerequisites\n- **Python 3.10+** and basic Python knowledge\n- **Google Colab** account (free) or local Jupyter environment\n- Specific tutorials may require domain knowledge (noted in difficulty levels)\n\n### Tutorial Catalog\n\n| Tutorial | Difficulty | Time | Action |\n| :--- | :---: | :---: | :--- |\n| [**🔬 Cryo-EM \u0026 SAXS Lab**](examples/interactive_tutorials/cryo_em_saxs_lab.ipynb) | ⭐ Beginner | 20 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/interactive_tutorials/cryo_em_saxs_lab.ipynb) |\n| [**🧪 The Virtual CD Lab**](examples/interactive_tutorials/virtual_cd_lab.ipynb) | ⭐ Beginner | 15 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/interactive_tutorials/virtual_cd_lab.ipynb) |\n| [**🤖 AI Protein Data Factory**](examples/ml_integration/ml_handover_demo.ipynb) | ⭐ Beginner | 15 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/ml_integration/ml_handover_demo.ipynb) |\n| [**🏭 Bulk Dataset Factory**](examples/ml_integration/dataset_factory.ipynb) | ⭐ Beginner | 15 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/ml_integration/dataset_factory.ipynb) |\n| [**🔗 Framework Handover**](examples/ml_loading/) | ⭐ Beginner | 10 min | [View JAX/PyTorch/MLX Examples](https://github.com/elkins/synth-pdb/tree/master/examples/ml_loading) |\n| [**🧪 BMRB Validation Pipeline**](examples/interactive_tutorials/bmrb_validation.ipynb) | ⭐⭐ Intermediate | 25 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/interactive_tutorials/bmrb_validation.ipynb) |\n| [**⭕ Macrocycle Design Lab**](examples/ml_integration/macrocycle_lab.ipynb) | ⭐⭐ Intermediate | 20 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/ml_integration/macrocycle_lab.ipynb) |\n| [**🪞 The Mirror World Lab**](examples/interactive_tutorials/mirror_world_lab.ipynb) | ⭐⭐ Intermediate | 20 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/interactive_tutorials/mirror_world_lab.ipynb) |\n| [**💊 Bio-Active Hormone Lab**](examples/ml_integration/hormone_lab.ipynb) | ⭐⭐ Intermediate | 20 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/ml_integration/hormone_lab.ipynb) |\n| [**🔍 Protein Quality Assessment**](examples/interactive_tutorials/protein_quality_assessment.ipynb) | ⭐⭐ Intermediate | 25 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/interactive_tutorials/protein_quality_assessment.ipynb) |\n| [**🧠 GNN pLDDT Explorer**](examples/interactive_tutorials/gnn_plddt_explorer.ipynb) | ⭐⭐ Intermediate | 30 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/interactive_tutorials/gnn_plddt_explorer.ipynb) |\n| [**🔬 The Virtual NMR Spectrometer**](examples/interactive_tutorials/virtual_nmr_spectrometer.ipynb) | ⭐⭐ Intermediate | 25 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/interactive_tutorials/virtual_nmr_spectrometer.ipynb) |\n| [**🧲 RDC Alignment Tensor Explorer**](examples/interactive_tutorials/rdc_alignment_explorer.ipynb) | ⭐⭐ Intermediate | 30 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/interactive_tutorials/rdc_alignment_explorer.ipynb) |\n| [**📊 RPF Score Validation**](examples/interactive_tutorials/nmr_validation_rpf.ipynb) | ⭐⭐ Intermediate | 25 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/interactive_tutorials/nmr_validation_rpf.ipynb) |\n| [**🛢️ The Oil Drop Model: Hydrophobic Burial**](examples/interactive_tutorials/sasa_hydrophobic_burial.ipynb) | ⭐⭐ Intermediate | 25 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/interactive_tutorials/sasa_hydrophobic_burial.ipynb) |\n| [**📡 Neural NMR Pipeline**](examples/ml_integration/neural_nmr_pipeline.ipynb) | ⭐⭐ Intermediate | 25 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/ml_integration/neural_nmr_pipeline.ipynb) |\n| [**🔗 The NeRF Geometry Lab**](examples/interactive_tutorials/nerf_geometry_lab.ipynb) | ⭐⭐ Intermediate | 25 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/interactive_tutorials/nerf_geometry_lab.ipynb) |\n| [**📦 Modern Formats Lab**](examples/interactive_tutorials/modern_formats_lab.ipynb) | ⭐⭐ Intermediate | 15 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/interactive_tutorials/modern_formats_lab.ipynb) |\n| [**📏 Geometry Tools Lab**](examples/interactive_tutorials/geometry_tools_reference.ipynb) | ⭐⭐ Intermediate | 20 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/interactive_tutorials/geometry_tools_reference.ipynb) |\n| [**🧪 The GFP Molecular Forge**](examples/interactive_tutorials/gfp_molecular_forge.ipynb) | ⭐⭐ Intermediate | 30 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/interactive_tutorials/gfp_molecular_forge.ipynb) |\n| [**⚙️ The Molecular Machine Lab**](examples/interactive_tutorials/molecular_machine_lab.ipynb) | ⭐⭐ Intermediate | 25 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/interactive_tutorials/molecular_machine_lab.ipynb) |\n| [**🧠 The Prion Chameleon Lab**](examples/interactive_tutorials/prion_chameleon_lab.ipynb) | ⭐⭐ Intermediate | 25 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/interactive_tutorials/prion_chameleon_lab.ipynb) |\n| [**🕸️ The NOE Network Explorer**](examples/interactive_tutorials/noe_network_explorer.ipynb) | ⭐⭐ Intermediate | 25 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/interactive_tutorials/noe_network_explorer.ipynb) |\n| [**📡 NMR Relaxation Fingerprint**](examples/interactive_tutorials/nmr_relaxation_fingerprint.ipynb) | ⭐⭐ Intermediate | 25 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/interactive_tutorials/nmr_relaxation_fingerprint.ipynb) |\n| [**🔭 The SAXS Shape Decoder**](examples/interactive_tutorials/saxs_shape_decoder.ipynb) | ⭐⭐ Intermediate | 25 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/interactive_tutorials/saxs_shape_decoder.ipynb) |\n| [**🔬 The HS-AFM Lab**](examples/interactive_tutorials/hs_afm_lab.ipynb) | ⭐⭐⭐ Advanced | 35 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/interactive_tutorials/hs_afm_lab.ipynb) |\n| [**🎭 Protein Dynamics Theater**](examples/interactive_tutorials/protein_dynamics_theater.ipynb) | ⭐⭐⭐ Advanced | 35 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/interactive_tutorials/protein_dynamics_theater.ipynb) |\n| [**🧬 PLM Embeddings (ESM-2)**](examples/ml_integration/plm_embeddings.ipynb) | ⭐⭐ Intermediate | 30 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/ml_integration/plm_embeddings.ipynb) |\n| [**📊 Ubiquitin Validation Suite**](examples/interactive_tutorials/ubiquitin_chemical_shift_validation.ipynb) | ⭐⭐⭐ Advanced | 45 min | [CS](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/interactive_tutorials/ubiquitin_chemical_shift_validation.ipynb) / [J-Coupling](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/interactive_tutorials/ubiquitin_j_coupling_validation.ipynb) / [RDC](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/interactive_tutorials/ubiquitin_rdc_validation.ipynb) |\n| [**📐 6D Orientogram Lab**](examples/ml_integration/orientogram_lab.ipynb) | ⭐⭐⭐ Advanced | 30 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/ml_integration/orientogram_lab.ipynb) |\n| [**🎯 The Hard Decoy Challenge**](examples/ml_integration/hard_decoy_challenge.ipynb) | ⭐⭐⭐ Advanced | 35 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/ml_integration/hard_decoy_challenge.ipynb) |\n| [**🔬 Structure Defensibility Dashboard**](examples/interactive_tutorials/structure_defensibility_dashboard.ipynb) | ⭐⭐⭐ Advanced | 30 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/interactive_tutorials/structure_defensibility_dashboard.ipynb) |\n| [**🧬 Co-evolution Factory**](examples/interactive_tutorials/coevolution_msa_factory.ipynb) | ⭐⭐⭐ Advanced | 35 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/interactive_tutorials/coevolution_msa_factory.ipynb) |\n| [**🗺️ Contact Map Fingerprinting**](examples/ml_integration/contact_map_fingerprinting.ipynb) | ⭐⭐⭐ Advanced | 30 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/ml_integration/contact_map_fingerprinting.ipynb) |\n| [**🧬 Co-evolutionary Fitness Landscape**](examples/ml_integration/fitness_landscape_explorer.ipynb) | ⭐⭐⭐ Advanced | 35 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/ml_integration/fitness_landscape_explorer.ipynb) |\n| [**💊 Drug Discovery Pipeline**](examples/ml_integration/drug_discovery_pipeline.ipynb) | ⭐⭐⭐ Advanced | 35 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/ml_integration/drug_discovery_pipeline.ipynb) |\n| [**🌌 AI Latent Space Explorer**](examples/interactive_tutorials/latent_space_explorer.ipynb) | ⭐⭐⭐ Advanced | 35 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/interactive_tutorials/latent_space_explorer.ipynb) |\n| [**🏔️ The Live Folding Landscape**](examples/interactive_tutorials/folding_landscape.ipynb) | ⭐⭐⭐ Advanced | 40 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/interactive_tutorials/folding_landscape.ipynb) |\n| [**☁️ IDP Conformational Ensembles**](examples/interactive_tutorials/idp_ensemble_validation.ipynb) | ⭐⭐⭐ Advanced | 30 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/interactive_tutorials/idp_ensemble_validation.ipynb) |\n| [**🤖 AlphaFold pLDDT vs NMR S²**](examples/interactive_tutorials/alphafold_vs_nmr_dynamics.ipynb) | ⭐⭐⭐ Advanced | 35 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/interactive_tutorials/alphafold_vs_nmr_dynamics.ipynb) |\n\n### 🎓 Learning Paths\n\nChoose a path based on your background and goals:\n\n#### 🤖 **For ML Engineers**\n*Build AI models with synthetic protein data*\n\n1. **🤖 AI Protein Data Factory** (15 min) - Learn zero-copy data handover to PyTorch/JAX\n2. **🏭 Bulk Dataset Factory** (15 min) - Generate thousands of training samples\n3. **🔗 Framework Handover** (10 min) - Integrate with your ML framework\n4. **🎯 Hard Decoy Challenge** (35 min) - Create negative samples for robust training\n5. **🧬 PLM Embeddings (ESM-2)** (30 min) - Add evolutionary context as per-residue node features\n6. **📐 6D Orientogram Lab** (30 min) - Work with rotation-invariant representations\n7. **🧬 Co-evolution Factory** (35 min) - Simulate sequence evolution kernels\n8. **🧠 The Prion Chameleon Lab** (25 min) - Generate high-quality misfolded decoys for robust structural scoring models\n\n#### 🔬 **For Biophysicists**\n*Understand structure, dynamics, and spectroscopy*\n\n1. **🔗 NeRF Geometry Lab** (25 min) - Learn internal coordinate systems\n2. **📏 Geometry Tools Reference** (20 min) - Kabsch, RMSD, and specialized geometry primitives\n3. **🧪 Virtual CD Lab** (15 min) - Learn how secondary structure encodes Far-UV spectral signatures\n4. **🔬 Virtual NMR Spectrometer** (25 min) - Predict relaxation rates and chemical shifts\n5. **🧲 RDC Alignment Tensor Explorer** (30 min) - Visualize the alignment tensor and RDC physics interactively\n6. **🕸️ NOE Network Explorer** (25 min) - Visualize the distance-restraint web that defines protein structure, rendered as a glowing 3D cylinder network\n7. **📡 NMR Relaxation Fingerprint** (25 min) - Read protein motion from R₁/R₂/hetNOE profiles; compare 600 vs 900 MHz field dependence\n8. **🔭 SAXS Shape Decoder** (25 min) - Decode protein architecture from Guinier, Kratky, and P(r) plots; distinguish folded from disordered\n9. **🔬 The HS-AFM Lab** (35 min) - Generate synthetic high-speed AFM images and movies; explore tip-dilation and scanning-lag artifacts\n10. **🎭 Protein Dynamics Theater** (35 min) - Compute normal modes, animate the global breathing motion, and compare NMA vs Langevin RMSF\n11. **🔍 Protein Quality Assessment** (25 min) - Validate structure quality and geometry\n12. **🧠 GNN pLDDT Explorer** (30 min) - Score structures with a Graph Neural Network; interpret per-residue pLDDT confidence using AlphaFold's colour scheme; compute TM-score, lDDT, and GDT-TS metrics\n13. **🧪 GFP Molecular Forge** (30 min) - Explore chromophore chemistry\n14. **⚙️ The Molecular Machine Lab** (25 min) - Simulate hinge motions and dynamic CD/NMR observables\n15. **🧠 The Prion Chameleon Lab** (25 min) - Model alpha-to-beta transitions and infectious folding decoys\n16. **🏔️ Live Folding Landscape** (40 min) - Visualize energy surfaces and Ramachandran space\n17. **📡 Neural NMR Pipeline** (25 min) - Connect structure to NMR observables\n18. **🧬 PLM Embeddings (ESM-2)** (30 min) - See how sequence encodes secondary structure context\n19. **☁️ IDP Conformational Ensembles** (30 min) - Validate unstructured physical domains\n20. **🤖 AlphaFold pLDDT vs NMR S²** (35 min) - Contrast AI rigidity with physical 15N flexibility\n21. **🔬 Cryo-EM \u0026 SAXS Lab** (20 min) - Simulate 3D density maps and 1D scattering\n22. **🧪 BMRB Validation Pipeline** (25 min) - Programmatic NMR validation\n\n#### 💊 **For Drug Designers**\n*Design and optimize therapeutic peptides*\n\n1. **💊 Drug Discovery Pipeline** (35 min) - End-to-end peptide library to lead selection\n2. **⭕ Macrocycle Design Lab** (20 min) - Create head-to-tail cyclic peptides\n3. **💊 Bio-Active Hormone Lab** (20 min) - Model bioactive peptide hormones\n4. **🪞 The Mirror World Lab** (20 min) - Design protease-resistant D-amino acid peptides\n5. **🎯 Hard Decoy Challenge** (35 min) - Generate decoys for docking validation\n6. **🌌 AI Latent Space Explorer** (35 min) - Navigate chemical space with ML\n7. **🔬 Virtual NMR Spectrometer** (25 min) - Predict experimental observables\n8. **🔬 Cryo-EM \u0026 SAXS Lab** (20 min) - Multi-modal verification of peptide folds\n\n\n## Table of Contents\n- [Features](#features)\n- [Installation](#installation)\n- [Quick Start](#quick-start)\n- [Usage](#usage)\n  - [Command-Line Arguments](#command-line-arguments)\n  - [Examples](#examples)\n  - [ML Integration (AI Research)](#ml-integration-ai-research)\n- [Validation \u0026 Refinement](#validation--refinement)\n- [Output PDB Format](#output-pdb-format)\n- [Scientific Context](#scientific-context)\n- [Limitations](#limitations)\n- [Development](#development)\n- [Glossary of Scientific Terms \u0026 Acronyms](#glossary-of-scientific-terms--acronyms)\n- [License](#license)\n\n---\n\n## 🔬 Experimental Incubator\n\nThe [`/incubator`](./incubator/) directory is our frontier for \"What If?\" scenarios and advanced structural biology research. This space is dedicated to developing use cases that push `synth-pdb` beyond traditional experimental boundaries:\n\n- **Cryo-EM \"Standard Candle\"**: Generating atomic-resolution density maps for software benchmarking.\n- **IDP Ensemble-First Validation**: Automated pipelines for modeling Intrinsically Disordered Proteins.\n- **Mapping the \"Dark Proteome\"**: Creating hard decoys for unverified AI-predicted structures.\n- **De Novo Miniprotein Forge**: Rapid prototyping for synthetic biology designs.\n\nCheck out the [Incubator README](./incubator/README.md) for the full roadmap of these experimental explorations.\n\n---\n\n## Features\n\n✨ **Structure Generation**\n- Full atomic representation with backbone and side-chain heavy atoms + hydrogens\n- Customizable sequence (1-letter or 3-letter amino acid codes)\n- Random sequence generation with uniform or biologically plausible frequencies\n- **Conformational diversity**: Generate alpha helices, beta sheets, extended chains, or random conformations\n- **Prompt-to-Protein Interface**: Use natural language to describe structures via `--prompt`. Supports interactive input and piping for complex requirements.\n- **Backbone-Dependent Rotamers**: Side-chain conformations are selected based on local secondary structure (Helix/Sheet) to minimize steric clashes (Dunbrack library).\n- **Bulk Dataset Generation**: Generate thousands of (Structure, Sequence, Contact Map) triplets for AI training via `--mode dataset`.\n- **Metal Ion Coordination**: Automatic detection and structural injection of cofactors like **Zinc (Zn2+)** with physics-aware harmonic constraints.\n- **Disulfide Bonds**: Automatic detection and annotation of **SSBOND** records for Cysteine pairs.\n- **Salt Bridge Stabilization**: Automatic detection of ionic interactions with harmonic restraints in OpenMM.\n- **Advanced Chemical Shifts**: SPARTA-lite prediction + **Ring Current Effects** (shielding/deshielding from aromatic rings).\n- **Relaxation Rates**: Lipari-Szabo Model-Free formalism with **SASA-modulated Order Parameters** ($S^2$), allowing \"buried\" residues to be more rigid than \"exposed\" ones.\n- **Biophysical Realism**:\n    - **Backbone-Dependent Rotamers**: Chi angles depend on secondary structure.\n    - **Pre-Proline Bias**: Residues preceding Proline automatically adopt restricted conformations (extended/beta).\n    - **Cis-Proline Isomerization**: X-Pro bonds can adopt cis conformations (~5% probability).\n    - **Post-Translational Modifications**: Support for Phosphorylation (SEP, TPO, PTR) with valid physics parameters.\n- **Cyclic Peptides (Macrocycles)**: Support for **Head-to-Tail cyclization**. Closes the peptide bond between N- and C-termini using physics-based minimization.\n- **NMR Functionality**: As of v1.16.0, all NMR-related features (chemical shifts, relaxation, NOEs, J-couplings) have been refactored into the separate [`synth-nmr`](https://pypi.org/project/synth-nmr/) Python package.  This allows for independent use and development of NMR tools.\n- **Residual Dipolar Couplings (RDCs)**: `synth_pdb.rdc` computes backbone N–H RDCs using the Saupe-matrix formalism given an alignment tensor (`Da`, `R`). Q-factor validation is demonstrated against published ubiquitin (1D3Z) data. Interactive alignment-tensor exploration is available in the `rdc_alignment_explorer.ipynb` tutorial.\n- **NMR Ensemble Analysis** (`synth_pdb.ensemble`): Comprehensive tools for evaluating NMR structure bundles:\n    - **`DAOPCalculator`**: Dihedral Angle Order Parameter (Hyberts et al. 1992) for quantifying backbone consistency across an ensemble; includes `find_well_defined_residues` (PDBStat S(φ)+S(ψ) ≥ 1.8 convention).\n    - **`EnsembleStatistics`**: Typed dataclass reporting pairwise RMSD, RMSF, medoid, well-defined residues, and overall quality (Tejero et al. 2013 thresholds).\n- **MSA Co-Evolution** (`synth_pdb.msa`): Generates deep multiple sequence alignments by simulating MCMC evolution on a 3D structural Potts Model — enabling zero-shot generation of DCA/AlphaFold-ready MSAs.\n    - Metropolis-Hastings sampling with O(1) Δ-Energy evaluation (~500× speedup).\n    - \"Magic Step\" coupled mutations for contacting residues (20% proposal rate).\n    - SASA selective pressure enforcing hydrophobic core isolation.\n    - Electrostatic salt-bridge rewards and charge-repulsion penalties in J_ij couplings.\n- **Protein Language Model Embeddings** (`synth_pdb.quality.plm`): ESM-2 per-residue and pooled embeddings for zero-shot quality scoring and downstream ML tasks. Install with `pip install synth-pdb[plm]`.\n- **GNN Quality Scorer** (`synth_pdb.quality.gnn`): Graph Neural Network model for structure quality assessment where nodes represent residues and edges encode sequence proximity and spatial contacts. Install with `pip install synth-pdb[gnn]`.\n\n🚀 **High Performance Physics**\n- **Hardware Acceleration**: Automatically detects and uses **GPU acceleration** (CUDA, OpenCL/Metal) if available.\n    - **Apple Silicon Support**: Fully supported on M1/M2/M3/M4 chips via OpenCL driver (5x speedup over CPU).\n- **Vectorized Geometry**: Construction kernels are optimized with NumPy vectorization for fast validation.\n- **Tunable Minimization**: Control `tolerance` and `max_iterations` to balance speed/quality for bulk datasets.\n\n🔬 **Validation Suite**\n- Bond length validation\n- Bond angle validation (**Engh \u0026 Huber Z-scores**: geometry validated against the landmark 1991 standard deviations)\n- Ramachandran angle checking — upgraded to **Top2018** high-resolution dataset (~15,000 chains)\n- Side-Chain Rotamer validation (Chi1/Chi2 angles checked against backbone-dependent Dunbrack library)\n- Steric clash detection (minimum distance + van der Waals overlap)\n- Peptide plane planarity (omega angle)\n- Sequence improbability detection (charge clusters, hydrophobic stretches, etc.)\n- **SASA-based Burial Validation**: Shrake-Rupley algorithm (via biotite) confirming hydrophobic core formation (Kauzmann 1959)\n- **`get_quality_report()`**: Multi-layered structural plausibility report covering Geometry, Physics, and Biophysics layers with peer-reviewed thresholds\n\n⚙️ **Quality Control**\n- `--best-of-N`: Generate multiple structures and select the one with fewest violations\n- `--guarantee-valid`: Iteratively generate until a violation-free structure is found\n- `--refine-clashes`: Iteratively adjust atoms to reduce steric clashes\n- `--quality-filter`: Use Random Forest-based Structure Quality Filter to validate structure geometry\n- `--quality-score-cutoff`: Set minimum confidence score for quality filter (0.0-1.0)\n\n📝 **Reproducibility**\n- Command-line parameters stored in PDB header (REMARK 3 records)\n- Timestamps in generated filenames and headers\n\n## 📚 Understanding PDB Output - Educational Guide\n\n### Biophysical Realism\n\n**synth-pdb** generates structures with realistic properties that mimic real experimental data:\n\n#### 🌡️ B-factors (Temperature Factors)\n**What**: Measure atomic mobility/flexibility (columns 61-66)\n**Formula**: B = 8π²⟨u²⟩ (mean square displacement)\n**Range**: 5-60 Ų\n**Pattern**: Backbone (15-25) \u003c Side chains (20-35) \u003c Termini (30-50)\n\n#### 📊 Occupancy Values\n**What**: Fraction of molecules with atom at position (columns 55-60)\n**Range**: 0.85-1.00\n**Correlation**: High B-factor ↔ Low occupancy\n**Pattern**: Backbone (0.95-1.00) \u003e Side chains (0.85-0.95)\n\n#### 🔄 Backbone-Dependent Rotamer Libraries\n**Definition**: A **Rotamer** (Rotational Isomer) is a low-energy, stable conformation of an amino acid side chain defined by specific values of its side-chain dihedral angles ($\\chi_1, \\chi_2...$). Side chains are not flopping randomly; they snap into these discrete \"preset\" shapes.\n\n**The \"Backbone-Dependent\" Twist**:\nThe preferred shape of a side chain strongly depends on the shape of the backbone behind it (Alpha Helix vs Beta Sheet).\n*   **Helix ($\\alpha$)**: Side chains pack tightly. Bulky rotamers (like 'trans' chi1 for Val/Ile) often crash into the backbone (steric clash).\n*   **Sheet ($\\beta$)**: The backbone is extended, creating more room for different rotamers.\n\n**Implementation**: Synth-PDB uses a simplified version of the **Dunbrack Library**. It intelligently checks the backbone geometry ($\\phi, \\psi$) before picking a side chain shape, ensuring biophysical realism.\n\n#### ⭕ Macrocyclization (Cyclic Peptides)\n**What**: Creating a covalent bond between the N-terminal Amine and the C-terminal Carboxyl group to form a closed ring.\n**Biophysical Magnitude**:\n*   **Conformational Entropy**: Rigidifies the peptide. A linear peptide is a \"floppy\" string; a cyclic peptide is a \"locked\" ring. This reduces the entropy loss upon binding to a receptor, significantly increasing affinity.\n*   **Metabolic Stability**: Most degradation in the blood happens via *exopeptidases* (enzymes that clip ends). With no ends to clip, macrocycles are much more stable and long-lived in biological systems.\n*   **Pre-organization**: Cyclic peptides are \"pre-organized\" for their biological function, making them excellent drug scaffolds.\n**Coverage**: Supports **All 20 Standard Amino Acids** (including charged/polar residues).\n\n#### 🧬 D-Amino Acids (Inverted Stereochemistry)\n**What**: Mirror-images of standard L-amino acids.\n**Biophysical Magnitude**:\n*   **Protease Resistance**: Most enzymes that degrade proteins (proteases) are \"evolutionarily locked\" to only recognize L-amino acids. By replacing a single L-amino acid with a D-amino acid, a peptide can become hundreds of times more stable in human blood.\n*   **Bacterial Cell Walls**: Bacteria uniquely use D-amino acids (like D-Ala and D-Glu) in their cross-linked peptidoglycan cell walls. This is why many antibiotics (like Penicillin) target these non-L structures.\n*   **Non-Natural Foldamers**: D-amino acids allow for the creation of \"mirror-image\" helices and unique turns (e.g., Beta-turns involving D-Pro) that are impossible with standard biology.\n**Implementation**: **synth-pdb** mirrors sidechain coordinates across the N-CA-C backbone plane and uses standard PDB 3-letter codes (e.g., `DAL`, `DPH`).\n\n#### 🧬 Secondary Structures\n**What**: Regular backbone patterns (helices, sheets)\n**Control**: Per-region via `--structure` parameter\n**Example**: `--structure \"1-10:alpha,11-15:random,16-25:alpha\"`\n\n#### 🧪 Residue-Specific Ramachandran Validation (MolProbity-Style)\n\u003e [!TIP]\n\u003e **Realism Equals Efficiency**: By using valid backbone angles (Pre-Proline bias) and correct side-chain rotamers, `synth-pdb` structures start much closer to a physical energy minimum. Validation experiments show this reduces Energy Minimization time by **\u003e60%** due to fewer initial steric clashes.\n**What**: Realistic backbone geometry validation based on amino acid type using MolProbity/Top8000 data.\n- **Glycine (GLY)**: Correctly allowed in left-handed alpha region (phi \u003e 0).\n- **Proline (PRO)**: Checks against restricted phi angles.\n- **General**: All other residues are checked against standard Favored/Allowed polygons.\n- **Precision**: Uses point-in-polygon algorithms for accurate classification (Favored, Allowed, Outlier).\n\n#### 📐 NeRF Geometry (The Construction Engine)\n**What**: Natural Extension Reference Frame algorithm\n**Term**: Building 3D structures from \"Internal Coordinates\" (Z-Matrix)\n**Mechanism**: Places each atom (N, CA, C, O) relative to the local coordinate system of the three previous atoms.\n**Educational Value**: Teaches how math converts 1D sequences + 2D angles into 3D shapes.\n\n#### ⛓️ Metal Coordination (Cofactors)\n**What**: Structural integration of inorganic ions (e.g. Zinc).\n**Motifs**: Detected via ligand clustering (Cys/His sites).\n**Physics**: Applied via Harmonic Constraints in Energy Minimization.\n**Importance**: Models structural stability of Zinc Fingers and enzymatic sites.\n\n#### 🧲 Salt Bridge Stabilization\n**What**: Automatic detection of ionic interactions (e.g., LYS+ and ASP-).\n**Criteria**: Distance-based detection between charged side-chain atoms (cutoff 5.0 Å).\n**Physics**: Stabilized via harmonic restraints during energy minimization.\n**Importance**: Maintains tertiary structure integrity in synthetic protein models.\n\n#### 🔗 Disulfide Bonds (SSBOND)\n**What**: Covalent bonds between Cysteine residues\n**Detection**: Automatic detection of close CYS-CYS pairs (SG-SG distance 2.0-2.2 Å)\n**Output**: SSBOND records added to PDB header\n**Importance**: Annotates stabilizing post-translational modifications\n\n#### ⭕ Cyclic Peptides (Macrocyclization)\n**What**: Binds the N-terminal Nitrogen to the C-terminal Carbon to form a closed ring.\n**Mechanism**: Uses OpenMM's physics engine to regularize the covalent bond and minimize ring strain.\n**Bio-Context**: Many potent drugs (e.g., Cyclosporine) and toxins are cyclic peptides. Cyclization increases metabolic stability and reduces conformational entropy, improving binding affinity.\n\n### Educational Philosophy \u0026 Integrity\n\n`synth-pdb` is built on the principle of **\"Code as Textbook\"**.\n\n*   **Pedagogical Comments**: Key source files (`generator.py`, `test_bfactor.py`) contain detailed block comments explaining the *why* alongside the *how* (e.g., explaining Lipari-Szabo stiffness vs. B-factor flexibility).\n*   **Integrity Safeguards**: We include a specialized test suite (`tests/test_docs_integrity.py`) that strictly enforces the presence of these educational notes. This ensures that future refactoring never accidentally deletes the scientific context.\n*   **Visual Learning**: We believe that seeing is understanding. The integrated `--visualize` tool connects biophysical theory (minimized energy, restrained dynamics) to immediate visual feedback, helping visual learners grasp complex 3D relationships.\n*   **Universal Patterns**: The generator is tuned to reproduce universal biophysical phenomena (like terminal fraying and backbone rigidity) rather than just random noise, making it a valid tool for teaching structural biology concepts.\n\n## Installation\n\n### From PyPI (Recommended)\n\nInstall the latest stable release from PyPI:\n\n```bash\npip install synth-pdb\n```\n\nThis installs the `synth-pdb` package and makes the `synth-pdb` command available system-wide.\n\n### From Source (For Development)\n\nInstall directly from the project directory:\n\n```bash\ngit clone https://github.com/elkins/synth-pdb.git\ncd synth-pdb\npip install .\n```\n\n### Requirements\n- Python 3.10+\n- NumPy\n- Biotite (for residue templates and structure manipulation)\n\nDependencies are automatically installed with pip.\n\n## Quick Start\n\nGenerate a simple 10-residue peptide:\n```bash\nsynth-pdb --length 10\n```\n\nGenerate and validate a specific sequence:\n```bash\nsynth-pdb --sequence \"ACDEFGHIKLMNPQRSTVWY\" --validate --output my_peptide.pdb\n```\n\nGenerate with mixed secondary structures and visualize:\n```bash\nsynth-pdb --structure \"1-10:alpha,11-20:beta\" --visualize\n```\n\nGenerate the best of 10 attempts with clash refinement:\n```bash\nsynth-pdb --length 20 --best-of-N 10 --refine-clashes 5 --output refined_peptide.pdb\n```\n\n## 🤖 Feature Spotlight: AI Model Support \u0026 Hard Decoys\n\nGenerating \"good\" structures is only half the battle. To train robust AI models (like AlphaFold-3 or RosettaFold), researchers need **High-Quality Negative Samples**—structures that look physically plausible but are biologically or topologically incorrect.\n\n**Synth-PDB** provides three powerful mechanisms for generating these \"Hard Decoys\":\n\n### 1. Sequence Threading (Fold Mismatch)\nForce a specific sequence onto the backbone \"fold\" of a completely different sequence. This creates a realistic-looking structure where the side-chain packing is fundamentally incompatible with the backbone.\n```bash\n# Thread Poly-Ala sequence onto a backbone generated for Poly-Pro\nsynth-pdb --mode decoys --sequence AAAAA --template-sequence PPPPP --hard\n```\n\n### 2. Torsion Angle Drift (Conformational Noise)\nAdd controlled, random noise to ideal Ramachandran angles. This creates \"near-native\" decoys—structures that are *almost* correct but have subtle, realistic errors.\n```bash\n# Add 5 degrees of maximum drift to all phi/psi angles\nsynth-pdb --mode decoys --drift 5.0\n```\n\n### 3. Label Shuffling (Sequence Mismatch)\nGenerate a perfectly valid structure for a sequence, then randomly shuffle the identity of the residues in the final PDB. This tests if an AI model can detect that a residue (e.g., Trp) is in an environment meant for another (e.g., Gly).\n```bash\nsynth-pdb --mode decoys --sequence ACDEF --hard --shuffle-sequence\n```\n\n---\n\n## 🌟 Feature Spotlight: \"Spectroscopically Realistic\" Dynamics\n\nMost synthetic PDB generators create static bricks. They might create reasonable geometry, but the \"B-factor\" column (Column 11) is often just zero or random noise.\n\n**Synth-PDB is different.** It simulates the **physics of protein motion** to generate a unified model of structure AND dynamics.\n\n### The \"Structure-Dynamics Link\"\nWe implement the **Lipari-Szabo Model-Free formalism** (Nobel-adjacent physics) directly into the generator:\n1.  **Structure Awareness**: The engine analyzes the generated geometry (`alpha-helix` vs `random-coil`).\n2.  **Order Parameter ($S^2$) Prediction**: It assigns specific rigidity values:\n    *   **Helices**: $S^2 \\approx 0.85$ (Rigid H-bond network)\n    *   **Loops**: $S^2 \\approx 0.65$ (Flexible nanosecond motions)\n    *   **Termini**: $S^2 \\approx 0.45$ (Disordered fraying)\n3.  **Unified Output**:\n    *   **PDB B-Factors**: Calculated via $B \\propto (1 - S^2)$. When you visualize the PDB in PyMOL, flexible regions *visually* appear thicker/redder, matching real crystal data distributions.\n    *   **NMR Relaxation**: $R_1, R_2, NOE$ rates are calculated from the *same* parameters.\n\n**Why this matters**:\n\u003e \"The correlation between NMR order parameters ($S^2$) and crystallographic B-factors is a bridge between solution-state and solid-state dynamics.\" — *Fenwick et al., PNAS (2014)*\n\nThis feature allows you to test **bioinformatics pipelines** that rely on correlation between sequence, structure, and experimental observables, without needing expensive Molecular Dynamics (MD) simulations.\n\n### 4. Relax (Simulate Dynamics)\nGenerate relaxation rates ($R_1, R_2, NOE$) with **realistic internal dynamics**:\n```bash\npython main.py relax --input output/my_peptide.pdb --output output/relaxation_data.nef --field 600 --tm 10.0\n```\nThis module now implements the **Lipari-Szabo Model-Free** formalism with structure-based Order Parameter ($S^2$) prediction:\n*   **Helices/Sheets**: $S^2 \\approx 0.85$ (Rigid, high $R_1/R_2$)\n*   **Loops/Turns**: $S^2 \\approx 0.65$ (Flexible, lower $R_1/R_2$)\n*   **Termini**: $S^2 \\approx 0.45$ (Highly disordered)\n\nThis creates realistic \"relaxation gradients\" along the sequence, perfect for testing dynamics software.\n\n## 🚀 Quick Visual Demo\n\nWant to see the **Physics + Visualization** capabilities in action?\n\nRun this command to generate a **Leucine Zipper** (classic alpha helix), **minimize** its energy using OpenMM, and immediately **visualize** it in your browser:\n\n```bash\nsynth-pdb --sequence \"LKELEKELEKELEKELEKELEKEL\" --conformation alpha --minimize --visualize\n```\n\nThis effectively demonstrates:\n1.  **Generation**: Creating the alpha-helical backbone.\n2.  **Minimization**: \"Relaxing\" the structure (geometry regularization).\n3.  **Visualization**: Launching the interactive 3D viewer.\n\n## Usage\n\n### Command-Line Arguments\n\n#### **Structure Definition**\n\n- `--length \u003cLENGTH\u003e`: Number of residues in the peptide chain\n  - Type: Integer\n  - Default: `10`\n  - Example: `--length 50`\n\n- `--sequence \u003cSEQUENCE\u003e`: Specify an exact amino acid sequence\n  - Formats:\n    - 1-letter codes: `\"ACDEFG\"`\n    - 3-letter codes: `\"ALA-CYS-ASP-GLU-PHE-GLY\"`\n  - Overrides `--length`\n  - Example: `--sequence \"MVHLTPEEK\"`\n\n- `--plausible-frequencies`: Use biologically realistic amino acid frequencies for random generation\n  - Based on natural protein composition\n  - Ignored if `--sequence` is provided\n\n- `--conformation \\u003cCONFORMATION\\u003e`: Secondary structure conformation to generate\n  - Options: `alpha`, `beta`, `ppii`, `extended`, `random`\n  - Default: `alpha` (alpha helix)\n  - Choices:\n    - `alpha`: Alpha helix (φ=-57°, ψ=-47°)\n    - `beta`: Beta sheet (φ=-135°, ψ=135°)\n    - `ppii`: Polyproline II helix (φ=-75°, ψ=145°)\n    - `extended`: Extended/stretched conformation (φ=-120°, ψ=120°)\n    - `random`: Random sampling from allowed Ramachandran regions\n  - Example: `--conformation beta`\n\n#### 🤖 AI \u0026 Machine Learning: Bulk Dataset Generation\n\n`synth-pdb` serves as valid data generator for training Deep Learning models (GNNs, Transformers, Diffusion Models). It can generate massive, diverse, and labeled datasets.\n\n**Command:**\n```bash\nsynth-pdb --mode dataset --dataset-format npz --num-samples 1000 --output my_training_data\n```\n\n**Features:**\n*   **Formats**:\n    *   `npz`: (Recommended) Compressed NumPy archives. Contains `coords` (L,5,3), `sequence` (One-hot), and `contact_map` (LxL). Ideal for PyTorch/TensorFlow dataloaders.\n    *   `pdb`: Writes individual PDB files and CASP contact maps (slower, for legacy tools).\n*   **Multiprocessing**: Automatically uses all available CPU cores.\n*   **Manifest**: Generates a `dataset_manifest.csv` tracking all samples and their metadata (split, length, conformation).\n\n**Output Structure (`--dataset-format npz`)**:\n```\nmy_training_data/\n├── dataset_manifest.csv\n├── train/\n│   ├── synth_000001.npz\n│   ├── synth_000002.npz\n│   ...\n└── test/\n    ├── synth_000801.npz\n    ...\n```\n\n### 🔍 Visualization \u0026 Analysis\n#### **Validation \u0026 Quality Control**\n\n- `--validate`: Run validation checks on the generated structure\n  - Checks: bond lengths, bond angles, Ramachandran, steric clashes, peptide planes, sequence improbabilities\n  - Reports violations to console\n\n- `--guarantee-valid`: Generate structures until one with zero violations is found\n  - Implies `--validate`\n  - Use with `--max-attempts` to limit iterations\n  - Example: `--guarantee-valid --max-attempts 100`\n\n- `--max-attempts \u003cN\u003e`: Maximum generation attempts for `--guarantee-valid`\n  - Default: `100`\n\n- `--best-of-N \u003cN\u003e`: Generate N structures and select the one with fewest violations\n  - Implies `--validate`\n  - Overrides `--guarantee-valid`\n  - Example: `--best-of-N 20`\n\n- `--refine-clashes \u003cITERATIONS\u003e`: Iteratively adjust atoms to reduce steric clashes\n  - Applies after structure selection\n  - Iterates until improvements stop or max iterations reached\n  - Example: `--refine-clashes 10`\n\n#### **Structure Quality Filter (Random Forest)**\n\n\u003e [!NOTE]\n\u003e Despite the flag name history, this feature uses a **classical Random Forest classifier** (scikit-learn), not a neural network or generative AI. It scores structures on geometric quality metrics derived from Ramachandran angles, steric clashes, bond lengths, and radius of gyration.\n\n- `--quality-filter`: Enable the **Structure Quality Filter** to screen generated structures.\n  - Using a Random Forest classifier trained on thousands of samples, this filter automatically rejects \"low quality\" structures (clashing, distorted geometry).\n  - It considers Ramachandran angles, steric clashes, bond lengths, and radius of gyration.\n  - Useful for filtering out failed minimization attempts in bulk generation.\n\n- `--quality-score-cutoff \u003cFLOAT\u003e`: Minimum probability score (0.0-1.0) for a structure to be considered \"Good\".\n  - Higher values = stricter filtering (fewer false positives, more false negatives).\n  - Default: `0.5`\n  - Example: `--quality-score-cutoff 0.8` (Only keep highly confident good structures)\n  - Scores below `0.5` are typically rejected as \"Bad\".\n\n#### **Physics \u0026 Advanced Refinement **\n\n- `--minimize`: Run physics-based energy minimization (OpenMM).\n  - Defaults to implicit solvent (OBC2) and AMBER forcefield.\n  - Highly recommended for \"realistic\" geometry.\n  - Example: `--minimize`\n\n- `--solvent \u003cMODEL\u003e`: Specify the solvent model for minimization/equilibration.\n  - Options: `obc2` (default), `obc1`, `gbn`, `gbn2`, `hct`, `explicit`\n  - Example: `--solvent explicit` (simulates a TIP3P water box)\n\n- `--solvent-padding \u003cFLOAT\u003e`: Padding distance (in nm) for the explicit water box.\n  - Default: `1.0`\n  - Example: `--solvent-padding 1.5`\n\n- `--keep-solvent`: Retain the generated water molecules (HOH) in the final PDB file.\n  - Default: False (water is stripped for cleaner outputs)\n\n- `--optimize`: Run Monte Carlo side-chain optimization.\n  - Reduces steric clashes by rotating side chains.\n  - Example: `--optimize`\n\n- `--forcefield \u003cNAME\u003e`: Specify OpenMM forcefield.\n  - Default: `amber14-all.xml`\n  - Example: `--forcefield amber14-all.xml`\n  - Default: `amber14-all.xml`\n\n- `--minimization-k \u003cFLOAT\u003e`: Energy minimization tolerance (kJ/mole/nm).\n  - Higher values = Faster but less precise.\n  - Recommended for bulk generation: `100.0`\n  - Default: `10.0` (High Precision)\n\n- `--minimization-max-iter \u003cINT\u003e`: Max iterations for minimization.\n  - `0` = Unlimited (Convergence based on tolerance)\n  - Recommended for bulk generation: `1000`\n  - Default: `0`\n\n#### **Synthetic NMR Data**\n\n\u003e **📦 NMR Functionality Powered by [`synth-nmr`](https://github.com/elkins/synth-nmr)**\n\u003e As of version 1.17.0, all NMR-related functionality (NOE calculation, relaxation rates, chemical shifts, J-couplings) is provided by the standalone [`synth-nmr`](https://pypi.org/project/synth-nmr/) package. This package can be used independently for NMR data generation in your own projects. The integration is fully backward compatible—all existing code continues to work without changes.\n\n\n- `--gen-nef`: Generate synthetic NOE restraints in NEF format.\n  - Scans structure for H-H pairs \u003c cutoff.\n  - Outputs `.nef` file.\n  - Note: Requires hydrogens (use with `--minimize` or internal default).\n\n- `--noe-cutoff \u003cDIST\u003e`: Cutoff distance for NOEs in Angstroms.\n  - Default: `5.0`\n  - Example: `--noe-cutoff 6.0`\n\n- `--nef-output \u003cFILE\u003e`: Custom output filename for NEF.\n\n#### **Synthetic Relaxation Data **\n\n- `--gen-relax`: Generate synthetic NMR relaxation data ($R_1, R_2, \\{^1H\\}-^{15}N\\ NOE$) in NEF format.\n  - Calculates Model-Free parameters ($S^2 \\approx 0.85$ for core, $0.5$ for flexible termini).\n  - Outputs `_relax.nef` file.\n  - **Physics Note**: $NOE$ values depend on tumbling time, not just internal flexibility.\n\n- `--field \u003cMHZ\u003e`: Proton Larmor frequency in MHz.\n  - Default: `600.0`\n  - Calculates proper spectral density frequencies for this field.\n\n- `--tumbling-time \u003cNS\u003e`: Global rotational correlation time ($\\tau_m$) in nanoseconds.\n  - Default: `10.0`\n  - Controls the overall magnitude of relaxation rates. Larger proteins have larger $\\tau_m$.\n\n#### **Constraints Export **\n\n- `--export-constraints \u003cFILE\u003e`: Export contact map constraints for modeling/folding.\n  - Useful for checking agreement with AlphaFold/CASP predictions.\n  - Outputs a file containing residue-residue contacts.\n  - Example: `--export-constraints constraints.casp`\n\n- `--constraint-format {casp,csv}`: Format for the exported constraints.\n  - `casp`: Critical Assessment of Structure Prediction (RR) format.\n  - `csv`: Comma-separated values (i, j, distance).\n  - Default: `casp`\n\n- `--constraint-cutoff \u003cDIST\u003e`: Distance cutoff for defining binary contacts (Angstroms).\n  - Default: `8.0`\n\n#### **Torsion Angle Export **\n\n- `--export-torsion \u003cFILE\u003e`: Export backbone torsion angles (Phi, Psi, Omega) for every residue.\n  - Useful for training ML models on backbone geometry.\n  - Outputs a CSV or JSON file.\n  - Example: `--export-torsion angles.csv`\n\n- `--torsion-format {csv,json}`: Format for the exported data.\n  - Default: `csv`\n\n#### **Synthetic MSA (Evolution) **\n\n- `--gen-msa`: Generate a Multiple Sequence Alignment (MSA) by simulating neutral drift.\n  - Conserves hydrophobic core residues while mutating surface residues.\n  - Outputs a FASTA file useful for testing co-evolution signals in AI models.\n\n- `--msa-depth \u003cN\u003e`: Number of sequences to generate.\n  - Default: `100`\n\n- `--mutation-rate \u003cRATE\u003e`: Probability of mutation per position per sequence.\n  - Default: `0.1` (10% divergence per sequence).\n\n#### **Distogram Export (Spatial Relationships) **\n- `--export-distogram \u003cFILE\u003e`: Export NxN Distance Matrix representing the protein geometry.\n  - Rotation-invariant representation ideal for AI model training/validation.\n  - Supports `json`, `csv`, or `npz` (NumPy) formats.\n  - Example: `--export-distogram dist.json`\n\n- `--distogram-format {json,csv,npz}`: Output format.\n  - Default: `json`\n\n#### **Biophysical Realism (Physics) **\n- `--ph \u003cVAL\u003e`: Set pH for titration (default 7.4).\n  - Automatically adjusts Histidine protonation (`HIS` $\\rightarrow$ `HIP` if pH \u003c 6.0).\n  - Critical for realistic electrostatics and NMR chemical shifts.\n\n- `--cap-termini`: Add terminal blocking groups.\n  - N-terminus: Acetyl (`ACE`)\n  - C-terminus: N-methylamide (`NME`)\n  - Removes charged termini ($\\text{NH}_3^+$/$\\text{COO}^-$) for realistic peptide modeling.\n\n- `--cyclic`: Generate a **Head-to-Tail cyclic peptide**.\n  - Connects the N-terminus and C-terminus with a covalent peptide bond.\n  - **Requirement**: Automatically implies `--minimize` to ensure proper closure.\n  - **Incompatibility**: Disables `--cap-termini`.\n\n- `--equilibrate`: Run Molecular Dynamics (MD) equilibration.\n  - Simulates the protein at **300 Kelvin** (solution state).\n  - Uses Langevin Dynamics to shake atoms out of local minima.\n  - Generates a \"thermalized\" structure closer to NMR conditions.\n  - Options: `--md-steps \u003cINT\u003e` (default 1000, $\\approx$ 2 ps).\n\n- `--metal-ions {auto,none}`: Control metal ion coordination.\n  - `auto` (default): Scans for binding sites and injects ions.\n  - `none`: Disables automatic coordination.\n\n- `--phosphorylation-rate \u003cFLOAT\u003e`: Probability of phosphorylating S/T/Y residues.\n  - Value between 0.0 and 1.0.\n  - Converts SER-\u003eSEP, THR-\u003eTPO, TYR-\u003ePTR.\n  - Mimics kinase activity for regulatory simulation.\n  - Example: `--phosphorylation-rate 0.5`\n\n- `--cis-proline-frequency \u003cFLOAT\u003e`: Probability of X-Pro peptide bond being Cis.\n  - Default: `0.05` (5%)\n  - Cis-Proline is critical for tight turns and folding.\n  - Set to `0.0` for all-Trans, `1.0` for all-Cis.\n\n#### **Bulk Dataset Generation (AI)**\n\n- `--mode dataset`: Enable bulk generation mode.\n- `--num-samples \u003cN\u003e`: Number of samples to generate (default 100).\n- `--min-length \u003cN\u003e`, `--max-length \u003cN\u003e`: Range for random sequence lengths (default 10-50).\n- `--train-ratio \u003cFLOAT\u003e`: Fraction of samples for the training set (default 0.8).\n- `--output \u003cDIR\u003e`: Directory to save the dataset.\n\n\n\n#### **Output Options**\n\n- `--output \u003cFILENAME\u003e`: Custom output filename\n  - If omitted, auto-generates: `random_linear_peptide_\u003clength\u003e_\u003ctimestamp\u003e.pdb`\n  - Example: `--output my_protein.pdb`\n\n- `--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}`: Logging verbosity\n  - Default: `INFO`\n  - Use `DEBUG` for detailed validation reports\n\n- `--seed \u003cINT\u003e`: Random seed for reproducible generation\n  - Default: `None` (Random)\n  - Example: `--seed 42`\n  - Guarantees identical output for the same command.\n\n- `--help`: Show the help message and exit.\n\n### Examples\n\n#### Basic Generation\n\n```bash\n# Simple 25-residue peptide\nsynth-pdb --length 25\n\n# Custom sequence with validation\nsynth-pdb --sequence \"ELVIS\" --validate --output elvis.pdb\n\n# Use biologically realistic frequencies\nsynth-pdb --length 100 --plausible-frequencies\n\n# Generate a random 20-residue alpha helix\nsynth_pdb --length 20 --conformation alpha --output random_helix.pdb\n\n# Generate a high-quality, physically realistic structure (Recommended)\n# Includes: Minimization, Terminal Capping, and Thermal Equilibration (MD)\nsynth_pdb --length 20 --minimize --cap-termini --equilibrate --output best_structure.pdb\n\n# Generate beta sheet conformation\nsynth-pdb --length 20 --conformation beta --output beta_sheet.pdb\n\n# Generate extended conformation\nsynth-pdb --length 15 --conformation extended\n\n# Generate random conformation (mixed alpha/beta regions)\nsynth-pdb --length 30 --conformation random\n\n# 🤖 Bulk dataset generation for AI training\nsynth-pdb --mode dataset --num-samples 500 --min-length 10 --max-length 40 --output ./my_dataset\n\n# ⛓️ Generate a Zinc Finger with structural cofactors\nsynth-pdb --sequence \"CPHCGKSFSQKSDLVKHQRT\" --minimize --metal-ions auto --output zinc_finger.pdb\n```\n\n#### Quality Control\n\n```bash\n# Generate until valid (may take time!)\nsynth-pdb --length 15 --guarantee-valid --max-attempts 200 --output valid.pdb\n\n# Best of 50 attempts\nsynth-pdb --length 20 --best-of-N 50 --output best_structure.pdb\n```\n\n#### Explicit Solvent \u0026 Hardware Testing\n\nSimulate your protein in a realistic water box (TIP3P) for high-fidelity physics or export the explicit solvent map for downstream molecular dynamics.\n\n```bash\n# Basic explicit solvent: generate a small peptide and pad with 1.2 nm of water.\n# By default, synth-pdb strips the water atoms before saving the final clean PDB.\nsynth-pdb --sequence ALA-PRO-GLY --minimize --solvent explicit --solvent-padding 1.2 --output small_peptide.pdb\n\n# Retain the water box: save the entire simulated system (protein + thousands of HOH atoms)\nsynth-pdb --sequence TRP-TYR-PHE --minimize --solvent explicit --solvent-padding 1.5 --keep-solvent --output full_water_box.pdb\n\n# 🚀 EXTREME Hardware Limit Test\n# Generate a large 50-residue sequence, bury it in a massive 2.5 nm water box,\n# and run 10,000 steps of Langevin Dynamics equilibration.\n# WARNING: This will generate \u003e50,000 atoms and heavily tax your CPU/GPU!\nsynth-pdb --length 50 --conformation random --minimize --equilibrate --md-steps 10000 --solvent explicit --solvent-padding 2.5 --keep-solvent --output extreme_limit_test.pdb\n```\n\n## ML Integration (AI Research)\n\n**synth-pdb** is designed to be a high-performance \"Data Factory\" for Training Protein AI models. It can generate thousands of unique, physically plausible protein structures in seconds—bypassing the bottleneck of parsing millions of PDB files from disk.\n\n### 🤖 The Batch Walk (Vectorized Performance)\nUsing the `BatchedGenerator` module, the tool uses SIMD/Vectorized math (NeRF algorithm) to build peptide backbones in parallel.\n\n### ⚡ Zero-Copy Handover\nTransition from biological coordinates to Deep Learning tensors instantly. Our `BatchedPeptide` output is **C-Contiguous**, allowing tools like PyTorch and JAX to map the memory without copying data.\n\n```python\nfrom synth_pdb.batch_generator import BatchedGenerator\nimport torch\n\n# Generate 1,000 structures in milliseconds\nbg = BatchedGenerator(\"ALA-GLY-SER-TRP\", n_batch=1000)\nbatch = bg.generate_batch()\n\n# Instant PyTorch Handover (Shared RAM)\ncoords_tensor = torch.from_numpy(batch.coords).float()\n```\n\n### 🚀 Try it in the Cloud\n- **AI Protein Data Factory:** [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elkins/synth-pdb/blob/master/examples/ml_integration/ml_handover_demo.ipynb)\n\n### 🧩 Framework Specifics\nFor detailed examples of how to load generated data into your favorite framework without any performance overhead, see our specialized handover notebooks:\n- [JAX Handover](examples/ml_loading/jax_handover.ipynb) - Zero-copy using `jax.numpy.asarray`.\n- [PyTorch Handover](examples/ml_loading/pytorch_handover.ipynb) - Unified memory mapping with `torch.from_numpy`.\n- [MLX Handover](examples/ml_loading/mlx_handover.ipynb) - Optimized for Apple Silicon (M-series CPUs/GPUs).\n\n#### Quality Control (Continued)\n\n```bash\n# Refine steric clashes (5 iterations)\nsynth-pdb --length 30 --refine-clashes 5 --output refined.pdb\n\n# Combined: best of 10 + refinement\nsynth-pdb --length 25 --best-of-N 10 --refine-clashes 3 --output optimized.pdb\n```\n\n#### Biologically-Inspired Examples\n\nGenerate structures that mimic real protein motifs for educational demonstrations:\n\n```bash\n# Collagen-like triple helix motif (polyproline II)\n# Collagen is rich in proline and glycine with PPII conformation\nsynth-pdb --sequence \"GPGPPGPPGPPGPPGPPGPP\" --conformation ppii --output collagen_like.pdb\n\n# Silk fibroin-like beta sheet\n# Silk proteins contain repeating (GAGAGS) motifs forming beta sheets\nsynth-pdb --sequence \"GAGAGSGAGAGSGAGAGS\" --conformation beta --output silk_like.pdb\n\n# Amyloid fibril-like beta structure\n# Amyloid fibrils are rich in beta sheets, often with hydrophobic residues\nsynth-pdb --sequence \"LVEALYLVCGERGFFYTPKA\" --conformation beta --best-of-N 10 --output amyloid_like.pdb\n\n# Leucine zipper motif (alpha helix)\n# Leucine zippers are alpha-helical with leucine repeats every 7 residues\nsynth-pdb --sequence \"LKELEKELEKELEKELEKELEKEL\" --conformation alpha --output leucine_zipper.pdb\n\n# Intrinsically disordered region (random conformation)\n# IDRs lack stable structure, rich in charged/polar residues\nsynth-pdb --sequence \"GGSEGGSEGGSEGGSEGGSE\" --conformation random --output disordered_region.pdb\n\n# Transmembrane helix-like structure (extended alpha helix)\n# Membrane-spanning regions are often long alpha helices with hydrophobic residues\nsynth-pdb --sequence \"LVIVLLVIVLLVIVLLVIVL\" --conformation alpha --output transmembrane_like.pdb\n\n# Beta-turn rich structure (mixed conformations)\n# Proline and glycine favor turns and loops\nsynth-pdb --sequence \"GPGPGPGPGPGPGPGP\" --conformation random --output beta_turn_rich.pdb\n\n# Elastin-like peptide (extended/random)\n# Elastin contains repeating VPGVG motifs with flexible structure\nsynth-pdb --sequence \"VPGVGVPGVGVPGVGVPGVG\" --conformation extended --output elastin_like.pdb\n\n# Antimicrobial peptide-like (alpha helix)\n# Many AMPs are short amphipathic alpha helices\nsynth-pdb --sequence \"KWKLFKKIGAVLKVL\" --conformation alpha --validate --output amp_like.pdb\n\n# Zinc finger motif-like (mixed structure)\n# Zinc fingers have beta sheets and alpha helices\nsynth-pdb --sequence \"CPHCGKSFSQKSDLVKHQRT\" --conformation random --best-of-N 5 --output zinc_finger_like.pdb\n```\n\n**Educational Notes:**\n- These examples demonstrate **sequence-structure relationships**\n- Real proteins would have more complex tertiary structures and post-translational modifications\n- Use these for teaching secondary structure concepts, not for actual molecular modeling\n- Combine with `--validate` to show how different conformations affect structural quality\n- Try `--best-of-N` and `--refine-clashes` to explore quality control strategies\n\n#### Visualization-Optimized Examples\n\nThese examples are specifically designed to look great in the 3D viewer with `--visualize`:\n\n```bash\n# 🧬 Compact Alpha Helix (BEST for visualization)\n# Short, tight helix - perfect for interactive viewing\nsynth-pdb --length 15 --conformation alpha --visualize\n\n# 🔗 Helix-Turn-Helix DNA-Binding Motif\n# Classic protein architecture with two helices and a turn\nsynth-pdb --sequence \"AAAAAAGGGAAAAA\" --structure \"1-6:alpha,7-9:random,10-14:alpha\" --visualize\n\n# 🧬 \"Textbook\" Stabilized Alpha Helix (Salt Bridges)\n# Demonstrates charge pairs (Glu-Lys) stabilizing the backbone (i, i+4)\n# Use --minimize to geometry-optimize these ionic interactions\nsynth-pdb --sequence \"EAAKEAAKEAAKEAAK\" --conformation alpha --minimize --cap-termini --visualize\n\n# 🔗 Zinc Finger with Metal Coordination\n# See the Zinc ion (Zn2+) automatically coordinated by Cys/His residues!\n# The --minimize flag applies harmonic constraints to the metal center.\nsynth-pdb --sequence \"CPHCGKSFSQKSDLVKHQRT\" --structure \"1-10:beta,11-20:alpha\" --metal-ions auto --minimize --visualize\n\n# 🎀 Refined Beta Hairpin\n# Two antiparallel beta strands connected by a turn, relaxed with physics\nsynth-pdb --sequence \"VVVVVGGVVVVV\" --structure \"1-5:beta,6-8:random,9-12:beta\" --minimize --visualize\n\n# 🧪 Polyproline II Helix (Collagen-like)\n# Left-handed helix, compact and visually distinct\nsynth-pdb --sequence \"GPGPPGPPGPPGPP\" --conformation ppii --minimize --visualize\n\n# 🧪 The \"Kitchen Sink\" (Features Demo)\n# Combines distinct secondary structures (Helix, Sheet) with a Type I Beta Turn and PTMs.\n# Look for the magenta helix, purple turn, and orange phosphorylated residues (SEP/TPO/PTR).\nsynth-pdb --length 25 --structure \"1-10:alpha,11-14:typeI,15-25:beta\" --phosphorylation-rate 0.3 --visualize\n\n# ⭕ The \"Molecular Hoop\" (Macrocycle)\n# A simple flexible ring of Glycines. Perfect for visualizing ring closure.\nsynth-pdb --sequence \"GGGGGGGGGGGG\" --cyclic --minimize --visualize\n```\n\n**Visualization Tips:**\n- **Best conformations for viewing**: `alpha` (most compact), `ppii` (distinctive shape)\n- **Optimal length**: 10-20 residues for clear visualization\n- **In the viewer**: Use \"Cartoon\" style and \"Spectrum\" color for best results\n- **Interactive**: Rotate with left-click, zoom with scroll, pan with right-click\n\n#### Mixed Secondary Structures\n\nThe `--structure` parameter enables creation of realistic protein-like structures with different conformations in different regions:\n\n```bash\n# Helix-turn-helix DNA-binding motif\n# Two alpha helices connected by a flexible turn region, minimized for realism\nsynth-pdb --length 25 --structure \"1-10:alpha,11-15:random,16-25:alpha\" --minimize --output helix_turn_helix.pdb\n\n# Beta-alpha-beta fold unit\n# Common protein architecture with sheet-helix-sheet\nsynth-pdb --length 30 --structure \"1-10:beta,11-15:random,16-25:alpha,26-30:beta\" --minimize --output bab_fold.pdb\n\n# Zinc finger with realistic structure\n# Beta sheet + alpha helix (actual zinc finger architecture)\nsynth-pdb --sequence \"CPHCGKSFSQKSDLVKHQRT\" --structure \"1-5:beta,6-10:random,11-20:alpha\" --minimize --output zinc_finger_realistic.pdb\n\n# Immunoglobulin domain\n# Multiple beta sheets connected by loops (antibody-like)\nsynth-pdb --length 40 --structure \"1-8:beta,9-12:random,13-20:beta,21-24:random,25-32:beta,33-40:random\" --minimize --output ig_domain.pdb\n\n# Coiled-coil with flexible linker\n# Two helical regions connected by disordered linker\nsynth-pdb --length 50 --structure \"1-20:alpha,21-30:random,31-50:alpha\" --minimize --output coiled_coil.pdb\n\n# Intrinsically disordered region with structured domain\n# Disordered N-terminus, structured C-terminus (common in signaling proteins)\nsynth-pdb --length 40 --structure \"1-15:random,16-40:alpha\" --minimize --output idr_with_domain.pdb\n\n# Collagen-like with flexibility\n# PPII helix with occasional flexible regions (more realistic than uniform)\nsynth-pdb --sequence \"GPGPPGPPGPPGPPGPPGPP\" --structure \"1-6:ppii,7-9:random,10-20:ppii\" --output collagen_flexible.pdb\n\n# Beta-hairpin motif\n# Two antiparallel beta strands connected by a turn\nsynth-pdb --length 20 --structure \"1-7:beta,8-12:random,13-20:beta\" --refine-clashes 5 --output beta_hairpin.pdb\n```\n\n**Why This Matters:**\n- Real proteins have **mixed secondary structures**, not uniform conformations\n- These examples are much more realistic than single-conformation structures\n- Useful for teaching protein architecture and domain organization\n- Great for testing structure analysis tools with realistic inputs\n- Demonstrates how sequence and structure work together\n\n#### Detailed Educational Case Studies\n\nThese comprehensive examples demonstrate how to use `synth-pdb` to model specific biological features found in well-known proteins.\n\n**1. Glucagon (Alpha Helix Hormone)**\n*29 residues | PDB: 1GCN*\nGlucagon is a peptide hormone that raises glucose levels. It folds into a characteristic alpha helix.\n```bash\nsynth-pdb --sequence HSQGTFTSDYSKYLDSRRAQDFVQWLMNT --conformation alpha --refine-clashes 0 --output glucagon.pdb\n```\n*Educational Concept*: Studying alpha-helical packing and amphipathicity.\n\n**2. Melittin (Bent Helix / Hinge)**\n*26 residues | PDB: 2MLT*\nThe principal toxin in bee venom. It forms two alpha helices separated by a \"hinge\" region, allowing it to puncture membranes.\n```bash\nsynth-pdb --sequence GIGAVLKVLTTGLPALISWIKRKRQQ --structure \"1-11:alpha,12-14:random,15-26:alpha\" --refine-clashes 50 --output melittin.pdb\n```\n*Educational Concept*: Modeling non-linear secondary structures and flexible linkers (hinges).\n\n**3. Bovine Pancreatic Trypsin Inhibitor (BPTI) (Disulfide Bonds)**\n*58 residues | PDB: 1BPI*\nA classic model for protein folding studies (\"The Hydrogen Atom of Protein Folding\"). It is stabilized by three disulfide bonds.\n```bash\nsynth-pdb --sequence RPDFCLEPPYTGPCKARIIRYFYNAKAGLCQTFVYGGCRAKRNNFKSAEDCMRTCGGA --conformation random --minimize --visualize --output bpti.pdb\n```\n*Educational Concept*: Automatic detection of disulfide bonds (`SSBOND` records). The `--minimize` flag brings cysteine sulfurs into proper bonding distance (2.0 Å).\n\n**4. Ubiquitin (Complex Mixed Fold)**\n*76 residues | PDB: 1UBQ*\nA highly conserved regulatory protein with a complex mixed alpha/beta fold (beta grasp fold).\n```bash\nsynth-pdb --sequence MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG --structure \"1-7:beta,12-16:beta,23-34:alpha,41-45:beta,48-49:beta,56-59:alpha,66-70:beta\" --minimize --best-of-N 5 --output ubiquitin.pdb\n```\n*Educational Concept*: Generating complex, multi-domain topologies. Physics-based minimization (`--minimize`) resolves steric clashes better than geometric heuristics alone.\n\n**5. SFTI-1 (Sunflower Trypsin Inhibitor)**\n*14 residues | PDB: 1SFI*\nA small, potent protease inhibitor that is both **cyclic** and stabilized by a **disulfide bond**.\n```bash\nsynth-pdb --sequence \"GRCTKSIPPICFPD\" --cyclic --minimize --visualize --output sfti1.pdb\n```\n*Educational Concept*: Combining multiple stabilizing modifications (**Cyclization** + **Disulfide Bonds**) to create a rigid, functional scaffold.\n**6. Gramicidin S (D-Amino Acid Antibiotic)**\n*10 residues | PDB: 1TK2*\nA powerful cyclic antibiotic produced by soil bacteria. It contains the rare **D-Phenylalanine** (`D-PHE`) which is critical for its \"beta-sheet-like\" hairpins.\n```bash\nsynth-pdb --sequence \"VAL-ORN-LEU-D-PHE-PRO-VAL-ORN-LEU-D-PHE-PRO\" --cyclic --minimize --visualize --output gramicidin_s.pdb\n```\n*Note: This utilizes ORN (Ornithine) if supported, or sub for LYS. The key is the D-PHE residue.*\n*Educational Concept*: Using D-amino acids to induce specific turns and achieve antimicrobial activity through membrane disruption.\n\n#### 🏗️ \"Architectural\" Protein Examples (The Giants)\n\nThese larger structures demonstrate domain organization and fibrous protein architectures.\n\n**1. \"Synthetic Spectrin\" (Multi-Domain Repeat)**\n*~150 Residues*\nSpectrin is a cytoskeletal protein made of repeating triple-helical bundles. We can simulate a simplified version: three distinct alpha-helical domains connected by flexible linkers.\n```bash\nsynth-pdb --length 150 --structure \"1-40:alpha,41-50:random,51-90:alpha,91-100:random,101-140:alpha,141-150:random\" --minimize --visualize --output synthetic_spectrin.pdb\n```\n*Educational Concept*: Demonstrates \"beads on a string\" domain organization and stable inter-domain flexibility.\n\n**2. \"Titin Segment\" (Poly-Beta Repeat)**\n*~120 Residues*\nTitin acts as a molecular spring in muscle, made of distinct Ig-like (beta sheet) domains.\n```bash\nsynth-pdb --length 120 --structure \"1-30:beta,31-40:random,41-70:beta,71-80:random,81-110:beta,111-120:random\" --minimize --visualize --output titin_segment.pdb\n```\n*Educational Concept*: Shows distinct rigid beta-regions separated by disordered \"hinges\", mimicking force-bearing structural proteins.\n\n**3. \"Giant Coiled-Coil\" (The Molecular Rod)**\n*~100 Residues*\nA super-long continuous alpha helix, modeled after Myosin tails or Tropomyosin.\n```bash\nsynth-pdb --sequence \"LKELEKELEKELEKELEKELEKELEKELEKELEKELEKELEKELEKELEKE\" --conformation alpha --minimize --visualize --output long_coil.pdb\n```\n*Educational Concept*: A massive, rigid rod where the helical groove is clearly visible. Excellent for demonstrating persistence length.\n\n**4. \"Synthetic Antibody\" (The Ultimate Stress Test)**\n*450 Residues*\nEmpirical simulation of a full IgG Heavy Chain: 4 Beta-sandwich domains (VH, CH1, CH2, CH3) connected by linkers.\n```bash\nsynth-pdb --length 450 --structure \"1-100:beta,101-110:random,111-210:beta,211-230:random,231-330:beta,331-340:random,341-440:beta,441-450:random\" --minimize --visualize --output synthetic_antibody.pdb\n```\n*Note*: This is a computationally intensive task! Energy minimization for ~7000 atoms may take several minutes.\n*Educational Concept*: Simulating multi-domain packing and the flexibility of the hinge region (residues 211-230).\n\n#### For Structural Biologists\n\n```bash\n# All natural amino acids with validation report\nsynth-pdb --sequence \"ACDEFGHIKLMNPQRSTVWY\" --validate --log-level DEBUG\n\n# Test structure for MD simulation pipeline\nsynth-pdb --length 50 --guarantee-valid --max-attempts 500 --output test_md.pdb\n\n# Benchmark structure with known violations (good for testing validators)\nsynth-pdb --length 100 --validate --output benchmark.pdb\n```\n\n#### The \"Power User\" Pipeline ⚡️\n\nCombine all features to simulate a complete NMR structure determination workflow:\n\n1.  **Generate** a sequence.\n2.  **Fold** it (alpha helix).\n3.  **Refine** geometry (minimization).\n4.  **Simulate** experimental data (NOEs and Relaxation).\n5.  **Visualize** the result.\n\n```bash\nsynth-pdb --sequence \"LKELEKELEKELEKELEKELEKEL\" \\\n          --conformation alpha \\\n          --minimize \\\n          --gen-nef --noe-cutoff 6.0 \\\n          --gen-relax --field 800 \\\n          --visualize\n```\n\n\u003e **👀 Viewer Tip**: Since you used `--gen-nef`, the **synthetic NOE restraints** will automatically appear as **red cylinders** connecting the protons. Use the **\"🔴 Restraints\"** button in the viewer to toggle them on/off!\n\n![Ghost Mode with Restraints](https://raw.githubusercontent.com/elkins/synth-pdb/master/docs/images/viewer_restraints.png)\n\n#### 🌿 Amphipathic Helix Visualization\nA classic biophysical motif where one face of the helix is hydrophobic (L, V, I) and the other is hydrophilic (K, E, R).\n\n```bash\n# Generate and Minimize\nsynth-pdb --sequence \"LKWLKRLLKWLKRLLKWLKRL\" --conformation alpha --minimize --visualize\n```\n*In the viewer*: Switch to **\"Sphere\"** style and **\"Element\"** color. You will see the \"greasy\" hydrophobic patch (Carbon-rich) clearly separated from the charged residues (Nitrogen/Oxygen-rich). This \"hydrophobic moment\" drives membrane binding!\n\n\u003e **🎓 Academic Note - \"Amphipathic\"**:\n\u003e From Greek *amphi* (both) and *pathos* (feeling). An amphipathic helix has a \"split personality\":\n\u003e *   **Hydrophobic Face** (L, V, I, F): Hates water. Buries itself inside the protein core or membrane.\n\u003e *   **Hydrophilic Face** (K, R, E, D): Loves water. Faces the solvent to keep the protein soluble.\n\u003e This duality is the fundamental force driving protein folding! 🧬🌗\n\n## Validation \u0026 Refinement\n\n### Validation Checks\n\nWhen `--validate` is enabled, the tool checks for:\n\n1. **Bond Lengths**: Compares N-CA, CA-C, C-N, C-O distances against standard values (±0.05 Å tolerance)\n\n2. **Bond Angles**: Validates N-CA-C, CA-C-N, CA-C-O angles (±5° tolerance)\n\n3. **Ramachandran Angles**: Checks phi/psi dihedral angles against MolProbity-defined polygonal regions\n   - **Categories**: General, Glycine, Proline, Pre-Proline\n   - **Levels**: Distinguishes between Favored, Allowed, and Outlier status\n\n4. **Steric Clashes**: Detects atoms that are too close\n   - Minimum distance rule: ≥2.0 Å between any atoms\n   - van der Waals overlap: atoms closer than sum of vdW radii\n\n5. **Peptide Plane Planarity**: Checks omega (ω) dihedral angles\n   - Trans: ~180° (±30° tolerance)\n   - Cis: ~0° (±30° tolerance)\n\n6. **Sequence Improbabilities**: Flags unusual sequence patterns\n   - Charge clusters (4+ consecutive charged residues)\n   - Long hydrophobic stretches (8+ residues)\n   - Odd cysteine counts (unpaired cysteines)\n   - Poly-proline or poly-glycine runs\n\n7. **Chirality**: Validates L-amino acid stereochemistry\n   - Checks improper dihedral angle N-CA-C-CB\n   - L-amino acids should have proper chirality (improper dihedral ±60° to ±120°)\n   - Glycine is automatically exempt (no CB atom)\n   - Detects incorrect stereochemistry (D-amino acids)\n\n### Refinement Strategy\n\nThe `--refine-clashes` option uses an iterative approach:\n1. Identifies clashing atom pairs\n2. Slightly adjusts positions to increase separation\n3. Re-validates structure\n4. Stops when no improvement or max iterations reached\n\n\u003e **Note**: Refinement focuses on steric clashes and may introduce other violations. Use in combination with `--best-of-N` for better results.\n\n## Output PDB Format\n\n### Structure Representation\n\n- **Full Atomic Model**: All backbone atoms (N, CA, C, O) + side-chain heavy atoms + hydrogens\n- **Geometry**: Linear alpha-helix conformation along the X-axis\n- **Chain ID**: Always 'A'\n- **Residue Numbering**: Sequential from 1\n- **Terminal Modifications**: N-terminal and C-terminal hydrogens/oxygens included\n\n### Atomic Records \u0026 B-Factors\n\nEach atom line follows the standard PDB format. The **B-factor** (Temperature Factor) is stored in **columns 61-66**.\n\n```text\nATOM      1  N   ALA A   1      -2.193   1.858   1.271  0.85 56.71           N\nATOM      5  CB  ALA A   1       0.241   1.845   1.013  0.85 86.14           C\n                                                        ^^^^ ^^^^^\n                                                       Occpy B-Fact\n```\n\n*   **Occupancy (0.85)**: Reflects the Order Parameter ($S^2$) if calculated, or default.\n*   **B-Factor (56.71 vs 86.14)**: Reflects atomic mobility. Note how the side-chain atom (CB) has a higher B-factor than the backbone (N), indicating greater flexibility.\n\n### Header Information\n\nGenerated PDB files include standard header records:\n\n```\nHEADER    PEPTIDE           \u003cDATE\u003e\nTITLE     GENERATED LINEAR PEPTIDE OF LENGTH \u003cN\u003e\nREMARK 1  This PDB file was generated by the CLI 'synth-pdb' tool.\nREMARK 2  It represents a simplified model of a linear peptide chain.\nREMARK 2  Coordinates are idealized and do not reflect real-world physics.\nREMARK 3  GENERATION PARAMETERS:\nREMARK 3  Command: synth-pdb --length 10 --validate ...\n```\n\nThe **REMARK 3** records store the exact command-line arguments used for **reproducibility**.\n\n### Validation Reports\n\nWhen `--validate` is used, violations are reported:\n```\nWARNING  --- PDB Validation Report for /path/to/file.pdb ---\nWARNING  Final PDB has 5 violations.\nWARNING  Bond length violation: N-1-A to CA-1-A. Distance: 1.52Å, Expected: 1.46Å±0.05Å\nWARNING  Steric clash (min distance): Atoms CA-3-A and CB-3-A are too close (1.85Å)...\n```\n\n## Scientific Context\n\n### Intended Use Cases\n\n✅ **Appropriate Uses:**\n- Testing PDB parsers and file I/O\n- Benchmarking structure validation tools\n- Educational demonstrations of protein structure concepts\n- Generating test datasets for bioinformatics pipelines\n- Placeholder structures for software development\n\n❌ **Inappropriate Uses:**\n- Homology modeling templates\n- Drug docking studies\n- Experimental predictions\n- Publication-quality structures\n\nReal protein structures require sophisticated methods like:\n- Molecular dynamics with force fields (AMBER, CHARMM)\n- Quantum mechanics calculations (DFT)\n- Energy minimization and conformational search\n- Crystallographic or NMR experimental data\n\n## Limitations\n\n### Structural Limitations\n\n1. **Topology**:\n   - Primarily generates **linear** variations or simple **disulfide-bonded** loops.\n   - Does not perform *de novo* folding (prediction of tertiary structure from sequence).\n   - Multi-chain complexes are currently limited to simple docking preparations.\n\n2. **Geometry**:\n   - **Default Mode**: Uses idealized internal coordinates (perfect bond lengths/angles).\n   - **Physically Realistic Mode** (`--minimize`): Resolves this by relaxing the structure with OpenMM, but is computationally more expensive.\n\n3. **Rotamer Library**:\n   - **Backbone-Dependent**: Fully implemented for **All 20 Amino Acids**.\n   - **Mechanism**: Checks local secondary structure (Alpha/Beta) to select rotamers that avoid backbone clashes.\n   - **Rare Rotamers**: Very rare side-chain conformations (\u003c1% probability) may be undersampled.\n\n4. **Environmental Effects**:\n   - **Solvent**: Uses Implicit Solvent (OBC2) to model water screening, but lacks explicit water molecules.\n   - **Membranes**: No lipid bilayer simulation for transmembrane proteins.\n\n### Validation Limitations\n\n- **Ramachandran Regions**: Uses simplified **rectangular boundaries** for valid phi/psi regions. While faster, this is less rigorous than the contoured probability density functions used by MolProbity.\n- **Electrostatics**: Basic clash detection does not account for long-range electrostatic repulsion/attraction (though `--minimize` does).\n- **Protonation**: Simple pH-based titration (His/Asp/Glu) without full pKa calculation.\n\n### Terminology: Decoys vs NMR Ensembles\n\nThere is an important distinction between the \"Decoys\" generated by this tool and a traditional \"NMR Ensemble\":\n\n*   **NMR Ensemble**: A set of structures (usually 20) that *all satisfy* experimental restraints (NOEs) and have converged to the same fold. They represent the **precision** of the structure determination.\n*   **Decoys (Conformational Ensemble)**: A set of independent structures generated to sample the conformational space. They often have high RMSD (diversity) and represent the **search space**.\n\n`synth-pdb --mode decoys` generates the latter: independent snapshots. To create a pseudo-NMR ensemble, use `--rmsd-max 2.0` to filter for similar structures.\n\n### Performance Considerations\n\n- `--guarantee-valid` may **never converge** for long sequences (\u003e50 residues)\n  - Combinatorial explosion of possible violations\n  - Consider using `--best-of-N` instead\n\n- `--refine-clashes` is **iterative and may be slow** for large structures\n  - Each iteration requires full re-validation\n\n- Validation runtime scales with sequence length (O(N²) for steric clashes)\n\n## Development\n\n### Running Tests\n\n```bash\n# All tests\npytest -v\n\n# With coverage\npytest --cov=synth_pdb --cov-report=term-missing\n\n# Specific test file\npytest tests/test_generator.py -v\n```\n\n**Test Coverage**: 93% overall\n- 1369 tests covering generation, validation, CLI and edge cases\n\n\n\n### Project Structure\n\n```\nsynth-pdb/\n├── synth_pdb/\n│   ├── __init__.py\n│   ├── main.py              # CLI entry point\n│   ├── generator.py         # PDB structure generation (NeRF, rotamers, PTMs, D-AAs)\n│   ├── validator.py         # Validation checks \u0026 get_quality_report()\n│   ├── physics.py           # OpenMM energy minimization, MD, simulate_trajectory()\n│   ├── data.py              # Constants, rotamer library, Ramachandran polygons\n│   ├── nmr.py               # RPF scores, NOE compatibility shims (delegates to synth-nmr)\n│   ├── rdc.py               # Residual Dipolar Coupling (Saupe-matrix formalism)\n│   ├── msa.py               # MCMC Potts-model MSA co-evolution generator\n│   ├── plm.py               # ESM-2 protein language model embeddings\n│   ├── orientogram.py       # 6D rotation-invariant inter-residue orientation\n│   ├── batch_generator.py   # Vectorized BatchedGenerator for AI training\n│   ├── decoys.py            # Hard-decoy generation (threading, drift, shuffle)\n│   ├── dataset.py           # Bulk dataset generation (NPZ / PDB format)\n│   ├── chemical_shifts.py   # SPARTA-lite + ring-current shift prediction\n│   ├── biophysics.py        # Biophysical utility functions\n│   ├── viewer.py            # 3Dmol.js browser-based visualizer\n│   ├── geometry/            # Geometry subpackage (v1.27+)\n│   │   ├── superposition.py # Kabsch algorithm, apply_transformation, find_medoid\n│   │   ├── rmsd.py          # RMSD, pairwise RMSD, symmetry-aware variants\n│   │   ├── dihedral.py      # Dihedral angle calculations\n│   │   ├── nerf.py          # NeRF backbone construction kernels\n│   │   ├── sidechain.py     # Side-chain geometry helpers\n│   │   └── vectorized.py    # NumPy-vectorized / Numba-JIT geometry kernels\n│   ├── ensemble/            # NMR ensemble analysis subpackage (v1.34.1+)\n│   │   ├── daop.py          # DAOPCalculator (Hyberts 1992 dihedral order parameters)\n│   │   └── statistics.py    # EnsembleStatistics, QualityAssessment dataclasses\n│   └── quality/             # Structure quality scoring (v1.18+)\n│       ├── gnn/             # Graph Neural Network quality scorer\n│       ├── classifier.py    # Random Forest / GNN quality filter interface\n│       └── features.py      # Feature extraction for quality models\n├── tests/\n│   ├── test_generator.py\n│   ├── test_validator.py\n│   ├── test_scientific_validation.py\n│   ├── test_coupling.py\n│   ├── unit/                # Unit tests for geometry, ensemble, quality modules\n│   └── ... (many more)\n├── examples/\n│   ├── interactive_tutorials/\n│   ├── ml_integration/\n│   └── ml_loading/          # JAX / PyTorch / MLX zero-copy handover\n├── docs/\n├── incubator/\n├── pyproject.toml\n└── README.md\n```\n\n## 📚 Biophysical References \u0026 Further Reading\n\nFor students and researchers interested in the physics behind the code, here are key seminal papers:\n\n*   **Cis-Proline (~5% Frequency):**\n    *   MacArthur, M. W., \u0026 Thornton, J. M. (1991). Influence of proline residues on protein conformation. *J Mol Biol*, 218(2), 397-412.\n    *   Weiss, M. S., et al. (1998). Cis-proline. *Acta Cryst D*, 54, 323-329.\n\n*   **Macrocyclization \u0026 Cyclic Peptides:**\n    *   Horton, D. A., et al. (2003). The combinatorial synthesis of bicyclic peptides. *Chem. Rev.*, 103(3), 893-930. (Seminal review on macrocycles).\n    *   Craik, D. J., et al. (2013). The future of peptide-based drugs. *Chem. Biol. Drug Des.*, 81(1), 136-147.\n\n*   **NMR Structure Validation \u0026 Chirality:**\n    *   Montelione, G. T., et al. (2013). Recommendations of the wwPDB NMR Validation Task Force. *Structure*, 21(9), 1563-1570. (Defines standards for geometric validation).\n    *   Huang, Y. J., Powers, R., \u0026 Montelione, G. T. (2005). \"Protein NMR recall, precision, and F-measure scores (RPF scores): structure quality assessment measures based on information retrieval statistics.\" *Journal of the American Chemical Society*, 127(6), 1665-1674.\n    *   Raman, S., et al. (2010). \"NMR Structure Determination for Larger Proteins Using Backbone-Only Data.\" *Science*, 327(5968), 1014-1018. (Using chemical shifts and RDCs for structure prediction).\n    *   Bhattacharya, A., \u0026 Montelione, G. T. (2011). PDBStat: a server for validation of protein NMR structures.\n\n*   **Nuclear Overhauser Effect (NOE) \u0026 $r^{-6}$:**\n    *   Wüthrich, K. (1986). *NMR of Proteins and Nucleic Acids*. Wiley-Interscience. (The definitive text).\n    *   Wüthrich, K. (2003). Nobel Lecture: NMR Studies of Protein Structure and Dynamics.\n\n*   **Chemical Shift Prediction (SPARTA) \u0026 Referencing (DSS):**\n    *   Shen, Y., \u0026 Bax, A. (2010). SPARTA+: a modest improvement in empirical NMR chemical shift prediction... *J Biomol NMR*, 48, 13-22.\n    *   Markley, J. L., et al. (1998). Recommendations for the presentation of NMR structures... (IUPAC). *Pure Appl Chem*, 70(1), 117-142. (Defined DSS as the standard).\n\n*   **Internal Dynamics \u0026 Model-Free Formalism:**\n    *   Lipari, G., \u0026 Szabo, A. (1982). Model-free approach to the interpretation of nuclear magnetic resonance relaxation in macromolecules. 1. Theory and range of validity. *J. Am. Chem. Soc.*, 104(17), 4546–4559. (The foundational theory).\n    *   Kay, L. E., Torchia, D. A., \u0026 Bax, A. (1989). Backbone dynamics of proteins as studied by 15N inverse detected heteronuclear NMR spectroscopy... *Biochemistry*, 28(23), 8972-8979. (The seminal application to proteins).\n\n## References \u0026 Bibliography\n\n### Structure Generation \u0026 Rotamers\n1.  **Dunbrack, R. L., \u0026 Cohen, F. E. (1997).** Bayesian statistical analysis of protein side-chain rotamer preferences. *Protein Science, 6*(8), 1661–1681.\n    - Used for: Rotamer libraries and side-chain probability distributions.\n2.  **Parsons, J., et al. (2005).** Practical conversion from torsion space to Cartesian space for in silico protein synthesis. *Journal of Computational Chemistry, 26*(10), 1063–1068.\n    - Used for: The NeRF (Natural Extension Reference Frame) algorithm for backbone construction.\n3.  **MacArthur, M. W., \u0026 Thornton, J. M. (1991).** Influence of proline residues on protein conformation. *Journal of Molecular Biology*, 218(2), 397-412.\n    - Used for: Cis-Proline isomerization statistics (~5% cis frequency).\n4.  **Homeyer, N., et al. (2006).** AMBER force-field parameters for phosphorylated amino acids... *Journal of Molecular Modeling*, 12(3), 281-289.\n    - Used for: PTM physics parameters (SEP, TPO, PTR) in OpenMM.\n5.  **Smith, D. M. (2001).** Protein Composition and Structure. *Encyclopedia of Life Sciences*.\n    - Used for: Biological amino acid frequency data.\n\n### NMR Dynamics \u0026 Relaxation\n6.  **Lipari, G., \u0026 Szabo, A. (1982).** Model-free approach to the interpretation of nuclear magnetic resonance relaxation in macromolecules. *Journal of the American Chemical Society, 104*(17), 4546–4559.\n    - Used for: Calculating $S^2$ order parameters and relaxation rates ($R_1, R_2, NOE$).\n7.  **Wishart, D. S., et al. (1995).** 1H, 13C and 15N random coil NMR chemical shifts of the common amino acids. *Journal of Biomolecular NMR, 6, 135–140.*\n    - Used for: Random coil chemical shift baselines.\n8.  **Cavanagh, J., et al. (2007).** *Protein NMR Spectroscopy: Principles and Practice*. Academic Press.\n    - Used for: General NMR theory and relaxation equations.\n\n### Validation\n7.  **Williams, C. J., et al. (2018).** MolProbity: More and better reference data for improved all-atom structure validation. *Protein Science, 27*(1), 293–315.\n    - Used for: Ramachandran polygon definitions and validation criteria.\n8.  **Lovell, S. C., et al. (2003).** Structure validation by Calpha geometry: phi,psi and Cbeta deviation. *Proteins: Structure, Function and Bioinformatics, 50*(3), 437–450.\n    - Used for: Early reference for Ramachandran validation concepts.\n\n## Glossary of Scientific Terms \u0026 Acronyms\n\nThis section provides definitions and seminal references for the biophysical and computational terms used throughout `synth-pdb`. Entries are sorted alphabetically.\n\n| Term | Definition | Reference |\n| :--- | :--- | :--- |\n| **AMBER** | **Assisted Model Building with Energy Refinement**. A widely-used suite of molecular simulation programs and force fields for biomolecules. | Case, D. A., et al. (2005). *J. Comput. Chem.* |\n| **B-factor** | **Temperature Factor** (8π²⟨u²⟩). Measures atomic displacement due to thermal motion and static disorder. Higher values indicate greater flexibility; lower values indicate rigidity. | — |\n| **Backbone-Dependent Rotamer** | A side-chain conformation probability that depends on the local backbone angles (φ, ψ). Used to select realistic side-chain orientations based on secondary structure context. | Dunbrack \u0026 Cohen (1997). *Protein Science.* |\n| **CASP** | **Critical Assessment of Structure Prediction**. A community-wide experiment held every two years to establish the state-of-the-art in protein structure modeling. | Kryshtafovych, A., et al. (2021). *Proteins.* |\n| **Chi Angles (χ)** | Dihedral angles describing side-chain conformation about successive bonds from Cα outward (χ₁, χ₂, …). Discrete preferred values define rotamers. | — |\n| **CSI** | **Chemical Shift Index**. A standard method used to deduce protein secondary structure (alpha helix vs. beta sheet) from detected NMR chemical shift deviations. | Wishart, D. S., et al. (1992). *Biochemistry.* |\n| **Macrocycle** | A cyclic macromolecule or macromolecular network, such as a cyclic peptide or a crown ether. In therapeutic chemistry, macrocyclization improves metabolic stability and binding affinity. | IUPAC Gold Book. |\n| **MolProbity** | A structure validation web service and scoring function providing the gold standard for Ramachandran and rotamer analysis. | Chen, V. B., et al. (2010). *Acta Cryst. D.* |\n| **NEF** | **NMR Exchange Format**. A unified, open standard for the exchange of NMR restraint data among various software packages. | Gutmanas, A., et al. (2015). *Nat. Struct. Mol. Biol.* |\n| **NeRF** | **Natural Extension Reference Frame**. An algorithm for rapidly constructing 3D Cartesian coordinates from internal coordinates (bond lengths, angles, and dihedrals). | Parsons, J., et al. (2005). *J. Comput. Chem.* |\n| **NOE** | **Nuclear Overhauser Effect**. A phenomenon where magnetization is transferred between spins through space, allowing measurement of inter-atomic distances (r⁻⁶ dependency). | Wüthrich, K. (1986). *NMR of Proteins and Nucleic Acids.* |\n| **OBC2** | **Onufriev-Bashford-Case model 2**. A computationally efficient implicit solvent model (Generalized Born) used to simulate the screening effect of water on charged groups. | Onufriev, A., et al. (2004). *Proteins.* |\n| **PDB** | **Protein Data Bank**. The global repository for 3D structural data of proteins, nucleic acids, and complex assemblies. | Berman, H. M., et al. (2000). *Nucleic Acids Res.* |\n| **Phi/Psi (φ, ψ)** | Backbone dihedral angles. φ is defined by C(i−1)−N−Cα−C; ψ is defined by N−Cα−C−N(i+1). Together they determine backbone geometry and are plotted on the Ramachandran plot. | — |\n| **Pre-Proline** | The residue immediately preceding a Proline. It has restricted conformational freedom due to steric clash with the Proline ring, and uses a distinct Ramachandran distribution. | — |\n| **Ramachandran Plot** | A 2D plot of φ vs ψ angles showing energetically allowed and disallowed backbone conformations for amino acids. The basis for structural validation. | Ramachandran et al. (1963). *J. Mol. Biol.* |\n| **Rotamer** | Short for \"Rotational Isomer\". Preferred, low-energy side-chain conformations defined by discrete χ-angle clusters. | Dunbrack, R. L. (2002). *Curr. Opin. Struct. Biol.* |\n| **S²** | **Model-Free Order Parameter** (Lipari-Szabo). A value between 0 (random/flexible) and 1 (perfectly rigid) describing the degree of spatial restriction of local backbone motion on ps–ns timescales. | Lipari, G., \u0026 Szabo, A. (1982). *J. Am. Chem. Soc.* |\n| **SASA** | **Solvent Accessible Surface Area**. The surface area of a biomolecule accessible to a solvent probe (typically a 1.4 Å water molecule). Low SASA indicates a buried residue; high SASA indicates solvent exposure. | Shrake \u0026 Rupley (1973). *J. Mol. Biol.* |\n| **BMRB** | **BioMagResBank**. The international repository for NMR spectroscopic data derived from biological molecules, including chemical shift assignments, restraint files, and relaxation data. | Ulrich, E. L., et al. (2008). *Nucleic Acids Res.* |\n| **DAOP** | **Dihedral Angle Order Parameter**. A circular statistics metric (range 0–1) quantifying the consistency of backbone dihedral angles (φ, ψ) across an NMR ensemble. Well-defined residues satisfy S(φ)+S(ψ) ≥ 1.8 (PDBStat convention). Available via `synth_pdb.ensemble.daop`. | Hyberts, S. G., et al. (1992). *Protein Science* 1:736. |\n| **DCA** | **Direct Coupling Analysis**. A statistical inference method that identifies evolutionarily co-varying residue pairs in a multiple sequence alignment to predict spatial contacts and generate AlphaFold-ready MSA inputs. | Morcos, F., et al. (2011). *PNAS* 108:E1293. |\n| **Engh \u0026 Huber** | The landmark (1991) set of ideal bond lengths and bond angles for the 20 standard amino acids, derived from small-molecule crystallography. `PDBValidator` uses these as Z-score reference distributions (v1.29+). | Engh, R. A., \u0026 Huber, R. (1991). *Acta Cryst. A* 47:392. |\n| **ESM-2 / PLM** | **Evolutionary Scale Modeling 2 / Protein Language Model**. A large transformer trained on millions of protein sequences that produces per-residue embeddings for zero-shot quality scoring. Available via `synth_pdb.quality.plm`; install with `pip install synth-pdb[plm]`. | Lin, Z., et al. (2023). *Science* 379:1123. |\n| **GNN** | **Graph Neural Network**. A deep learning model operating on graph-structured data. In `synth_pdb.quality.gnn`, residues are nodes and spatial/sequence contacts are edges, enabling structure quality assessment. Install with `pip install synth-pdb[gnn]`. | Kipf, T. N., \u0026 Welling, M. (2017). *ICLR.* |\n| **IDR / IDP** | **Intrinsically Disordered Region / Protein**. A protein region that lacks a stable 3D fold under physiological conditions. Characterised by high RMSF, low S², and low AlphaFold pLDDT. Validated against PRE NMR data in `idp_ensemble_validation.ipynb`. | Dyson, H. J., \u0026 Wright, P. E. (2005). *Nat. Rev. Mol. Cell Biol.* |\n| **Kauzmann (Hydrophobic Effect)** | The thermodynamic driving force for hydrophobic residues to bury in a protein's core, arising from the entropic cost of ordering water around non-polar groups. Cited in SASA burial validation (v1.29). | Kauzmann, W. (1959). *Adv. Protein Chem.* 14:1. |\n| **Magic Step** | A coupled MCMC mutation proposal in the MSA Potts-Model sampler where two spatially contacting residues are mutated simultaneously, preserving co-evolutionary constraints (20% proposal rate, v1.26+). | — |\n| **MCMC / Metropolis-Hastings** | **Markov Chain Monte Carlo**. A class of algorithms for sampling from probability distributions. Used in `synth_pdb.msa` to simulate protein sequence evolution on the Potts Model energy landscape. | Metropolis, N., et al. (1953). *J. Chem. Phys.* 21:1087. |\n| **Orientogram** | A 6D rotation-invariant representation of inter-residue orientations in a protein structure, used as a structural fingerprint and neural network input feature. See `synth_pdb.orientogram`. | — |\n| **pLDDT** | **Predicted Local Distance Difference Test**. AlphaFold2's per-residue confidence score (0–100). Low pLDDT (\u003c 50) accurately signals intrinsically disordered regions — not prediction failure. Correlates inversely with NMR S² and MD RMSF. | Jumper, J., et al. (2021). *Nature* 596:583. |\n| **Potts Model** | A statistical physics model of interacting spins on a lattice, applied in `synth_pdb.msa` to protein sequences: each position is a spin (amino acid) and J_ij couplings encode co-evolutionary interactions between residue pairs. | Weigt, M., et al. (2009). *PNAS* 106:67. |\n| **PPII** | **Polyproline II Helix**. A left-handed helical conformation (φ ≈ −75°, ψ ≈ +145°) common in collagen and proline-rich sequences. Specifiable via `--conformation ppii`. | — |\n| **PRE** | **Paramagnetic Relaxation Enhancement**. An NMR phenomenon where a paramagnetic spin label broadens nearby nuclear resonances proportional to r⁻⁶. Used to validate IDP conformational ensembles. | Clore, G. M., \u0026 Iwahara, J. (2009). *Chem. Rev.* 109:4108. |\n| **Q-factor** | A dimensionless goodness-of-fit metric for Residual Dipolar Couplings: Q = RMSD(D_calc − D_obs) / RMSD(D_obs). Lower is better; high-quality structures typically achieve Q \u003c 0.20. | Cornilescu, G., et al. (1998). *J. Biomol. NMR* 12:373. |\n| **RDC** | **Residual Dipolar Coupling**. An NMR observable arising when a molecule is partially aligned in an anisotropic medium. Encodes long-range bond-vector orientation information relative to the molecular alignment frame. Computed by `synth_pdb.rdc`. | Tjandra, N., \u0026 Bax, A. (1997). *Science* 278:1111. |\n| **RMSF** | **Root Mean Square Fluctuation**. The standard deviation of each residue's position over time in an MD trajectory (after Kabsch rigid-body alignment). High RMSF = flexibility; Low RMSF = rigidity. Inversely related to S² and pLDDT. | — |\n| **Saupe Matrix / Alignment Tensor** | The 3×3 traceless symmetric tensor describing the degree and orientation of molecular alignment in an anisotropic medium. Parameterised by axial component `Da` and rhombicity `R` for RDC calculations. | Saupe, A. (1968). *Angew. Chem.* 7:97. |\n| **Top2018** | A high-resolution Ramachandran reference dataset derived from ~15,000 protein chains (resolution \u003c 1.5 Å), superseding Top8000. Adopted in `PDBValidator` from v1.29 for more accurate φ/ψ boundary validation. | — |\n| **Top8000** | A high-quality curated dataset of ~8000 protein chains (resolution \u003c 2.0 Å, low sequence homology) used to derive accurate Ramachandran contours and rotamer libraries. | Lovell, S. C., et al. (2003). *Proteins.* |\n\n## License\n\nThis project is provided as-is for educational and testing purposes.\n\n---\n\n## Citation\n\nIf you use this software in your research, please cite:\n\n```bibtex\n@software{synth_pdb,\n  author = {Elkins, George},\n  title = {synth-pdb: Realistic Protein Structure Generator},\n  year = {2026},\n  url = {https://github.com/elkins/synth-pdb}\n}\n```\n\n## 🛠️ Software \u0026 Libraries\n\nThis project relies on the following open-source scientific software:\n\n- **[OpenMM](https://openmm.org/)**: High-performance molecular dynamics toolkit used for physics-based energy minimization (Implicit Solvent/OBC2).\n- **[Biotite](https://www.biotite-python.org/)**: Comprehensive library for structural biology involved in PDB IO, atom manipulation, and geometric analysis.\n- **[3Dmol.js](https://3dmol.csb.pitt.edu/)**: JavaScript library for molecular visualization used in the `--visualize` browser-based viewer.\n- **[NumPy](https://numpy.org/)**: Fundamental package for scientific computing and matrix operations.\n\n### Tools with NEF Support\nThese external tools can import the data generated by `synth-pdb`:\n- **[CCPNMR Analysis](https://ccpn.ac.uk/)**: Premier software for NMR data analysis, assignment, and structure calculation (Native NEF support).\n- **[CYANA](http://www.cyana.org/)**: Automated NMR structure calculation.\n- **[XPLOR-NIH](https://nmr.cit.nih.gov/xplor-nih/)**: Biomolecular structure determination.\n\n## 📚 References \u0026 Scientific Publications\n\n### Key Publications in NMR Structure Validation\n\n1.  **Protein Structure Validation Suite (PSVS)**\n    *   Bhattacharya, A., Tejero, R., \u0026 Montelione, G. T. (2007). \"Evaluating protein structures determined by structural genomics consortia.\" *Proteins: Structure, Function, and Bioinformatics*, 66(4), 778-795.\n    *   [Link to Publisher](https://onlinelibrary.wiley.com/doi/abs/10.1002/prot.21165)\n\n2.  **RPF Scores (Recall, Precision, F-measure)**\n    *   Huang, Y. J., Powers, R., \u0026 Montelione, G. T. (2005). \"Protein NMR recall, precision, and F-measure scores (RPF scores): structure quality assessment measures based on information retrieval statistics.\" *Journal of the American Chemical Society*, 127(6), 1665-1674.\n    *   [Link to Publisher](https://pubs.acs.org/doi/10.1021/ja0471963)\n\n3.  **DP Score (Discriminant Power)**\n    *   Huang, Y. J., Tejero, R., Powers, R., \u0026 Montelione, G. T. (2006). \"A topology-constrained distance network algorithm for protein structure determination from NOESY data.\" *Proteins: Structure, Function, and Bioinformatics*, 62(3), 587-603.\n    *   [Link to Publisher](https://onlinelibrary.wiley.com/doi/abs/10.1002/prot.20784)\n\n### Data Standards\n\n- **NMR Exchange Format (NEF)**\n    *   Gutmanas, A., et al. (2015). \"NMR Exchange Format: a unified and open standard for representation of NMR restraint data.\" *Nature Structural \u0026 Molecular Biology*, 22, 433–434.\n    *   [Link to Publisher](https://www.nature.com/articles/nsmb.3041)\n    *   **Extension Proposal:** \"Proposal For Incorporating NMR Relaxation Data In NEF\" (GitHub PDF)\n        *   [Link to Proposal](https://github.com/NMRExchangeFormat/NEF/blob/master/specification/Proposal","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felkins%2Fsynth-pdb","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Felkins%2Fsynth-pdb","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felkins%2Fsynth-pdb/lists"}