{"id":28414646,"url":"https://github.com/shahram-boshra/qm7_database_process","last_synced_at":"2026-04-29T10:33:34.620Z","repository":{"id":296486492,"uuid":"993546555","full_name":"shahram-boshra/qm7_database_process","owner":"shahram-boshra","description":"QM7 Dataset Processing and Curation","archived":false,"fork":false,"pushed_at":"2025-07-24T17:58:33.000Z","size":48,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-09-05T09:48:41.371Z","etag":null,"topics":["chemical-informatics","gnn","graph-neural-networks","machine-deep-learning","python","pytorch","pytorch-geometric","qm7-database","rdkit-chem"],"latest_commit_sha":null,"homepage":"https://github.com/shahram-boshra/qm7_database_process","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shahram-boshra.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":"citations.py","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-31T01:52:30.000Z","updated_at":"2025-07-24T17:58:37.000Z","dependencies_parsed_at":"2025-05-31T13:58:48.696Z","dependency_job_id":"d1577e4f-e88c-4215-8365-ad014bc41cfe","html_url":"https://github.com/shahram-boshra/qm7_database_process","commit_stats":null,"previous_names":["shahram-boshra/qm7_database_process"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/shahram-boshra/qm7_database_process","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shahram-boshra%2Fqm7_database_process","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shahram-boshra%2Fqm7_database_process/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shahram-boshra%2Fqm7_database_process/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shahram-boshra%2Fqm7_database_process/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shahram-boshra","download_url":"https://codeload.github.com/shahram-boshra/qm7_database_process/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shahram-boshra%2Fqm7_database_process/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32421858,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-29T06:29:02.080Z","status":"ssl_error","status_checked_at":"2026-04-29T06:29:00.631Z","response_time":110,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chemical-informatics","gnn","graph-neural-networks","machine-deep-learning","python","pytorch","pytorch-geometric","qm7-database","rdkit-chem"],"created_at":"2025-06-03T08:46:58.547Z","updated_at":"2026-04-29T10:33:34.614Z","avatar_url":"https://github.com/shahram-boshra.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# QM7 Dataset Processing and Curation to PyTorch Geometric Molecular Graphs\n\n[![Python Version](https://img.shields.io/badge/python-%3E=3.8-blue.svg)](https://www.python.org/downloads/)\n[![License](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Code Style](https://img.shields.io/badge/code%20style-pep8-brightgreen.svg)](https://peps.python.org/pep-0008/)\n[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%23167ac6)](https://pycqa.github.io/isort/)\n[![Linter: pylint](https://img.shields.io/badge/linter-pylint-yellowgreen)](https://www.pylint.org/)\n[![Formatter: black](https://img.shields.io/badge/formatter-black-000000?style=flat\u0026logo=python\u0026logoColor=yellow)](https://github.com/psf/black)\n\n\u003e A robust Python pipeline for processing the QM7 quantum chemistry dataset into graph-based format optimized for PyTorch Geometric.\n\n## 🧪 Overview\n\nThis repository provides a comprehensive solution for transforming the QM7 quantum chemistry dataset into PyTorch Geometric (PyG) graph format. The QM7 dataset contains 7,165 molecules with up to 23 atoms (C, O, N, S, H) and their corresponding atomization energies, making it an ideal benchmark for Graph Neural Networks (GNNs) in molecular property prediction tasks.\n\n### Key Capabilities\n\n- **Multi-format Data Loading**: Seamlessly processes SDF, CSV, and MAT files\n- **Rich Graph Construction**: Creates detailed molecular graphs with comprehensive node and edge features\n- **Data Quality Assurance**: Implements thorough consistency checks and alignment validation\n- **Memory-Efficient Processing**: Handles large datasets through intelligent chunking\n- **Feature Normalization**: Applies global standardization using scikit-learn's StandardScaler\n- **Flexible Filtering**: Supports custom pre-filtering based on molecular properties\n- **Extensible Transforms**: Integrates with PyTorch Geometric's transform ecosystem\n\n## 🚀 Quick Start\n\n### Prerequisites\n\n- Python ≥ 3.8\n- CUDA-capable GPU (recommended for large-scale processing)\n\n### Installation\n\n1. **Clone the repository**\n   ```bash\n   git clone https://github.com/shahram-boshra/qm7_database_process.git\n   cd qm7_database_process\n   ```\n\n2. **Create and activate virtual environment**\n   ```bash\n   python -m venv venv\n   source venv/bin/activate  # On Windows: venv\\Scripts\\activate\n   ```\n\n3. **Install dependencies**\n   ```bash\n   pip install -r requirements.txt\n   ```\n\n### Dataset Setup\n\nDownload the QM7 dataset files and organize them as follows:\n\n```\ndata/\n└── qm7/\n    └── raw_data/\n        ├── gdb7.sdf                    # Molecular structures\n        ├── atomization_energies.csv    # Target energies\n        └── qm7.mat                     # Coulomb matrices and charges\n```\n\n**Dataset Sources:**\n- [QM7 on Figshare](https://figshare.com/articles/QM7_dataset/1930773)\n- [MoleculeNet QM7](https://moleculenet.org/datasets/qm7)\n\n### Basic Usage\n\n```python\nfrom pathlib import Path\nfrom pyg_qm7_processing import process_qm7_data\nfrom qm7_curation import curate_qm7_data\n\n# Define paths\nbase_dir = Path(\"data/qm7\")\nraw_dir = base_dir / \"raw_data\"\nprocessed_dir = base_dir / \"processed\"\n\n# Step 1: Process raw data into chunks\nprocess_qm7_data(\n    sdf_file=raw_dir / \"gdb7.sdf\",\n    energies_file=raw_dir / \"atomization_energies.csv\", \n    mat_file=raw_dir / \"qm7.mat\",\n    intermediate_chunk_output_dir=base_dir / \"chunks\",\n    chunk_size=1000\n)\n\n# Step 2: Curate and normalize features\ncurate_qm7_data(\n    chunk_dir=base_dir / \"chunks\",\n    output_path=processed_dir / \"qm7_processed.pt\",\n    feature_keys_for_norm=['x', 'edge_attr']\n)\n```\n\n## 📊 Dataset Features\n\n### Node Features (per atom)\n- **Atom Type**: One-hot encoded atomic species\n- **Atomic Number**: Raw atomic number values\n- **Chemical Properties**: Aromaticity, hybridization state (SP/SP2/SP3)\n- **Hydrogen Count**: Total number of bonded hydrogens\n- **Quantum Properties**: Atomic charges and Coulomb matrix diagonal elements\n\n### Edge Features (per bond)\n- **Bond Type**: One-hot encoded (Single, Double, Triple, Aromatic)\n- **Coulomb Interactions**: Off-diagonal Coulomb matrix elements\n\n### Graph Properties\n- **3D Coordinates**: Atomic positions from conformers\n- **Target Values**: Atomization energies (eV)\n- **Metadata**: Original dataset indices for traceability\n\n## 🏗️ Architecture\n\n```\nscripts/\n├── pyg_qm7_processing.py    # Core data processing and graph construction\n├── qm7_curation.py          # Feature normalization and final transforms\n├── exceptions.py            # Custom exception handling\n└── main_process.py          # Complete pipeline orchestration\n```\n\n### Processing Pipeline\n\n1. **Data Loading \u0026 Validation**\n   - Load molecular structures from SDF files\n   - Parse energy targets from CSV\n   - Extract quantum properties from MAT files\n   - Validate data consistency across sources\n\n2. **Graph Construction**\n   - Build molecular graphs using RDKit\n   - Extract comprehensive node and edge features\n   - Apply optional pre-filtering criteria\n   - Save intermediate results in memory-efficient chunks\n\n3. **Feature Curation**\n   - Calculate global feature statistics\n   - Apply standardization transforms\n   - Integrate custom preprocessing steps\n   - Consolidate final dataset\n\n## 🔧 Advanced Configuration\n\n### Custom Filtering\n\n```python\nfrom functools import partial\n\n# Define molecule filters\ndef filter_by_complexity(data, min_atoms=5, max_atoms=20):\n    return min_atoms \u003c= data.num_nodes \u003c= max_atoms\n\ndef filter_by_carbon_content(data, min_carbons=1):\n    carbon_count = (data.z == 6).sum().item()\n    return carbon_count \u003e= min_carbons\n\n# Combine filters\ncombined_filter = lambda data: (\n    filter_by_complexity(data, 5, 20) and\n    filter_by_carbon_content(data, 1)\n)\n```\n\n### Custom Transforms\n\n```python\nfrom torch_geometric.transforms import Compose\nfrom qm7_curation import CustomEdgeFeatureCombiner\n\ntransforms = Compose([\n    CustomEdgeFeatureCombiner(param1='value1'),\n    # Add your custom transforms here\n])\n```\n\n## 📈 Performance \u0026 Scalability\n\n- **Memory Efficiency**: Chunked processing handles datasets of arbitrary size\n- **Processing Speed**: Optimized for large-scale molecular datasets\n- **GPU Compatibility**: Full CUDA support for accelerated computation\n- **Robust Error Handling**: Comprehensive exception management for production use\n\n## 🤝 Contributing\n\nWe welcome contributions! Please see our [contribution guidelines](CONTRIBUTING.md) for details.\n\n### Development Setup\n\n```bash\n# Install development dependencies\npip install -r requirements-dev.txt\n\n# Run tests\npython -m pytest tests/\n\n# Code formatting\nblack scripts/\nisort scripts/\n\n# Linting\npylint scripts/\n```\n\n## 📋 Requirements\n\n```\ntorch\u003e=1.9.0\ntorch_geometric\u003e=2.0.0\nrdkit-pypi\u003e=2022.3.5\nnumpy\u003e=1.21.0\nscipy\u003e=1.7.0\npandas\u003e=1.3.0\ntqdm\u003e=4.62.0\nscikit-learn\u003e=1.0.0\nPyYAML\u003e=6.0\n```\n\n## 📄 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## 🙏 Acknowledgments\n\n- **QM7 Dataset Creators** - For providing this valuable quantum chemistry benchmark\n- **RDKit Team** - Essential cheminformatics toolkit for molecular manipulation\n- **PyTorch Geometric** - Powerful graph neural network library\n- **PyTorch Team** - Foundational deep learning framework\n- **Scientific Python Community** - NumPy, Pandas, and scikit-learn developers\n\n## 📚 Citation\n\nIf you use this code in your research, please cite:\n\n```bibtex\n@software{qm7_processing_2024,\n  title={QM7 Dataset Processing and Curation to PyTorch Geometric Molecular Graphs},\n  author={[Your Name]},\n  year={2024},\n  url={https://github.com/shahram-boshra/qm7_database_process}\n}\n```\n\n## 🐛 Issues \u0026 Support\n\n- **Bug Reports**: [Open an issue](https://github.com/shahram-boshra/qm7_database_process/issues)\n- **Feature Requests**: [Request a feature](https://github.com/shahram-boshra/qm7_database_process/issues)\n- **Questions**: [Start a discussion](https://github.com/shahram-boshra/qm7_database_process/discussions)\n\n---\n\n**⭐ If this project helped your research, please consider giving it a star!**\n\nMade with ❤️ for the molecular machine learning community\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshahram-boshra%2Fqm7_database_process","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshahram-boshra%2Fqm7_database_process","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshahram-boshra%2Fqm7_database_process/lists"}