{"id":43500944,"url":"https://github.com/aimat-lab/ml4pxrds","last_synced_at":"2026-02-03T11:23:24.435Z","repository":{"id":168604540,"uuid":"573069939","full_name":"aimat-lab/ML4pXRDs","owner":"aimat-lab","description":"Contains code to train neural networks based on simulated powder XRDs from synthetic crystals.","archived":false,"fork":false,"pushed_at":"2023-07-14T08:17:06.000Z","size":62636,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-09-05T13:05:11.212Z","etag":null,"topics":["automated-analysis","diffractograms","high-throughput","machine-learning","powder","xrd"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aimat-lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.bib","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-12-01T16:24:29.000Z","updated_at":"2025-06-12T12:24:41.000Z","dependencies_parsed_at":null,"dependency_job_id":"584eeb2e-25ae-4c36-81e3-4cc9687d8c52","html_url":"https://github.com/aimat-lab/ML4pXRDs","commit_stats":null,"previous_names":["aimat-lab/ml4pxrds"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/aimat-lab/ML4pXRDs","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aimat-lab%2FML4pXRDs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aimat-lab%2FML4pXRDs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aimat-lab%2FML4pXRDs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aimat-lab%2FML4pXRDs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aimat-lab","download_url":"https://codeload.github.com/aimat-lab/ML4pXRDs/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aimat-lab%2FML4pXRDs/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29044110,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-03T10:09:22.136Z","status":"ssl_error","status_checked_at":"2026-02-03T10:09:16.814Z","response_time":96,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automated-analysis","diffractograms","high-throughput","machine-learning","powder","xrd"],"created_at":"2026-02-03T11:23:23.851Z","updated_at":"2026-02-03T11:23:24.426Z","avatar_url":"https://github.com/aimat-lab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ML for pXRDs using synthetic crystals\nThis repository contains the code of the publication [\"Neural networks trained on\nsynthetically generated crystals can extract structural information from ICSD\npowder X-ray diffractograms\"](https://arxiv.org/abs/2303.11699). It can be used to train machine learning models\n(e.g., for the classification of space groups) on powder XRD diffractograms\nsimulated on-the-fly from synthetically generated random crystal structures.\n\nYou can find details about this project in our [`paper`](https://arxiv.org/abs/2303.11699). If you want to cite our work, you can use the provided bibtex file [CITATION.bib](CITATION.bib).\n\nIf you have any problems using the provided software, if documentation is\nmissing, or if you find any bugs, feel free to add a new issue on GitHub.\n\nThe repository contains the following components:\n\n1. Optimized simulation\n\n    The code of the optimized simulation of powder XRDs (using numba LLVM\n    just-in-time compilation) can be found in `./ml4pxrd_tools/simulation/`. This code\n    is based on the implementation found in the\n    [`pymatgen`](https://github.com/materialsproject/pymatgen) library.\n\n2. Generation of synthetic crystals\n\n    The code of the generation of synthetic crystals can be found in\n    `./ml4pxrd_tools/generation/`.\n\n3. Distributed training\n\n    The code of the distributed training architecture uses `tensorflow` with\n    the distributed computing framework `ray`. The relevant script files can be\n    found in `./training/`.\n\n# Documentation\n## Getting started\nFor convenience, the code for the optimized simulation of pXRDs and generation\nof synthetic crystals is provided as a package called `ml4pxrd_tools`. Before\ntraining, this should be installed, ideally in a separate virtual environment or\nanaconda environment. We tested the package with python 3.8.0 on Ubuntu, but it\nshould also work for other python versions and operating systems.\n\nTo install the package, call pip in the root of the repository:\n\n```\npip install -e .\n```\n\nThis will further install all required dependencies. \n\nTo further run the training script and some of the analysis scripts in\n`./training/analysis`, the following additional dependencies can be installed\nusing pip:\n\n- `ray`\n- `psutil`\n- `ase`\n- `tensorflow`\n- `tensorflow-addons`\n\nWe tested and recommend TensorFlow version 2.10.0. Also, make sure that the\n`CUDA` and `cuDNN` dependencies of `tensorflow` are installed and that the\nversions are compatible (we refer to the table available at\nhttps://www.tensorflow.org/install/source#tested_build_configurations). For\nTensorFlow 2.10.0, you can simply install the required `CUDA` and `cuDNN`\ndependencies using conda:\n\n```\nconda install -c conda-forge cudatoolkit==11.2.0\nconda install -c conda-forge cudnn==8.1.0.77\n```\n\n## Loading statistics of the ICSD\nIn order to be able to generate synthetic crystals, some general statistics\n(e.g., about the occupation of the Wyckoff positions for each space group) need\nto be extracted from the ICSD. If you only want to generate synthetic crystals\n(and simulate pXRDs based on them) without running your own training\nexperiments, you can use the statistical data provided by us in\n`./public_statistics`. We refer to section `Training` of this README if you want\nto create your own dataset and extract your own statistics from the ICSD.\n\nThe required data can be loaded using the function\n`ml4pxrd_tools.manage_dataset.load_dataset_info` with parameter\n`load_public_statistics_only=True`. The returned objects can then be passed to\nthe respective functions to generate synthetic crystals and simulate pXRDs (see\nbelow). \n\n```python\nfrom ml4pxrd_tools.manage_dataset import load_dataset_info\n\n(\n    probability_per_spg_per_element,\n    probability_per_spg_per_element_per_wyckoff,\n    NO_unique_elements_prob_per_spg,\n    NO_repetitions_prob_per_spg_per_element,\n    denseness_factors_density_per_spg,\n    denseness_factors_conditional_sampler_seeds_per_spg,\n    lattice_paras_density_per_lattice_type,\n    per_element,\n    represented_spgs,\n    probability_per_spg,\n) = load_dataset_info(load_public_statistics_only=True)\n```\n\n## Generating synthetic crystals\n\nAfter loading the statistics, you can use the statistics to generate synthetic\nstructures of a given space group (here for space group 125):\n\n```python\nfrom ml4pxrd_tools.generation.structure_generation import generate_structures\n\nstructures = generate_structures(\n    125,\n    N=1,\n    probability_per_spg_per_element=probability_per_spg_per_element,\n    probability_per_spg_per_element_per_wyckoff=probability_per_spg_per_element_per_wyckoff,\n    NO_unique_elements_prob_per_spg=NO_unique_elements_prob_per_spg,\n    NO_repetitions_prob_per_spg_per_element=NO_repetitions_prob_per_spg_per_element,\n    denseness_factors_conditional_sampler_seeds_per_spg=denseness_factors_conditional_sampler_seeds_per_spg,\n    lattice_paras_density_per_lattice_type=lattice_paras_density_per_lattice_type,\n)\n```\n\n## Simulating pXRDs\nThis repository provides various functions to simulate powder XRD diffractograms:\n\n- Use function `ml4pxrd_tools.simulation.simulation_core.get_pattern_optimized`\nfor fast simulation of the angles and intensities of all peaks in a given\n$2\\theta$ range. This uses an optimized version of the pymatgen implementation.\n- Use function `ml4pxrd_tools.simulation.simulation_smeared.get_smeared_patterns`\nto simulate one or more smeared patterns (peaks convoluted with a Gaussian preak profile)\nfor a given structure object.\n- Use function `ml4pxrd_tools.simulation.simulation_smeared.get_synthetic_smeared_patterns`\nto generate synthetic crystals and simulate pXRDs based on them.\n\nHere is an example of how to call `get_synthetic_smeared_patterns` using the\nstatistics loaded using `load_dataset_info` (see above):\n\n```python\nfrom ml4pxrd_tools.simulation.simulation_smeared import get_synthetic_smeared_patterns\n\npatterns, labels = get_synthetic_smeared_patterns(\n    [125],\n    N_structures_per_spg=5,\n    wavelength=1.5406,\n    two_theta_range=(5, 90),\n    N=8501,\n    NO_corn_sizes=1,\n    probability_per_spg_per_element=probability_per_spg_per_element,\n    probability_per_spg_per_element_per_wyckoff=probability_per_spg_per_element_per_wyckoff,\n    NO_unique_elements_prob_per_spg=NO_unique_elements_prob_per_spg,\n    NO_repetitions_prob_per_spg_per_element=NO_repetitions_prob_per_spg_per_element,\n    denseness_factors_conditional_sampler_seeds_per_spg=denseness_factors_conditional_sampler_seeds_per_spg,\n    lattice_paras_density_per_lattice_type=lattice_paras_density_per_lattice_type,\n)\n```    \n\nThe functions `get_smeared_patterns` and `get_synthetic_smeared_patterns`\ncalculate the FWHM of the gaussian peak profiles using the Scherrer equation\nwith a random crystallite size uniformly sampled in the range\n`pymatgen_crystallite_size_gauss_min=20` to\n`pymatgen_crystallite_size_gauss_max=100` (in nm). You can change the default\nrange at the top of script file\n`./ml4pxrd_tools/simulation/simulation_smeared.py`.\n\n## Training\nYou can find the weights of our largest model (ResNet-101) trained using\nsynthetic crystals and the weights of the ResNet-50 trained with \nexperimental imperfections in our [latest release](https://github.com/aimat-lab/ML4pXRDs/releases/tag/v1.0).\n\n### Pre-simulate patterns for testing\nIf you want to run your own ML experiments, you need to generate your own\ndataset from the ICSD that contains the required simulated diffractograms\nand crystals. This is needed to test the accuracy of the ML models.\n\nIn order to generate a dataset, a license for the ICSD database is needed. If\nyou have the license and downloaded the database, you need to first simulate\npowder diffractograms based on the ICSD crystals. This can be accomplished by running\nthe script `./ml4pxrd_tools/simulation/icsd_simulator.py`. Before running this\nscript, make sure that you change the variables at the top of this script file,\nof the file `simulation_worker.py`, and of `simulation_smeared.py`.\n\nInstead of running the script directly, you can also use the provided slurm\nscript `submit_icsd_simulation_slurm.slr` to run it on a cluster. Make sure to\nadapt it to your cluster first and potentially change the path to your `.bashrc`\nfile and the name of your anaconda environment.\n\nAs a point of reference, it takes ~14 hours to simulate the full ICSD on 8 cores.\n\n### Extract statistics and generate dataset split\nTo generate a new dataset with prototype-based split using the just simulated\npatterns, you can use the script `./ml4pxrd_tools/manage_dataset.py`. Please\nfirst change the variables at the top of this script file. Then, you can\ngenerate the dataset and extract the statistics: \n\n```bash\npython manage_dataset.py\n```\n\nThis will take a while (~5 hours). Finally, you can find the prepared dataset\nincluding the statistics in the directory `./prepared_dataset`.\n\n### Run experiments\nAt the top of the training script (`./trainig/train_random_classifier.py`), you\ncan find some variables / options of the training experiment including detailed\nexplanations. While you should look through all options, the following options\nalways need to be changed:\n\n- `path_to_patterns`\n- `path_to_icsd_directory_local` or `path_to_icsd_directory_cluster`\n\nFurthermore, you might want to change the used model (see line `model =\nbuild_model_XX(...)`). You can find the models implemented by us in\nthe file `./training/models.py`.\n\nYou can call the training script like this:\n\n```bash\npython train_classifier.py \u003cUnique name / ID of experiment\u003e head-only \u003cnumber of ray workers\u003e\n```\n\nInstead of calling the script directly, you can also use the slurm script files\ncontained in `./training/submit_scripts_slurm/` to perform the training runs. You\ncan use `submit_head_only.sh` to run an experiment on a single node containing\none or more GPUs.\n\nHowever, to obtain reasonable training times, we recommend using additional\ncompute nodes to generate synthetic crystals and simulate their powder diffractogram. Depending on the model size, the number of needed cores to not\nthrottle the training process changes (bigger models train slower and need less\ncompute cores). You can use the script `submit.sh` (execute with `bash`, not\n`sbatch`) to automatically spawn three slurm jobs on different compute nodes:\none head job and two compute worker jobs. The three jobs will wait until all\njobs are started and then initiate the training experiment. If your cluster\nsupports heterogeneous jobs, feel free to adapt the scripts accordingly.\n\nMake sure to adapt all submit scripts to the exact specifications of your\ncluster and change the name of the anaconda environment and potentially the path\nto your `.bashrc` file in all submit scripts.\n\nEach training experiment will put its data (TensorBoard data, logs, checkpoint files)\nin a separate run directory. The current run directory will be printed in the beginning\nof the training script.\n\nThe easiest way to track the progress and results of your training runs is to use\n`TensorBoard`. Simply navigate to the run directory in your terminal and execute\n`tensorboard --logdir .`.\n\nThere are several metrics that are logged to TensorBoard during a run:\n- `accuracy/loss all`: Performance on ICSD test dataset\n- `accuracy/loss match`: Performance on ICSD test dataset, only using structures that match\nthe simulation parameters (volume \u003c 7000 angstroms, less than 100 atoms in asymmetric unit)\n- `accuracy/loss random`: Performance on pXRDs from synthetically generated crystals\n(same distribution as training data)\n- `accuracy/loss match_correct_spgs`: Performance on ICSD test dataset, only using structures\nthat match the simulation parameters. Furthermore, the space group labels obtained using \n`spglib` are used instead of those provided by the ICSD.\n- `accuracy/loss match_correct_spgs_pure`: Performance on ICSD test dataset, only using structures\nthat match the simulation parameters. Furthermore, the space group labels obtained using \n`spglib` are used instead of those provided by the ICSD. Also, only structures without partial\noccupancies are used.\n- `accuracy gap`: `accuracy random - accuracy match`\n\nAdditionally to those metrics, after each epoch, the current learning rate and the current\nsize of the `ray` queue object (indicating if enough workers are used) are logged.\n\n## Inference\n\nYou can either use one of the models provided in our [latest release](https://github.com/aimat-lab/ML4pXRDs/releases/tag/v1.0)\nor your own trained models to run inference on new diffractograms.\n\n```python\nimport tensorflow.keras as keras\n\nmodel = keras.models.load_model(\"path/to/your/model\")\n\npredictions = model.predict(your_diffractograms, batch_size=145)\n\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faimat-lab%2Fml4pxrds","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faimat-lab%2Fml4pxrds","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faimat-lab%2Fml4pxrds/lists"}