https://github.com/imperial-genomics-facility/data-management-python
Python library for running data analysis pipelines for IGF team
https://github.com/imperial-genomics-facility/data-management-python
illumina mysql ngs pandas python sqlalchemy
Last synced: 5 months ago
JSON representation
Python library for running data analysis pipelines for IGF team
- Host: GitHub
- URL: https://github.com/imperial-genomics-facility/data-management-python
- Owner: imperial-genomics-facility
- License: apache-2.0
- Created: 2017-03-24T11:28:45.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2025-04-08T02:06:20.000Z (6 months ago)
- Last Synced: 2025-04-19T10:28:23.186Z (6 months ago)
- Topics: illumina, mysql, ngs, pandas, python, sqlalchemy
- Language: Python
- Homepage: https://data-management-python.readthedocs.io
- Size: 3.19 MB
- Stars: 5
- Watchers: 5
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[](https://app.travis-ci.com/github/imperial-genomics-facility/data-management-python) [](https://data-management-python.readthedocs.io/en/master/?badge=master) [](https://www.codacy.com/gh/imperial-genomics-facility/data-management-python/dashboard?utm_source=github.com&utm_medium=referral&utm_content=imperial-genomics-facility/data-management-python&utm_campaign=Badge_Grade)
# Data Management Using Python Library
https://data-management-python.readthedocs.io
This repository contains the core Python library developed and maintained by the NIHR Imperial BRC Genomics Facility for managing raw and processed genomic datasets efficiently.
## Key Features
**1. Metadata Management**
* Utilizes an extended [ENA metadata model](https://ena-docs.readthedocs.io/en/latest/submit/general-guide/metadata.html) for managing information about:
* Projects
* Samples
* Sequencing runs
* Analysis
* File paths and
* Pipeline instances**2. Genomic Sequencing Runs Processing**
* Tracks ongoing sequencing runs and initiates processing upon completion.
* Generates summary reports and sends email notifications to users.**3. Analysis Pipelines**
* Includes wrappers for both community-developed and vendor-provided data pipelines.
* Automates:
* Configuration generation
* Input formatting
* Executes external pipelines on HPC using bash script wrappers.
* Manages post-processing, including:
* Custom report generation
* Analysis data validation## Requirements
• Python v3.10## Installation
**1. Clone the Repository**
```bash
git clone https://github.com/imperial-genomics-facility/data-management-python.git
```**2. Install Dependencies**
Install required Python libraries:```bash
pip install -r requirements_2.10.4.txt # For compatibility with Apache Airflow v2.10.4
```**3. Update PYTHONPATH**
Add the core library path to PYTHONPATH:
```bash
export PYTHONPATH=/PATH/data-management-python
```## Update Airflow version
**1. Set env variables**
```bash
export AIRFLOW_VERSION=VERSION
export PYTHON_VERSION=VERSION
export CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
```**2. Install core Airflow libraries**
```bash
pip install "apache-airflow[celery,postgres,redis,graphviz,pandas,apache-spark,airbyte,amazon,slack,singularity,ssh,sftp,smtp]==VERSION" --constraint ${CONSTRAINT_URL}
```**3. Install additional libraries**
```bash
pip install asana gviz-api html5lib matplotlib PyMySQL pytest pytest-cov tox slackclient --constraint ${CONSTRAINT_URL}
```**4. List Python library versons in the requirements file**
```bash
pip freeze > requirements_vVERSION.txt
```## License
This project is licensed under the **Apache-2.0 License**. See the [LICENSE](LICENSE) file for details.