{"id":15060795,"url":"https://github.com/ostrokach/uniparc_xml_parser","last_synced_at":"2026-01-03T03:09:06.015Z","repository":{"id":57510351,"uuid":"337524843","full_name":"ostrokach/uniparc_xml_parser","owner":"ostrokach","description":"UniParc dataset describing ~300 million protein sequences converted into relational tables accessible through Google BigQuery (and as Parquet files).","archived":false,"fork":false,"pushed_at":"2021-10-29T21:55:36.000Z","size":32576,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-04-27T11:07:25.773Z","etag":null,"topics":["bigquery","bioinformatics","csv-files","parquet-files","protein-domains","protein-sequences"],"latest_commit_sha":null,"homepage":"https://gitlab.com/ostrokach/uniparc_xml_parser","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ostrokach.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-02-09T20:15:39.000Z","updated_at":"2023-08-09T07:16:11.000Z","dependencies_parsed_at":"2022-09-26T17:50:53.316Z","dependency_job_id":null,"html_url":"https://github.com/ostrokach/uniparc_xml_parser","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ostrokach%2Funiparc_xml_parser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ostrokach%2Funiparc_xml_parser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ostrokach%2Funiparc_xml_parser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ostrokach%2Funiparc_xml_parser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ostrokach","download_url":"https://codeload.github.com/ostrokach/uniparc_xml_parser/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243695589,"owners_count":20332629,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigquery","bioinformatics","csv-files","parquet-files","protein-domains","protein-sequences"],"created_at":"2024-09-24T23:04:43.093Z","updated_at":"2026-01-03T03:09:05.963Z","avatar_url":"https://github.com/ostrokach.png","language":"Rust","readme":"# UniParc XML parser \u003c!-- omit in toc --\u003e\n\n[![gitlab](https://img.shields.io/badge/GitLab-main-orange?logo=gitlab)](https://gitlab.com/ostrokach/uniparc_xml_parser)\n[![docs](https://img.shields.io/badge/docs-v0.2.1-blue.svg?logo=gitbook)](https://ostrokach.gitlab.io/uniparc_xml_parser/v0.2.1/)\n[![crates.io](https://img.shields.io/crates/d/uniparc_xml_parser?logo=rust)](https://crates.io/crates/uniparc_xml_parser/)\n[![conda](https://img.shields.io/conda/dn/ostrokach-forge/uniparc_xml_parser?logo=conda-forge)](https://anaconda.org/ostrokach-forge/uniparc_xml_parser/)\n[![pipeline status](https://gitlab.com/ostrokach/uniparc_xml_parser/badges/v0.2.1/pipeline.svg)](https://gitlab.com/ostrokach/uniparc_xml_parser/commits/v0.2.1/)\n\n- [Introduction](#introduction)\n- [Usage](#usage)\n- [Table schema](#table-schema)\n- [Installation](#installation)\n  - [Binaries](#binaries)\n  - [Cargo](#cargo)\n  - [Conda](#conda)\n- [Output files](#output-files)\n  - [Parquet](#parquet)\n  - [Google BigQuery](#google-bigquery)\n- [Benchmarks](#benchmarks)\n- [Example SQL queries](#example-sql-queries)\n  - [Find and extract all Gene3D domain sequences](#find-and-extract-all-gene3d-domain-sequences)\n  - [Find and extract all _unique_ Gene3D domain sequences](#find-and-extract-all-unique-gene3d-domain-sequences)\n  - [Map Ensembl identifiers to UniProt](#map-ensembl-identifiers-to-uniprot)\n  - [Find crystal structures of all GPCRs](#find-crystal-structures-of-all-gpcrs)\n- [FAQ (Frequently Asked Questions)](#faq-frequently-asked-questions)\n- [Roadmap](#roadmap)\n\n## Introduction\n\n`uniparc_xml_parser` is a small utility which can process the UniParc XML file (`uniparc_all.xml.gz`), available from the UniProt [website](http://www.uniprot.org/downloads), into a set of CSV files that can be loaded into a relational database.\n\nWe also provide Parquet files, which can be queried using tools such as AWS Athena and Apache Presto, and have uploaded the generated data to Google BigQuery (see: [Output files](#output-files)).\n\n## Usage\n\nUncompressed XML data can be piped into `uniparc_xml_parser` in order to parse the data into a set of CSV files on the fly:\n\n```bash\n$ curl -sS ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/uniparc/uniparc_all.xml.gz \\\n    | zcat \\\n    | uniparc_xml_parser\n```\n\nThe output is a set of CSV (or more specifically TSV) files:\n\n```bash\n$ ls\n-rw-r--r-- 1 user group 174G Feb  9 13:52 xref.tsv\n-rw-r--r-- 1 user group 149G Feb  9 13:52 domain.tsv\n-rw-r--r-- 1 user group 138G Feb  9 13:52 uniparc.tsv\n-rw-r--r-- 1 user group 107G Feb  9 13:52 protein_name.tsv\n-rw-r--r-- 1 user group  99G Feb  9 13:52 ncbi_taxonomy_id.tsv\n-rw-r--r-- 1 user group  74G Feb  9 20:13 uniparc.parquet\n-rw-r--r-- 1 user group  64G Feb  9 13:52 gene_name.tsv\n-rw-r--r-- 1 user group  39G Feb  9 13:52 component.tsv\n-rw-r--r-- 1 user group  32G Feb  9 13:52 proteome_id.tsv\n-rw-r--r-- 1 user group  15G Feb  9 13:52 ncbi_gi.tsv\n-rw-r--r-- 1 user group  21M Feb  9 13:52 pdb_chain.tsv\n-rw-r--r-- 1 user group  12M Feb  9 13:52 uniprot_kb_accession.tsv\n-rw-r--r-- 1 user group 656K Feb  9 04:04 uniprot_kb_accession.parquet\n```\n\n## Table schema\n\nThe generated CSV files conform to the following schema:\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"docs/images/uml-diagram.svg\" width=\"800px\" /\u003e\n\u003c/div\u003e\n\n## Installation\n\n### Binaries\n\nLinux binaries are available here: https://gitlab.com/ostrokach/uniparc_xml_parser/-/packages.\n\n### Cargo\n\nUse [`cargo`](https://crates.io/) to compile and install `uniparc_xml_parser` for your target platform:\n\n```bash\ncargo install uniparc_xml_parser\n```\n\n### Conda\n\nUse [`conda`](https://docs.conda.io/en/latest/miniconda.html) to install precompiled binaries:\n\n```bash\nconda install -c ostrokach-forge uniparc_xml_parser\n```\n\n## Output files\n\n### Parquet\n\nParquet files containing the processed data are available at the following URL and are updated monthly: \u003chttp://uniparc.data.proteinsolver.org/\u003e.\n\n### Google BigQuery\n\nThe latest data can also be queried directly using Google BigQuery: \u003chttps://console.cloud.google.com/bigquery?project=ostrokach-data\u0026p=ostrokach-data\u0026page=dataset\u0026d=uniparc\u003e.\n\n## Benchmarks\n\nParsing 10,000 XML entires takes around 30 seconds (the process is mostly IO-bound):\n\n```bash\n$ time bash -c \"zcat uniparc_top_10k.xml.gz | uniparc_xml_parser \u003e/dev/null\"\n\nreal    0m33.925s\nuser    0m36.800s\nsys     0m1.892s\n```\n\nThe actual `uniparc_all.xml.gz` file has around 373,914,570 elements.\n\n## Example SQL queries\n\n### Find and extract all Gene3D domain sequences\n\n```sql\nSELECT\n  uniparc_id,\n  database_id AS gene3d_id,\n  interpro_name,\n  interpro_id,\n  domain_start,\n  domain_end,\n  SUBSTR(sequence, domain_start, domain_end - domain_start + 1) AS domain_sequence\nFROM `ostrokach-data.uniparc.uniparc` u\nJOIN `ostrokach-data.uniparc.domain` d\nUSING (uniparc_id)\nWHERE d.database = 'Gene3D';\n```\n\nBigQuery: \u003chttps://console.cloud.google.com/bigquery?sq=930310419365:a29f957964174c6dbfba7caac1dfeee9\u003e.\n\n![query-result](docs/images/gene3d-domains-result.png)\n\n### Find and extract all _unique_ Gene3D domain sequences\n\n```sql\nSELECT\n  ARRAY_AGG(uniparc_id ORDER BY uniparc_id, domain_start, domain_end) uniparc_id,\n  ARRAY_AGG(gene3d_id ORDER BY uniparc_id, domain_start, domain_end) gene3d_id,\n  ARRAY_AGG(interpro_name ORDER BY uniparc_id, domain_start, domain_end) interpro_name,\n  ARRAY_AGG(interpro_id ORDER BY uniparc_id, domain_start, domain_end) interpro_id,\n  ARRAY_AGG(domain_start ORDER BY uniparc_id, domain_start, domain_end) domain_start,\n  ARRAY_AGG(domain_end ORDER BY uniparc_id, domain_start, domain_end) domain_end,\n  domain_sequence\nFROM (\n  SELECT\n    uniparc_id,\n    database_id AS gene3d_id,\n    interpro_name,\n    interpro_id,\n    domain_start,\n    domain_end,\n    SUBSTR(sequence, domain_start, domain_end - domain_start + 1) AS domain_sequence\n  FROM `ostrokach-data.uniparc.uniparc` u\n  JOIN `ostrokach-data.uniparc.domain` d\n  USING (uniparc_id)\n  WHERE d.database = 'Gene3D') t\nGROUP BY\n  domain_sequence;\n```\n\nBigQuery: \u003chttps://console.cloud.google.com/bigquery?sq=930310419365:f8fa36964fed48c8b187ccadcf070223\u003e.\n\n![query-result](docs/images/gene3d-unique-domains-result.png)\n\n### Map Ensembl identifiers to UniProt\n\nFind UniProt indentifiers for Ensembl transcript identifiers corresponding to the same sequence:\n\n```sql\nSELECT\n  ensembl.db_id ensembl_id,\n  uniprot.db_id uniprot_id\nFROM (\n  SELECT uniparc_id, db_id\n  FROM `ostrokach-data.uniparc.xref`\n  WHERE db_type = 'Ensembl') ensembl\nJOIN (\n  SELECT uniparc_id, db_id\n  FROM `ostrokach-data.uniparc.xref`\n  WHERE db_type = 'UniProtKB/Swiss-Prot') uniprot\nUSING (uniparc_id);\n```\n\nBigQuery: \u003chttps://console.cloud.google.com/bigquery?sq=930310419365:488eace5d1524ba8bdc049935ba09251\u003e.\n\n![query-result](docs/images/ensembl-to-uniprot-result.png)\n\n### Find crystal structures of all GPCRs\n\n```sql\nSELECT\n  uniparc_id,\n  SUBSTR(p.value, 1, 4) pdb_id,\n  SUBSTR(p.value, 5) pdb_chain,\n  d.database_id pfam_id,\n  d.domain_start,\n  d.domain_end,\n  u.sequence pdb_chain_sequence\nFROM `ostrokach-data.uniparc.uniparc` u\nJOIN `ostrokach-data.uniparc.domain` d USING (uniparc_id)\nJOIN `ostrokach-data.uniparc.pdb_chain` p USING (uniparc_id)\nWHERE d.database = 'Pfam'\nAND d.database_id = 'PF00001';\n```\n\nBigQuery: \u003chttps://console.cloud.google.com/bigquery?sq=930310419365:8df5c3ad12144f418b0e1bc1285befc4\u003e.\n\n![query-result](docs/images/gpcr-structures-result.png)\n\n\n## FAQ (Frequently Asked Questions)\n\n**Why not split `uniparc_all.xml.gz` into multiple small files and process them in parallel?**\n\n- Splitting the file requires reading the entire file. If we're reading the entire file anyway, why not parse it as we read it?\n- Having a single process which parses `uniparc_all.xml.gz` makes it easier to create an incremental unique index column (e.g. `xref.xref_id`).\n\n## Roadmap\n\n- [ ] Add support for writing Apache Parquet files directly.\n- [ ] Add support for writing output to object stores (such as S3 and GCS).\n- [ ] Avoid decoding the input into strings (keep everything as bytes throughout).\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fostrokach%2Funiparc_xml_parser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fostrokach%2Funiparc_xml_parser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fostrokach%2Funiparc_xml_parser/lists"}