{"id":28712429,"url":"https://github.com/polusai/pubmed-types","last_synced_at":"2025-07-22T11:33:54.047Z","repository":{"id":175079326,"uuid":"653312247","full_name":"PolusAI/pubmed-types","owner":"PolusAI","description":null,"archived":false,"fork":false,"pushed_at":"2023-06-13T20:29:59.000Z","size":503,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-07-13T11:49:47.755Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PolusAI.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-06-13T20:16:10.000Z","updated_at":"2023-06-13T20:30:07.000Z","dependencies_parsed_at":null,"dependency_job_id":"7009b342-6d0e-4110-966b-0b67059992b9","html_url":"https://github.com/PolusAI/pubmed-types","commit_stats":null,"previous_names":["polusai/pubmed-types"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/PolusAI/pubmed-types","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PolusAI%2Fpubmed-types","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PolusAI%2Fpubmed-types/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PolusAI%2Fpubmed-types/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PolusAI%2Fpubmed-types/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PolusAI","download_url":"https://codeload.github.com/PolusAI/pubmed-types/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PolusAI%2Fpubmed-types/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266483965,"owners_count":23936452,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-22T02:00:09.085Z","response_time":66,"last_error":null,"robots_txt_status":null,"robots_txt_updated_at":null,"robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-14T23:05:49.574Z","updated_at":"2025-07-22T11:33:54.041Z","avatar_url":"https://github.com/PolusAI.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# pubmed-types (v0.2.0)\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://img.shields.io/pypi/dm/pubmed-types?style=flat-square\" /\u003e\n    \u003cimg src=\"https://img.shields.io/pypi/l/pubmed-types?style=flat-square\"/\u003e\n    \u003cimg src=\"https://img.shields.io/pypi/v/pubmed-types?style=flat-square\"/\u003e\n    \u003ca href=\"https://github.com/tefra/xsdata-pydantic\"\u003e\n        \u003cimg alt=\"Built with: xsdata-pydantic\" src=\"https://img.shields.io/badge/Built%20with-xsdata--pydantic-blue\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/dbrgn/coverage-badge\"\u003e\n        \u003cimg src=\"./images/coverage.svg\"\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\n## Introduction\n\nA complete implementation of the XML schema for PMC Open Access articles and Pubmed\narticle sets (citations).\n\nThis package helps to parse PubMed XML data into Pydantic models. This validates the\ninput xml data and provides typehints for working with the complex XML structures\npresent in PubMed data.\n\n## Most Recent Changes\n\n* **Breaking Change:** The `parse_pubmed_xml` is replaced by `pmc_article` and `pubmed_article_set`.\n* More test coverage\n* Pubmed Articles can now parse MathML\n* Restructured code to separate out `jats` (pmc open access articles) and `pubmed` (pubmed article set)\n* One unit test with 99% coverage\n* Added [CHANGELOG.md](CHANGELOG.md)\n\n## Why do I need this?\n\nPubMed keeps track of 10s of millions of research data, and a complex XML structure is\nused to store it. Parsing XML on its own is challenging enough. Add to it the feature\nrich data inside of each citation, and you will find yourself with hours or days of\nnavigating the XML structure.\n\nThe approach here was to autogenerate Pydantic classes to parse the XML using the\n`xsdata-pydantic` tool. This approach has the benefit of making sure every piece of data\nis parsed properly, and an error is thrown is something is missing or incorrect. Instead\nof using dictionaries to hold the data, Pydantic classes have the benefit of providing\ntype hints with tab completion for IDEs, making it easier to navigate the complex\nstructure of the citation data.\n\n## How do I use it?\n\nIt is possible to use `xsdata-pydantic` and the autogenerated classes directly to parse\nan XML file, but we provide a convenience function to easily open PubMed XMl citations\nand PMC open access articles.\n\n### Example 1: A PMC Open Access Article\n\n```python\nimport tarfile\nimport urllib.request as request\nfrom contextlib import closing\nfrom pathlib import Path\n\nfrom pubmed_types import pmc_article\n\n# Input file source and output file destination\nsource = (\n    \"ftp://ftp.ncbi.nlm.nih.gov\"\n    + \"/pub/pmc/oa_bulk/oa_comm/xml\"\n    + \"/oa_comm_xml.incr.2023-03-21.tar.gz\"\n)\ndestination = Path(\"downloads\")\ndestination.mkdir(exist_ok=True)\n\n# 1. Get an open access article dataset from the FTP server\nwith closing(request.urlopen(source)) as url:\n    with tarfile.open(fileobj=url, mode=\"r:gz\") as fr:\n        fr.extractall(destination)\n\n# 2. Parse the file\nfile_path = destination.joinpath(\"PMC009xxxxxx\").joinpath(\"PMC9970662.xml\")\nfull_text = pmc_article(file_path)\n\n# 3. Print out the article title\nprint(f\"Title: {full_text.front.article_meta.title_group.article_title.content[0]}\")\n```\n\nOutput:\n\n```bash\nTitle: Lactate as a myokine and exerkine: drivers and signals of physiology and metabolism\n```\n\n### Example 2: A Pubmed baseline citation file\n\n```python\nimport gzip\nimport urllib.request as request\nfrom contextlib import closing\nfrom pathlib import Path\n\nfrom pubmed_types import pubmed_article_set\n\n# Input file source and output file destination\nsource = \"ftp://ftp.ncbi.nlm.nih.gov\" + \"/pubmed/updatefiles\" + \"/pubmed23n1168.xml.gz\"\ndestination = Path(\"downloads\").joinpath(\"pubmed23n1168.xml\")\ndestination.parent.mkdir(exist_ok=True)\n\n# 1. Get a pubmed citation daily update file from the FTP server\nwith closing(request.urlopen(source)) as url:\n    with gzip.GzipFile(fileobj=url, mode=\"rb\") as fr:\n        with open(destination, mode=\"wb\") as fw:\n            fw.write(fr.read())\n\n# 2. Parse the file\narticle_set = pubmed_article_set(destination)\n\n# 3. Get the number of citations in the file\nprint(f\"Number of citations: {len(article_set.pubmed_article)}\")\nprint(\n    f\"{article_set.pubmed_article[0].medline_citation.article.article_title.content[0]}\"\n)\n```\n\nOutput:\n\n```bash\nNumber of citations: 2543\nA Patent and Pattern Mother.\n```\n\n## FAQ\n\n### Why does it take so long to parse a pubmed citation set\n\nThere is a lot of data, and the schema is deep and complex.\n\n### Why are the return structures so complicated?\n\nThe return structures are a direct reflection of the XML format defined by the NLM. In\nthe future some utility classes might be made for common components (title, authors,\netc), but for now this is intended to be an unbiased way of parsing the XML.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpolusai%2Fpubmed-types","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpolusai%2Fpubmed-types","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpolusai%2Fpubmed-types/lists"}