{"id":38976985,"url":"https://github.com/blackrock/xml_to_parquet","last_synced_at":"2026-01-17T16:47:48.574Z","repository":{"id":46296730,"uuid":"341669658","full_name":"blackrock/xml_to_parquet","owner":"blackrock","description":"Convert one or more XML files into Apache Parquet format. Only requires a XSD and XML file to get started.","archived":false,"fork":false,"pushed_at":"2023-01-20T14:20:57.000Z","size":22,"stargazers_count":29,"open_issues_count":4,"forks_count":17,"subscribers_count":6,"default_branch":"master","last_synced_at":"2024-03-26T15:46:25.204Z","etag":null,"topics":["parquet","python","xml","xsd"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/blackrock.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-02-23T19:41:30.000Z","updated_at":"2024-03-21T23:17:25.000Z","dependencies_parsed_at":"2023-02-12T02:46:26.199Z","dependency_job_id":null,"html_url":"https://github.com/blackrock/xml_to_parquet","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/blackrock/xml_to_parquet","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blackrock%2Fxml_to_parquet","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blackrock%2Fxml_to_parquet/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blackrock%2Fxml_to_parquet/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blackrock%2Fxml_to_parquet/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/blackrock","download_url":"https://codeload.github.com/blackrock/xml_to_parquet/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blackrock%2Fxml_to_parquet/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28511870,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-17T13:38:16.342Z","status":"ssl_error","status_checked_at":"2026-01-17T13:37:44.060Z","response_time":85,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["parquet","python","xml","xsd"],"created_at":"2026-01-17T16:47:48.011Z","updated_at":"2026-01-17T16:47:48.543Z","avatar_url":"https://github.com/blackrock.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# **XML To Parquet Converter**\n\nThis repository contains code for the XML to Parquet Converter.\nThis converter is written in Python and will convert one or more XML files into Parquet files\n\n# Key Features\n\nConverts XML to valid Parquet \n\nRequires only two files to get started. Your XML file and the XSD schema file for that XML file.\n\nMultiprocessing enabled to parse XML files concurrently if the XML files are in the same format. Call with -m # option.\n\nUses Python's iterparse event based methods which enables parsing very large files with low memory requirements. This is very similar to Java's SAX parser\n\nFiles are processed in order with the largest files first to optimize overall parsing time\n\nOption to write results to either Linux or HDFS folders\n\n# Additional Notes\n\nXML files with xs:union data types are not currently supported. A parquet column can only support a single data type.\n\nFor larger XML files the block_size parameter is required to allocate enough memory to capture your XML data.\n\n# How to run?\n```python\npython xml_to_parquet.py\n```\n\n# Parameters\n```python\nusage: xml_to_parquet.py [-h] -x XSD_FILE [-t TARGET_PATH]\n                         [-p XPATHS] [-e EXCLUDEPATHS] [-m MULTI] [-l LOG]\n                         [-v VERBOSE] [-d] [-b BLOCK_SIZE]\n                         ...\n\nXML To Parquet Parser\n\npositional arguments:\n  xml_files             xml files to convert\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -x XSD_FILE, --xsd_file XSD_FILE\n                        xsd file name\n  -t TARGET_PATH, --target_path TARGET_PATH\n                        target path. hdfs targets require hadoop client\n                        installation. Examples: /proj/test, hdfs:///proj/test,\n                        hdfs://halfarm/proj/test\n  -p XPATHS, --xpaths XPATHS\n                        xpaths to parse out. Pass in as a comma separated string.\n                        /path/include1,/path/include2\n  -e EXCLUDEPATHS, --excludepaths EXCLUDEPATHS\n                        elements to exclude. Pass in as a comma separated string.\n                        /path/exclude1,/path/exclude2\n  -m MULTI, --multi MULTI\n                        number of parsers. Default is 1.\n  -l LOG, --log LOG     log file\n  -v VERBOSE, --verbose VERBOSE\n                        verbose output level. INFO, DEBUG, etc.\n  -d, --delete          delete source file when completed\n  -b, --block_size      allocate additional memory for large xml files in bytes\n  -f, --file_info       capture file information metadata in parquet file\n\n\n```\n\nSample XML\n```xml\n\u003c?xml version=\"1.0\"?\u003e\n\u003cpurchaseOrder orderDate=\"1999-10-20\"\u003e\n    \u003cshipTo country=\"US\"\u003e\n        \u003cname\u003eAlice Smith\u003c/name\u003e\n        \u003cstreet\u003e123 Maple Street\u003c/street\u003e\n        \u003ccity\u003eMill Valley\u003c/city\u003e\n        \u003cstate\u003eCA\u003c/state\u003e\n        \u003czip\u003e90952\u003c/zip\u003e\n    \u003c/shipTo\u003e\n    \u003cbillTo country=\"US\"\u003e\n        \u003cname\u003eRobert Smith\u003c/name\u003e\n        \u003cstreet\u003e8 Oak Avenue\u003c/street\u003e\n        \u003ccity\u003eOld Town\u003c/city\u003e\n        \u003cstate\u003ePA\u003c/state\u003e\n        \u003czip\u003e95819\u003c/zip\u003e\n    \u003c/billTo\u003e\n    \u003ccomment\u003eHurry, my lawn is going wild!\u003c/comment\u003e\n    \u003citems\u003e\n        \u003citem partNum=\"872-AA\"\u003e\n            \u003cproductName\u003eLawnmower\u003c/productName\u003e\n            \u003cquantity\u003e1\u003c/quantity\u003e\n            \u003cUSPrice\u003e148.95\u003c/USPrice\u003e\n            \u003ccomment\u003eConfirm this is electric\u003c/comment\u003e\n        \u003c/item\u003e\n        \u003citem partNum=\"926-AA\"\u003e\n            \u003cproductName\u003eBaby Monitor\u003c/productName\u003e\n            \u003cquantity\u003e1\u003c/quantity\u003e\n            \u003cUSPrice\u003e39.98\u003c/USPrice\u003e\n            \u003cshipDate\u003e1999-05-21\u003c/shipDate\u003e\n        \u003c/item\u003e\n    \u003c/items\u003e\n\u003c/purchaseOrder\u003e\n```\n\n# Convert a small XML file to a Parquet file\n```python\npython xml_to_parquet.py -x PurchaseOrder.xsd PurchaseOrder.xml\n\nINFO - 2021-01-21 12:32:38 - Parsing XML Files..\nINFO - 2021-01-21 12:32:38 - Processing 1 files\nDEBUG - 2021-01-21 12:32:38 - Generating schema from PurchaseOrder.xsd\nDEBUG - 2021-01-21 12:32:38 - Parsing PurchaseOrder.xml\nDEBUG - 2021-01-21 12:32:38 - Saving to file PurchaseOrder.xml.parquet\nDEBUG - 2021-01-21 12:32:38 - Completed PurchaseOrder.xml\n```\n\nJSON equivalent output\n(zip code looks funny, but blame Microsoft which says zip is a decimal in the XSD file spec \u003cxs:element name=\"zip\" type=\"xs:decimal\"/\u003e)\n```json\n{\"purchaseOrder\":{\"purchaseOrder@orderDate\":\"1999-10-20 00:00:00.000\",\"shipTo\":{\"shipTo@country\":\"US\",\"name\":\"Alice Smith\",\"street\":\"123 Maple Street\",\"city\":\"Mill Valley\",\"state\":\"CA\",\"zip\":90952.0},\"billTo\":{\"billTo@country\":\"US\",\"name\":\"Robert Smith\",\"street\":\"8 Oak Avenue\",\"city\":\"Old Town\",\"state\":\"PA\",\"zip\":95819.0},\"comment\":\"Hurry, my lawn is going wild!\",\"items\":{\"item\":[{\"item@partNum\":\"872-AA\",\"productName\":\"Lawnmower\",\"quantity\":1,\"USPrice\":148.95,\"comment\":\"Confirm this is electric\",\"shipDate\":null},{\"item@partNum\":\"926-AA\",\"productName\":\"Baby Monitor\",\"quantity\":1,\"USPrice\":39.98,\"comment\":null,\"shipDate\":\"1999-05-21 00:00:00.000\"}]}}}\n```\n\n# Convert an entire directory of XML files to Parquet\nParse 3 files concurrently and only extract /PurchaseOrder/items/item elements\n```python\ncp PurchaseOrder.xml 1.xml\ncp 1.xml 2.xml\ncp 1.xml 3.xml\ncp 1.xml 4.xml\n\npython xml_to_parquet.py -m 3 -p /purchaseOrder/items/item -x PurchaseOrder.xsd *.xml\n\nINFO - 2021-01-21 12:38:00 - Parsing XML Files..\nINFO - 2021-01-21 12:38:00 - Processing 5 files\nINFO - 2021-01-21 12:38:00 - Parsing files in the following order:\nINFO - 2021-01-21 12:38:00 - ['1.xml', '4.xml', 'PurchaseOrder.xml', '2.xml', '3.xml']\nDEBUG - 2021-01-21 12:38:00 - Generating schema from PurchaseOrder.xsd\nDEBUG - 2021-01-21 12:38:00 - Generating schema from PurchaseOrder.xsd\nDEBUG - 2021-01-21 12:38:00 - Generating schema from PurchaseOrder.xsd\nDEBUG - 2021-01-21 12:38:00 - Parsing 4.xml\nDEBUG - 2021-01-21 12:38:00 - Parsing 1.xml\nDEBUG - 2021-01-21 12:38:00 - Parsing PurchaseOrder.xml\nDEBUG - 2021-01-21 12:38:00 - Saving to file 4.xml.parquet\nDEBUG - 2021-01-21 12:38:00 - Saving to file PurchaseOrder.xml.parquet\nDEBUG - 2021-01-21 12:38:00 - Saving to file 1.xml.parquet\nDEBUG - 2021-01-21 12:38:00 - Completed 4.xml\nDEBUG - 2021-01-21 12:38:00 - Generating schema from PurchaseOrder.xsd\nDEBUG - 2021-01-21 12:38:00 - Completed PurchaseOrder.xml\nDEBUG - 2021-01-21 12:38:00 - Completed 1.xml\nDEBUG - 2021-01-21 12:38:00 - Generating schema from PurchaseOrder.xsd\nDEBUG - 2021-01-21 12:38:00 - Parsing 3.xml\nDEBUG - 2021-01-21 12:38:00 - Parsing 2.xml\nDEBUG - 2021-01-21 12:38:00 - Saving to file 2.xml.parquet\nDEBUG - 2021-01-21 12:38:00 - Saving to file 3.xml.parquet\nDEBUG - 2021-01-21 12:38:00 - Completed 2.xml\nDEBUG - 2021-01-21 12:38:00 - Completed 3.xml\n\n```\nJSON equivalent output for PurchaseOrder.parquet\n```json\nls -l *.parquet\n-rw-rw-r-- 1 user users 3714 Jan 21 12:39 1.parquet\n-rw-rw-r-- 1 user users 3714 Jan 21 12:39 2.parquet\n-rw-rw-r-- 1 user users 3714 Jan 21 12:39 3.parquet\n-rw-rw-r-- 1 user users 3714 Jan 21 12:39 4.parquet\n-rw-rw-r-- 1 user users 3714 Jan 21 12:39 PurchaseOrder.parquet\n\n{\"purchaseOrder\":{\"purchaseOrder@orderDate\":\"1999-10-20 00:00:00.000\",\"items\":{\"item\":[{\"item@partNum\":\"872-AA\",\"productName\":\"Lawnmower\",\"quantity\":1,\"USPrice\":148.95,\"comment\":\"Confirm this is electric\",\"shipDate\":null},{\"item@partNum\":\"926-AA\",\"productName\":\"Baby Monitor\",\"quantity\":1,\"USPrice\":39.98,\"comment\":null,\"shipDate\":\"1999-05-21 00:00:00.000\"}]}}}\n```\n\n# Exclude xpath elements\nThis removes xpaths from your result\n```python\npython xml_to_parquet.py -e /purchaseOrder/comment,/purchaseOrder/items -x PurchaseOrder.xsd PurchaseOrder.xml\n```\nJSON equivalent output\n```json\n{\"purchaseOrder\":{\"purchaseOrder@orderDate\":\"1999-10-20 00:00:00.000\",\"shipTo\":{\"shipTo@country\":\"US\",\"name\":\"Alice Smith\",\"street\":\"123 Maple Street\",\"city\":\"Mill Valley\",\"state\":\"CA\",\"zip\":90952.0},\"billTo\":{\"billTo@country\":\"US\",\"name\":\"Robert Smith\",\"street\":\"8 Oak Avenue\",\"city\":\"Old Town\",\"state\":\"PA\",\"zip\":95819.0}}}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fblackrock%2Fxml_to_parquet","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fblackrock%2Fxml_to_parquet","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fblackrock%2Fxml_to_parquet/lists"}