{"id":28998554,"url":"https://github.com/luminati-io/parsing-xml-with-python","last_synced_at":"2025-06-25T07:09:14.406Z","repository":{"id":291979156,"uuid":"962435854","full_name":"luminati-io/parsing-xml-with-python","owner":"luminati-io","description":"Parse XML in Python using ElementTree, lxml, SAX, and more for efficient data processing and structured data integration.","archived":false,"fork":false,"pushed_at":"2025-05-07T13:34:39.000Z","size":14,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-05-07T14:44:26.081Z","etag":null,"topics":["elementtree","lxml","minidom","parsing","python","sax-parser","untangle","web-scraping","xml","xml-parser","xml-parsing"],"latest_commit_sha":null,"homepage":"https://brightdata.com/blog/how-tos/parsing-xml-in-python","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/luminati-io.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-08T06:46:11.000Z","updated_at":"2025-05-07T13:34:43.000Z","dependencies_parsed_at":"2025-05-07T14:45:27.459Z","dependency_job_id":"030cab41-de05-455d-a58b-bb309e49089b","html_url":"https://github.com/luminati-io/parsing-xml-with-python","commit_stats":null,"previous_names":["luminati-io/parsing-xml-with-python"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/luminati-io/parsing-xml-with-python","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminati-io%2Fparsing-xml-with-python","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminati-io%2Fparsing-xml-with-python/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminati-io%2Fparsing-xml-with-python/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminati-io%2Fparsing-xml-with-python/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/luminati-io","download_url":"https://codeload.github.com/luminati-io/parsing-xml-with-python/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminati-io%2Fparsing-xml-with-python/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261823771,"owners_count":23215149,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["elementtree","lxml","minidom","parsing","python","sax-parser","untangle","web-scraping","xml","xml-parser","xml-parsing"],"created_at":"2025-06-25T07:09:11.742Z","updated_at":"2025-06-25T07:09:14.361Z","avatar_url":"https://github.com/luminati-io.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Parsing XML with Python\n\n[![Bright Data Promo](https://github.com/luminati-io/LinkedIn-Scraper/raw/main/Proxies%20and%20scrapers%20GitHub%20bonus%20banner.png)](https://brightdata.com/)\n\nLearn how to parse XML in Python using libraries like ElementTree, lxml, and SAX to enhance your data processing projects.\n\n- [Key Concepts of an XML File](#key-concepts-of-an-xml-file)\n- [Various Ways to Parse XML in Python](#various-ways-to-parse-xml-in-python)\n- [ElementTree](#elementtree)\n- [lxml](#lxml)\n- [minidom](#minidom)\n- [SAX Parser](#sax-parser)\n\n## Key Concepts of an XML File\n\nBefore diving into how to parse XML in Python, it's important to first understand what XML Schema Definition (XSD) is and the fundamental elements that make up an XML file. This foundational knowledge will guide you in choosing the right Python library for your parsing needs.\n\n[XSD](https://en.wikipedia.org/wiki/XML_Schema_(W3C)) is a schema specification that defines the structure, content, and data types permitted in an XML document. It acts as a set of validation rules, ensuring that XML files follow a consistent format.\n\nAn XML file typically contains elements such as `Namespace`, `root`, `attributes`, `elements`, and `text content`, which together represent structured data.\n\n- **[`Namespace`](https://www.w3schools.com/xml/xml_namespaces.asp)** uniquely identifies elements and attributes in XML documents. It helps prevent naming collisions and supports interoperability between different XML datasets.\n- **[`root`](https://en.wikipedia.org/wiki/Root_element)** is the top-level element of an XML document. It serves as the entry point to the XML structure and encompasses all other elements.\n- **[`attributes`](https://www.w3schools.com/xml/xml_attributes.asp)** offer additional context about an element. Defined within an element's start tag, they consist of name-value pairs.\n- **[`elements`](https://www.w3schools.com/xml/xml_elements.asp)** are the core units of an XML file, representing the actual data or structural tags. Elements can nest within each other to build a hierarchy.\n- **[`text content`](https://developer.mozilla.org/en-US/docs/Web/API/Node/textContent)** refers to the actual textual data between an element’s start and end tags. This can include plain text, numeric values, or other characters.\n\nFor instance, the [Bright Data sitemap](https://brightdata.com/post-sitemap.xml) follows this XML structure:\n\n- **`urlset`** serves as the `root` element.\n- **`\u003curlset xsi:schemaLocation=\"http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd\"\u003e`** is the namespace declaration for the `urlset` element. This indicates that the schema rules apply to `urlset` and all nested elements.\n- **`url`** is a direct child of the `root` element.\n- **`loc`** is a child element within the `url` element.\n\nNow that you’ve got a clearer picture of XSD and XML structure, it’s time to put that knowledge to use by parsing an XML file using a few helpful Python libraries.\n\n## Various Ways to Parse XML in Python\n\nLet's use Bright Data sitemap. In the following examples, the Bright Data sitemap content is fetched using the Python `requests` library.\n\nThe Python requests library is not built-in, so you need to install it before proceeding. You can do so using the following command:\n\n```sh\npip install requests\n```\n\n## ElementTree\n\nThe [ElementTree XML API](https://docs.python.org/3/library/xml.etree.elementtree.html) offers a straightforward and user-friendly way to parse and generate XML data in Python. Since it’s part of Python’s standard library, there’s no need for any additional installation.\n\nFor instance, you can use the [`findall()`](https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.findall) method to retrieve all `url` elements from the root and print the text content of each `loc` element, like so:\n\n```python\nimport xml.etree.ElementTree as ET\nimport requests\n\nurl = 'https://brightdata.com/post-sitemap.xml'\n\nresponse = requests.get(url)\nif response.status_code == 200:\n   \n    root = ET.fromstring(response.content)\n\n    for url_element in root.findall('.//{http://www.sitemaps.org/schemas/sitemap/0.9}url'):\n        loc_element = url_element.find('{http://www.sitemaps.org/schemas/sitemap/0.9}loc')\n        if loc_element is not None:\n            print(loc_element.text)\nelse:\n    print(\"Failed to retrieve XML file from the URL.\")\n\n```\n\nAll the URLs in the sitemap are printed in the output:\n\n```\nhttps://brightdata.com/case-studies/powerdrop-case-study\nhttps://brightdata.com/case-studies/addressing-brand-protection-from-every-angle\nhttps://brightdata.com/case-studies/taking-control-of-the-digital-shelf-with-public-online-data\nhttps://brightdata.com/case-studies/the-seo-transformation\nhttps://brightdata.com/case-studies/data-driven-automated-e-commerce-tools\nhttps://brightdata.com/case-studies/highly-targeted-influencer-marketing\nhttps://brightdata.com/case-studies/data-driven-products-for-smarter-shopping-solutions\nhttps://brightdata.com/case-studies/workplace-diversity-facilitated-by-online-data\nhttps://brightdata.com/case-studies/alternative-travel-solutions-enabled-by-online-data-railofy\nhttps://brightdata.com/case-studies/data-intensive-analytical-solutions\nhttps://brightdata.com/case-studies/canopy-advantage-solutions\nhttps://brightdata.com/case-studies/seamless-digital-automations\n```\n\nElementTree is a simple and intuitive XML parser in Python, great for small scripts like reading RSS feeds. However, it lacks strong schema validation and may not be suitable for complex or large-scale XML parsing—libraries like `lxml` are better for those cases.\n\n## lxml\n\n[lxml](https://lxml.de/) is a fast, easy-to-use, and feature-rich API for parsing XML files in Python. You can [install `lxml`](https://lxml.de/installation.html#installation) using `pip`:\n\n```sh\npip install lxml\n```\n\nOnce installed, you can use `lxml` to parse XML files using [various API](https://lxml.de/apidoc/lxml.html) methods, such as `find()`, `findall()`, `findtext()`, `get()`, and `get_element_by_id()`.\n\nFor instance, you can use the `findall()` method to iterate over the `url` elements, find their `loc` elements (which are child elements of the `url` element), and then print the location text using the following code:\n\n```python\nfrom lxml import etree\nimport requests\n\nurl = \"https://brightdata.com/post-sitemap.xml\"\n\nresponse = requests.get(url)\nif response.status_code == 200:\n\n    root = etree.fromstring(response.content)\n    \n\n    for url in root.findall(\".//{http://www.sitemaps.org/schemas/sitemap/0.9}url\"):\n        loc = url.find(\"{http://www.sitemaps.org/schemas/sitemap/0.9}loc\").text.strip()\n        print(loc)\nelse:\n    print(\"Failed to retrieve XML file from the URL.\")\n```\n\nThe output displays all the URLs found in the sitemap:\n\n```\nhttps://brightdata.com/case-studies/powerdrop-case-study\nhttps://brightdata.com/case-studies/addressing-brand-protection-from-every-angle\nhttps://brightdata.com/case-studies/taking-control-of-the-digital-shelf-with-public-online-data\nhttps://brightdata.com/case-studies/the-seo-transformation\nhttps://brightdata.com/case-studies/data-driven-automated-e-commerce-tools\nhttps://brightdata.com/case-studies/highly-targeted-influencer-marketing\nhttps://brightdata.com/case-studies/data-driven-products-for-smarter-shopping-solutions\nhttps://brightdata.com/case-studies/workplace-diversity-facilitated-by-online-data\nhttps://brightdata.com/case-studies/alternative-travel-solutions-enabled-by-online-data-railofy\nhttps://brightdata.com/case-studies/data-intensive-analytical-solutions\nhttps://brightdata.com/case-studies/canopy-advantage-solutions\nhttps://brightdata.com/case-studies/seamless-digital-automations\n```\n\nUp to this point, you’ve seen how to locate elements and display their values. Next, let’s look at how to validate an XML file against its schema before parsing. This step confirms that the file follows the structure defined in the XSD.\n\nHere’s what the sitemap’s XSD looks like:\n\n```xml\n\u003c?xml version=\"1.0\"?\u003e\n\u003cxs:schema xmlns:xs=\"http://www.w3.org/2001/XMLSchema\"\n           targetNamespace=\"http://www.sitemaps.org/schemas/sitemap/0.9\"\n           xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\"\n           elementFormDefault=\"qualified\"\n           xmlns:xhtml=\"http://www.w3.org/1999/xhtml\"\u003e\n\n  \n  \u003cxs:element name=\"urlset\"\u003e\n    \u003cxs:complexType\u003e\n      \u003cxs:sequence\u003e\n        \u003cxs:element ref=\"url\" minOccurs=\"0\" maxOccurs=\"unbounded\"/\u003e\n      \u003c/xs:sequence\u003e\n    \u003c/xs:complexType\u003e\n  \u003c/xs:element\u003e\n  \n  \u003cxs:element name=\"url\"\u003e\n    \u003cxs:complexType\u003e\n      \u003cxs:sequence\u003e\n        \u003cxs:element name=\"loc\" type=\"xs:anyURI\"/\u003e\n      \u003c/xs:sequence\u003e\n    \u003c/xs:complexType\u003e\n  \u003c/xs:element\u003e\n\n\u003c/xs:schema\u003e\n```\n\nTo use the sitemap for schema validation, make sure you copy it manually and create a file named `schema.xsd`.\n\nNow, validate the XML file using this XSD:\n\n```python\nfrom lxml import etree\n\nimport requests\n\nurl = \"https://brightdata.com/post-sitemap.xml\"\n\nresponse = requests.get(url)\n\nif response.status_code == 200:\n\n    root = etree.fromstring(response.content)\n\n    try:\n        print(\"Schema Validation:\")\n        schema_doc = etree.parse(\"schema.xsd\")  \n        schema = etree.XMLSchema(schema_doc)  \n        schema.assertValid(root)  \n        print(\"XML is valid according to the schema.\")\n    except etree.DocumentInvalid as e:\n        print(\"XML validation error:\", e)\n```\n\nIn this step, you parse the XSD file using the [`etree.parse()`](https://lxml.de/tutorial.html#the-parse-function) method, then build an XML Schema from the parsed content. Finally, you validate the XML root against that schema using `assertValid()`. If the XML passes validation, you'll see a message like `XML is valid according to the schema`; otherwise, a [`DocumentInvalid`](https://lxml.de/api/lxml.etree.DocumentInvalid-class.html) exception is thrown.\n\nYour output should look like this:\n\n```\n Schema Validation:\n    XML is valid according to the schema.\n```\n\nNow, let’s read an XML file that uses the `xpath` method to find the elements using their path.\n\nTo read the elements using the `xpath()` method, use the following code:\n\n```python\nfrom lxml import etree\n\nimport requests\n\nurl = \"https://brightdata.com/post-sitemap.xml\"\nresponse = requests.get(url)\n\nif response.status_code == 200:\n   \n    root = etree.fromstring(response.content)\n    \n    print(\"XPath Support:\")\n    root = etree.fromstring(response.content)\n\n    namespaces = {\"ns\": \"http://www.sitemaps.org/schemas/sitemap/0.9\"}\n    for url in root.xpath(\".//ns:url/ns:loc\", namespaces=namespaces):\n        print(url.text.strip())\n\n```\n\nIn this snippet, you register the `ns` prefix and link it to the namespace URI `http://www.sitemaps.org/schemas/sitemap/0.9`. The `XPath` expression then uses this prefix to target namespaced elements. Specifically, `.//ns:url/ns:loc` selects all `loc` elements that are children of `url` elements within that namespace.\n\nThe output will look like this:\n\n```\nXPath Support:\n\nhttps://brightdata.com/case-studies/powerdrop-case-study\nhttps://brightdata.com/case-studies/addressing-brand-protection-from-every-angle\nhttps://brightdata.com/case-studies/taking-control-of-the-digital-shelf-with-public-online-data\nhttps://brightdata.com/case-studies/the-seo-transformation\nhttps://brightdata.com/case-studies/data-driven-automated-e-commerce-tools\nhttps://brightdata.com/case-studies/highly-targeted-influencer-marketing\nhttps://brightdata.com/case-studies/data-driven-products-for-smarter-shopping-solutions\nhttps://brightdata.com/case-studies/workplace-diversity-facilitated-by-online-data\nhttps://brightdata.com/case-studies/alternative-travel-solutions-enabled-by-online-data-railofy\nhttps://brightdata.com/case-studies/data-intensive-analytical-solutions\nhttps://brightdata.com/case-studies/canopy-advantage-solutions\nhttps://brightdata.com/case-studies/seamless-digital-automations\n```\n\nThe `find()` and `findall()` methods are faster than `xpath()` since `xpath()` loads all results into memory. Use `find()` unless you need more complex queries.\n\n`lxml` is a powerful library for parsing XML and HTML, supporting advanced features like XPath, schema validation, and XSLT. It's ideal for high-performance or complex tasks but requires separate installation.\n\nIf you're working with large or intricate XML data—like financial feeds—`lxml` is a strong choice for efficient querying, validation, and transformation.\n\n## minidom\n\n[`minidom`](https://docs.python.org/3/library/xml.dom.minidom.html) is a simple and lightweight XML parsing library included in Python’s standard library. While not as feature-rich or efficient as `lxml`, it provides an easy way to parse and manipulate XML data.\n\nYou can use various DOM methods to access elements. For instance, the [`getElementsByTagName()` method](https://developer.mozilla.org/en-US/docs/Web/API/Document/getElementsByTagName) allows you to retrieve elements by their tag name.\n\nHere’s an example of using the `minidom` library to parse an XML file and fetch elements by their tag names:\n\n```python\nimport requests\nimport xml.dom.minidom\n\nurl = \"https://brightdata.com/post-sitemap.xml\"\n\nresponse = requests.get(url)\nif response.status_code == 200:\n    dom = xml.dom.minidom.parseString(response.content)\n    \n    urlset = dom.getElementsByTagName(\"urlset\")[0]\n    for url in urlset.getElementsByTagName(\"url\"):\n        loc = url.getElementsByTagName(\"loc\")[0].firstChild.nodeValue.strip()\n        print(loc)\nelse:\n    print(\"Failed to retrieve XML file from the URL.\")\n```\n\nYour output would look like this:\n\n```\nhttps://brightdata.com/case-studies/powerdrop-case-study\nhttps://brightdata.com/case-studies/addressing-brand-protection-from-every-angle\nhttps://brightdata.com/case-studies/taking-control-of-the-digital-shelf-with-public-online-data\nhttps://brightdata.com/case-studies/the-seo-transformation\nhttps://brightdata.com/case-studies/data-driven-automated-e-commerce-tools\nhttps://brightdata.com/case-studies/highly-targeted-influencer-marketing\nhttps://brightdata.com/case-studies/data-driven-products-for-smarter-shopping-solutions\nhttps://brightdata.com/case-studies/workplace-diversity-facilitated-by-online-data\nhttps://brightdata.com/case-studies/alternative-travel-solutions-enabled-by-online-data-railofy\nhttps://brightdata.com/case-studies/data-intensive-analytical-solutions\nhttps://brightdata.com/case-studies/canopy-advantage-solutions\nhttps://brightdata.com/case-studies/seamless-digital-automations\n```\n\n`minidom` represents XML data as a DOM tree, making it easy to navigate and manipulate. It's ideal for basic tasks like reading, modifying, or creating simple XML structures.\n\nIf your program needs to read settings from an XML file, `minidom` allows easy access to specific elements, such as finding child nodes or attributes. For example, you can quickly retrieve a `font-size` node and use its value in your application.\n\n## SAX Parser\n\nThe [SAX parser](https://docs.python.org/3/library/xml.sax.html) is an event-driven XML parser that processes documents sequentially, emitting events like start tags, end tags, and text content. Unlike DOM parsers, SAX doesn’t load the entire document into memory, making it ideal for large XML files where memory efficiency is important.\n\nTo use SAX, you define event handlers for specific XML events, such as `startElement` and `endElement`, which you can customize to handle the document’s structure and content.\n\nHere’s an example of using the SAX parser to process an XML file, defining event handlers to extract URL information from a sitemap:\n\n```python\nimport requests\nimport xml.sax.handler\nfrom io import BytesIO\n\nclass MyContentHandler(xml.sax.handler.ContentHandler):\n    def __init__(self):\n        self.in_url = False\n        self.in_loc = False\n        self.url = \"\"\n\n    def startElement(self, name, attrs):\n        if name == \"url\":\n            self.in_url = True\n        elif name == \"loc\" and self.in_url:\n            self.in_loc = True\n\n    def characters(self, content):\n        if self.in_loc:\n            self.url += content\n\n    def endElement(self, name):\n        if name == \"url\":\n            print(self.url.strip())\n            self.url = \"\"\n            self.in_url = False\n        elif name == \"loc\":\n            self.in_loc = False\n\nurl = \"https://brightdata.com/post-sitemap.xml\"\n\nresponse = requests.get(url)\nif response.status_code == 200:\n\n    xml_content = BytesIO(response.content)\n    \n    content_handler = MyContentHandler()\n    parser = xml.sax.make_parser()\n    parser.setContentHandler(content_handler)\n    parser.parse(xml_content)\nelse:\n    print(\"Failed to retrieve XML file from the URL.\")\n```\n\nYour output would look like this:\n\n```\nhttps://brightdata.com/case-studies/powerdrop-case-study\nhttps://brightdata.com/case-studies/addressing-brand-protection-from-every-angle\nhttps://brightdata.com/case-studies/taking-control-of-the-digital-shelf-with-public-online-data\nhttps://brightdata.com/case-studies/the-seo-transformation\nhttps://brightdata.com/case-studies/data-driven-automated-e-commerce-tools\nhttps://brightdata.com/case-studies/highly-targeted-influencer-marketing\nhttps://brightdata.com/case-studies/data-driven-products-for-smarter-shopping-solutions\nhttps://brightdata.com/case-studies/workplace-diversity-facilitated-by-online-data\nhttps://brightdata.com/case-studies/alternative-travel-solutions-enabled-by-online-data-railofy\nhttps://brightdata.com/case-studies/data-intensive-analytical-solutions\nhttps://brightdata.com/case-studies/canopy-advantage-solutions\nhttps://brightdata.com/case-studies/seamless-digital-automations\n```\n\nUnlike other parsers that load the entire file into memory, SAX processes files incrementally, saving memory and improving performance. However, it requires more code to handle each data segment and doesn’t allow revisiting parts of the data for later analysis.\n\nSAX is ideal for efficiently scanning large XML files (e.g., log files) to extract specific information (e.g., error messages). However, if your analysis needs to explore relationships between different data segments, SAX may not be the best choice.\n\n## Conclusion\n\nPython offers versatile libraries to simplify XML parsing. However, when using the requests library to access files online, you may face quota and throttling issues. [Bright Data](https://brightdata.com/) offers reliable proxy solutions to help bypass these limitations. \n\nIf you'd rather skip the scraping and parsing, check out our [dataset marketplace](https://brightdata.com/products/datasets) for free!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluminati-io%2Fparsing-xml-with-python","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fluminati-io%2Fparsing-xml-with-python","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluminati-io%2Fparsing-xml-with-python/lists"}