{"id":29630092,"url":"https://github.com/dps/go-xml-parse","last_synced_at":"2025-10-07T13:48:47.380Z","repository":{"id":3641819,"uuid":"4709091","full_name":"dps/go-xml-parse","owner":"dps","description":"Streaming XML parser example in go","archived":false,"fork":false,"pushed_at":"2015-07-04T01:01:46.000Z","size":259,"stargazers_count":133,"open_issues_count":2,"forks_count":27,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-07-21T10:19:44.271Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dps.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-06-19T03:38:44.000Z","updated_at":"2025-02-02T22:54:12.000Z","dependencies_parsed_at":"2022-08-18T18:21:04.673Z","dependency_job_id":null,"html_url":"https://github.com/dps/go-xml-parse","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/dps/go-xml-parse","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dps%2Fgo-xml-parse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dps%2Fgo-xml-parse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dps%2Fgo-xml-parse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dps%2Fgo-xml-parse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dps","download_url":"https://codeload.github.com/dps/go-xml-parse/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dps%2Fgo-xml-parse/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278786691,"owners_count":26045588,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-07T02:00:06.786Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-07-21T10:08:16.567Z","updated_at":"2025-10-07T13:48:47.357Z","avatar_url":"https://github.com/dps.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"go-xml-parse\n============\n\nStreaming XML parser example in Go\n\nIntro\n-----\n\nI've recently been messing around with the XML dumps of Wikipedia. These are pretty huge XML files - for instance the most recent revision is 36G when uncompressed. That's a lot of XML!\n\nI've been experimenting with a few different languages and parsers for my task (which also happens to involve some non trivial processing for each article) and found Go to be a great fit.\n\nGo has a common library package for parsing xml (encoding/xml) which is very convenient to code against. However, the simple version of the API requires parsing the whole document at once, which for 36G is not a viable strategy. \n\nThe parser can also be used in a streaming mode but I found the documentation and examples online to be terse and non-existant respectively, so here is my example code for parsing wikipedia with encoding/xml and a little explanation! (full example code at https://github.com/dps/go-xml-parse/blob/master/go-xml-parse.go)\n\nHere's a little snippet of an example wikipedia page in the doc:\n\n```xml\n\u003cpage\u003e \n  \u003ctitle\u003eApollo 11\u003c/title\u003e \n    \u003credirect title=\"Foo bar\" /\u003e \n    ... \n     \u003crevision\u003e \n     ... \n       \u003ctext xml:space=\"preserve\"\u003e \n       {{Infobox Space mission \n       |mission_name=\u003c!--See above-\u003e; \n       |insignia=Apollo_11_insignia.png \n     ... \n       \u003c/text\u003e \n     \u003c/revision\u003e \n\u003c/page\u003e\n```\n\nIn our Go code, we define a struct to match the \u003cpage\u003e element, its nested \u003credirect\u003e element and grab a couple of fields we're interested in (\u003ctext\u003e and \u003ctitle\u003e).\n```go\ntype Redirect struct { \n    Title string `xml:\"title,attr\"` \n} \n\ntype Page struct { \n    Title string `xml:\"title\"` \n    Redir Redirect `xml:\"redirect\"` \n    Text string `xml:\"revision\u003etext\"` \n}\n```\nNow we would usually tell the parser that a wikipedia dump contains a bunch of \u003cpage\u003es and try to read the whole thing, but let's see how we stream it instead.\n\nIt's quite simple when you know how - iterate over tokens in the file until you encounter a StartElement with the name \"page\" and then use the magic decoder.DecodeElement API to unmarshal the whole following page into an object of the Page type defined above. Cool!\n\n```go\ndecoder := xml.NewDecoder(xmlFile) \n\nfor { \n    // Read tokens from the XML document in a stream. \n    t, _ := decoder.Token() \n    if t == nil { \n        break \n    } \n    // Inspect the type of the token just read. \n    switch se := t.(type) { \n    case xml.StartElement: \n        // If we just read a StartElement token \n        // ...and its name is \"page\" \n        if se.Name.Local == \"page\" { \n            var p Page \n            // decode a whole chunk of following XML into the\n            // variable p which is a Page (se above) \n            decoder.DecodeElement(\u0026p, \u0026se) \n            // Do some stuff with the page. \n            p.Title = CanonicalizeTitle(p.Title)\n            ...\n        } \n...\n```\n\n\nI hope this saves you some time if you need to parse a huge XML file yourself.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdps%2Fgo-xml-parse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdps%2Fgo-xml-parse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdps%2Fgo-xml-parse/lists"}