{"id":13878295,"url":"https://github.com/soulcutter/saxerator","last_synced_at":"2025-04-05T02:09:12.916Z","repository":{"id":2919574,"uuid":"3929772","full_name":"soulcutter/saxerator","owner":"soulcutter","description":"A SAX-based XML parser for parsing large files into manageable chunks","archived":false,"fork":false,"pushed_at":"2022-09-10T15:46:07.000Z","size":198,"stargazers_count":126,"open_issues_count":9,"forks_count":19,"subscribers_count":9,"default_branch":"master","last_synced_at":"2024-04-23T06:58:26.423Z","etag":null,"topics":["ruby","sax","xml"],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/soulcutter.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-04-04T14:53:17.000Z","updated_at":"2024-03-30T13:24:19.000Z","dependencies_parsed_at":"2022-07-15T23:04:33.440Z","dependency_job_id":null,"html_url":"https://github.com/soulcutter/saxerator","commit_stats":null,"previous_names":[],"tags_count":22,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soulcutter%2Fsaxerator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soulcutter%2Fsaxerator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soulcutter%2Fsaxerator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soulcutter%2Fsaxerator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/soulcutter","download_url":"https://codeload.github.com/soulcutter/saxerator/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247276164,"owners_count":20912288,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ruby","sax","xml"],"created_at":"2024-08-06T08:01:45.458Z","updated_at":"2025-04-05T02:09:12.898Z","avatar_url":"https://github.com/soulcutter.png","language":"Ruby","funding_links":[],"categories":["Ruby"],"sub_categories":[],"readme":"Saxerator [![soulcutter](https://circleci.com/gh/soulcutter/saxerator.svg?style=shield)](https://circleci.com/gh/soulcutter/saxerator)[![Code Climate](https://codeclimate.com/github/soulcutter/saxerator.png)](https://codeclimate.com/github/soulcutter/saxerator)\n=========\n\nSaxerator is a streaming xml-to-hash parser designed for working with very large xml files by\ngiving you Enumerable access to manageable chunks of the document.\n\nEach xml chunk is parsed into a JSON-like Ruby Hash structure for consumption.\n\nYou can parse any valid xml in 3 simple steps.\n\n1. Initialize the parser\n1. Specify which tag you care about using a simple DSL\n1. Perform your work in an `each` block, or using any [Enumerable](http://apidock.com/ruby/Enumerable)\nmethod\n\nInstallation\n------------\n1. `gem install saxerator`\n1. Choose an xml parser\n    * (default) Use ruby's built-in REXML parser - no other dependencies necessary\n    * `gem install nokogiri`\n    * `gem install ox`\n1. If not using the default, specify your adapter in the [Saxerator configuration](#configuration)\n\nThe DSL\n-------\nThe DSL consists of predicates that may be combined to describe which elements the parser should enumerate over.\nSaxerator will only enumerate over chunks of xml that match all of the combined predicates (see Examples section\nfor added clarity).\n\n| Predicate        | Explanation |\n|:-----------------|:------------|\n| `all`            | Returns the entire document parsed into a hash. Cannot combine with other predicates\n| `for_tag(name)`  | Elements whose name matches the given `name`\n| `for_tags(names)`| Elements whose name is in the `names` Array\n| `at_depth(n)`    | Elements `n` levels deep inside the root of an xml document. The root element itself is `n = 0`\n| `within(name)`   | Elements nested anywhere within an element with the given `name`\n| `child_of(name)` | Elements that are direct children of an element with the given `name`\n| `with_attribute(name, value)` | Elements that have an attribute with a given `name` and `value`. If no `value` is given, matches any element with the specified attribute name present\n| `with_attributes(attrs)` | Similar to `with_attribute` except takes an Array or Hash indicating the attributes to match\n\nOn any parsing error it'll raise an `Saxerator::ParseException` exception with the message that describe what is wrong on XML document.\n**Warning** Rexml won't raise and error if root elent wasn't closed. (will be fixed on ruby 2.5)\n\nExamples\n--------\n```ruby\nparser = Saxerator.parser(File.new(\"rss.xml\"))\n\nparser.for_tag(:item).each do |item|\n  # where the xml contains \u003citem\u003e\u003ctitle\u003e...\u003c/title\u003e\u003cauthor\u003e...\u003c/author\u003e\u003c/item\u003e\n  # item will look like {'title' =\u003e '...', 'author' =\u003e '...'}\n  puts \"#{item['title']}: #{item['author']}\"\nend\n\n# a String is returned here since the given element contains only character data\nputs \"First title: #{parser.for_tag(:title).first}\"\n```\n\nAttributes are stored as a part of the Hash or String object they relate to\n\n```ruby\n# author is a String here, but also responds to .attributes\nprimary_authors = parser.for_tag(:author).select { |author| author.attributes['type'] == 'primary' }\n```\n\nYou can combine predicates to isolate just the tags you want.\n\n```ruby\nrequire 'saxerator'\n\nparser = Saxerator.parser(bookshelf_xml)\n\n# You can chain predicates\nparser.for_tag(:name).within(:book).each { |book_name| puts book_name }\n\n# You can re-use intermediary predicates\nbookshelf_contents = parser.within(:bookshelf)\n\nbooks = bookshelf_contents.for_tag(:book)\nmagazines = bookshelf_contents.for_tag(:magazine)\n\nbooks.each do |book|\n  # ...\nend\n\nmagazines.each do |magazine|\n  # ...\nend\n```\n\nConfiguration\n-------------\n\nCertain options are available via a configuration block at parser initialization.\n\n```ruby\nSaxerator.parser(xml) do |config|\n  config.output_type = :xml\nend\n```\n\n| Setting           | Default | Values          | Description\n|:------------------|:--------|-----------------|------------\n| `adapter`         | `:nokogiri` | `:nokogiri`, `:oga`, `:ox`, `:rexml` | The XML parser used by Saxerator |\n| `output_type`     | `:hash` | `:hash`, `:xml` | The type of object generated by Saxerator's parsing. `:hash` generates a Ruby Hash, `:xml` generates a `REXML::Document`\n| `symbolize_keys!` | n/a     | n/a             | Call this method if you want the hash keys to be symbols rather than strings\n| `ignore_namespaces!`| n/a   | n/a             | Call this method if you want to treat the XML document as if it has no namespace information. It differs slightly from `strip_namespaces!` since it deals with how the XML is processed rather than how it is output\n| `strip_namespaces!`| n/a     | user-specified  | Called with no arguments this strips all namespaces, or you may specify an arbitrary number of namespaces to strip, i.e. `config.strip_namespaces! :rss, :soapenv`\n| `put_attributes_in_hash!` | n/a     | n/a             | Call this method if you want xml attributes included as elements of the output hash - only valid with `output_type = :hash`\n\nKnown Issues\n------------\n* JRuby closes the file stream at the end of parsing, therefor to perform multiple operations\n  which parse a file you will need to instantiate a new parser with a new File object.\n\nOther Documentation\n-------------------\n* [REXML](http://www.germane-software.com/software/rexml/) ([api docs](http://ruby-doc.org/stdlib-2.4.0/libdoc/rexml/rdoc/REXML/Document.html))\n* [Nokogiri](http://www.nokogiri.org/) ([api docs](http://www.rubydoc.info/github/sparklemotion/nokogiri))\n* [Oga](https://github.com/YorickPeterse/oga) ([api docs](http://code.yorickpeterse.com/oga/latest/))\n* [Ox](https://github.com/ohler55/ox) ([api docs](http://www.ohler.com/ox/))\n\nFAQ\n---\nWhy the name 'Saxerator'?\n\n  \u003e It's a combination of SAX + Enumerator.\n\nWhy use Saxerator over regular SAX parsing?\n\n  \u003e Much of the SAX parsing code I've written over the years has fallen into a pattern that Saxerator encapsulates:\n  \u003e marshall a chunk of an XML document into an object, operate on that object, then move on to the\n  \u003e next chunk. Saxerator alleviates the pain of marshalling and allows you to focus solely on operating on the\n  \u003e document chunk.\n\nWhy not DOM parsing?\n\n  \u003e DOM parsers load the entire document into memory. Saxerator only holds a single chunk in memory at a time. If your\n  \u003e document is very large, this can be an important consideration.\n\nWhen I fetch a tag that has one or more elements, sometimes I get an `Array`, and other times I get a `Hash` or `String`. Is there a way I can treat these consistently?\n\n  \u003e You can treat objects consistently as arrays using\n  \u003e [Ruby's built-in array conversion method](http://www.ruby-doc.org/core-2.1.1/Kernel.html#method-i-Array)\n  \u003e in the form `Array(element_or_array)`\n\nWhy Active Record fails when I'm passing String value to the query?\n\n  \u003e Saxerator doesn't return Array, Hash or String to you. But you can convert it to needed type by calling `.to_\u003ctype\u003e` method as you usually do.\n\n###  Contribution ###\n\nFor running tests for all parsers run `rake spec:adapters`\n\n### Acknowledgements ###\nSaxerator was inspired by - but not affiliated with - [nori](https://github.com/savonrb/nori) and Gregory Brown's\n[Practicing Ruby](http://practicingruby.com/)\n\n#### Legal Stuff ####\nCopyright © 2012-2020 Bradley Schaefer. MIT License (see LICENSE file).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsoulcutter%2Fsaxerator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsoulcutter%2Fsaxerator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsoulcutter%2Fsaxerator/lists"}