{"id":13484121,"url":"https://gitlab.com/yorickpeterse/oga","last_synced_at":"2025-03-27T15:31:07.157Z","repository":{"id":50926558,"uuid":"4545021","full_name":"yorickpeterse/oga","owner":"yorickpeterse","description":"Moved to https://github.com/yorickpeterse/oga","archived":true,"fork":false,"pushed_at":null,"size":null,"stargazers_count":45,"open_issues_count":13,"forks_count":7,"subscribers_count":null,"default_branch":"master","last_synced_at":"2024-11-17T14:55:14.719Z","etag":null,"topics":["html","parser","ruby","xml"],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://gitlab.com/uploads/-/system/project/avatar/4545021/oga.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-11-02T00:13:06.321Z","updated_at":"2022-08-19T02:41:22.400Z","dependencies_parsed_at":"2022-08-20T13:50:59.379Z","dependency_job_id":null,"html_url":"https://gitlab.com/yorickpeterse/oga","commit_stats":null,"previous_names":[],"tags_count":47,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/gitlab.com/repositories/yorickpeterse%2Foga","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/gitlab.com/repositories/yorickpeterse%2Foga/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/gitlab.com/repositories/yorickpeterse%2Foga/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/gitlab.com/repositories/yorickpeterse%2Foga/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/gitlab.com/owners/yorickpeterse","download_url":"https://gitlab.com/yorickpeterse/oga/-/archive/master/oga-master.zip","host":{"name":"gitlab.com","url":"https://gitlab.com","kind":"gitlab","repositories_count":4518295,"owners_count":6897,"icon_url":"https://github.com/gitlab.png","version":null,"created_at":"2022-05-30T11:31:42.605Z","updated_at":"2024-07-18T11:24:13.055Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/gitlab.com","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/gitlab.com/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/gitlab.com/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/gitlab.com/owners"}},"keywords":["html","parser","ruby","xml"],"created_at":"2024-07-31T17:01:19.629Z","updated_at":"2025-03-27T15:31:05.290Z","avatar_url":"https://gitlab.com/uploads/-/system/project/avatar/4545021/oga.png","language":null,"funding_links":[],"categories":["HTML/XML Parsing"],"sub_categories":[],"readme":"# Oga\n\n**NOTE:** my spare time is limited which means I am unable to dedicate a lot of\ntime on Oga. If you're interested in contributing to FOSS, please take a look at\nthe open issues and submit a pull request to address them where possible.\n\nOga is an XML/HTML parser written in Ruby. It provides an easy to use API for\nparsing, modifying and querying documents (using XPath expressions). Oga does\nnot require system libraries such as libxml, making it easier and faster to\ninstall on various platforms. To achieve better performance Oga uses a small,\nnative extension (C for MRI/Rubinius, Java for JRuby).\n\nOga provides an API that allows you to safely parse and query documents in a\nmulti-threaded environment, without having to worry about your applications\nblowing up.\n\nFrom [Wikipedia][oga-wikipedia]:\n\n\u003e Oga: A large two-person saw used for ripping large boards in the days before\n\u003e power saws. One person stood on a raised platform, with the board below him,\n\u003e and the other person stood underneath them.\n\nThe name is a pun on [Nokogiri][nokogiri].\n\n## Versioning Policy\n\nOga uses the version format `MAJOR.MINOR` (e.g. `2.1`). An increase of the MAJOR\nversion indicates backwards incompatible changes were introduced. The MINOR\nversion is _only_ increased when changes are backwards compatible, regardless of\nwhether those changes are bugfixes or new features. Up until version 1.0 the\ncode should be considered unstable meaning it can change (and break) at any\ngiven moment.\n\nAPIs explicitly tagged as private (e.g. using Ruby's `private` keyword or YARD's\n`@api private` tag) are not covered by these rules.\n\n## Examples\n\nParsing a simple string of XML:\n\n    Oga.parse_xml('\u003cpeople\u003e\u003cperson\u003eAlice\u003c/person\u003e\u003c/people\u003e')\n\nParsing XML using strict mode (disables automatic tag insertion):\n\n    Oga.parse_xml('\u003cpeople\u003efoo\u003c/people\u003e', :strict =\u003e true) # works fine\n    Oga.parse_xml('\u003cpeople\u003efoo', :strict =\u003e true)          # throws an error\n\nParsing a simple string of HTML:\n\n    Oga.parse_html('\u003clink rel=\"stylesheet\" href=\"foo.css\"\u003e')\n\nParsing an IO handle pointing to XML (this also works when using\n`Oga.parse_html`):\n\n    handle = File.open('path/to/file.xml')\n\n    Oga.parse_xml(handle)\n\nParsing an IO handle using the pull parser:\n\n    handle = File.open('path/to/file.xml')\n    parser = Oga::XML::PullParser.new(handle)\n\n    parser.parse do |node|\n      parser.on(:text) do\n        puts node.text\n      end\n    end\n\nUsing an Enumerator to download and parse an XML document on the fly:\n\n    enum = Enumerator.new do |yielder|\n      HTTPClient.get('http://some-website.com/some-big-file.xml') do |chunk|\n        yielder \u003c\u003c chunk\n      end\n    end\n\n    document = Oga.parse_xml(enum)\n\nParse a string of XML using the SAX parser:\n\n    class ElementNames\n      attr_reader :names\n\n      def initialize\n        @names = []\n      end\n\n      def on_element(namespace, name, attrs = {})\n        @names \u003c\u003c name\n      end\n    end\n\n    handler = ElementNames.new\n\n    Oga.sax_parse_xml(handler, '\u003cfoo\u003e\u003cbar\u003e\u003c/bar\u003e\u003c/foo\u003e')\n\n    handler.names # =\u003e [\"foo\", \"bar\"]\n\nQuerying a document using XPath:\n\n    document = Oga.parse_xml \u003c\u003c-EOF\n    \u003cpeople\u003e\n      \u003cperson id=\"1\"\u003e\n        \u003cname\u003eAlice\u003c/name\u003e\n        \u003cage\u003e28\u003c/name\u003e\n      \u003c/person\u003e\n    \u003c/people\u003e\n    EOF\n\n    # The \"xpath\" method returns an enumerable (Oga::XML::NodeSet) that you can\n    # iterate over.\n    document.xpath('people/person').each do |person|\n      puts person.get('id') # =\u003e \"1\"\n\n      # The \"at_xpath\" method returns a single node from a set, it's the same as\n      # person.xpath('name').first.\n      puts person.at_xpath('name').text # =\u003e \"Alice\"\n    end\n\nQuerying the same document using CSS:\n\n    document = Oga.parse_xml \u003c\u003c-EOF\n    \u003cpeople\u003e\n      \u003cperson id=\"1\"\u003e\n        \u003cname\u003eAlice\u003c/name\u003e\n        \u003cage\u003e28\u003c/name\u003e\n      \u003c/person\u003e\n    \u003c/people\u003e\n    EOF\n\n    # The \"css\" method returns an enumerable (Oga::XML::NodeSet) that you can\n    # iterate over.\n    document.css('people person').each do |person|\n      puts person.get('id') # =\u003e \"1\"\n\n      # The \"at_css\" method returns a single node from a set, it's the same as\n      # person.css('name').first.\n      puts person.at_css('name').text # =\u003e \"Alice\"\n    end\n\nModifying a document and serializing it back to XML:\n\n    document = Oga.parse_xml('\u003cpeople\u003e\u003cperson\u003eAlice\u003c/person\u003e\u003c/people\u003e')\n    name     = document.at_xpath('people/person[1]/text()')\n\n    name.text = 'Bob'\n\n    document.to_xml # =\u003e \"\u003cpeople\u003e\u003cperson\u003eBob\u003c/person\u003e\u003c/people\u003e\"\n\nQuerying a document using a namespace:\n\n    document = Oga.parse_xml('\u003croot xmlns:x=\"foo\"\u003e\u003cx:div\u003e\u003c/x:div\u003e\u003c/root\u003e')\n    div      = document.xpath('root/x:div').first\n\n    div.namespace # =\u003e Namespace(name: \"x\" uri: \"foo\")\n\n## Features\n\n* Support for parsing XML and HTML(5)\n  * DOM parsing\n  * Stream/pull parsing\n  * SAX parsing\n* Low memory footprint\n* High performance (taking into account most work happens in Ruby)\n* Support for XPath 1.0\n* CSS3 selector support\n* XML namespace support (registering, querying, etc)\n* Windows support\n\n## Requirements\n\n| Ruby     | Required      | Recommended |\n|:---------|:--------------|:------------|\n| MRI      | \u003e= 2.3.0      | \u003e= 2.6.0    |\n| JRuby    | \u003e= 1.7        | \u003e= 1.7.12   |\n| Rubinius | Not supported |             |\n| Maglev   | Not supported |             |\n| Topaz    | Not supported |             |\n| mruby    | Not supported |             |\n\nMaglev and Topaz are not supported due to the lack of a C API (that I know of)\nand the lack of active development of these Ruby implementations. mruby is not\nsupported because it's a very different implementation all together.\n\nTo install Oga on MRI or Rubinius you'll need to have a working compiler such as\ngcc or clang. Oga's C extension can be compiled with both. JRuby does not\nrequire a compiler as the native extension is compiled during the Gem building\nprocess and bundled inside the Gem itself.\n\n## Thread Safety\n\nOga does not use a unsynchronized global mutable state. As a result of this you\ncan parse/create documents concurrently without any problems. Modifying\ndocuments concurrently can lead to bugs as these operations are not\nsynchronized.\n\nSome querying operations will cache data in instance variables, without\nsynchronization. An example is `Oga::XML::Element#namespace` which will cache an\nelement's namespace after the first call.\n\nIn general it's recommended to _not_ use the same document in multiple threads\nat the same time.\n\n## Namespace Support\n\nOga fully supports parsing/registering XML namespaces as well as querying them\nusing XPath. For example, take the following XML:\n\n    \u003croot xmlns=\"http://example.com\"\u003e\n        \u003cbar\u003ebar\u003c/bar\u003e\n    \u003c/root\u003e\n\nIf one were to try and query the `bar` element (e.g. using XPath `root/bar`)\nthey'd end up with an empty node set. This is due to `\u003croot\u003e` defining an\nalternative default namespace. Instead you can query this element using the\nfollowing XPath:\n\n    *[local-name() = \"root\"]/*[local-name() = \"bar\"]\n\nAlternatively, if you don't really care where the `\u003cbar\u003e` element is located you\ncan use the following:\n\n    descendant::*[local-name() = \"bar\"]\n\nAnd if you want to specify an explicit namespace URI, you can use this:\n\n    descendant::*[local-name() = \"bar\" and namespace-uri() = \"http://example.com\"]\n\nLike Nokogiri, Oga provides a way to create \"dynamic\" namespaces.\nThat is, Oga allows one to query the above document as following:\n\n    document = Oga.parse_xml('\u003croot xmlns=\"http://example.com\"\u003e\u003cbar\u003ebar\u003c/bar\u003e\u003c/root\u003e')\n\n    document.xpath('x:root/x:bar', namespaces: {'x' =\u003e 'http://example.com'})\n\nMoreover, because Oga assigns the name \"xmlns\" to default namespaces you can use\nthis in your XPath queries:\n\n    document = Oga.parse_xml('\u003croot xmlns=\"http://example.com\"\u003e\u003cbar\u003ebar\u003c/bar\u003e\u003c/root\u003e')\n\n    document.xpath('xmlns:root/xmlns:bar')\n\nWhen using this you can still restrict the query to the correct namespace URI:\n\n    document.xpath('xmlns:root[namespace-uri() = \"http://example.com\"]/xmlns:bar')\n\n## HTML5 Support\n\nOga fully supports HTML5 including the omission of certain tags. For example,\nthe following is parsed just fine:\n\n    \u003cli\u003eHello\n    \u003cli\u003eWorld\n\nThis is effectively parsed into:\n\n    \u003cli\u003eHello\u003c/li\u003e\n    \u003cli\u003eWorld\u003c/li\u003e\n\nOne exception Oga makes is that it does _not_ automatically insert `html`,\n`head` and `body` tags. Automatically inserting these tags requires a\ndistinction between documents and fragments as a user might not always want\nthese tags to be inserted if left out. This complicates the user facing API as\nwell as complicating the parsing internals of Oga. As a result I have decided\nthat Oga _does not_ insert these tags when left out.\n\nA more in depth explanation can be found here:\n\u003chttps://gitlab.com/yorickpeterse/oga/issues/98#note_45443992\u003e\n\n## Documentation\n\nThe documentation is best viewed [on the documentation website][doc-website].\n\n* {file:CONTRIBUTING Contributing}\n* {file:changelog Changelog}\n* {file:migrating\\_from\\_nokogiri Migrating From Nokogiri}\n* {Oga::XML::Parser XML Parser}\n* {Oga::XML::SaxParser XML SAX Parser}\n* {file:xml\\_namespaces XML Namespaces}\n\n## Why Another HTML/XML parser?\n\nCurrently there are a few existing parser out there, the most famous one being\n[Nokogiri][nokogiri]. Another parser that's becoming more popular these days is\n[Ox][ox]. Ruby's standard library also comes with REXML.\n\nThe sad truth is that these existing libraries are problematic in their own\nways. Nokogiri for example is extremely unstable on Rubinius. On MRI it works\nbecause of the non concurrent nature of MRI, on JRuby it works because it's\nimplemented as Java. Nokogiri also uses libxml2 which is a massive beast of a\nlibrary, is not thread-safe and problematic to install on certain platforms\n(apparently). I don't want to compile libxml2 every time I install Nokogiri\neither.\n\nTo give an example about the issues with Nokogiri on Rubinius (or any other\nRuby implementation that is not MRI or JRuby), take a look at these issues:\n\n* \u003chttps://github.com/rubinius/rubinius/issues/2957\u003e\n* \u003chttps://github.com/rubinius/rubinius/issues/2908\u003e\n* \u003chttps://github.com/rubinius/rubinius/issues/2462\u003e\n* \u003chttps://github.com/sparklemotion/nokogiri/issues/1047\u003e\n* \u003chttps://github.com/sparklemotion/nokogiri/issues/939\u003e\n\nSome of these have been fixed, some have not. The core problem remains:\nNokogiri acts in a way that there can be a large number of places where it\n*might* break due to throwing around void pointers and what not and expecting\nthat things magically work. Note that I have nothing against the people running\nthese projects, I just heavily, *heavily* dislike the resulting codebase one\nhas to deal with today.\n\nOx looks very promising but it lacks a rather crucial feature: parsing HTML\n(without using a SAX API). It's also again a C extension making debugging more\nof a pain (at least for me).\n\nI just want an XML/HTML parser that I can rely on stability wise and that is\nwritten in Ruby so I can actually debug it. In theory it should also make it\neasier for other Ruby developers to contribute.\n\n## License\n\nAll source code in this repository is subject to the terms of the Mozilla Public\nLicense, version 2.0 unless stated otherwise. A copy of this license can be\nfound the file \"LICENSE\" or at \u003chttps://www.mozilla.org/MPL/2.0/\u003e.\n\n[nokogiri]: https://github.com/sparklemotion/nokogiri\n[oga-wikipedia]: https://en.wikipedia.org/wiki/Japanese_saw#Other_Japanese_saws\n[ox]: https://github.com/ohler55/ox\n[doc-website]: http://code.yorickpeterse.com/oga/latest/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/gitlab.com%2Fyorickpeterse%2Foga","html_url":"https://awesome.ecosyste.ms/projects/gitlab.com%2Fyorickpeterse%2Foga","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/gitlab.com%2Fyorickpeterse%2Foga/lists"}