https://github.com/JuliaWeb/Gumbo.jl
Julia wrapper around Google's gumbo C library for parsing HTML
https://github.com/JuliaWeb/Gumbo.jl
Last synced: 2 months ago
JSON representation
Julia wrapper around Google's gumbo C library for parsing HTML
- Host: GitHub
- URL: https://github.com/JuliaWeb/Gumbo.jl
- Owner: JuliaWeb
- License: other
- Created: 2014-05-04T21:53:08.000Z (about 12 years ago)
- Default Branch: master
- Last Pushed: 2025-01-02T19:16:10.000Z (over 1 year ago)
- Last Synced: 2026-03-05T18:53:51.043Z (3 months ago)
- Language: Julia
- Homepage:
- Size: 143 KB
- Stars: 159
- Watchers: 7
- Forks: 26
- Open Issues: 12
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
- awesome-julia-security - Gumbo.jl - Julia wrapper around Google's gumbo HTML parser for web scraping and security analysis. (Web Security / HTTP and Web Frameworks)
README
# Gumbo.jl
[](https://juliahub.com/ui/Packages/Gumbo/mllB2) [](https://travis-ci.org/JuliaWeb/Gumbo.jl) [](http://codecov.io/github/JuliaWeb/Gumbo.jl?branch=master) [](https://juliahub.com/ui/Packages/Gumbo/mllB2) [](https://juliahub.com/ui/Packages/Gumbo/mllB2?t=2)
Gumbo.jl is a Julia wrapper around
[the gumbo library](https://github.com/google/gumbo-parser) for
parsing HTML.
> [!WARNING]
> The underlying C library is currently unmaintained. Use at your own risk.
Getting started is very easy:
```julia
julia> using Gumbo
julia> parsehtml("
Hello, world!
")
HTML Document:
HTMLElement{:HTML}:
Hello, world!
```
Read on for further documentation.
## Installation
```jl
using Pkg
Pkg.add("Gumbo")
```
or activate `Pkg` mode in the REPL by typing `]`, and then:
```
add Gumbo
```
## Basic usage
The workhorse is the `parsehtml` function, which takes a single
argument, a valid UTF8 string, which is interpreted as HTML data to be
parsed, e.g.:
```julia
parsehtml("
Hello, world!
")
```
Parsing an HTML file named `filename`can be done using:
```julia
julia> parsehtml(read(filename, String))
```
The result of a call to `parsehtml` is an `HTMLDocument`, a type which
has two fields: `doctype`, which is the doctype of the parsed document
(this will be the empty string if no doctype is provided), and `root`,
which is a reference to the `HTMLElement` that is the root of the
document.
Note that gumbo is a very permissive HTML parser, designed to
gracefully handle the insanity that passes for HTML out on the wild,
wild web. It will return a valid HTML document for *any* input, doing
all sorts of algorithmic gymnastics to twist what you give it into
valid HTML.
If you want an HTML validator, this is probably not your library. That
said, `parsehtml` does take an optional `Bool` keyword argument,
`strict` which, if `true`, causes an `InvalidHTMLError` to be thrown
if the call to the gumbo C library produces any errors.
## HTML types
This library defines a number of types for representing HTML.
### `HTMLDocument`
`HTMlDocument` is what is returned from a call to `parsehtml` it has a
`doctype` field, which contains the doctype of the parsed document,
and a `root` field, which is a reference to the root of the document.
### `HTMLNode`s
A document contains a tree of HTML Nodes, which are represented as
children of the `HTMLNode` abstract type. The first of these is
`HTMLElement`.
### `HTMLElement`
```julia
mutable struct HTMLElement{T} <: HTMLNode
children::Vector{HTMLNode}
parent::HTMLNode
attributes::Dict{String, String}
end
```
`HTMLElement` is probably the most interesting and frequently used
type. An `HTMLElement` is parameterized by a symbol representing its
tag. So an `HTMLElement{:a}` is a different type from an
`HTMLElement{:body}`, etc. An empty `HTMLElement` of a given tag can be
constructed as follows:
```julia
julia> HTMLElement(:div)
HTMLElement{:div}:
```
`HTMLElement`s have a `parent` field, which refers to another
`HTMLNode`. `parent` will always be an `HTMLElement`, unless the
element has no parent (as is the case with the root of a document), in
which case it will be a `NullNode`, a special type of `HTMLNode` which
exists for just this purpose. Empty `HTMLElement`s constructed as in
the example above will also have a `NullNode` for a parent.
`HTMLElement`s also have `children`, which is a vector of
`HTMLElement` containing the children of this element, and
`attributes`, which is a `Dict` mapping attribute names to values.
`HTMLElement`s implement `getindex`, `setindex!`, and `push!`;
indexing into or pushing onto an `HTMLElement` operates on its
children array.
There are a number of convenience methods for working with `HTMLElement`s:
- `tag(elem)`
get the tag of this element as a symbol
- `attrs(elem)`
return the attributes dict of this element
- `children(elem)`
return the children array of this element
- `getattr(elem, name)`
get the value of attribute `name` or raise a `KeyError`. Also
supports being called with a default value (`getattr(elem, name,
default)`) or function (`getattr(f, elem, name)`).
- `setattr!(elem, name, value)`
set the value of attribute `name` to `value`
### `HTMLText`
```jl
type HTMLText <: HTMLNode
parent::HTMLNode
text::String
end
```
Represents text appearing in an HTML document. For example:
```julia
julia> doc = parsehtml("
Hello, world!
")
HTML Document:
HTMLElement{:HTML}:
Hello, world!
julia> doc.root[2][1][1]
HTML Text: Hello, world!
```
This type is quite simple, just a reference to its parent and the
actual text it represents (this is also accessible by a `text`
function). You can construct `HTMLText` instances as follows:
```jl
julia> HTMLText("Example text")
HTML Text: Example text
```
Just as with `HTMLElement`s, the parent of an instance so constructed
will be a `NullNode`.
## Tree traversal
Use the iterators defined in
[AbstractTrees.jl](https://github.com/Keno/AbstractTrees.jl/), e.g.:
```julia
julia> using AbstractTrees
julia> using Gumbo
julia> doc = parsehtml("""
""");
julia> for elem in PreOrderDFS(doc.root) println(tag(elem)) end
HTML
head
body
div
p
a
p
div
span
julia> for elem in PostOrderDFS(doc.root) println(tag(elem)) end
head
p
a
p
div
span
div
body
HTML
julia> for elem in StatelessBFS(doc.root) println(tag(elem)) end
HTML
head
body
div
div
p
a
p
span
julia>
```
## TODOS
- support CDATA
- support comments