Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/aantron/lambdasoup

Functional HTML scraping and rewriting with CSS in OCaml
https://github.com/aantron/lambdasoup

css html ocaml scraping soup

Last synced: 5 days ago
JSON representation

Functional HTML scraping and rewriting with CSS in OCaml

Awesome Lists containing this project

README

        

# Lambda Soup   [![Coverage][coveralls-img]][coveralls]

[coveralls]: https://coveralls.io/github/aantron/lambdasoup?branch=master
[coveralls-img]: https://img.shields.io/coveralls/aantron/lambdasoup/master.svg

Lambda Soup is a functional HTML scraping and manipulation library for OCaml
aimed at being easy to use.

![Lambda Soup usage example][sample]

[sample]: https://raw.githubusercontent.com/aantron/lambdasoup/master/docs/sample.gif

Lambda Soup is *simple*. It provides a set of
[elementary traversals][traversals] for getting from node to node, familiar
functional [combinators][combinators] such as `filter`, `map`, and `fold`, and
support for all CSS selectors that still make sense when not running in a
browser (and a few obvious [extensions][extracss] on top of that).

Here is a trivial self-contained example:

```ocaml
(parse "

World!

") $ ".Hello" |> R.leaf_text;;
- : string = "World!"
```

And, a mutation:

```ocaml
let soup = parse "

World!

" in
wrap (soup $ ".Hello" |> R.child) (create_element "strong");
soup |> to_string;;
- : string = "

World!

"
```

For some more examples, see the Lambda Soup [postprocessor][postprocess] that
runs on Lambda Soup's own [documentation][docs] after it is generated by
`ocamldoc`.

The library is [tested][tests] thoroughly.

Lambda Soup is based on [Markup.ml][markupml]. As a consequence, it resolves
entity references, detects character encodings automatically, and converts
everything to UTF-8. And, you can use Lambda Soup on XML, by
[parsing][parse_xml] the XML with Markup.ml and [feeding][from_signals] the
signals to Lambda Soup.

[parse_xml]: http://aantron.github.io/markup.ml/#VALparse_xml
[from_signals]: http://aantron.github.io/lambdasoup/#2_Parsingsignals


## Installing

opam install lambdasoup

[contributing-install]: https://github.com/aantron/lambdasoup/blob/master/docs/CONTRIBUTING.md#developing


## Starting from scratch

To use Lambda Soup interactively as in the GIF at the top of this README, you
need to have done something like this:

```sh
your-package-manager install ocaml opam
opam init
eval `opam config env` # Or restart your shell
opam install lambdasoup
```

and make sure your `~/.ocamlinit` file looks something like this:

```ocaml
let () =
try Topdirs.dir_directory (Sys.getenv "OCAML_TOPLEVEL_PATH")
with Not_found -> ()
;;

#use "topfind";;
```

Then, run `ocaml -short-paths` to start the top-level, and scrape away!


## Depending

Lambda Soup uses semantic versioning, but is currently in `0.x.x`. For now, the
minor version number will be incremented on breaking changes. So, to give
yourself a chance to review the changelog before your code breaks, put the
following constraint on Lambda Soup: `lambdasoup {< "0.7.0"}`.


## Documentation

Lambda Soup's interface consists of one module `Soup`, whose signature is
documented [here][docs].


## Developing

See [`CONTRIBUTING`][contributing]. All feedback is welcome – open an issue on
GitHub, or send me an email at [[email protected]][email]. If you find
yourself repeatedly writing the same helper on top of Lambda Soup's functions,
perhaps we should add it to Lambda Soup.


## History

Lambda Soup was originally written to answer a [Stack Overflow question][so] in
November 2015.

[docs]: http://aantron.github.io/lambdasoup
[postprocess]: https://github.com/aantron/lambdasoup/blob/master/docs/postprocess.ml
[tests]: https://github.com/aantron/lambdasoup/blob/master/test/test.ml
[contributing]: https://github.com/aantron/lambdasoup/blob/master/docs/CONTRIBUTING.md
[email]: mailto:[email protected]
[extracss]: http://aantron.github.io/lambdasoup#VALselect
[traversals]: http://aantron.github.io/lambdasoup#2_Elementarytraversals
[combinators]: http://aantron.github.io/lambdasoup#2_Combinators
[markupml]: https://github.com/aantron/markup.ml
[so]: https://stackoverflow.com/questions/33489575/parsing-html-with-ocaml