Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/aantron/lambdasoup
Functional HTML scraping and rewriting with CSS in OCaml
https://github.com/aantron/lambdasoup
css html ocaml scraping soup
Last synced: about 9 hours ago
JSON representation
Functional HTML scraping and rewriting with CSS in OCaml
- Host: GitHub
- URL: https://github.com/aantron/lambdasoup
- Owner: aantron
- License: mit
- Created: 2015-11-11T22:09:13.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2024-11-18T19:27:20.000Z (about 1 month ago)
- Last Synced: 2024-11-24T16:59:06.756Z (about 1 month ago)
- Topics: css, html, ocaml, scraping, soup
- Language: OCaml
- Homepage: https://aantron.github.io/lambdasoup
- Size: 584 KB
- Stars: 384
- Watchers: 12
- Forks: 31
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- Contributing: docs/CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE.md
Awesome Lists containing this project
- awesome-list - lambdasoup
README
Lambda Soup[coveralls]: https://coveralls.io/github/aantron/lambdasoup?branch=master
[coveralls-img]: https://img.shields.io/coveralls/aantron/lambdasoup/master.svg**Lambda Soup** is a functional HTML scraping and manipulation library for OCaml
aimed at being easy to use.
[sample]: https://raw.githubusercontent.com/aantron/lambdasoup/master/docs/sample.gif
Lambda Soup is *simple*. It provides a set of
[elementary traversals][traversals] for getting from node to node, familiar
functional [combinators][combinators] such as `filter`, `map`, and `fold`, and
support for all CSS selectors that still make sense when not running in a
browser (and a few obvious [extensions][extracss] on top of that).Here is a trivial self-contained example:
```ocaml
(parse "World!
") $ ".Hello" |> R.leaf_text;;
- : string = "World!"
```And, a mutation:
```ocaml
let soup = parse "World!
" in
wrap (soup $ ".Hello" |> R.child) (create_element "strong");
soup |> to_string;;
- : string = "World!
"
```For some more examples, see the Lambda Soup [postprocessor][postprocess] that
runs on Lambda Soup's own [documentation][docs] after it is generated by
`ocamldoc`.The library is [tested][tests] thoroughly.
Lambda Soup is based on [Markup.ml][markupml]. As a consequence, it resolves
entity references, detects character encodings automatically, and converts
everything to UTF-8. And, you can use Lambda Soup on XML, by
[parsing][parse_xml] the XML with Markup.ml and [feeding][from_signals] the
signals to Lambda Soup.[parse_xml]: http://aantron.github.io/markup.ml/#VALparse_xml
[from_signals]: http://aantron.github.io/lambdasoup/#2_Parsingsignals
## Installing
opam install lambdasoup
[contributing-install]: https://github.com/aantron/lambdasoup/blob/master/docs/CONTRIBUTING.md#developing
## Starting from scratch
To use Lambda Soup interactively as in the GIF at the top of this README, you
need to have done something like this:```sh
your-package-manager install ocaml opam
opam init
eval `opam config env` # Or restart your shell
opam install lambdasoup
```and make sure your `~/.ocamlinit` file looks something like this:
```ocaml
let () =
try Topdirs.dir_directory (Sys.getenv "OCAML_TOPLEVEL_PATH")
with Not_found -> ()
;;#use "topfind";;
```Then, run `ocaml -short-paths` to start the top-level, and scrape away!
## Depending
Lambda Soup uses semantic versioning, but is currently in `0.x.x`. For now, the
minor version number will be incremented on breaking changes. So, to give
yourself a chance to review the changelog before your code breaks, put the
following constraint on Lambda Soup: `lambdasoup {< "0.7.0"}`.
## Documentation
Lambda Soup's interface consists of one module `Soup`, whose signature is
documented [here][docs].
## Developing
See [`CONTRIBUTING`][contributing]. All feedback is welcome – open an issue on
GitHub, or send me an email at [[email protected]][email]. If you find
yourself repeatedly writing the same helper on top of Lambda Soup's functions,
perhaps we should add it to Lambda Soup.
## History
Lambda Soup was originally written to answer a [Stack Overflow question][so] in
November 2015.[docs]: http://aantron.github.io/lambdasoup
[postprocess]: https://github.com/aantron/lambdasoup/blob/master/docs/postprocess.ml
[tests]: https://github.com/aantron/lambdasoup/blob/master/test/test.ml
[contributing]: https://github.com/aantron/lambdasoup/blob/master/docs/CONTRIBUTING.md
[email]: mailto:[email protected]
[extracss]: http://aantron.github.io/lambdasoup#VALselect
[traversals]: http://aantron.github.io/lambdasoup#2_Elementarytraversals
[combinators]: http://aantron.github.io/lambdasoup#2_Combinators
[markupml]: https://github.com/aantron/markup.ml
[so]: https://stackoverflow.com/questions/33489575/parsing-html-with-ocaml