{"id":13494735,"url":"https://github.com/ruippeixotog/scala-scraper","last_synced_at":"2025-05-14T21:05:51.396Z","repository":{"id":21810668,"uuid":"25133358","full_name":"ruippeixotog/scala-scraper","owner":"ruippeixotog","description":"A Scala library for scraping content from HTML pages","archived":false,"fork":false,"pushed_at":"2025-05-07T22:51:30.000Z","size":907,"stargazers_count":723,"open_issues_count":8,"forks_count":105,"subscribers_count":28,"default_branch":"master","last_synced_at":"2025-05-07T23:29:32.679Z","etag":null,"topics":["dsl","hacktoberfest","html-parsing","scala","scraper"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ruippeixotog.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2014-10-12T21:58:22.000Z","updated_at":"2025-05-07T22:51:34.000Z","dependencies_parsed_at":"2023-11-07T04:41:24.753Z","dependency_job_id":"c1fbbc8c-dec0-42fc-85a0-cfa247390248","html_url":"https://github.com/ruippeixotog/scala-scraper","commit_stats":{"total_commits":723,"total_committers":16,"mean_commits":45.1875,"dds":0.4591977869986169,"last_synced_commit":"e33b8a5abb488cb4a0552eec12a025f67dfb9006"},"previous_names":[],"tags_count":19,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ruippeixotog%2Fscala-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ruippeixotog%2Fscala-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ruippeixotog%2Fscala-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ruippeixotog%2Fscala-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ruippeixotog","download_url":"https://codeload.github.com/ruippeixotog/scala-scraper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254227611,"owners_count":22035669,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dsl","hacktoberfest","html-parsing","scala","scraper"],"created_at":"2024-07-31T19:01:27.628Z","updated_at":"2025-05-14T21:05:51.366Z","avatar_url":"https://github.com/ruippeixotog.png","language":"Scala","funding_links":[],"categories":["Scala","Table of Contents","XML / HTML"],"sub_categories":["XML / HTML"],"readme":"# Scala Scraper [![Build Status](https://github.com/ruippeixotog/scala-scraper/workflows/CI/badge.svg?branch=master)](https://github.com/ruippeixotog/scala-scraper/actions?query=workflow%3ACI+branch%3Amaster) [![Coverage Status](https://coveralls.io/repos/github/ruippeixotog/scala-scraper/badge.svg?branch=master)](https://coveralls.io/github/ruippeixotog/scala-scraper?branch=master) [![Maven Central](https://img.shields.io/maven-central/v/net.ruippeixotog/scala-scraper_2.13.svg)](https://maven-badges.herokuapp.com/maven-central/net.ruippeixotog/scala-scraper_2.13) [![Join the chat at https://gitter.im/ruippeixotog/scala-scraper](https://badges.gitter.im/ruippeixotog/scala-scraper.svg)](https://gitter.im/ruippeixotog/scala-scraper)\n\nA library providing a DSL for loading and extracting content from HTML pages.\n\nTake a look at [Examples.scala](core/src/test/scala/net/ruippeixotog/scalascraper/Examples.scala) and at the [unit specs](core/src/test/scala/net/ruippeixotog/scalascraper) for usage examples or keep reading for more thorough documentation. Feel free to use [GitHub Issues](https://github.com/ruippeixotog/scala-scraper/issues) for submitting any bug or feature request and [Gitter](https://gitter.im/ruippeixotog/scala-scraper) to ask questions.\n\nThis README contains the following sections:\n\n- [Quick Start](#quick-start)\n- [Core Model](#core-model)\n- [Browsers](#browsers)\n- [Content Extraction](#content-extraction)\n- [Content Validation](#content-validation)\n- [Other DSL Features](#other-dsl-features)\n- [Using Browser-Specific Features](#using-browser-specific-features)\n- [Working Behind an HTTP/HTTPS Proxy](#working-behind-an-httphttps-proxy)\n- [Integration with Typesafe Config](#integration-with-typesafe-config)\n- [New Features and Migration Guide](#new-features-and-migration-guide)\n- [Copyright](#copyright)\n\n## Quick Start\n\nTo use Scala Scraper in an existing SBT project with Scala 2.13 or newer, add the following dependency to your `build.sbt`:\n\n```scala\nlibraryDependencies += \"net.ruippeixotog\" %% \"scala-scraper\" % \"3.2.0\"\n```\n\nIf you are using an older version of this library, see this document for the version you're using: [1.x](https://github.com/ruippeixotog/scala-scraper/blob/v1.2.1/README.md), [0.1.2](https://github.com/ruippeixotog/scala-scraper/blob/v0.1.2/README.md), [0.1.1](https://github.com/ruippeixotog/scala-scraper/blob/v0.1.1/README.md), [0.1](https://github.com/ruippeixotog/scala-scraper/blob/v0.1/README.md).\n\nAn implementation of the `Browser` trait, such as `JsoupBrowser`, can be used to fetch HTML from the web or to parse a local HTML file or string:\n\n```scala\nimport net.ruippeixotog.scalascraper.browser.JsoupBrowser\n\nval browser = JsoupBrowser()\nval doc = browser.parseFile(\"core/src/test/resources/example.html\")\nval doc2 = browser.get(\"http://example.com\")\n```\n\nThe returned object is a `Document`, which already provides several methods for manipulating and querying HTML elements. For simple use cases, it can be enough. For others, this library improves the content extracting process by providing a powerful DSL.\n\nYou can open the [example.html](core/src/test/resources/example.html) file loaded above to follow the examples throughout the README.\n\nFirst of all, the DSL methods and conversions must be imported:\n\n```scala\nimport net.ruippeixotog.scalascraper.dsl.DSL._\nimport net.ruippeixotog.scalascraper.dsl.DSL.Extract._\nimport net.ruippeixotog.scalascraper.dsl.DSL.Parse._\n```\n\nContent can then be extracted using the `\u003e\u003e` extraction operator and CSS queries:\n\n```scala\nimport net.ruippeixotog.scalascraper.model._\n\n// Extract the text inside the element with id \"header\"\ndoc \u003e\u003e text(\"#header\")\n// res0: String = \"Test page h1\"\n\n// Extract the \u003cspan\u003e elements inside #menu\nval items = doc \u003e\u003e elementList(\"#menu span\")\n// items: List[Element] = List(\n//   JsoupElement(underlying = \u003cspan\u003e\u003ca href=\"#home\"\u003eHome\u003c/a\u003e\u003c/span\u003e),\n//   JsoupElement(underlying = \u003cspan\u003e\u003ca href=\"#section1\"\u003eSection 1\u003c/a\u003e\u003c/span\u003e),\n//   JsoupElement(underlying = \u003cspan class=\"active\"\u003eSection 2\u003c/span\u003e),\n//   JsoupElement(underlying = \u003cspan\u003e\u003ca href=\"#section3\"\u003eSection 3\u003c/a\u003e\u003c/span\u003e)\n// )\n\n// From each item, extract all the text inside their \u003ca\u003e elements\nitems.map(_ \u003e\u003e allText(\"a\"))\n// res1: List[String] = List(\"Home\", \"Section 1\", \"\", \"Section 3\")\n\n// From the meta element with \"viewport\" as its attribute name, extract the\n// text in the content attribute\ndoc \u003e\u003e attr(\"content\")(\"meta[name=viewport]\")\n// res2: String = \"width=device-width, initial-scale=1\"\n```\n\nIf the element may or may not be in the page, the `\u003e?\u003e` tries to extract the content and returns it wrapped in an `Option`:\n\n```scala\n// Extract the element with id \"footer\" if it exists, return `None` if it\n// doesn't:\ndoc \u003e?\u003e element(\"#footer\")\n// res3: Option[Element] = Some(\n//   value = JsoupElement(\n//     underlying = \u003cdiv id=\"footer\"\u003e\n//  \u003cspan\u003eNo copyright 2014\u003c/span\u003e\n// \u003c/div\u003e\n//   )\n// )\n```\n\nWith only these two operators, some useful things can already be achieved:\n\n```scala\n// Go to a news website and extract the hyperlink inside the h1 element if it\n// exists. Follow that link and print both the article title and its short\n// description (inside \".lead\")\nfor {\n  headline \u003c- browser.get(\"http://observador.pt\") \u003e?\u003e element(\"h1 a\")\n  headlineDesc = browser.get(headline.attr(\"href\")) \u003e\u003e text(\".lead\")\n} println(\"== \" + headline.text + \" ==\\n\" + headlineDesc)\n```\n\nIn the next two sections the core classes used by this library are presented. They are followed by a description of the full capabilities of the DSL, including the ability to parse content after extracting, validating the contents of a page and defining custom extractors or validators.\n\n## Core Model\n\nThe library represents HTML documents and their elements by [Document](core/src/main/scala/net/ruippeixotog/scalascraper/model/Document.scala) and [Element](core/src/main/scala/net/ruippeixotog/scalascraper/model/Element.scala) objects, simple interfaces containing methods for retrieving information and navigating through the DOM.\n\n[Browser](core/src/main/scala/net/ruippeixotog/scalascraper/browser/Browser.scala) implementations are the entrypoints for obtaining `Document` instances. Most notably, they implement `get`, `post`, `parseFile` and `parseString` methods for retrieving documents from different sources. Depending on the browser used, `Document` and `Element` instances may have different semantics, mainly on their immutability guarantees.\n\n## Browsers\n\nThe library currently provides two built-in implementations of `Browser`:\n\n* [JsoupBrowser](core/src/main/scala/net/ruippeixotog/scalascraper/browser/JsoupBrowser.scala) is backed by [jsoup](http://jsoup.org/), a Java HTML parser library. `JsoupBrowser` provides powerful and efficient document querying, but it doesn't run JavaScript in the pages. As such, it is limited to working strictly with the HTML sent in the page source;\n* [HtmlUnitBrowser](core/src/main/scala/net/ruippeixotog/scalascraper/browser/HtmlUnitBrowser.scala) is based on [HtmlUnit](http://htmlunit.sourceforge.net), a GUI-less browser for Java programs. `HtmlUnitBrowser` simulates thoroughly a web browser, executing JavaScript code in the pages in addition to parsing HTML. It supports several compatibility modes, allowing it to emulate browsers such as Internet Explorer.\n\nDue to its speed and maturity, `JsoupBrowser` is the recommended browser to use when JavaScript execution is not needed. More information about each browser and its semantics can be obtained in the Scaladoc of each implementation.\n\n## Content Extraction\n\nThe `\u003e\u003e` and `\u003e?\u003e` operators shown above accept an `HtmlExtractor` as their right argument, a trait with a very simple interface:\n\n```scala\ntrait HtmlExtractor[-E \u003c: Element, +A] {\n  def extract(doc: ElementQuery[E]): A\n}\n```\n\nOne can always create a custom extractor by implementing `HtmlExtractor`. However, the DSL provides several ways to create `HtmlExtractor` instances, which should be enough in most situations. In general, you can use the `extractor` factory method:\n\n```\ndoc \u003e\u003e extractor(\u003ccssQuery\u003e, \u003ccontentExtractor\u003e, \u003ccontentParser\u003e)\n```\n\nWhere the arguments are:\n\n* **cssQuery**: the CSS query used to select the elements to be processed;\n* **contentExtractor**: the content to be extracted from the selected elements, e.g. the element objects themselves, their text, a specific attribute, form data;\n* **contentParser**: an optional parser for the data extracted in the step above, such as parsing numbers and dates or using regexes.\n\nThe DSL provides several `contentExtractor` and `contentParser` instances, which were imported before with `DSL.Extract._` and `DSL.Parse._`. The full list can be seen in [ContentExtractors.scala](core/src/main/scala/net/ruippeixotog/scalascraper/scraper/ContentExtractors.scala) and [ContentParsers.scala](core/src/main/scala/net/ruippeixotog/scalascraper/scraper/ContentParsers.scala).\n\nSome usage examples:\n\n```scala\n// Extract the date from the \"#date\" element\ndoc \u003e\u003e extractor(\"#date\", text, asLocalDate(\"yyyy-MM-dd\"))\n// res5: org.joda.time.LocalDate = 2014-10-26\n\n// Extract the text of all \"#mytable td\" elements and parse each of them as a number\ndoc \u003e\u003e extractor(\"#mytable td\", texts, seq(asDouble))\n// res6: IterableOnce[Double] = non-empty iterator\n\n// Extract an element \"h1\" and do no parsing (the default parsing behavior)\ndoc \u003e\u003e extractor(\"h1\", element, asIs[Element])\n// res7: Element = JsoupElement(underlying = \u003ch1\u003eTest page h1\u003c/h1\u003e)\n```\n\nWith the help of the implicit conversions provided by the DSL, we can write more succinctly the most common extraction cases:\n\n* `\u003ccssQuery\u003e` is taken as `extractor(\u003ccssQuery\u003e, elements, asIs)` (by an implicit conversion);\n* `\u003ccontentExtractor\u003e` is taken as `extractor(\":root\", \u003ccontentExtractor\u003e, asIs)` (content extractors are also `HtmlExtractor` instances by themselves);\n* `\u003ccontentExtractor\u003e(\u003ccssQuery\u003e)` is taken as `extractor(\u003ccssQuery\u003e, \u003ccontentExtractor\u003e, asIs)` (by an implicit conversion).\n\nBecause of that, one can write the expressions in the Quick Start section, as well as:\n\n```scala\n// Extract all the \"h3\" elements (as a lazy iterable)\ndoc \u003e\u003e \"h3\"\n// res8: ElementQuery[Element] = Iterable(\n//   JsoupElement(underlying = \u003ch3\u003eSection 1 h3\u003c/h3\u003e),\n//   JsoupElement(underlying = \u003ch3\u003eSection 2 h3\u003c/h3\u003e),\n//   JsoupElement(underlying = \u003ch3\u003eSection 3 h3\u003c/h3\u003e)\n// )\n\n// Extract all text inside this document\ndoc \u003e\u003e allText\n// res9: String = \"Test page Test page h1 Home Section 1 Section 2 Section 3 Test page h2 2014-10-26 2014-10-26T12:30:05Z 4.5 2 Section 1 h3 Some text for testing More text for testing Section 2 h3 My Form Add field Section 3 h3 3 15 15 1 No copyright 2014\"\n\n// Extract the elements with class \".active\"\ndoc \u003e\u003e elementList(\".active\")\n// res10: List[Element] = List(\n//   JsoupElement(underlying = \u003cspan class=\"active\"\u003eSection 2\u003c/span\u003e)\n// )\n\n// Extract the text inside each \"p\" element\ndoc \u003e\u003e texts(\"p\")\n// res11: Iterable[String] = List(\n//   \"Some text for testing\",\n//   \"More text for testing\"\n// )\n```\n\n## Content Validation\n\nWhile scraping web pages, it is a common use case to validate if a page effectively has the expected structure. This library provides special support for creating and applying validations.\n\nA `HtmlValidator` has the following signature:\n\n```scala\ntrait HtmlValidator[-E \u003c: Element, +R] {\n  def matches(doc: ElementQuery[E]): Boolean\n  def result: Option[R]\n}\n```\n\nAs with extractors, the DSL provides the `validator` constructor and the `\u003e/~` operator for applying a validation to a document:\n\n```\ndoc \u003e/~ validator(\u003cextractor\u003e)(\u003cmatcher\u003e)\n```\n\nWhere the arguments are:\n\n* **extractor**: an extractor as defined in the previous section;\n* **matcher**: a function mapping the extracted content to a boolean indicating if the document is valid.\n\nThe result of a validation is an `Either[R, A]` instance, where `A` is the type of the document and `R` is the result type of the validation (which will be explained later).\n\nSome validation examples:\n\n```scala\n// Check if the title of the page is \"Test page\"\ndoc \u003e/~ validator(text(\"title\"))(_ == \"Test page\")\n// res12: Either[Unit, browser.DocumentType] = Right(\n//   value = JsoupDocument(\n//     underlying = \u003c!doctype html\u003e\n// \u003chtml lang=\"en\"\u003e\n//  \u003chead\u003e\n//   \u003cmeta charset=\"utf-8\"\u003e\n//   \u003cmeta name=\"viewport\" content=\"width=device-width, initial-scale=1\"\u003e\n//   \u003ctitle\u003eTest page\u003c/title\u003e\n//  \u003c/head\u003e\n//  \u003cbody\u003e\n//   \u003cdiv id=\"wrapper\"\u003e\n//    \u003cdiv id=\"header\"\u003e\n//     \u003ch1\u003eTest page h1\u003c/h1\u003e\n//    \u003c/div\u003e\n//    \u003cdiv id=\"menu\"\u003e\n//     \u003cspan\u003e\u003ca href=\"#home\"\u003eHome\u003c/a\u003e\u003c/span\u003e\u003cspan\u003e\u003ca href=\"#section1\"\u003eSection 1\u003c/a\u003e\u003c/span\u003e \u003cspan class=\"active\"\u003eSection 2\u003c/span\u003e \u003cspan\u003e\u003ca href=\"#section3\"\u003eSection 3\u003c/a\u003e\u003c/span\u003e\n//    \u003c/div\u003e\n//    \u003cdiv id=\"content\"\u003e\n//     \u003ch2\u003eTest page h2\u003c/h2\u003e\n//     \u003cspan id=\"date\"\u003e2014-10-26\u003c/span\u003e\u003cspan id=\"datefull\"\u003e2014-10-26T12:30:05Z\u003c/span\u003e \u003cspan id=\"rating\"\u003e4.5\u003c/span\u003e \u003cspan id=\"pages\"\u003e2\u003c/span\u003e\n//     \u003csection\u003e\n//      \u003ch3\u003eSection 1 h3\u003c/h3\u003e\n//      \u003cp\u003eSome text for testing\u003c/p\u003e\n//      \u003cp\u003eMore text for testing\u003c/p\u003e\n//     \u003c/section\u003e\n//     \u003csection\u003e\n//      \u003ch3\u003eSection 2 h3\u003c/h3\u003e\n//      \u003cspan\u003eMy Form\u003c/span\u003e\n//      \u003cform id=\"myform\" action=\"submit.html\"\u003e\n//       \u003cinput type=\"text\" name=\"name\" value=\"John\"\u003e\u003cinput type=\"text\" name=\"address\"\u003e \u003cinput type=\"submit\" value=\"Submit\"\u003e \u003cspan\u003e\u003ca href=\"#\"\u003eAdd field\u003c/a\u003e\u003c/span\u003e\n//      \u003c/form\u003e\n//     \u003c/section\u003e\n//     \u003csection\u003e\n//      \u003ch3\u003eSection 3 h3\u003c/h3\u003e\n//      \u003ctable id=\"mytable\"\u003e\n//       \u003ctbody\u003e\n//        \u003ctr\u003e\n//         \u003ctd\u003e3\u003c/td\u003e\n//         \u003ctd\u003e15\u003c/td\u003e\n//         \u003ctd\u003e15\u003c/td\u003e\n//         \u003ctd\u003e1\u003c/td\u003e\n//        \u003c/tr\u003e\n//       \u003c/tbody\u003e\n//      \u003c/table\u003e\n//     \u003c/section\u003e\n// ...\n\n// Check if there are at least 3 \".active\" elements\ndoc \u003e/~ validator(\".active\")(_.size \u003e= 3)\n// res13: Either[Unit, browser.DocumentType] = Left(value = ())\n\n// Check if the text in \".desc\" contains the word \"blue\"\ndoc \u003e/~ validator(allText(\"#mytable\"))(_.contains(\"blue\"))\n// res14: Either[Unit, browser.DocumentType] = Left(value = ())\n```\n\nWhen a document fails a validation, it may be useful to identify the problem by pattern-matching it against common scraping pitfalls, such as a login page that appears unexpectedly because of an expired cookie, dynamic content that disappeared or server-side errors. If we define validators for both the success case and error cases:\n\n```scala\nval succ = validator(text(\"title\"))(_ == \"My Page\")\n\nval errors = Seq(\n  validator(allText(\".msg\"), \"Not logged in\")(_.contains(\"sign in\")),\n  validator(\".item\", \"Too few items\")(_.size \u003c 3),\n  validator(text(\"h1\"), \"Internal Server Error\")(_.contains(\"500\")))\n```\n\nThey can be used in combination to create more informative validations:\n\n```scala\ndoc \u003e/~ (succ, errors)\n// res15: Either[String, browser.DocumentType] = Left(value = \"Too few items\")\n```\n\nValidators matching errors were constructed above using an additional `result` parameter after the extractor. That value is returned wrapped in a `Left` if that particular error occurs during a validation.\n\n## Other DSL Features\n\nAs shown before in the Quick Start section, one can try if an extractor works in a page and obtain the extracted content wrapped in an `Option`:\n\n```scala\n// Try to extract an element with id \"optional\", return `None` if none exist\ndoc \u003e?\u003e element(\"#optional\")\n// res16: Option[Element] = None\n```\n\nNote that when using `\u003e?\u003e` with content extractors that return sequences, such as `texts` and `elements`, `None` will never be returned (`Some(Seq())` will be returned instead).\n\nIf you want to use multiple extractors in a single document or element, you can pass tuples or triples to `\u003e\u003e`:\n\n```scala\n// Extract the text of the title element and all inputs of #myform\ndoc \u003e\u003e (text(\"title\"), elementList(\"#myform input\"))\n// res17: (String, List[Element]) = (\n//   \"Test page\",\n//   List(\n//     JsoupElement(underlying = \u003cinput type=\"text\" name=\"name\" value=\"John\"\u003e),\n//     JsoupElement(underlying = \u003cinput type=\"text\" name=\"address\"\u003e),\n//     JsoupElement(underlying = \u003cinput type=\"submit\" value=\"Submit\"\u003e)\n//   )\n// )\n```\n\nThe extraction operators work on `List`, `Option`, `Either` and other instances for which a [Scalaz](https://github.com/scalaz/scalaz) `Functor` instance exists. The extraction occurs by mapping over the functors:\n\n```scala\n// Extract the titles of all documents in the list\nList(doc, doc) \u003e\u003e text(\"title\")\n// res18: List[String] = List(\"Test page\", \"Test page\")\n\n// Extract the title if the document is a `Some`\nOption(doc) \u003e\u003e text(\"title\")\n// res19: Option[String] = Some(value = \"Test page\")\n```\n\nYou can apply other extractors and validators to the result of an extraction, which is particularly powerful combined with the feature shown above:\n\n```scala\n// From the \"#menu\" element, extract the text in the \".active\" element inside\ndoc \u003e\u003e element(\"#menu\") \u003e\u003e text(\".active\")\n// res20: String = \"Section 2\"\n\n// Same as above, but in a scenario where \"#menu\" can be absent\ndoc \u003e?\u003e element(\"#menu\") \u003e\u003e text(\".active\")\n// res21: Option[String] = Some(value = \"Section 2\")\n\n// Same as above, but check if the \"#menu\" has any \"span\" element before\n// extracting the text\ndoc \u003e?\u003e element(\"#menu\") \u003e/~ validator(\"span\")(_.nonEmpty) \u003e\u003e text(\".active\")\n// res22: Option[Either[Unit, String]] = Some(\n//   value = Right(value = \"Section 2\")\n// )\n\n// Extract the links inside all the \"#menu \u003e span\" elements\ndoc \u003e\u003e elementList(\"#menu \u003e span\") \u003e?\u003e attr(\"href\")(\"a\")\n// res23: List[Option[String]] = List(\n//   Some(value = \"#home\"),\n//   Some(value = \"#section1\"),\n//   None,\n//   Some(value = \"#section3\")\n// )\n```\n\nThis library also provides a `Functor` for `HtmlExtractor`, making it possible to map over extractors and create chained extractors that can be passed around and stored like objects. For example, new extractors can be defined like this:\n\n```scala\nimport net.ruippeixotog.scalascraper.scraper.HtmlExtractor\n\n// An extractor for a list with the first link found in each \"span\" element\nval spanLinks: HtmlExtractor[Element, List[Option[String]]] =\n  elementList(\"span\") \u003e?\u003e attr(\"href\")(\"a\")\n\n// An extractor for the number of \"span\" elements that actually have links\nval spanLinksCount: HtmlExtractor[Element, Int] =\n  spanLinks.map(_.flatten.length)\n```\n\nYou can also \"prepend\" a query to any existing extractor by using its `mapQuery` method:\n\n```scala\n// An extractor for `spanLinks` that are inside \"#menu\"\nval menuLinks: HtmlExtractor[Element, List[Option[String]]] =\n  spanLinks.mapQuery(\"#menu\")\n```\n\nAnd they can be used just as extractors created using other means provided by the DSL:\n\n```scala\ndoc \u003e\u003e spanLinks\n// res24: List[Option[String]] = List(\n//   Some(value = \"#home\"),\n//   Some(value = \"#section1\"),\n//   None,\n//   Some(value = \"#section3\"),\n//   None,\n//   None,\n//   None,\n//   None,\n//   None,\n//   Some(value = \"#\"),\n//   None\n// )\n\ndoc \u003e\u003e spanLinksCount\n// res25: Int = 4\n\ndoc \u003e\u003e menuLinks\n// res26: List[Option[String]] = List(\n//   Some(value = \"#home\"),\n//   Some(value = \"#section1\"),\n//   None,\n//   Some(value = \"#section3\")\n// )\n```\n\nJust remember that you can only apply extraction operators `\u003e\u003e` and `\u003e?\u003e` to documents, elements or functors \"containing\" them, which means that the following is a compile-time error:\n\n```scala\n// The `texts` extractor extracts a list of strings and extractors cannot be\n// applied to strings\ndoc \u003e\u003e texts(\"#menu \u003e span\") \u003e\u003e \"a\"\n// error: value \u003e\u003e is not a member of Iterable[String]\n```\n\nFinally, if you prefer not using operators for the sake of code legibility, you can use alternative methods:\n\n```scala\n// `extract` is the same as `\u003e\u003e`\ndoc extract text(\"title\")\n// res28: String = \"Test page\"\n\n// `tryExtract` is the same as `\u003e?\u003e`\ndoc tryExtract element(\"#optional\")\n// res29: Option[Element] = None\n\n// `validateWith` is the same as `\u003e/~`\ndoc validateWith (succ, errors)\n// res30: Either[String, browser.DocumentType] = Left(value = \"Too few items\")\n```\n\n## Using Browser-Specific Features\n\n_NOTE: this feature is in a beta stage. Please expect API changes in future releases._\n\nAt this moment, Scala Scraper is focused on providing a DSL for querying documents efficiently and elegantly. Therefore, it doesn't support directly modifying the DOM or executing actions such as clicking an element. However, since version 2.0.0 a new typed element API allows users to interact directly with the data structures of the underlying `Browser` implementation.\n\nFirst of all, make sure your `Browser` instance has a concrete type, like `HtmlUnitBrowser`:\n\n```scala\nimport net.ruippeixotog.scalascraper.browser.HtmlUnitBrowser\nimport net.ruippeixotog.scalascraper.browser.HtmlUnitBrowser._\n\n// the `typed` method on the companion object of a `Browser` returns instances\n// with their concrete type\nval typedBrowser: HtmlUnitBrowser = HtmlUnitBrowser.typed()\n\nval typedDoc: HtmlUnitDocument = typedBrowser.parseFile(\"core/src/test/resources/example.html\")\n```\n\nNote that the `val` declarations are explicitly typed for explanation purposes only; the methods work just as well when types are inferred.\n\nThe content extractors `pElement`, `pElements` and `pElementList` are special types of extractors - they are polymorphic extractors. They work just like their non-polymorphic `element`, `elements` and `elementList` extractors, but they propagate the concrete types of the elements if the document or element being extracted also has a concrete type. For example:\n\n```scala\n// extract the \"a\" inside the second child of \"#menu\"\nval aElem = typedDoc \u003e\u003e pElement(\"#menu span:nth-child(2) a\")\n// aElem: HtmlUnitElement = HtmlUnitElement(\n//   underlying = HtmlAnchor[\u003ca href=\"#section1_2\"\u003e]\n// )\n```\n\nNote that extracting using CSS queries also keeps the concrete types of the elements:\n\n```scala\n// same thing as above\ntypedDoc \u003e\u003e \"#menu\" \u003e\u003e \"span:nth-child(2)\" \u003e\u003e \"a\" \u003e\u003e pElement\n// res31: pElement.Out[HtmlUnitElement] = HtmlUnitElement(\n//   underlying = HtmlAnchor[\u003ca href=\"#section1_2\"\u003e]\n// )\n```\n\nConcrete element types, like `HtmlUnitElement`, expose a public `underlying` field with the underlying element object used by the browser backend. In the case of HtmlUnit, that would be a [`DomElement`](http://htmlunit.sourceforge.net/apidocs/com/gargoylesoftware/htmlunit/html/DomElement.html), which exposes a whole new range of operations:\n\n```scala\n// extract the current \"href\" this \"a\" element points to\naElem \u003e\u003e attr(\"href\")\n// res32: String = \"#section1\"\n\n// use `underlying` to update the \"href\" attribute\naElem.underlying.setAttribute(\"href\", \"#section1_2\")\n\n// verify that \"href\" was updated\naElem \u003e\u003e attr(\"href\")\n// res34: String = \"#section1_2\"\n\n// get the location of the document (without the host and the full path parts)\ntypedDoc.location.split(\"/\").last\n// res35: String = \"example.html\"\n\ndef click(elem: HtmlUnitElement): Unit = {\n  // the type param may be needed, as the original API uses Java wildcards\n  aElem.underlying.click[org.htmlunit.Page]()\n}\n\n// simulate a click on our recently modified element\nclick(aElem)\n\n// check the new location\ntypedDoc.location.split(\"/\").last\n// res37: String = \"example.html#section1_2\"\n```\n\nUsing the typed element API provides much more flexibility when more than querying elements is required. However, one should avoid using it unless strictly necessary, as:\n\n* It binds code to specific `Browser` implementations, making it more difficult to change implementations later;\n* The code becomes subject to changes in the API of the underlying library;\n* It's heavier on the Scala type system and it is not as mature, leading to possible unexpected compilation errors. If that happens, please file an issue!\n\n## Working Behind an HTTP/HTTPS Proxy\n\nIf you are behind an HTTP or SOCKS proxy, you can configure `Browser` implementations to make connections through it by either using the browser's appropriate constructor (implementation-dependent) or by calling `withProxy` on any browser instance:\n\n```scala\nimport net.ruippeixotog.scalascraper.browser.Proxy\n\nval browser2 = JsoupBrowser().withProxy(Proxy(\"example.com\", 7000, Proxy.SOCKS))\n```\n\n## Integration with Typesafe Config\n\nThe [Scala Scraper Config module](modules/config/README.md) can be used to load extractors and validators from config files.\n\n## New Features and Migration Guide\n\nThe [CHANGELOG](CHANGELOG.md) is kept updated with the bug fixes and new features of each version. When there are breaking changes, they are listed there together with suggestions for migrating old code.\n\n## Copyright\n\nCopyright (c) 2014-2022 Rui Gonçalves. See LICENSE for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fruippeixotog%2Fscala-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fruippeixotog%2Fscala-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fruippeixotog%2Fscala-scraper/lists"}