{"id":20709979,"url":"https://github.com/oxylabs/web-scraping-with-java","last_synced_at":"2025-12-26T03:12:10.386Z","repository":{"id":134336687,"uuid":"526125120","full_name":"oxylabs/web-scraping-with-java","owner":"oxylabs","description":"Web Scraping With Java. Let’s examine this library to create a Java website scraper.","archived":false,"fork":false,"pushed_at":"2024-04-19T11:19:11.000Z","size":16,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-17T21:07:21.291Z","etag":null,"topics":["java-web-scraper","jsoup-library","node-js","node-scraper","nodejs","web-scraping-with-java"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oxylabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2022-08-18T08:33:44.000Z","updated_at":"2024-12-31T08:16:41.000Z","dependencies_parsed_at":null,"dependency_job_id":"6091eae3-758f-43d5-9299-90addb584bcd","html_url":"https://github.com/oxylabs/web-scraping-with-java","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fweb-scraping-with-java","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fweb-scraping-with-java/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fweb-scraping-with-java/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fweb-scraping-with-java/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oxylabs","download_url":"https://codeload.github.com/oxylabs/web-scraping-with-java/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242980783,"owners_count":20216285,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["java-web-scraper","jsoup-library","node-js","node-scraper","nodejs","web-scraping-with-java"],"created_at":"2024-11-17T02:09:26.556Z","updated_at":"2025-12-26T03:12:10.376Z","avatar_url":"https://github.com/oxylabs.png","language":null,"readme":"# Web Scraping With Java\n\n[![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.io/pages/gitoxy?utm_source=877\u0026utm_medium=affiliate\u0026groupid=877\u0026utm_content=web-scraping-with-java-github\u0026transaction_id=102f49063ab94276ae8f116d224b67)\n\n\n[![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge\u0026theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge\u0026logo=youtube\u0026logoColor=white)](https://www.youtube.com/@oxylabs)\n\n## Getting JSoup\n\nThe first step of  web scraping with Java is to get the Java libraries. Maven can help here. Use any Java IDE, and create a Maven project. If you do not want to use Maven, head over to this page to find alternate downloads.\n\nIn the `pom.xml` (Project Object Model) file, add a new section for dependencies and add a dependency for JSoup. The `pom.xml` file would look something like this:\n\n```java\n\u003cdependencies\u003e\n    \u003cdependency\u003e\n        \u003cgroupId\u003eorg.jsoup\u003c/groupId\u003e\n        \u003cartifactId\u003ejsoup\u003c/artifactId\u003e\n        \u003cversion\u003e1.14.1\u003c/version\u003e\n    \u003c/dependency\u003e\n\u003c/dependencies\u003e\n```\n\nWith this, we are ready to create a Java scraper.\n\n## Getting and parsing the HTML\n\nThe second step of web scraping with Java is to get the HTML from the target URL and parse it into a Java object.  Let’s begin with the imports:\n\n```java\nimport org.jsoup.Connection;\nimport org.jsoup.Jsoup;\nimport org.jsoup.nodes.Document;\nimport org.jsoup.nodes.Element;\nimport org.jsoup.select.Elements;\n```\n\nNote that it is not a good practice to import everything with a wildcard – `import org.jsoup.*.` Always import exactly what you need. The above imports are what we are going to use in this Java web scraping tutorial.\n\nJSoup provides the `connect` function. This function takes the URL and returns a `Document`. Here is how you can get the page’s HTML:\n\n```java\nDocument doc = Jsoup.connect(\"https://en.wikipedia.org/wiki/Jsoup\").get();\n```\n\nYou will often see this line in places, but it has a disadvantage. This shortcut does not have any error handling. A better approach would be to create a function. This function takes a URL as the parameter. First, it creates a connection and stores it in a variable. After that, the `get()` method of the connection object is called to retrieve the HTML document. This document is returned as an instance of the `Document` class. The `get()` method can throw an `IOException`, which needs to be handled.\n\n```java\npublic static Document getDocument(String url) {\n    Connection conn = Jsoup.connect(url);\n    Document document = null;\n    try {\n        document = conn.get();\n    } catch (IOException e) {\n        e.printStackTrace();\n    // handle error\n    }\n    return document;\n}\n```\n\nIn some instances, you would need to pass a custom user agent. This can be done by sending the user agent string to the `userAgent()` function before calling the `get()` function.\n\n```java\nConnection conn = Jsoup.connect(url);\nconn.userAgent(\"custom user agent\");\ndocument = conn.get();\n```\n\nThis action should resolve all the common problems.\n\n## Querying HTML\n\nThe most crucial step of any Java web scraper building process is to query the HTML `Document` object for the desired data. This is the point where you will be spending most of your time while writing the web scraper in Java.\n\nJSoup supports many ways to extract the desired elements. There are many methods, such as `getElementByID`, `getElementsByTag`, etc., that make it easier to query the DOM.\n\nHere is an example of navigating to the JSoup page on Wikipedia. Right-click the heading and select Inspect, thus opening the developer tool with the heading selected.\n\n![](https://images.prismic.io/oxylabs-sm/MjdmZDQ4NmEtNWNjOC00ZTJhLWEzNzctYWEzZDdjNmE2MTdh_getelementbyclass-1.png?auto=compress,format\u0026rect=0,0,1301,662\u0026w=1301\u0026h=662\u0026fm=webp\u0026dpr=2\u0026q=50)\n\nIn this case, either `getElementByID` or `getElementsByClass` can be used. One important point to note here is that `getElementById` (note the singular `Element`) returns one `Element` object, whereas `getElementsByClass` (note plural `Elements`) returns an Array list of `Element` objects.\n\nConveniently, this library has a class `Elements` that extends `ArrayList\u003cElement\u003e`. This makes code cleaner and provides more functionality.\n\nIn the code example below, the `first()` method can be used to get the first element from the `ArrayList`. After getting the reference of the element, the `text()` method can be called to get the text.\n\n```java\nElement firstHeading = document.getElementsByClass(\"firstHeading\").first();\nSystem.out.println(firstHeading.text());\n```\n\nThese functions are good; however, they are specific to JSoup. For most cases, the select function can be a better choice. The only case when select functions will not work is when you need to traverse up the document. In these cases, you may want to use `parent()`, `children()`, and `child()`. For a complete list of all the available methods, visit this page.\n\nThe following code demonstrates how to use the `selectFirst()` method, which returns the first match.\n\n```java\nElement firstHeading= document.selectFirst(\".firstHeading\"); \n```\n\nIn this example, `selectFirst()` method was used. If multiple elements need to be selected, you can use the `select()` method. This will take the CSS selector as a parameter and return an instance of Elements, which is an extension of the type `ArrayList\u003cElement\u003e`.\n\n## Getting and parsing the HTML\n\nThe first step of web scraping with Java is to get the Java libraries. Maven can help here. Create a new maven project or use the one created in the previous section. If you do not want to use Maven, head over to this page to find alternate downloads.\n\nIn the `pom.xml` file, add a new section for `dependencies` and add a dependency for HtmlUnit. The `pom.xml` file would look something like this:\n\n```java\n\u003cdependency\u003e\n    \u003cgroupId\u003enet.sourceforge.htmlunit\u003c/groupId\u003e\n    \u003cartifactId\u003ehtmlunit\u003c/artifactId\u003e\n    \u003cversion\u003e2.51.0\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n## Getting the HTML\n\nThe second step of web scraping with Java is to retrieve the HTML from the target URL as a Java object.  Let’s begin with the imports:\n\n```java\nimport com.gargoylesoftware.htmlunit.WebClient;\nimport com.gargoylesoftware.htmlunit.html.DomNode;\nimport com.gargoylesoftware.htmlunit.html.DomNodeList;\nimport com.gargoylesoftware.htmlunit.html.HtmlElement;\nimport com.gargoylesoftware.htmlunit.html.HtmlPage;\n```\n\nAs discussed in the previous section, it is not a good practice to do a wildcard import such as `import com.gargoylesoftware.htmlunit.html.*`. Import only what you need. The above imports are what we are going to use in this Java web scraping tutorial.\n\nIn this example, we will scrape this Librivox page.\n\nHtmlUnit uses `WebClient` class to get the page. The first step would be to create an instance of this class. In this example, there is no need for CSS rendering, and there is no use of JavaScript as well. We can set the options to disable these two.\n\n```java\nWebClient webClient = new WebClient();\nwebClient.getOptions().setCssEnabled(false);\nwebClient.getOptions().setJavaScriptEnabled(false);\nHtmlPage page = webClient.getPage(\"https://librivox.org/the-first-men-in-the-moon-by-hg-wells\");\n```\n\nNote that `getPage()` functions can throw `IOException`. You would need to surround it in try-catch.\n\nHere is one example implementation of a function that returns an instance of `HtmlPage`:\n\n```java\npublic static HtmlPage getDocument(String url) {\n    HtmlPage page = null;\n    try (final WebClient webClient = new WebClient()) {\n        webClient.getOptions().setCssEnabled(false);\n        webClient.getOptions().setJavaScriptEnabled(false);\n        page = webClient.getPage(url);\n    } catch (IOException e) {\n        e.printStackTrace();\n    }\n    return page;\n} \n```\n\nNow we can proceed with the next step.\n\n## Querying HTML\n\nThere are three categories of methods that can be used with `HTMLPage`. The first is DOM methods such as `getElementById()`, `getElementByName()`, etc. that return one element. These also have their counterparts like `getElementsById()` that return all the matches. These methods return a `DomElement` object or a List of `DomElement` objects.\n\n```java\nHtmlPage page = webClient.getPage(\"https://en.wikipedia.org/wiki/Jsoup\");\nDomElement firstHeading = page.getElementById(\"firstHeading\");\nSystem.out.print(firstHeading.asNormalizedText()); // prints Jsoup\n```\n\nThe second category of a selector uses XPath. In this Java web scraping tutorial, we will go through creating a web scraper using Java.\n\nNavigate to this page, right-click the book title and click inspect. If you are already comfortable with XPath, you should be able to see that the XPath to select the book title would be `//div[@class=\"content-wrap clearfix\"]/h1`.\n\n![](https://images.prismic.io/oxylabs-sm/ODFjZjIwOWItMjhhMS00ZjlmLTg1NjctYmM5N2IyMzMxNDUy_selectbyxpath-1.png?auto=compress,format\u0026rect=0,0,1377,575\u0026w=1377\u0026h=575\u0026fm=webp\u0026dpr=2\u0026q=50)\n\nThere are two methods that can work with XPath — `getByXPath()` and `getFirstByXPath()`. They return HtmlElement instead of DomElement. Note that special characters like quotation marks will need to be escaped using a backslash:\n\n```java\nHtmlElement book = page.getFirstByXPath(\"//div[@class=\\\"content-wrap clearfix\\\"]/h1\");\nSystem.out.print(book.asNormalizedText());\n```\n\nLastly, the third category of methods uses CSS selectors. These methods are `querySelector()` and `querySelectorAll()`. They return `DomNode` and `DomNodeList\u003cDomNode\u003e` respectively.\n\nTo make this Java web scraper tutorial more realistic, let’s print all the chapter names, reader names, and duration from the page. The first step is to determine the selector that can select all rows. Next, we will use the `querySelectorAll()` method to select all the rows. Finally, we will run a loop on all the rows and call `querySelector()` to extract the content of each cell.\n\n```java\nString selector = \".chapter-download tbody tr\";\nDomNodeList\u003cDomNode\u003e rows = page.querySelectorAll(selector);\nfor (DomNode row : rows) {\n    String chapter = row.querySelector(\"td:nth-child(2) a\").asNormalizedText();\n    String reader = row.querySelector(\"td:nth-child(3) a\").asNormalizedText();\n    String duration = row.querySelector(\"td:nth-child(4)\").asNormalizedText();\n    System.out.println(chapter + \"\\t \" + reader + \"\\t \" + duration);\n}\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foxylabs%2Fweb-scraping-with-java","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foxylabs%2Fweb-scraping-with-java","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foxylabs%2Fweb-scraping-with-java/lists"}