{"id":26593747,"url":"https://github.com/luminati-io/jsoup-html-parsing","last_synced_at":"2026-05-05T03:31:08.800Z","repository":{"id":283784002,"uuid":"937938802","full_name":"luminati-io/jsoup-html-parsing","owner":"luminati-io","description":"How to parse HTML with jsoup in Java, covering DOM element selection methods, pagination, and advanced parsing techniques for efficient web scraping.","archived":false,"fork":false,"pushed_at":"2025-02-24T06:54:37.000Z","size":177,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-03-22T07:02:01.981Z","etag":null,"topics":["dom","getelementbyid","getelementsbyclassname","html","html-parsing","java","jsoup","maven","parsing","web-scraping"],"latest_commit_sha":null,"homepage":"https://brightdata.com/blog/web-data/parse-html-with-jsoup","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/luminati-io.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-24T06:49:15.000Z","updated_at":"2025-02-24T06:57:34.000Z","dependencies_parsed_at":"2025-03-22T07:02:05.794Z","dependency_job_id":"2ae481a6-5e29-4068-a179-7684835e8bc0","html_url":"https://github.com/luminati-io/jsoup-html-parsing","commit_stats":null,"previous_names":["luminati-io/jsoup-html-parsing"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/luminati-io/jsoup-html-parsing","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminati-io%2Fjsoup-html-parsing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminati-io%2Fjsoup-html-parsing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminati-io%2Fjsoup-html-parsing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminati-io%2Fjsoup-html-parsing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/luminati-io","download_url":"https://codeload.github.com/luminati-io/jsoup-html-parsing/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminati-io%2Fjsoup-html-parsing/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32634065,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-04T10:08:07.713Z","status":"online","status_checked_at":"2026-05-05T02:00:06.033Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dom","getelementbyid","getelementsbyclassname","html","html-parsing","java","jsoup","maven","parsing","web-scraping"],"created_at":"2025-03-23T15:20:11.378Z","updated_at":"2026-05-05T03:31:08.767Z","avatar_url":"https://github.com/luminati-io.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Parsing HTML With jsoup\n\n[![Promo](https://github.com/luminati-io/LinkedIn-Scraper/raw/main/Proxies%20and%20scrapers%20GitHub%20bonus%20banner.png)](https://brightdata.com/) \n\nThis guide explains how to parse HTML with `jsoup` in Java. You will learn how to use DOM methods, handle pagination, and optimize your parsing workflow.\n\n- [Using DOM Methods With Jsoup](#using-dom-methods-with-jsoup)\n  - [getElementById](#getelementbyid)\n  - [getElementsByTag](#getelementsbytag)\n  - [getElementsByClass](#getelementsbyclass)\n  - [getElementsByAttribute](#getelementsbyattribute)\n- [Advanced Techniques](#advanced-techniques)\n  - [CSS Selectors](#css-selectors)\n  - [Handling Pagination](#handling-pagination)\n- [Putting Everything Together](#putting-everything-together)\n\n## Getting Started\n\nThis tutorial assumes using [Maven](https://maven.apache.org/) for dependency management.\n\nOnce you’ve got Maven installed, create a new Java project called `jsoup-scraper`:\n\n```bash\nmvn archetype:generate -DgroupId=com.example -DartifactId=jsoup-scraper -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false\n```\n\nTo add relevant dependencies, replace the code in `pom.xml` with the code below:\n\n```xml\n\u003cproject xmlns=\"http://maven.apache.org/POM/4.0.0\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"\n  xsi:schemaLocation=\"http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd\"\u003e\n  \u003cmodelVersion\u003e4.0.0\u003c/modelVersion\u003e\n  \u003cgroupId\u003ecom.example\u003c/groupId\u003e\n  \u003cartifactId\u003ejsoup-scraper\u003c/artifactId\u003e\n  \u003cpackaging\u003ejar\u003c/packaging\u003e\n  \u003cversion\u003e1.0-SNAPSHOT\u003c/version\u003e\n  \u003cname\u003ejsoup-scraper\u003c/name\u003e\n  \u003curl\u003ehttp://maven.apache.org\u003c/url\u003e\n  \u003cdependencies\u003e\n    \u003cdependency\u003e\n      \u003cgroupId\u003ejunit\u003c/groupId\u003e\n      \u003cartifactId\u003ejunit\u003c/artifactId\u003e\n      \u003cversion\u003e3.8.1\u003c/version\u003e\n      \u003cscope\u003etest\u003c/scope\u003e\n    \u003c/dependency\u003e\n    \u003cdependency\u003e\n        \u003cgroupId\u003eorg.jsoup\u003c/groupId\u003e\n        \u003cartifactId\u003ejsoup\u003c/artifactId\u003e\n        \u003cversion\u003e1.16.1\u003c/version\u003e\n    \u003c/dependency\u003e\n  \u003c/dependencies\u003e\n  \u003cproperties\u003e\n    \u003cmaven.compiler.source\u003e17\u003c/maven.compiler.source\u003e\n    \u003cmaven.compiler.target\u003e17\u003c/maven.compiler.target\u003e\n\u003c/properties\u003e\n\u003c/project\u003e\n```\n\nNow paste the below code into `App.java`:\n\n```java\npackage com.example;\n\nimport org.jsoup.Jsoup;\nimport org.jsoup.nodes.Document;\nimport org.jsoup.nodes.Element;\nimport org.jsoup.select.Elements;\n\npublic class App {\n    public static void main(String[] args) {\n\n        String url = \"https://books.toscrape.com\";\n        int pageCount = 1;\n\n        while (pageCount \u003c= 1) {\n\n            try {\n                System.out.println(\"---------------------PAGE \"+pageCount+\"--------------------------\");\n\n                //connect to a website and get its HTML\n                Document doc = Jsoup.connect(url).get();\n            \n                //print the title\n                System.out.println(\"Page Title: \" + doc.title());\n            \n                \n            } catch (Exception e) {\n                e.printStackTrace();\n            }\n        }\n        System.out.println(\"Total pages scraped: \"+(pageCount-1));\n    }\n}\n```\n\n- `Jsoup.connect(\"https://books.toscrape.com\").get()`: This line fetches the page and returns a `Document` object that you can manipulate.\n- `doc.title()` returns the title in the HTML document, in this case: `All products | Books to Scrape - Sandbox`.\n\n## Using DOM Methods With Jsoup\n\n`jsoup` contains a variety of methods for finding elements in the DOM(Document Object Model). We can use any of the following to find page elements easily.\n\n- `getElementById()`: Find an element using its `id`.\n- `getElementsByClass()`: Find all elements using their CSS class.\n- `getElementsByTag()`: Find all elements using their HTML tag.\n- `getElementsByAttribute()`: Find all elements containing a certain attribute.\n\n### getElementById\n\nOn the website we are scraping, the sidebar contains a `div` with an `id` of `promotions_left`:\n\n![Inspect the sidebar](https://github.com/luminati-io/jsoup-html-parsing/blob/main/Images/Inspect-the-sidebar.png)\n\n```java\n//get by Id\nElement sidebar = doc.getElementById(\"promotions_left\");\n\nSystem.out.println(\"Sidebar: \" + sidebar);\n```\n\nThis code outputs the HTML element you see in the Inspect page.\n\n```\nSidebar: \u003cdiv id=\"promotions_left\"\u003e\n\u003c/div\u003e\n```\n\n### getElementsByTag\n\n`getElementsByTag()` allows to find all elements on the page with a certain tag. On this page, where each book is contained in a unique `article` tag:\n\n![Inspect books](https://github.com/luminati-io/jsoup-html-parsing/blob/main/Images/Inspect-books.png)\n\nThe code below returns an array of books that will provide the foundation for the rest of the data.\n\n```java\n//get by tag\nElements books = doc.getElementsByTag(\"article\");\n```\n\n### getElementsByClass\n\nLet's inspect the price of a book. The class is `price_color`:\n\n![Inspect price](https://github.com/luminati-io/jsoup-html-parsing/blob/main/Images/Inspect-price.png)\n\nThe below code snippet finds all elements of the `price_color` class and prints the text of the first one using `.first().text()`:\n\n```java\nSystem.out.println(\"Price: \" + book.getElementsByClass(\"price_color\").first().text());\n```\n\n### getElementsByAttribute\n\nLet's use `getElementsByAttribute(\"href\")` to find all elements with an `href` attribute:\n\n```java\n//get by attribute\nElements hrefs = book.getElementsByAttribute(\"href\");\nSystem.out.println(\"Link: https://books.toscrape.com/\" + hrefs.first().attr(\"href\"));\n```\n\n## Advanced Techniques\n\n### CSS Selectors\n\nTo find elements by multiple criteria, let's pass CSS selectors to the `select()` method. This will return an array of all objects matching the selector. In the next code snippet, we use `li[class='next']` to find all `li` items with the `next` class:\n\n```java\nElements nextPage = doc.select(\"li[class='next']\");\n```\n\n### Handling Pagination\n\nTo handle pagination, we start by using `nextPage.first()` to obtain the first element from the array. We then call `getElementsByAttribute(\"href\").attr(\"href\")` on that element to extract its `href` value.\n\nSince after page 2, the word `catalogue` is removed from the links,  we add `href` back if does not contain `catalogue`. After that, we combine this updated link with our base URL to obtain the URL for the next page.\n\n```java\nif (!nextPage.isEmpty()) {\n    String nextUrl = nextPage.first().getElementsByAttribute(\"href\").attr(\"href\");\n    if (!nextUrl.contains(\"catalogue\")) {\n        nextUrl = \"catalogue/\"+nextUrl;\n    } \n    url = \"https://books.toscrape.com/\" + nextUrl;\n    pageCount++;\n}\n```\n\n## Putting Everything Together\n\nHere is the final Java code. To scrape more than one page, simply change the `1` in `while (pageCount \u003c= 1)`. E.g., if you want to scrape 4 pages, use `while (pageCount \u003c= 4)`.\n\n```java\nimport org.jsoup.Jsoup;\nimport org.jsoup.nodes.Document;\nimport org.jsoup.nodes.Element;\nimport org.jsoup.select.Elements;\n\npublic class App {\n    public static void main(String[] args) {\n\n        String url = \"https://books.toscrape.com\";\n        int pageCount = 1;\n\n        while (pageCount \u003c= 1) {\n\n            try {\n                System.out.println(\"---------------------PAGE \"+pageCount+\"--------------------------\");\n\n                //connect to a website and get its HTML\n                Document doc = Jsoup.connect(url).get();\n            \n                //print the title\n                System.out.println(\"Page Title: \" + doc.title());\n            \n                //get by Id\n                Element sidebar = doc.getElementById(\"promotions_left\");\n\n                System.out.println(\"Sidebar: \" + sidebar);\n\n                //get by tag\n                Elements books = doc.getElementsByTag(\"article\");\n\n                for (Element book : books) {\n                    System.out.println(\"------Book------\");\n                    System.out.println(\"Title: \" + book.getElementsByTag(\"img\").first().attr(\"alt\"));\n                    System.out.println(\"Price: \" + book.getElementsByClass(\"price_color\").first().text());\n                    System.out.println(\"Availability: \" + book.getElementsByClass(\"instock availability\").first().text());\n\n                    //get by attribute\n                    Elements hrefs = book.getElementsByAttribute(\"href\");\n                    System.out.println(\"Link: https://books.toscrape.com/\" + hrefs.first().attr(\"href\"));\n                }\n\n                //find the next button using its CSS selector\n                Elements nextPage = doc.select(\"li[class='next']\");\n                if (!nextPage.isEmpty()) {\n                    String nextUrl = nextPage.first().getElementsByAttribute(\"href\").attr(\"href\");\n                    if (!nextUrl.contains(\"catalogue\")) {\n                        nextUrl = \"catalogue/\"+nextUrl;\n                    } \n                    url = \"https://books.toscrape.com/\" + nextUrl;\n                    pageCount++;\n                }\n\n            } catch (Exception e) {\n                e.printStackTrace();\n            }\n        }\n        System.out.println(\"Total pages scraped: \"+(pageCount-1));\n    }\n}\n```\n\nCompile the code:\n\n```bash\nmvn package\n```\n\nNow you can run it:\n\n```bash\nmvn exec:java -Dexec.mainClass=\"com.example.App\"\n```\n\nHere is the output from the first page.\n\n```\n---------------------PAGE 1--------------------------\nPage Title: All products | Books to Scrape - Sandbox\nSidebar: \u003cdiv id=\"promotions_left\"\u003e\n\u003c/div\u003e\n------Book------\nTitle: A Light in the Attic\nPrice: £51.77\nAvailability: In stock\nLink: https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html\n------Book------\nTitle: Tipping the Velvet\nPrice: £53.74\nAvailability: In stock\nLink: https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html\n------Book------\nTitle: Soumission\nPrice: £50.10\nAvailability: In stock\nLink: https://books.toscrape.com/catalogue/soumission_998/index.html\n------Book------\nTitle: Sharp Objects\nPrice: £47.82\nAvailability: In stock\nLink: https://books.toscrape.com/catalogue/sharp-objects_997/index.html\n------Book------\nTitle: Sapiens: A Brief History of Humankind\nPrice: £54.23\nAvailability: In stock\nLink: https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html\n------Book------\nTitle: The Requiem Red\nPrice: £22.65\nAvailability: In stock\nLink: https://books.toscrape.com/catalogue/the-requiem-red_995/index.html\n------Book------\nTitle: The Dirty Little Secrets of Getting Your Dream Job\nPrice: £33.34\nAvailability: In stock\nLink: https://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html\n------Book------\nTitle: The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull\nPrice: £17.93\nAvailability: In stock\nLink: https://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html\n------Book------\nTitle: The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics\nPrice: £22.60\nAvailability: In stock\nLink: https://books.toscrape.com/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html\n------Book------\nTitle: The Black Maria\nPrice: £52.15\nAvailability: In stock\nLink: https://books.toscrape.com/catalogue/the-black-maria_991/index.html\n------Book------\nTitle: Starving Hearts (Triangular Trade Trilogy, #1)\nPrice: £13.99\nAvailability: In stock\nLink: https://books.toscrape.com/catalogue/starving-hearts-triangular-trade-trilogy-1_990/index.html\n------Book------\nTitle: Shakespeare's Sonnets\nPrice: £20.66\nAvailability: In stock\nLink: https://books.toscrape.com/catalogue/shakespeares-sonnets_989/index.html\n------Book------\nTitle: Set Me Free\nPrice: £17.46\nAvailability: In stock\nLink: https://books.toscrape.com/catalogue/set-me-free_988/index.html\n------Book------\nTitle: Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)\nPrice: £52.29\nAvailability: In stock\nLink: https://books.toscrape.com/catalogue/scott-pilgrims-precious-little-life-scott-pilgrim-1_987/index.html\n------Book------\nTitle: Rip it Up and Start Again\nPrice: £35.02\nAvailability: In stock\nLink: https://books.toscrape.com/catalogue/rip-it-up-and-start-again_986/index.html\n------Book------\nTitle: Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991\nPrice: £57.25\nAvailability: In stock\nLink: https://books.toscrape.com/catalogue/our-band-could-be-your-life-scenes-from-the-american-indie-underground-1981-1991_985/index.html\n------Book------\nTitle: Olio\nPrice: £23.88\nAvailability: In stock\nLink: https://books.toscrape.com/catalogue/olio_984/index.html\n------Book------\nTitle: Mesaerion: The Best Science Fiction Stories 1800-1849\nPrice: £37.59\nAvailability: In stock\nLink: https://books.toscrape.com/catalogue/mesaerion-the-best-science-fiction-stories-1800-1849_983/index.html\n------Book------\nTitle: Libertarianism for Beginners\nPrice: £51.33\nAvailability: In stock\nLink: https://books.toscrape.com/catalogue/libertarianism-for-beginners_982/index.html\n------Book------\nTitle: It's Only the Himalayas\nPrice: £45.17\nAvailability: In stock\nLink: https://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html\nTotal pages scraped: 1\n```\n\n## Conclusion\n\nScraping dynamic sites like product listings, news, or research data can be challenging. [Bright Data’s tools](https://brightdata.com/products) help you scale your efforts:\n\n- **[Residential Proxies](https://brightdata.com/proxy-types/residential-proxies):** Bypass IP bans and geo-restrictions.\n- **[Scraping Browser](https://brightdata.com/products/scraping-browser):** Easily handle JavaScript-heavy sites.\n- **[Ready-to-Use Datasets](https://brightdata.com/products/datasets):** Get structured data without scraping.\n\nCombine these with jsoup for efficient, low-risk data extraction. Try them for free today!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluminati-io%2Fjsoup-html-parsing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fluminati-io%2Fjsoup-html-parsing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluminati-io%2Fjsoup-html-parsing/lists"}