https://github.com/lysabrina/web-scraping-practice
Collection of projects to learn web-scraping with Java & Spring-boot
https://github.com/lysabrina/web-scraping-practice
bootstrap dom html java react spring-boot web-scraping
Last synced: 2 months ago
JSON representation
Collection of projects to learn web-scraping with Java & Spring-boot
- Host: GitHub
- URL: https://github.com/lysabrina/web-scraping-practice
- Owner: LySabrina
- Created: 2023-03-25T04:35:40.000Z (over 3 years ago)
- Default Branch: master
- Last Pushed: 2023-05-04T19:15:16.000Z (about 3 years ago)
- Last Synced: 2024-12-31T08:45:45.092Z (over 1 year ago)
- Topics: bootstrap, dom, html, java, react, spring-boot, web-scraping
- Language: Java
- Homepage:
- Size: 18.5 MB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
#Description
Collection of several web-scraping practice projects.
Two dedicated sites that were made to be sites are from: http://toscrape.com/
Additionally, practice of Spring boot + React will be used.
#Notes
Document is from the org.jsoup.nodes package
- is a class
- Represents the HTML document
- represents the entire HTML or XML document --> provides access to the document data
Jsoup
- working with real-world HTML
- has API for fetching URLs and extracting & manipulating data using HTML5 DOM methods and CSS selectors
Class Jsoup (documentation: https://jsoup.org/apidocs/org/jsoup/Jsoup.html)
- static methods ==> usability methods
Class Connection
- creates a new Connection (session) with the defined request URL.
- Used to fetch and parse a HTML page
- get() method which is used to execute a request as a GET and fetch the HTML document
Class Element
- the HTML element
- can manipulate the node with methods similar to JS node manipulation in the DOM
Class Elements
- extends ArrayList
- hence Elements holds a collection of Element
## Anti-Scraping Systems
Some anti-scraping systems will block HTTP requests if they do not have HTTP headers.
How to avoid blocking techniques:
1) Always set the User-Agent header (identifies the applicaiton, OS, and vendor)