Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/lysabrina/web-scraping-practice
Collection of projects to learn web-scraping with Java & Spring-boot
https://github.com/lysabrina/web-scraping-practice
bootstrap dom html java react spring-boot web-scraping
Last synced: about 1 month ago
JSON representation
Collection of projects to learn web-scraping with Java & Spring-boot
- Host: GitHub
- URL: https://github.com/lysabrina/web-scraping-practice
- Owner: LySabrina
- Created: 2023-03-25T04:35:40.000Z (almost 2 years ago)
- Default Branch: master
- Last Pushed: 2023-05-04T19:15:16.000Z (almost 2 years ago)
- Last Synced: 2024-11-08T08:49:21.518Z (3 months ago)
- Topics: bootstrap, dom, html, java, react, spring-boot, web-scraping
- Language: Java
- Homepage:
- Size: 18.5 MB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
#Description
Collection of several web-scraping practice projects.
Two dedicated sites that were made to be sites are from: http://toscrape.com/Additionally, practice of Spring boot + React will be used.
#Notes
Document is from the org.jsoup.nodes package
- is a class
- Represents the HTML document
- represents the entire HTML or XML document --> provides access to the document dataJsoup
- working with real-world HTML
- has API for fetching URLs and extracting & manipulating data using HTML5 DOM methods and CSS selectorsClass Jsoup (documentation: https://jsoup.org/apidocs/org/jsoup/Jsoup.html)
- static methods ==> usability methodsClass Connection
- creates a new Connection (session) with the defined request URL.
- Used to fetch and parse a HTML page
- get() method which is used to execute a request as a GET and fetch the HTML documentClass Element
- the HTML element
- can manipulate the node with methods similar to JS node manipulation in the DOMClass Elements
- extends ArrayList
- hence Elements holds a collection of Element## Anti-Scraping Systems
Some anti-scraping systems will block HTTP requests if they do not have HTTP headers.How to avoid blocking techniques:
1) Always set the User-Agent header (identifies the applicaiton, OS, and vendor)