Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/lysabrina/web-scraping-practice

Collection of projects to learn web-scraping with Java & Spring-boot
https://github.com/lysabrina/web-scraping-practice

bootstrap dom html java react spring-boot web-scraping

Last synced: about 1 month ago
JSON representation

Collection of projects to learn web-scraping with Java & Spring-boot

Host: GitHub
URL: https://github.com/lysabrina/web-scraping-practice
Owner: LySabrina
Created: 2023-03-25T04:35:40.000Z (almost 2 years ago)
Default Branch: master
Last Pushed: 2023-05-04T19:15:16.000Z (almost 2 years ago)
Last Synced: 2024-11-08T08:49:21.518Z (3 months ago)
Topics: bootstrap, dom, html, java, react, spring-boot, web-scraping
Language: Java
Homepage:
Size: 18.5 MB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        #Description

Collection of several web-scraping practice projects.

Two dedicated sites that were made to be sites are from: http://toscrape.com/

Additionally, practice of Spring boot + React will be used. 

#Notes

Document is from the org.jsoup.nodes package

- is a class 

- Represents the HTML document

- represents the entire HTML or XML document --> provides access to the document data

Jsoup

- working with real-world HTML 

- has API for fetching URLs and extracting & manipulating data using HTML5 DOM methods and CSS selectors

Class Jsoup (documentation: https://jsoup.org/apidocs/org/jsoup/Jsoup.html)

- static methods ==> usability methods 

Class Connection 

- creates a new Connection (session) with the defined request URL.

- Used to fetch and parse a HTML page

- get() method which is used to execute a request as a GET and fetch the HTML document

Class Element

- the HTML element 

- can manipulate the node with methods similar to JS node manipulation in the DOM

Class Elements

- extends ArrayList

- hence Elements holds a collection of Element

## Anti-Scraping Systems

Some anti-scraping systems will block HTTP requests if they do not have HTTP headers.

How to avoid blocking techniques:

1) Always set the User-Agent header (identifies the applicaiton, OS, and vendor)