https://github.com/sazuna/zscore_scraper

Scrap a web page's main content based on the z-score (HTML Tags' paths with abnormally long texts will be caught by this method)
https://github.com/sazuna/zscore_scraper

beautifulsoup scraper selenium-webdriver webscraping zscore

Last synced: 6 days ago
JSON representation

Scrap a web page's main content based on the z-score (HTML Tags' paths with abnormally long texts will be caught by this method)

Host: GitHub
URL: https://github.com/sazuna/zscore_scraper
Owner: Sazuna
Created: 2024-07-23T13:17:51.000Z (12 months ago)
Default Branch: main
Last Pushed: 2024-07-23T13:31:37.000Z (12 months ago)
Last Synced: 2025-02-23T14:22:36.189Z (5 months ago)
Topics: beautifulsoup, scraper, selenium-webdriver, webscraping, zscore
Language: Python
Homepage:
Size: 4.88 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# zscore_scraper
Scrap a web page's main content based on the z-score (abnormally big paragraphs compared to menus etc will be caught by this method)

## Usage
```python zscore_scraper [url] [z-score|clustering]```
- Use z-score for an optimized result.
- Clustering attempts to cluster paths into relevant path or irrelevant paths based on their mean lenghts. This method is not as effective as z-score.
- Do not use the main method twice if you attend to call it from another python program (I will fix that I promise)

## Returns
- title (if any is found by beautifulsoup)
- publication datetime (if any is found by beautifulsoup)
- page's content.

Feel free to reuse this code in your projects :)

### Time optimization :
Use curl instead of selenium, as selenium is time-consuming. We use selenium to 'accept cookies' and scrap page content even if the content is dynamically loaded by JavaScript or other dynamic languages. We set a timer of total of 0.4s, which can be increased by the time of page loading.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sazuna/zscore_scraper

Awesome Lists containing this project

README