https://github.com/sazuna/zscore_scraper
Scrap a web page's main content based on the z-score (HTML Tags' paths with abnormally long texts will be caught by this method)
https://github.com/sazuna/zscore_scraper
beautifulsoup scraper selenium-webdriver webscraping zscore
Last synced: 6 days ago
JSON representation
Scrap a web page's main content based on the z-score (HTML Tags' paths with abnormally long texts will be caught by this method)
- Host: GitHub
- URL: https://github.com/sazuna/zscore_scraper
- Owner: Sazuna
- Created: 2024-07-23T13:17:51.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2024-07-23T13:31:37.000Z (12 months ago)
- Last Synced: 2025-02-23T14:22:36.189Z (5 months ago)
- Topics: beautifulsoup, scraper, selenium-webdriver, webscraping, zscore
- Language: Python
- Homepage:
- Size: 4.88 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# zscore_scraper
Scrap a web page's main content based on the z-score (abnormally big paragraphs compared to menus etc will be caught by this method)## Usage
```python zscore_scraper [url] [z-score|clustering]```
- Use z-score for an optimized result.
- Clustering attempts to cluster paths into relevant path or irrelevant paths based on their mean lenghts. This method is not as effective as z-score.
- Do not use the main method twice if you attend to call it from another python program (I will fix that I promise)## Returns
- title (if any is found by beautifulsoup)
- publication datetime (if any is found by beautifulsoup)
- page's content.Feel free to reuse this code in your projects :)
### Time optimization :
Use curl instead of selenium, as selenium is time-consuming. We use selenium to 'accept cookies' and scrap page content even if the content is dynamically loaded by JavaScript or other dynamic languages. We set a timer of total of 0.4s, which can be increased by the time of page loading.