https://github.com/martincastroalvarez/python-web-crawler
Web Crawler in Python
https://github.com/martincastroalvarez/python-web-crawler
probabilistic-programming python3 web-crawling web-scraping
Last synced: about 1 year ago
JSON representation
Web Crawler in Python
- Host: GitHub
- URL: https://github.com/martincastroalvarez/python-web-crawler
- Owner: MartinCastroAlvarez
- Created: 2019-06-19T18:39:03.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2022-04-04T10:58:47.000Z (about 4 years ago)
- Last Synced: 2025-02-14T17:31:35.272Z (over 1 year ago)
- Topics: probabilistic-programming, python3, web-crawling, web-scraping
- Language: HTML
- Homepage: https://martincastroalvarez.com
- Size: 9.77 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Probabilistic Web Crawler
### Installation:
```
virtualenv -p python3 env
source env/bin/activate
pip install lxml
```
### Usage:
##### Find element in all variants:
```
python3 crawl.py train/sample-0-origin.html test/sample-1-evil-gemini.html test/sample-2-container-and-clone.html test/sample-3-the-escape.html test/sample-4-the-mash.html "make-everything-ok-button"
```
Expected output:
```
'body/div/div/div[3]/div[1]/div/div[2]/a[2]' (score=0.9921029164925812)
'body/div/div/div[3]/div[1]/div/div[2]/div/a' (score=0.9952613484261197)
'body/div/div/div[3]/div[1]/div/div[3]/a' (score=0.989535451948429)
'body/div/div/div[3]/div[1]/div/div[3]/a' (score=0.9940386760307208)
```
##### Find element in one variant:
```
python3 crawl.py train/sample-0-origin.html test/sample-1-evil-gemini.html make-everything-ok-button
```
Expected output:
```
'body/div/div/div[3]/div[1]/div/div[2]/a[2]' (score=0.9921029164925812)
```