https://github.com/jonolav95/finn_scraper
Scraping https://www.finn.no/
https://github.com/jonolav95/finn_scraper
beautifulsoup lxml python requests scraping
Last synced: 6 months ago
JSON representation
Scraping https://www.finn.no/
- Host: GitHub
- URL: https://github.com/jonolav95/finn_scraper
- Owner: JonOlav95
- Created: 2024-01-15T15:38:54.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-11-01T23:03:05.000Z (11 months ago)
- Last Synced: 2024-11-02T00:16:56.188Z (11 months ago)
- Topics: beautifulsoup, lxml, python, requests, scraping
- Language: Python
- Homepage:
- Size: 855 KB
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# About
** CURRENTLY BROKEN XPATHS (TO BE FIXED) **
Scraping Finn housing/work ads with Python and requests. **Work in progress**.Scraping different subdomains within finn *(see parameters.yml)*. E.g. housing ads, project ads,
work ads. Each different subdomain requires a different set of xpaths, though there are many common denominators *(see src/xpaths.py)*.Only tested on Python 3.11
CSV example
Log example
# Setup
`mkdir scrapes`\
`mkdir logs`\
`pip install -r requirements.txt`### Parameters
Adjust parameters in `parameters.yml`.\
**daily_scrape:** If true scraper only scrapes the daily adds.\
**finn_sub_urls:** Which part of finn to scrape. A different CSV is created for
all the different sub urls.### To run
`python src/finn_scraper.py`## Checklist
- [ ] Add detail to headers.
- [ ] Add sleep timer and folder etc to parameters.yml.
- [ ] Custom queries instead of binary daily/not daily scrape.
- [ ] Reduce line length across project.
- [ ] Checking if all requests yields code 200.
- [ ] Process data function for html->text.