https://github.com/jonolav95/finn_scraper

Scraping https://www.finn.no/
https://github.com/jonolav95/finn_scraper

beautifulsoup lxml python requests scraping

Last synced: 6 months ago
JSON representation

Scraping https://www.finn.no/

Host: GitHub
URL: https://github.com/jonolav95/finn_scraper
Owner: JonOlav95
Created: 2024-01-15T15:38:54.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-11-01T23:03:05.000Z (11 months ago)
Last Synced: 2024-11-02T00:16:56.188Z (11 months ago)
Topics: beautifulsoup, lxml, python, requests, scraping
Language: Python
Homepage:
Size: 855 KB
Stars: 0
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# About
** CURRENTLY BROKEN XPATHS (TO BE FIXED) **
Scraping Finn housing/work ads with Python and requests. **Work in progress**.

Scraping different subdomains within finn *(see parameters.yml)*. E.g. housing ads, project ads,
work ads. Each different subdomain requires a different set of xpaths, though there are many common denominators *(see src/xpaths.py)*.

Only tested on Python 3.11

CSV example
![alt text](media/scrape_example.png)

Log example
![alt text](media/log_example.png)

# Setup
`mkdir scrapes`\
`mkdir logs`\
`pip install -r requirements.txt`

### Parameters
Adjust parameters in `parameters.yml`.\
**daily_scrape:** If true scraper only scrapes the daily adds.\
**finn_sub_urls:** Which part of finn to scrape. A different CSV is created for
all the different sub urls.

### To run
`python src/finn_scraper.py`

## Checklist
- [ ] Add detail to headers.
- [ ] Add sleep timer and folder etc to parameters.yml.
- [ ] Custom queries instead of binary daily/not daily scrape.
- [ ] Reduce line length across project.
- [ ] Checking if all requests yields code 200.
- [ ] Process data function for html->text.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jonolav95/finn_scraper

Awesome Lists containing this project

README