https://github.com/aquatiko/craigslist-spider
A python spider to scrape jobs list and details form https://newyork.craigslist.org.
https://github.com/aquatiko/craigslist-spider
craigslist dynamic jobseeker python3 scrapy-spider
Last synced: 2 months ago
JSON representation
A python spider to scrape jobs list and details form https://newyork.craigslist.org.
- Host: GitHub
- URL: https://github.com/aquatiko/craigslist-spider
- Owner: aquatiko
- Created: 2018-07-18T10:16:09.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2018-07-18T10:38:33.000Z (almost 7 years ago)
- Last Synced: 2025-01-21T00:32:31.218Z (4 months ago)
- Topics: craigslist, dynamic, jobseeker, python3, scrapy-spider
- Language: Python
- Homepage:
- Size: 121 KB
- Stars: 1
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Craigslist-spider
A python spider to scrape jobs list and their details form craigslist.## Usage
In Terminal or CMD, navigate to the main Scrapy project folder, and run the spider:
```scrapy crawl jobs -o output.csv```
### Settings
In settings.py change these parts to make spider site friendly-```Set CONCURRENT REQUESTS= 2 (or 5), to set maximum concurrent requests made by spider to domain. A high limit might be detected by domain.```
```ROBOTSTXT_OBEY= False ,to be able to scrape parts of website that itn't allowed by domain. You can check those rulse by visiting www.site_name/robots.txt```
```DOWNLOAD_DELAY=2 (in seconds), to allow a gap in time period between concurrent requests. This will make your spider slow but also lessens the chance of detected by domain.```
You can also uncomment other settings in settings.py and set their values for a more customized spider.