https://github.com/sfproductlabs/scrp
Horizontally scalable web scraper written in Go
https://github.com/sfproductlabs/scrp
altavista baidu bing google yahoo
Last synced: 6 months ago
JSON representation
Horizontally scalable web scraper written in Go
- Host: GitHub
- URL: https://github.com/sfproductlabs/scrp
- Owner: sfproductlabs
- License: apache-2.0
- Created: 2020-05-22T22:17:01.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2020-08-07T02:11:13.000Z (about 5 years ago)
- Last Synced: 2024-05-19T00:34:29.839Z (over 1 year ago)
- Topics: altavista, baidu, bing, google, yahoo
- Language: Go
- Homepage:
- Size: 236 KB
- Stars: 7
- Watchers: 4
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Scrp
A fully resumable horizontally (infinitely) scalable webscraper in Go. Think 1000's of machines scraping sites in a distributed way. Based on Docker Swarm, Cassandra, colly, gRPC, and my other [boilerplate](https://github.com/dioptre/gtrpc).## Note
Why are you even here? Maybe you could probably just use colly... Especially if you don't care about scalability... or use a shell script, for example:
```bash
#!/bin/bash
url="http://www.cityfeet.com/cont/api/search/listings-spatial"
cookie="ASP.NET_SessionId=x335iekckm5tqxcq12psv1p2; __RequestVerificationToken_L2NvbnQ1=FTTyjLMPpvjTLNYvWo5a5yFqhos830-fpyjtxwr4vsVnG8P7_bf5zEEpH4JjY2KfIKgHMuuotd9IyW4iUmSeYRHnLzQ1"
DATE=`date +%Y-%m-%d`
#for i in $(cat query.txt); do
for i in {1..9}; do
body="{'location':{'name':'San Francisco, CA','bb':[37.708131,-122.51777,37.863424,-122.3570311],'lat':37.7857775,'lng':-122.43740055,'state':'CA','city':'San Francisco','id':'3-19282','level':3},'lt':1,'pt':0,'sort':null,'partnerId':null,'lc':[],'mode':2,'portfolio':-1,'tt':0,'ignoreLocation':false,'KeyWord':null,'rent':{'type':1,'basis':0},'term':'San Francisco, CA','PageNum':$i,'PageSize':30,'state':{'\$type':'Cityfeet.Core.Listing.MultiSearchState, Core','ProviderPosition':{'PDS':$((30 * ($i -1))),'CF':0}}}"
content="$(curl -v -s "$url" --header "Cookie: $cookie" --header "Content-Type: application/json" --data "$body" --cookie "$cookie")"
echo "$content" > ./data/city-feet-com-listings-spatial-$DATE-$i.json
sleep 5
done
```
## Why
I built this to distribute scraping across multiple servers, so as to go undetected. I could have used proxies, but wanted to reuse the code for other distributed apps.## Local Execution
### Installing (local)
Run:
```
docker-compose up
```
Then (on linux - you can use brew on mac):
```
#apt install go
#./prereq.sh
#./build.sh
```### Scrape Instructions (local)
Add backend.local to your /etc/hosts file:
```
bash -c 'echo "127.0.0.1 backend.local" >> /etc/hosts'
```First run the server on all the nodes (use the testlocal.sh script for brevity):
```
#GOCQL_HOST_LOOKUP_PREFER_V4=true /usr/bin/nice -n 5 ./gsvc localhost false false ./.setup/keys/rootCa.crt ./.setup/keys/cassandra-client.crt ./.setup/keys/cassandra-client.key
```
Notice the parameters:
```
[0] - cassandra-databases (comma-separated, no spaces)
[1] - cassandra-retry (should we retry execution on the cassandra cluster)
[2] - cassandra-veify (should we verify the cassandra service)
[3] - cassandra-rootca (only use this if you need)
[4] - cassandra-client-cert (only use this if you need)
[5] - cassandra-client-key (only use this if you need)
```Then send a request via the client:
```
./gcli backend.local:50551 https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
```
Or something a little more complex (with domain filter & regex [note you can split regex into multiple filters using ```||```]):
```
./gcli backend.local:50551 https://en.wikipedia.org/wiki/List_of_HTTP_status_codes en.wikipedia.org,wikipedia.org ".*List.*status_codes$"
```
Or without the domain filter and just the regex (use the _ operator to skip):
```
./gcli backend.local:50551 https://en.wikipedia.org/wiki/List_of_HTTP_status_codes _ ".*List.*status_codes$"
```## Running on Docker Swarm
### TL;DR Be your own google bot cluster in 5 minutes.
This will set you up with a cluster running 30 machines on [Hetzner Cloud](https://hetzner.cloud/?ref=kVvYlAsUNbOF) (change the first 20 lines to suit your own cloud provider). I use this on my own production servers. Nothing else. This assumes you've setup a project in Hetzner and an API key. It should be a fresh environment. We may delete ALL the machines in Hetzner if things don't work, so start with a fresh project and use the RIGHT API key.
On your local/desktop/development machine (from the scrp github repository):
```
sudo apt install hcloud-cli
hcloud ssh-key create --name andy --public-key-from-file ~/.ssh/id_rsa.pub
hcloud network create --ip-range=10.1.0.0/16 --name=aftnet
hcloud network add-subnet --ip-range=10.1.0.0/16 --type=server --network-zone=eu-central aftnet
for n in {1..30}; do (hcloud server create --name scrp$RANDOM$RANDOM$RANDOM$RANDOM --type cx11 --image debian-9 --datacenter nbg1-dc3 --network aftnet --ssh-key andy 2>&1 >/dev/null &) ; done
watch -n 5 "echo "Press Ctrl-c to exit when your server count meets the desired amount. You will need to copy and paste just the following instructions to proceed." && hcloud server list | grep 'running' | awk 'END {print NR}'"
```
Wait until all your servers have been created then continue:
```
rm *.txt
hcloud server list -o columns=name -o noheader > scrps-names.txt
hcloud server list -o columns=ipv4 -o noheader > scrps-ips.txt
cat scrps-names.txt | xargs -I {} hcloud server describe -o json {} | jq -r '.private_net[0].ip' >> scrps-vips.txt
hcloud server create --name cassandra1 --type cx41 --image debian-9 --datacenter nbg1-dc3 --network aftnet --ssh-key andy
hcloud server describe -o json cassandra1 | jq -r '.private_net[0].ip' > cassandra-vip.txt
hcloud server create --name manager1 --type cx11 --image debian-9 --datacenter nbg1-dc3 --network aftnet --ssh-key andy
hcloud server describe -o json manager1 | jq -r '.private_net[0].ip' > manager-vip.txt
scp -o StrictHostKeyChecking=no *.txt root@$(hcloud server list -o columns=ipv4,name -o noheader | grep manager1 | awk '{print $1}'):~/
scp -o StrictHostKeyChecking=no ansible/* root@$(hcloud server list -o columns=ipv4,name -o noheader | grep manager1 | awk '{print $1}'):~/
scp -o StrictHostKeyChecking=no scrp-docker-compose.yml root@$(hcloud server list -o columns=ipv4,name -o noheader | grep manager1 | awk '{print $1}'):~/
scp -o StrictHostKeyChecking=no .setup/schema* root@$(hcloud server list -o columns=ipv4,name -o noheader | grep manager1 | awk '{print $1}'):~/
```
If it stuffs up run **DANGEROUS** it will delete all your servers for the project:
```
hcloud server list -o columns=name -o noheader | xargs -P 8 -I {} hcloud server delete {}
```
If not get on the manager node ```ssh -l root -A $(hcloud server list -o columns=ipv4,name -o noheader | grep manager1 | awk '{print $1}')``` and run (copy and paste the whole thing):
```
apt-get update && \
apt-get upgrade -y && \
apt-get install apt-transport-https ca-certificates curl gnupg-agent software-properties-common -y && \
curl -fsSL https://download.docker.com/linux/debian/gpg | sudo apt-key add - && \
apt-key fingerprint 0EBFCD88 && \
add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/debian $(lsb_release -cs) stable" && \
apt-get update && \
apt-get install docker-ce docker-ce-cli containerd.io ansible -y && \
docker swarm init --advertise-addr=ens10 && \
docker swarm join-token worker | xargs | sed -r 's/^.*(docker.*).*$/\1/' > join.sh && \
chmod +x join.sh && \
printf "\n[defaults]\nhost_key_checking = False\n" >> /etc/ansible/ansible.cfg && \
printf "\n[cassandras]\n" >> /etc/ansible/hosts && \
cat cassandra-vip.txt >> /etc/ansible/hosts && \
printf "\n[managers]\n" >> /etc/ansible/hosts && \
cat manager-vip.txt >> /etc/ansible/hosts && \
printf "\n[dockers]\n" >> /etc/ansible/hosts && \
cat manager-vip.txt >> /etc/ansible/hosts && \
cat scrps-vips.txt >> /etc/ansible/hosts && \
cat cassandra-vip.txt >> /etc/ansible/hosts && \
printf "\n[scrps]\n" >> /etc/ansible/hosts && \
cat scrps-vips.txt >> /etc/ansible/hosts && \
ansible dockers -a "uptime" && \
printf "\n $(cat join.sh | awk '{print $0}')" >> swarm-init.yml && \
ansible-playbook swarm-init.yml && \
ansible dockers -a "docker stats --no-stream" && \
docker node ls && \
docker node update --label-add cassandra=true cassandra1 && \
docker network create -d overlay --attachable forenet --subnet 192.168.9.0/24 && \
ansible-playbook cassandras-init.yml && \
docker secret create schema.1.cql schema.1.cql && \
docker stack deploy -c scrp-docker-compose.yml scrp
```
Finally run a query (see above for more complex examples):
```
docker run -it --net=forenet sfproductlabs/scrp /app/scrp/gcli scrp_scrp:50551 https://httpbin.org/delay/2
```### Deploy to a swarm
*Important: First make sure you deploy the [schema](https://github.com/sfproductlabs/scrp/blob/master/.setup/schema.1.cql) to cassandra somewhere.*Ex. ```cqlsh --ssl -f ./.setup/schema.1.cql ```
[Checkout and use the swarm-config example](https://github.com/sfproductlabs/scrp/blob/master/scrp-docker-compose.yml) then on your docker swarm manager:
```
docker stack deploy -c scrp-docker-compose.yml scrp
```
Then follow the logs to see if you need to update anything:
```
docker service logs scrp_scrp -f
```Then issue a query to the swarm (as above):
```
docker run -it --net=forenet sfproductlabs/scrp /app/scrp/gcli scrp_scrp:50551 https://httpbin.org/delay/2
```
#### Deploying swarm on Hetzner
```
$ = Your client/development machine, run them from this git repository root
# = As root on the dockermanager
d# = As root on the docker swarm drone
```##### Setting up Hetzner
**Remember to run the $ commands from the git repository root*** Install hetzner cli:
```
$sudo apt install hcloud-cli
```
* Go to the cloud console and create a project (important! make sure it's a new one, we will be deleting every server in here when we are done)
* Then click on project->access->api tokens->generate token
* Setup access on your local machine to the datacenters/project:
```
$hcloud context create scrp
```
* Make sure there are no servers here (yet)
```
$hcloud server list
```
* Add your local machine to ssh auth
```
$hcloud ssh-key create --name andy --public-key-from-file ~/.ssh/id_rsa.pub
```
* Choose a server-type
```
$hcloud server-type list
$hcloud image list
$hcloud datacenter list
```
* Create a network
```
$hcloud network create --ip-range=10.1.0.0/16 --name=aftnet
$hcloud network add-subnet --ip-range=10.1.0.0/16 --type=server --network-zone=eu-central aftnet
```
* Create 30 servers (replace the type with your instance preference Ex. cx41)
```
$for n in {1..30}; do (hcloud server create --name scrp$RANDOM$RANDOM$RANDOM$RANDOM --type cx11 --image debian-9 --datacenter nbg1-dc3 --network aftnet --ssh-key andy &) ; done
```
* In a SEPARATE terminal see the status of your booting machines (you can delete them all using the command below if something bad happens):
```
$watch -n 5 "echo "Press Ctrl-c to exit when your server count meets the desired amount" && hcloud server list | grep 'running' | awk 'END {print NR}'"
```
* Get a list of them. IMPORTANT. This will be used to delete the servers later. Check them!
```
$rm *.txt
$hcloud server list -o columns=name -o noheader > scrps-names.txt
$hcloud server list -o columns=ipv4 -o noheader > scrps-ips.txt
$cat scrps-names.txt | xargs -I {} hcloud server describe -o json {} | jq -r '.private_net[0].ip' >> scrps-vips.txt
```
* Create a cassandra server (16GB ram):
```
$hcloud server create --name cassandra1 --type cx41 --image debian-9 --datacenter nbg1-dc3 --network aftnet --ssh-key andy
$hcloud server describe -o json cassandra1 | jq -r '.private_net[0].ip' > cassandra-vip.txt
```
* Create a manager node, copy some files to it and login:
Addtional step required *only* on a mac:
```
eval `ssh-agent`
ssh-add ~/.ssh/id_rsa
```
Now create a manager, and get to it:
```
$hcloud server create --name manager1 --type cx11 --image debian-9 --datacenter nbg1-dc3 --network aftnet --ssh-key andy
$hcloud server describe -o json manager1 | jq -r '.private_net[0].ip' > manager-vip.txt
$scp -o StrictHostKeyChecking=no *.txt root@$(hcloud server list -o columns=ipv4,name -o noheader | grep manager1 | awk '{print $1}'):~/
$scp -o StrictHostKeyChecking=no ansible/* root@$(hcloud server list -o columns=ipv4,name -o noheader | grep manager1 | awk '{print $1}'):~/
$scp -o StrictHostKeyChecking=no scrp-docker-compose.yml root@$(hcloud server list -o columns=ipv4,name -o noheader | grep manager1 | awk '{print $1}'):~/
$scp -o StrictHostKeyChecking=no .setup/schema* root@$(hcloud server list -o columns=ipv4,name -o noheader | grep manager1 | awk '{print $1}'):~/
$ssh -l root -A $(hcloud server list -o columns=ipv4,name -o noheader | grep manager1 | awk '{print $1}')
```
##### Initializing a Docker Swarm
https://docs.docker.com/engine/install/debian/From the #docker manager1 (last ssh command above) as root **(it's important to make sure this runs perfectly)** run:
```
apt-get update && \
apt-get upgrade -y && \
apt-get install apt-transport-https ca-certificates curl gnupg-agent software-properties-common -y && \
curl -fsSL https://download.docker.com/linux/debian/gpg | sudo apt-key add - && \
apt-key fingerprint 0EBFCD88 && \
add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/debian $(lsb_release -cs) stable" && \
apt-get update && \
apt-get install docker-ce docker-ce-cli containerd.io ansible -y && \
docker swarm init --advertise-addr=ens10 && \
docker swarm join-token worker | xargs | sed -r 's/^.*(docker.*).*$/\1/' > join.sh && \
chmod +x join.sh
```
Now we can setup docker on all the client machines using ansible (still in the docker manager1):
```
printf "\n[defaults]\nhost_key_checking = False\n" >> /etc/ansible/ansible.cfgprintf "\n[cassandras]\n" >> /etc/ansible/hosts
cat cassandra-vip.txt >> /etc/ansible/hostsprintf "\n[managers]\n" >> /etc/ansible/hosts
cat manager-vip.txt >> /etc/ansible/hostsprintf "\n[dockers]\n" >> /etc/ansible/hosts
cat manager-vip.txt >> /etc/ansible/hosts
cat scrps-vips.txt >> /etc/ansible/hosts
cat cassandra-vip.txt >> /etc/ansible/hostsprintf "\n[scrps]\n" >> /etc/ansible/hosts
cat scrps-vips.txt >> /etc/ansible/hosts
```Test the machines are contactable:
```
ansible dockers -a "uptime"
```If that worked, install docker on all the machines:
```
printf "\n $(cat join.sh | awk '{print $0}')" >> swarm-init.yml
ansible-playbook swarm-init.yml
```Test the dockers are up:
```
ansible dockers -a "docker stats --no-stream"
docker node ls
```
Now deploy the swarm stack:
```
docker node update --label-add cassandra=true cassandra1
ansible-playbook cassandras-init.yml
docker network create -d overlay --attachable forenet --subnet 192.168.9.0/24
docker secret create schema.1.cql schema.1.cql
docker stack deploy -c scrp-docker-compose.yml scrp
```
Give it a few minutes to boot, the scrps will take a while and likely fail a few times before they finally connect to cassandra, to debug example:
```
docker service ps scrp_cassandra --no-trunc
docker service logs scrp_cassandra -f
```
Finally run a query (see above for more complex examples):
```
docker run -it --net=forenet sfproductlabs/scrp /app/scrp/gcli scrp_scrp:50551 https://httpbin.org/delay/2
```##### Getting my data out
If you are new to cassandra heres a quick intro:
```
$ssh -l root -A $(hcloud server list -o columns=ipv4,name -o noheader | grep cassandra1 | awk '{print $1}')
#docker ps
#docker exec -it c41 bash #replace c41 with your container
## cqlsh --ssl
cqlsh> select count(*) from scrp.content;
cqlsh> COPY scrp.content TO 'content.csv' WITH HEADER = FALSE;
```##### Deleting Machines
* DELETE THEM. Yes. Let's get used to it, and make sure we know what we're doing. Double check everything before executing these commands.
```
$cat scrps-names.txt | xargs -I {} hcloud server delete {}
```
or DANGEROUS (but great for cleaning up, will include cassandra), in parallel:
```
hcloud server list -o columns=name -o noheader | xargs -P 8 -I {} hcloud server delete {}
```##### Misc
Example commands (https://docs.hetzner.cloud/):
```
$source <(hcloud completion bash) # bash
$source <(hcloud completion zsh) # zsh
$hcloud server list
$hcloud ssh-key create --name demo --public-key-from-file ~/.ssh/id_rsa.pub
$hcloud server create --name demoserver --type cx11 --image debian-9 --ssh-key demo
$hcloud server list
$hcloud server list | grep -E "[0-9]+.[0-9]+.[0-9]+.[0-9]+" | sed -r 's/.*(\w[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+).*$/\1/' > scrps-ips.txt
$hcloud server list | grep -E '^[^ID]' | sed -r 's/^[0-9]+ +([^ ]+).*$/\1/ig' > scrps-names.txt
```##### Issues
Sometimes I get a network issue running a new docker container(out of swarm mode using ```docker run```) in a node in a drained state, try and run this to fix things:
```
docker node update --availability active manager1
```
For example:
```
docker node update --availability active manager1 && docker run -it --net=forenet sfproductlabs/scrp /app/scrp/gcli scrp_scrp:50551 https://www.bf.com/?country=us bf.com,www.bf.com
```