https://github.com/nullnull/scraping_sample
An easy and powerful template with the minimun you need to start web scraping with Ruby + Selenium + Docker + Google Kubernetes Engine
https://github.com/nullnull/scraping_sample
docker gke kubernetes ruby selenium
Last synced: 16 days ago
JSON representation
An easy and powerful template with the minimun you need to start web scraping with Ruby + Selenium + Docker + Google Kubernetes Engine
- Host: GitHub
- URL: https://github.com/nullnull/scraping_sample
- Owner: nullnull
- License: mit
- Created: 2018-09-15T14:02:27.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2023-04-12T05:20:42.000Z (about 3 years ago)
- Last Synced: 2025-07-24T21:05:07.080Z (9 months ago)
- Topics: docker, gke, kubernetes, ruby, selenium
- Language: Ruby
- Homepage: https://qiita.com/nullnull/items/61dae392f853f260cfb0
- Size: 18.6 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Scraping Sample
An easy and powerful template with the minimun you need to start web scraping with Ruby + Selenium + Docker + Google Kubernetes Engine
## Setup for local development on docker
```sh
git clone git@github.com:nullnull/scraping_sample.git
cd scraping_sample
docker-compose build
docker-compose up -d
docker-compose exec scraper sh setup.sh
docker-compose exec scraper bundle exec ruby app/fetch_search_results.rb
```
## Monitor scraping progress wth VNC
You can use [VNC server](https://qiita.com/yszk0123/items/840f16ed388fb52b0e21) to monitor selenium. Run `open vnc://localhost:5900/` and type `secret` for password.
## Deploy and Run Scraper
```sh
$ sh cronjob.sh
# to check progress with VNC
$ kubectl get pods
$ kubectl port-forward pod/ 5900 5900
$ open vnc://localhost:5900/
```
## Slack Integration (Optional)
Set your webhook url to `SLACK_WEBHOOK_URL` on `docker-compose.yml` and `kube/cronjob.yml` / `kube/deploy.yml`.
## Data Visualization (Optional)
We recommends [redash](https://redash.io/) to visualize your scraping results. It's easy to setup and have powerful visualization features.
##### Run redash on GCE
https://redash.io/help/open-source/setup
```sh
$ gcloud compute images create "redash-2-0-0" --source-uri gs://redash-images/redash.2.0.0.b2990.tar.gz
$ gcloud compute instances create redash \
--image redash-2-0-0 --scopes storage-ro,bigquery \
--machine-type g1-small --zone asia-east1-a
# and please finsh configuraton on your GCE console.
```