https://github.com/maxpleaner/showscraper
automated listing of concerts in san francisco area
https://github.com/maxpleaner/showscraper
Last synced: 7 months ago
JSON representation
automated listing of concerts in san francisco area
- Host: GitHub
- URL: https://github.com/maxpleaner/showscraper
- Owner: MaxPleaner
- License: other
- Created: 2022-04-26T14:53:39.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2025-04-28T04:40:54.000Z (about 1 year ago)
- Last Synced: 2025-04-28T05:19:53.376Z (about 1 year ago)
- Language: Ruby
- Homepage: https://bayareashows.org
- Size: 85.8 MB
- Stars: 5
- Watchers: 3
- Forks: 1
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ShowScraper
This is a collection of a bunch of scrapers which fetch concert listings from
various venues in the Bay Area.
It also is a website to view the listings.
## Tech Stack
The scraper uses headless [Selenium](https://www.selenium.dev/)
([Chromedriver](https://chromedriver.chromium.org/downloads))
and is written in [Ruby](https://www.ruby-lang.org/en/).
Results are converted to [JSON](https://www.json.org/json-en.html)
and pushed to a [Google Cloud Storage](https://cloud.google.com/storage)
bucket.
The front end is written in [React](https://reactjs.org/)
and styled with [Bulma CSS framework](https://bulma.io/).
## Setup - Scraper
1. Run `bundle install` in the root of the repo using a stable Ruby version
1. You're gonna need to install Chromedriver.
- **On Raspberry Pi**: you can do this with `sudo apt-get install chromium-chromedriver` and this will happen automatically (the scraper will check your system architecture and use the right
chromedriver path)
- **on OSX and Windows** you can find all versions at [https://chromedriver.chromium.org/](https://chromedriver.chromium.org/) and then add it to your PATH.
For example, add this to your `~/.bash_profile`:
```
export PATH=$PATH:/path/to/folder/containing/chromedriver/
```
1. Run `cp .env.example .env` in the root of the repo. The `.env` file sets environment variables which can be used to customize the application's behavior. This file can be edited directly.
1. Make a new "project" on google cloud. Create a GCS bucket in the project. Add the credentials to `.env`:
```
STORAGE_PROJECT = "my-project-id"
STORAGE_CREDENTIALS = "path/to/keyfile.json"
```
1. Change the GCS bucket permissions so all files are publicly available by default.
1. Configure `gsutils` to use your new project, then upload the CORS file which I've included in the repo:
```
gsutil cors set cors-json-file.json gs://
```
## Setup - Frontend
1. Make sure you're using a stable Node version
2. `cd frontend/react-app`
3. run `yarn install` to get dependencies
4. `yarn start` and then open `localhost:3000`
To build the project for production, use `yarn build`
_Note_: There is also a script which you can run from the root of the repo to start the react server:
```
bin/run_frontend
```
This runs `nvm use 14; cd frontend/react_app && yarn install & yarn start`
Note you will probably have to change this `nvm use 14` if you are using a different Node version.
## Running Scraper
There is a command line tool at `bin/run_scraper`.
By default it will run all scrapers (each will fetch a maximum of 200 events)
and then upload the results to GCS.
**Options** (note that most of these can also be set from `.env`)
```
# Limits each scraper to N results
--limit=10
# Just print the results, don't upload them to GCS
--skip-persist
# Don't rescue scraping errors - one broken scraper will stop the whole script
# If rescue=true (the default) then broken scrapers will just be skipped
--rescue=false
# Just update the list of venues. Don't actually scrape any events.
--no-scrape
# Limit the scrape to a set of venues. Comma-separated list.
--sources=GreyArea,Cornerstone
# Run headlessly, or not
--headless=true
--headless=false
```
For example, combining options:
```
bin/run-scraper --headless=true --limit=5 --skip-persist --rescue=false sources=ElboRoom,Knockout
```
There are some other configuration options done through ENV, see `.env.example`
Note that every time you run a scraper, it will completely overwrite the list of events for that venue.
## Testing
There are some basic automated tests for the scrapers. Run `bundle exec rspec` from the root of the repo.
Using rspec you can also isolate certain tests to run (left as an exercise to the reader).
## Adding a new scraper
1. Add a new entry to `sources.json`. You can get `latlng` from Google Maps
(right click the marker on the map and the coords will pop up). For `desc` you can
just copy the blurb from Google Maps as well.
2. Create a new file `scraper/lib/sources/venue_name.rb` (replacing `venue_name`, obviously).
3. You can copy one of the existing scraper classes as a starting point.
Note that there are a few different types of websites (calendar view, infinite scroll, all-on-one-page)
so it's best to find another scraper that is similar in that regard.
4. Make sure the class name is the exact same as the `name` value in `sources.json`
5. Fill out the contents of the scraper, using `binding.pry` and the `HEADLESS=false`
environment variable as needed for debugging.
6. Add a test case to `scraper_spec.rb` (can just use `generic_run_test` like the other scrapers)
Note, there is no need to explicitly `require` the scraper class anywhere into the codebase.
Autoloading is already set up based on `sources.json`.
## A note about the unused/server and unused/db folders
These are both unused. I kept them here in case I want to have a dedicated backend at some point.
For now it suffices to go backend-less and just host the results on GCS.
## Development - TODOS
- [ ] Map View
- [ ] Add more meta-scrapers (e.g. scrape other scrapers/aggregators), especially for electronic shows which aren't really captured by the current venue list or "The List"
- [ ] Add more venues (have specifically received requests for South Bay, but probably there are new SF / East Bay venues as well).
- [ ] Add Venue Events List view (accessible from Venue List View)
- [x] Find a way to handle events that don't have an explicit year in their date
- [x] Add Submit Event / About pages