Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/maxcnunes/wegottickets-events-crawler
A simple initial implementation of a web crawler of events from www.wegottickets.com.
https://github.com/maxcnunes/wegottickets-events-crawler
Last synced: about 1 month ago
JSON representation
A simple initial implementation of a web crawler of events from www.wegottickets.com.
- Host: GitHub
- URL: https://github.com/maxcnunes/wegottickets-events-crawler
- Owner: maxcnunes
- Created: 2016-08-10T23:30:09.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2016-08-10T23:30:46.000Z (over 8 years ago)
- Last Synced: 2024-10-18T07:22:40.100Z (3 months ago)
- Language: HTML
- Size: 29.3 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# wegottickets-events-crawler
This is a simple initial implementation of a web crawler of events from www.wegottickets.com. It will save the found events to a local file.
More details, please take a look the in the **Running section**.## Usage
### Building
```bash
go build
```Or for cross platform:
Using [goxc](https://github.com/laher/goxc).
```bash
goxc
```### Running
```bash
Usage of ./wegottickets-events-crawler:
-limit int
The limit of pages the crawler will fetch.
-out string
The output path the crawler will save the events. (default "./events.json")
-url string
The wegottickets search result URL. (default "http://www.wegottickets.com/searchresults/region/0/latest")
```**Some valid URLs**
* http://www.wegottickets.com/searchresults/region/0/latest
* http://www.wegottickets.com/searchresults/region/0/all### Testing
```bash
./scripts/test.sh
```## Tasks (summary)
* Fetch all pages from: http://www.wegotickets.com/searchresults/all
* Fetch the detail data for each event. Fields:
* the artists playing
* the city
* the name of the venue
* the date
* the price**Ideas for Improvements**
* Option to run "light crawler" - Fetching data only from the search result pages
* Keep a hash with all fetched data. Then is possible to avoid fetching a previous "visited" event.
* Respect robots.txt
* Show progress (remaining pages to fetch)
* Filter the search to only include music events