Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/harrisoncramer/cloture.scrapers
The webscrapers that pull down information for the Cloture application.
https://github.com/harrisoncramer/cloture.scrapers
Last synced: about 1 month ago
JSON representation
The webscrapers that pull down information for the Cloture application.
- Host: GitHub
- URL: https://github.com/harrisoncramer/cloture.scrapers
- Owner: harrisoncramer
- Created: 2020-05-25T21:57:46.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2021-06-02T20:57:25.000Z (over 3 years ago)
- Last Synced: 2023-04-22T00:48:42.359Z (over 1 year ago)
- Language: TypeScript
- Homepage: https://www.cloture.app
- Size: 3.4 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 🏛️ Cloture Scrapers
These are the webscrapers that pull down information for [Cloture.app](https://www.cloture.app), an online tool for journalists tracking congressional committee procedures.
The project uses [Bull.js](https://github.com/OptimalBits/bull) to handle the processing of roughly 40 different congressional committee websites.
_NOTE: This project requires connections to Redis and MongoDB, which must be installed locally in development. See the Environment section for how to configure them. It also requires a local installation of Chromium._
## Development
You must install and configure Chromium, Redis, and MongoDB locally.
1. `npm install`
2. `npm run dev:start`## Production
1. `docker build -t cloture_scrapers:latest .`
2. `docker run -dit --env NODE_ENV=production --env-file .env cloture_scrapers:latest`## Environment
This project connects to [Redis](https://redis.io/) for the queue and to [MongoDB](https://docs.mongodb.com/manual/installation/) for storage of the results of the scraping. In development (if you're on OSX) I'd recommend installing both with [Homebrew](https://brew.sh/).
In production, I'm using free cloud-based storage options to avoid having to configure these in separate docker containers. Redis' free option is available on [redislabs.com](redislabs.com); MongoDB likewise offers a managed DB which is adequate for this project.
Development `.env` file:
```
MONGODB_URI="mongodb://username:password@localhost:27017/database?authSource=admin"
MONGODB_USER=username
MONGODB_PASS=password
REDIS_PORT=6379
REDIS_URL="127.0.0.1"
MONGOOSE_LOGS=true
HEADLESS=true
```Production`.env` file:
```
MONGODB_URI=mongodb+srv://username:password@connection-string-on-mongodb's-cloud-service
REDIS_URL=redis-special-url-goes-here.cloud.redislabs.com
REDIS_PASSWORD=your_password
REDIS_PORT=your_port
```## How does this work?
Each congressional committee has certain instructions stored in a simple Javascript object. These are the tags to grab, and specific instructions that we'll use inside of Puppeteer when scraping the page. Each job is setup to run every half hour. Our BullJS queue sends the results of the completed jobs to the listener queue, which then takes the data and saves it to our database. All of the jobs communicate over Redis.