https://github.com/mfbx9da4/scrape_email
Uses google search to gather emails for a given set of queries
https://github.com/mfbx9da4/scrape_email
Last synced: about 2 months ago
JSON representation
Uses google search to gather emails for a given set of queries
- Host: GitHub
- URL: https://github.com/mfbx9da4/scrape_email
- Owner: mfbx9da4
- Created: 2018-06-18T17:51:45.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2018-09-16T14:53:24.000Z (over 6 years ago)
- Last Synced: 2025-01-26T17:45:46.323Z (4 months ago)
- Language: JavaScript
- Size: 242 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
## Scrape emails
- Provide either a json file with list of names
- Scrapes google first page for `${name} contact` @
- Then scrapes first few links until it finds @ on page## Method
part 1 - save html files:
- create html files for all google pages and top results
- in parallel
- would be faster to store in dbpart 2 - read each html file:
- scrape for email regex, accumalate all email## Naming
Always prefix functions and files with either `fetchBar` or `extractFoo`
## Getting Started
requries docker and docker compose
docker-compose up
yarnUse for debugging values
npm install -g redis-commander
redis-commander## Deploying
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/docker-basics.html#docker-basics-create-image
╰─$ aws ecr get-login --no-include-email --region eu-west-2
## paste output to login
╰─$ docker build -t scrape_email .
╰─$ docker tag scrape_email:latest 016582366134.dkr.ecr.eu-west-2.amazonaws.com/scrape_email:latest
╰─$ docker push 016582366134.dkr.ecr.eu-west-2.amazonaws.com/scrape_email:latest## Run Once
See package.json for scripts
node index.js
## Testing docker locally
docker build -t scrape_email .
docker run -p 49160:3001 3000:3000 -d scrape_email## Streamlining: When to update db?
- Individual agents should be updated independent of other agents therefore:
- Wait for all subpages to be scraped before updating an agent record? YES
- Wait for all properties to be scraped before updating an agent record? YES
- Wait for all agents to be scraped before updating an agent record? NO
- Wait for all regions to be scraped before updating an agent record? NO
- => Emit an event everytime you extract some data completely for an agent
- Emit an event when:
- Found agent name, address, etc
- Extracted agent property stats
- Extracted agent email## API
- Get email for query
- Get email for list of queries## TODO
- Support upload csv native upload
- Support download