An open API service indexing awesome lists of open source software.

https://github.com/mfbx9da4/scrape_email

Uses google search to gather emails for a given set of queries
https://github.com/mfbx9da4/scrape_email

Last synced: about 2 months ago
JSON representation

Uses google search to gather emails for a given set of queries

Awesome Lists containing this project

README

        

## Scrape emails

- Provide either a json file with list of names
- Scrapes google first page for `${name} contact` @
- Then scrapes first few links until it finds @ on page

## Method

part 1 - save html files:
- create html files for all google pages and top results
- in parallel
- would be faster to store in db

part 2 - read each html file:
- scrape for email regex, accumalate all email

## Naming

Always prefix functions and files with either `fetchBar` or `extractFoo`

## Getting Started

requries docker and docker compose

docker-compose up
yarn

Use for debugging values

npm install -g redis-commander
redis-commander

## Deploying

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/docker-basics.html#docker-basics-create-image

╰─$ aws ecr get-login --no-include-email --region eu-west-2
## paste output to login
╰─$ docker build -t scrape_email .
╰─$ docker tag scrape_email:latest 016582366134.dkr.ecr.eu-west-2.amazonaws.com/scrape_email:latest
╰─$ docker push 016582366134.dkr.ecr.eu-west-2.amazonaws.com/scrape_email:latest

## Run Once

See package.json for scripts

node index.js

## Testing docker locally

docker build -t scrape_email .
docker run -p 49160:3001 3000:3000 -d scrape_email

## Streamlining: When to update db?

- Individual agents should be updated independent of other agents therefore:
- Wait for all subpages to be scraped before updating an agent record? YES
- Wait for all properties to be scraped before updating an agent record? YES
- Wait for all agents to be scraped before updating an agent record? NO
- Wait for all regions to be scraped before updating an agent record? NO
- => Emit an event everytime you extract some data completely for an agent
- Emit an event when:
- Found agent name, address, etc
- Extracted agent property stats
- Extracted agent email

## API
- Get email for query
- Get email for list of queries

## TODO
- Support upload csv native upload
- Support download