Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/illimag/cl_scrapper
Python Web Scrapper with Rotating Proxies, Rotating User Agents, BS4
https://github.com/illimag/cl_scrapper
Last synced: 20 days ago
JSON representation
Python Web Scrapper with Rotating Proxies, Rotating User Agents, BS4
- Host: GitHub
- URL: https://github.com/illimag/cl_scrapper
- Owner: Illimag
- Created: 2019-05-07T02:00:42.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2022-11-08T06:58:51.000Z (almost 2 years ago)
- Last Synced: 2024-06-07T04:33:03.099Z (5 months ago)
- Language: Python
- Homepage:
- Size: 26 MB
- Stars: 1
- Watchers: 0
- Forks: 1
- Open Issues: 44
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Lead Generator & Mail Automation
![lead_generator](docs/leads_generator.png)
## Overview
This is a lead generator and a mail automation system. The lead generator get leads from Craiglists. The lead generator system is on User2 and the Mail Automation system is on Desktop-E1V0N4N. The system takes in URLs and filters them based on keyword and removes duplicates. Then the system transfers the files via FTP to Desktop-E1V0N4N.
Once the files, which are JSON objects are transfered via FTP to Desktop-E1V0N4N, WATCHDOG which is a python dependency automatically detects the file and runs check_master.py which takes the leads and compares them to the master.json file. Master.json is where all the leads are stored.
The new leads are added to the master.json and a new lead json object is created.
Watchdog detects this new lead json object and runs a UBOT STUDIO Executable.
This executable takes the urls and converts them to email addresses.
This is because Craiglist has server-side javascript that doesn't show the HTML on the client-side without being activated on the client-side. So we are using UBOT STUDIO to automate the mouse click functionality of the system. UBOT STUDIO is a window's application.
The email addresses are put into a CSV file.
Watchdog then runs another UBOT STUDIO executable, AUTO_LOAD.UBOT
This loads the CSV file into the mail server application and sends them.
## Automation Procedure
The two machines:
User0
Desktop-E1V0N4N
Automatically power-on:
5:00AM
Automatically power-off:
11:59AM
At power-on, the file:
starter.py
Is automatically run.
At power-off, the file:
closer.py
Is automatically run:
## Lead Generator
### Craiglist Scraper
Rotating IP Addresses and User Agents to spoof Craiglist.
The main program that scrapes is the:
spider.py
The spider.py file cycles through the URLS which are divided into lead_cycles.
The lead_cycles depend on the traffic of the URLS.
Updates to the:
spider_cycle.txt
Which keep track on the cycle number.
Once a cycle is completed, the spider.py updated the spider_cycle.txt to the next cycle.
### Filters
Filter based on keywords and duplicates.
clean_lead.py
Combines the keyword filter and the FTP transfer.
### FTP
The ftp server is located on the Desktop-E1V0N4N machine.
### Urls to Email
Currently using Ubot Studio to get EMAILS from the URLS.
with the emails won't show.
Need something that will click the button, because without it the
Sever-side Javascript?## Mail Automation
Ubot studio executable runs automatically sends mail.
With Mail for Good and AWS SES, we can send the emails.
testss