Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mohamed17717/scraper-helper
some classes i used most when i try to scrape
https://github.com/mohamed17717/scraper-helper
Last synced: 2 months ago
JSON representation
some classes i used most when i try to scrape
- Host: GitHub
- URL: https://github.com/mohamed17717/scraper-helper
- Owner: mohamed17717
- License: mit
- Created: 2018-12-30T10:12:30.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2022-12-08T01:30:24.000Z (about 2 years ago)
- Last Synced: 2024-08-03T01:26:28.637Z (6 months ago)
- Language: Python
- Homepage:
- Size: 26.4 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-rainmana - mohamed17717/scraper-helper - some classes i used most when i try to scrape (Python)
README
# Scraping Helper
Classes that saves time doing repeated stuff in scraping process.
## Getting Started
### Prerequisites
- [Python3.6](https://www.python.org/downloads/) or later
#### For Browser class
- [Firefox](https://www.mozilla.org/en-US/firefox/)
- [Selenium](https://pypi.org/project/selenium/)
- [Geckodriver](https://github.com/mozilla/geckodriver/releases)### Installing
#### Clone The repository and install dependencies
``` bash
git clone https://github.com/mohamed17717/scraper-helper.git
cd
pip install -r requirements.txt
```## Running the tests
Explain how to run the automated tests for this system
### **Scraper** class
> `downlaod video`
```python
from scraper import Scraper
s = Scraper()
video_link = 'https://video.faly2-1.fna.fbcdn.net/v/t42.9040-2/62397431_315586462709416_146625252862984192_n.mp4?_nc_cat=102&efg=eyJ2ZW5jb2RlX3RhZyI6InN2ZV9zZCJ9&_nc_oc=AQkgWetwMBu9sezw9cSv95KrlX03X1wX_ZOaSzMxtmcfq1_Ix_tXVWefEr2Xyq_8Ka4&_nc_ht=video.faly2-1.fna&oh=4ca74f2b4d64ea2856819efa3ce4fe4f&oe=5D3E91FB'
s.download(link=video_link)
```> `download Facebook video using cookies`
``` python
from scraper import *
from urllib.parse import unquoteurl = 'https://www.facebook.com/livingin2077/videos/2296419330686852/'
url = url.replace('www.', 'm.')
parsed_url = UrlParser(url)s = Scraper()
cookies = s.get_cookies('firefox', parsed_url.domain)
s.set_cookies(cookies)s.get(url)
soup = s.html_soup()a = soup.select_one('a[href^="/video_redirect"]')
video_link = unquote( a['href'].replace('/video_redirect/?src=', '') )
s.download(video_link)
```### **Browser** class
> `Login to instagram`
``` python
from scraper import Browser
from time import sleepb = Browser(hide=False)
b.get('http://instagram.com', with_cookies=False)
sleep(1)
b.click_btn('a[href^="/accounts/login"]')
sleep(1)
b.fill_input('input[name="username"]', 'any_user')
b.fill_input('input[name="password"]', 'wrong password')
b.click_btn('button[type="submit"]')
```> and so on.\
> scripts are much smaller than before.\
> there is much more capabilites i hope u find out## Built With
- **Requests** - Lib for Scraping the web
- **Beautiful Soup** - Parse HTML
- **Selenium** - To control the browser## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details
## TODO
- [ ] add proxy class
- [ ] add traversing class - to move throw DOM easily
- [ ] make code cleaner