Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mohamed17717/scraper-helper

some classes i used most when i try to scrape
https://github.com/mohamed17717/scraper-helper

Last synced: 2 months ago
JSON representation

some classes i used most when i try to scrape

Host: GitHub
URL: https://github.com/mohamed17717/scraper-helper
Owner: mohamed17717
License: mit
Created: 2018-12-30T10:12:30.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2022-12-08T01:30:24.000Z (about 2 years ago)
Last Synced: 2024-08-03T01:26:28.637Z (6 months ago)
Language: Python
Homepage:
Size: 26.4 KB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-rainmana - mohamed17717/scraper-helper - some classes i used most when i try to scrape (Python)

README

        # Scraping Helper

Classes that saves time doing repeated stuff in scraping process.

## Getting Started

### Prerequisites

- [Python3.6](https://www.python.org/downloads/) or later

#### For Browser class

- [Firefox](https://www.mozilla.org/en-US/firefox/)

- [Selenium](https://pypi.org/project/selenium/)

- [Geckodriver](https://github.com/mozilla/geckodriver/releases)

### Installing

#### Clone The repository and install dependencies

``` bash

git clone https://github.com/mohamed17717/scraper-helper.git 

cd 

pip install -r requirements.txt

```

## Running the tests

Explain how to run the automated tests for this system

### **Scraper** class

> `downlaod video`

```python

from scraper import Scraper

s = Scraper()

video_link = 'https://video.faly2-1.fna.fbcdn.net/v/t42.9040-2/62397431_315586462709416_146625252862984192_n.mp4?_nc_cat=102&efg=eyJ2ZW5jb2RlX3RhZyI6InN2ZV9zZCJ9&_nc_oc=AQkgWetwMBu9sezw9cSv95KrlX03X1wX_ZOaSzMxtmcfq1_Ix_tXVWefEr2Xyq_8Ka4&_nc_ht=video.faly2-1.fna&oh=4ca74f2b4d64ea2856819efa3ce4fe4f&oe=5D3E91FB'

s.download(link=video_link)

```

> `download Facebook video using cookies`

``` python

from scraper import *

from urllib.parse import unquote

url = 'https://www.facebook.com/livingin2077/videos/2296419330686852/'

url = url.replace('www.', 'm.')

parsed_url = UrlParser(url)

s = Scraper()

cookies = s.get_cookies('firefox', parsed_url.domain)

s.set_cookies(cookies)

s.get(url)

soup = s.html_soup()

a = soup.select_one('a[href^="/video_redirect"]')

video_link = unquote( a['href'].replace('/video_redirect/?src=', '') )

s.download(video_link)

```

### **Browser** class

> `Login to instagram`

``` python

from scraper import Browser

from time import sleep

b = Browser(hide=False)

b.get('http://instagram.com', with_cookies=False)

sleep(1)

b.click_btn('a[href^="/accounts/login"]')

sleep(1)

b.fill_input('input[name="username"]', 'any_user')

b.fill_input('input[name="password"]', 'wrong password')

b.click_btn('button[type="submit"]')

```

> and so on.\

> scripts are much smaller than before.\

> there is much more capabilites i hope u find out

## Built With

- **Requests** - Lib for Scraping the web

- **Beautiful Soup** - Parse HTML

- **Selenium** - To control the browser

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details

## TODO

- [ ] add proxy class

- [ ] add traversing class - to move throw DOM easily

- [ ] make code cleaner