Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/rootkot/invader
Python simple module for data grabbing from websites with JavaScript support
https://github.com/rootkot/invader
beautifulsoup grabber javascript parsing python2-7 python3 scraper web
Last synced: about 3 hours ago
JSON representation
Python simple module for data grabbing from websites with JavaScript support
- Host: GitHub
- URL: https://github.com/rootkot/invader
- Owner: rootKot
- Created: 2017-02-17T18:40:05.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2017-07-24T09:30:07.000Z (over 7 years ago)
- Last Synced: 2024-11-17T11:16:22.959Z (about 3 hours ago)
- Topics: beautifulsoup, grabber, javascript, parsing, python2-7, python3, scraper, web
- Language: HTML
- Homepage: https://pypi.python.org/pypi/invader
- Size: 39.1 KB
- Stars: 6
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Invader
============
### Invader is a Python simple module for data grabbing from websites. Also with JavaScript support!Invader is based on BeautifulSoup and dryscrape
---
Dependencies
============
* **[Requests](http://docs.python-requests.org/en/master/)**
* **[Beautiful Soup 4](https://www.crummy.com/software/BeautifulSoup/)**
* **[Dryscrape](https://github.com/niklasb/dryscrape)**Getting Started
============
* install all dependecies if you haven't
```
$ sudo pip install requests
```
```
$ sudo apt-get install python-bs4
$ sudo pip install beautifulsoup4
```
```
$ sudo apt-get install qt5-default libqt5webkit5-dev build-essential python-lxml python-pip xvfb
$ sudo pip install dryscrape
```
* intall invader
```
$ sudo pip install invader
```Items list data grabbing example:
```python
from invader import Invaderurl = 'https://duckduckgo.com/?q=python&t=hb&ia=web'
invader = Invader(url, js=True)res = invader.take_list('#links .result', {
'title': ['.result__a', 'text'],
'src': ['.result__a', 'href']
})print(res)
```
the response will be a list of dictionaries wich containing each item's image url and title```json
[
{"title": "Welcome to Python.org", "src": "https://www.python.org/"},
{"title": "Python (programming language) - Wikipedia", "src": "https://en.wikipedia.org/wiki/Python_%28programming_language%29"},
{"title": "Python | Codecademy", "src": "https://www.codecademy.com/learn/python"}
]
```Here is some **[examples](https://github.com/rootKot/invader/tree/master/examples)** of usage
Documentation
============First of all create import Invader class from invader.
Create instance of Invader and pass for argument the url address of website, and js=True if need to support javascript.```python
from invader import Invader
invader = Invader('http://some.site', js=True)
```After that, content of website will be getted and saved in instace.
### **Public functions**
### take(selector_list)
For example if you have a link address of a concrete topic page of some forum, and you need to just pull topic title, or you need to get a list with all pictures sources, then you easly can use this function.
**take()** function receives a one list argument, where first element of a list is a CSS selector of a html element, and second is a thing that needs you to take, and returns a string, or list with results.```python
res = invader.take(['.content .topic-title', 'text'])
```
in this example, we getting text of the element with class topic-title.
Also you can take some attribute value from the element.```python
res = invader.take(['.content .topic-title a', 'href'])
```
the result will be:```python
http://some.site/link
```### take_list(wrapper, fields_dict)
If you need to get each item's information of some shoping site, then use this function!
**take_list()** function receives a two arguments. First one is a string with selector of item wrapper element.
Second argument is a dictionary with keys and with their selectors and things that we need (text, src, href, etc.)```python
res = invader.take_list('.products-wrap > a', {
'img_url': ['.pr-item-wrap > img', 'src'],
'title': ['.pr-title', 'text']
})
```
the response will be a list of dictionaries wich containing each item's image_url and title```json
[
{"img_url": "/files/items/30735/icon_219x270.jpg", "title": "Поло Vit 16 9713tr"},
{"img_url": "/files/items/30734/icon_219x240.jpg", "title": "Поло Vit 16 9713tr"}
]
```also you can leave first argument None, if items havn't wrapper element, and just go one by one.
**But Warning!** Be careful in that case!
Be sure that each item have the same html elements that you want to get! Otherwise the order will be destroyed, and result going to be wrong.### screenshot(path)
If js-is enabled, requests goes with virtual browser, using dryscrape.
you can take a screenshot of website that you visited.
Give a path where to save screenshot if needs.
```python
invader = Invader('https://google.com', js=True)
invader.sceenshot('/var/www/screenshots/')
```