https://github.com/rootkot/invader

Python simple module for data grabbing from websites with JavaScript support
https://github.com/rootkot/invader

beautifulsoup grabber javascript parsing python2-7 python3 scraper web

Last synced: 5 months ago
JSON representation

Python simple module for data grabbing from websites with JavaScript support

Host: GitHub
URL: https://github.com/rootkot/invader
Owner: rootKot
Created: 2017-02-17T18:40:05.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2017-07-24T09:30:07.000Z (about 8 years ago)
Last Synced: 2025-04-25T15:22:56.236Z (5 months ago)
Topics: beautifulsoup, grabber, javascript, parsing, python2-7, python3, scraper, web
Language: HTML
Homepage: https://pypi.python.org/pypi/invader
Size: 39.1 KB
Stars: 6
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          Invader

============

### Invader is a Python simple module for data grabbing from websites. Also with JavaScript support!

Invader is based on BeautifulSoup and dryscrape

---

Dependencies

============

* **[Requests](http://docs.python-requests.org/en/master/)**

* **[Beautiful Soup 4](https://www.crummy.com/software/BeautifulSoup/)**

* **[Dryscrape](https://github.com/niklasb/dryscrape)**

Getting Started

============

* install all dependecies if you haven't

```

$ sudo pip install requests

```

```

$ sudo apt-get install python-bs4

$ sudo pip install beautifulsoup4

```

```

$ sudo apt-get install qt5-default libqt5webkit5-dev build-essential python-lxml python-pip xvfb

$ sudo pip install dryscrape

```

* intall invader

```

$ sudo pip install invader

```

Items list data grabbing example:

```python

from invader import Invader

url = 'https://duckduckgo.com/?q=python&t=hb&ia=web'

invader = Invader(url, js=True)

res = invader.take_list('#links .result', {

    'title': ['.result__a', 'text'],

    'src': ['.result__a', 'href']

})

print(res)

```

the response will be a list of dictionaries wich containing each item's image url and title

```json

[

    {"title": "Welcome to Python.org", "src": "https://www.python.org/"},

    {"title": "Python (programming language) - Wikipedia", "src": "https://en.wikipedia.org/wiki/Python_%28programming_language%29"},

    {"title": "Python | Codecademy", "src": "https://www.codecademy.com/learn/python"}

]

```

Here is some **[examples](https://github.com/rootKot/invader/tree/master/examples)** of usage

Documentation

============

First of all create import Invader class from invader.

Create instance of Invader and pass for argument the url address of website, and js=True if need to support javascript.

```python

from invader import Invader

invader = Invader('http://some.site', js=True)

```

After that, content of website will be getted and saved in instace.

### **Public functions**

### take(selector_list)

 For example if you have a link address of a concrete topic page of some forum, and you need to just pull topic title, or you need to get a list with all pictures sources, then you easly can use this function.

**take()** function receives a one list argument, where first element of a list is a CSS selector of a html element, and second is a thing that needs you to take, and returns a string, or list with results.

```python

res = invader.take(['.content .topic-title', 'text'])

```

in this example, we getting text of the element with class topic-title.

Also you can take some attribute value from the element.

```python

res = invader.take(['.content .topic-title a', 'href'])

```

the result will be:

```python

http://some.site/link

```

### take_list(wrapper, fields_dict)

If you need to get each item's information of some shoping site, then use this function!

**take_list()** function receives a two arguments. First one is a string with selector of item wrapper element.

Second argument is a dictionary with keys and with their selectors and things that we need (text, src, href, etc.)

```python

res = invader.take_list('.products-wrap > a', {

    'img_url': ['.pr-item-wrap > img', 'src'],

    'title': ['.pr-title', 'text']

})

```

the response will be a list of dictionaries wich containing each item's image_url and title

```json

[

  {"img_url": "/files/items/30735/icon_219x270.jpg", "title": "Поло  Vit 16 9713tr"},

  {"img_url": "/files/items/30734/icon_219x240.jpg", "title": "Поло  Vit 16 9713tr"}

]

```

also you can leave first argument None, if items havn't wrapper element, and just go one by one.

**But Warning!** Be careful in that case!

Be sure that each item have the same html elements that you want to get! Otherwise the order will be destroyed, and result going to be wrong.

### screenshot(path)

If js-is enabled, requests goes with virtual browser, using dryscrape.

you can take a screenshot of website that you visited.

Give a path where to save screenshot if needs.

```python

invader = Invader('https://google.com', js=True)

invader.sceenshot('/var/www/screenshots/')

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rootkot/invader

Awesome Lists containing this project

README