https://github.com/hakanu/selenium_scraper

A python scraper by using selenium which helps to parse the site after being fully loaded (AJAX calls, flash async loads etc).
https://github.com/hakanu/selenium_scraper

Last synced: over 1 year ago
JSON representation

A python scraper by using selenium which helps to parse the site after being fully loaded (AJAX calls, flash async loads etc).

Host: GitHub
URL: https://github.com/hakanu/selenium_scraper
Owner: hakanu
License: apache-2.0
Created: 2014-09-13T03:50:52.000Z (almost 12 years ago)
Default Branch: master
Last Pushed: 2014-09-13T04:35:55.000Z (almost 12 years ago)
Last Synced: 2025-03-05T14:37:40.772Z (over 1 year ago)
Language: Python
Size: 148 KB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          Selenium Scraper

================

A python scraper by using selenium which helps to parse the site after being fully loaded (AJAX calls, flash async loads etc).

# Fire up an instance

Not so easy...

## Prereqs

* Ubuntu machine (Preferably latest)

* Not ARM architecture. Can not make this run on my raspberry pi. If somebody has already done, shoot me a mail.

* sudo easy_install selenium

* sudo easy_install pyvirtualdisplay

* sudo apt-get install xvfb

In my case, Firefox and phantomjs are not capable of showing the flash videos. Chrome is the only successful one.

### Install chrome

* http://www.howopensource.com/2011/10/install-google-chrome-in-ubuntu-11-10-11-04-10-10-10-04/

* wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | sudo apt-key add -

* sudo sh -c 'echo "deb http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'

* sudo apt-get update

* sudo apt-get install google-chrome-stable

* Make sure chrome is install at /usr/bin/google-chrome

`ls /usr/bin | grep chrome`

* Get chrome driver from [here](http://chromedriver.storage.googleapis.com/index.html) to be able to use selenium with chrome.

`wget http://chromedriver.storage.googleapis.com/2.10/chromedriver_linux64.zip`

`unzip  chromedriver_linux64.zip`

`python selenium_scraper_server.py`

Go to [localhost:8080/url=%url%&p=%pattern%](http://localhost:8080/url=%url%&p=%pattern%)

Eg [localhost:8080/url=hakanu.net&p=hakan](http://localhost:8080/url=hakanu.net&p=hakan)

## Restrictions:

* Pattern and url must be percent encoded.

http://www.url-encode-decode.com/

* Pattern should not use +, instead * should be used. Because there is some confusion between url encoding's + (for space) and regexp +.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/hakanu/selenium_scraper

Awesome Lists containing this project

README