Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/slaveofcode/pycrawler

A Python crawler tool to grab page(s) information from their html data
https://github.com/slaveofcode/pycrawler

Last synced: about 8 hours ago
JSON representation

A Python crawler tool to grab page(s) information from their html data

Awesome Lists containing this project

README

        

[![Build Status](https://travis-ci.org/slaveofcode/pycrawler.svg?branch=master)](https://travis-ci.org/slaveofcode/pycrawler) [![GitHub license](https://img.shields.io/github/license/mashape/apistatus.svg)](https://github.com/slaveofcode/pycrawler/blob/master/LICENSE)

# Pycrawler
A Python crawler tool to grab page(s) information from their html data or web url.
This library using python 3 and some dependencies with java runtime.

# Installation

You can install this lib directly from github repository by execute

# Install from last stable release

pip install git+ssh://[email protected]/slaveofcode/pycrawler@master

# install by pip

pip install pycrawler3

# How To Use?

First of all you must installed java runtime machine to get the boilerpipe works, because it's depends on java machine.

from pycrawler.crawler import Crawler

# returns page object

page = Crawler.grab('http://www.pasarpanda.com')

# Here you can execute or get the information of page object

print(page.title) # print the title of page

print(page.images()) # get the image urls

print(page.content) # Print the extracted content

# Available Methods and Attributes

# Grab from URL
page = Crawler.grab('http://www.getscoop.com/berita/scoop-meluncurkan-fitur-baru-parental-control/')

# Grab from file
page = Crawler.from_file('/home/aditya/mydir/myhtml.html')

# Grab from string
page = Crawler.from_text('My title yoThe content of my html')

# Page Object Methods and Properties

page.title # get the title of the page object
>>> 'SCOOP Meluncurkan Fitur Baru Parental Control Untuk Mendukung Konten Edukasi dan Anak | SCOOP Berita'

page.encoding # get encoding of page
>>> 'UTF-8'

page.canonical_url # get the canonical url
>>> 'http://www.getscoop.com/berita/scoop-meluncurkan-fitur-baru-parental-control/'

page.favicon # get favicon icon as list
>>> ['http://www.getscoop.com/berita/wp-content/themes/metro-pro/images/favicon.ico']

page.language # get language
>>> 'en-US'

page.metas # get meta tags as list dictionary
>>> [{'charset': 'UTF-8'}, {'name': 'description', 'content': 'SCOOP ingin meningkatkan aktivitas edukatif dan pengaruh positif bagi anak di dunia digital. Baca selengkapnya SCOOP Meluncurkan Fitur Baru Parental Control Untuk Mendukung Konten Edukasi dan Anak.'}, {'name': 'robots', 'content': 'noodp,noydir'}, ...]

page.content # get extracted content
>>> 'SCOOP Meluncurkan Fitur Baru Parental Control Untuk Mendukung Konten Edukasi dan Anak\nNovember 18, 2015\nby Ita Istiqomah Leave a Comment\nSetelah sukses dengan fitur SCOOP Premium, kami kembali melakukan terobosan dan inovasi, salah satunya dengan merilis layanan terbaru "Parental Control” pada bulan November ini....'

page.links # get links
>>> ['http://www.getscoop.com/berita/scoop-meluncurkan-fitur-baru-parental-control/#respond', 'http://www.getscoop.com/berita/category/entrepreneurship/', 'http://www.getscoop.com/berita/category/technology/', ...]

page.original_links # get original links that same as page url
>>> ['http://www.getscoop.com/berita/2015/10/', 'http://www.getscoop.com/berita/tag/scoop/', 'http://www.getscoop.com/berita/barbie-girl-happy-sumpah-pemuda/#comment-101088', 'http://www.getscoop.com/berita/category/feature/', 'http://www.getscoop.com/berita/scoop-webstore/', ...]

page.js_links # get javascript links
>>> ['http://www.getscoop.com/berita/af-custom/js/jquery-1.7.2.min.js', 'http://www.getscoop.com/berita/wp-includes/js/jquery/jquery.js?ver=1.11.3', 'http://www.getscoop.com/berita/wp-includes/js/jquery/jquery.color.min.js?ver=2.1.1', 'http://www.getscoop.com/berita/wp-content/themes/metro-pro/js/backstretch-set.js?ver=1.0.0', ...]

page.css_links # get css links
>>> ['http://www.getscoop.com/berita/wp-content/plugins/wpfront-scroll-top/css/wpfront-scroll-top.css?ver=1.4.2', 'http://www.getscoop.com/berita/wp-content/plugins/ultimate-social-deux/public/assets/css/style.css?ver=3.1.6', '//fonts.googleapis.com/css?family=Oswald%3A400&ver=2.0.0', ...]

page.resource_links # get combined js & css links
>>> ['http://www.getscoop.com/berita/af-custom/js/jquery-1.7.2.min.js', 'http://www.getscoop.com/berita/wp-includes/js/jquery/jquery.js?ver=1.11.3', 'http://www.getscoop.com/berita/wp-includes/js/jquery/jquery.color.min.js?ver=2.1.1', ...]

page.images() # get images
>>> ['http://www.getscoop.com/berita/wp-content/uploads/2015/11/parental-control-scoop.jpg', 'http://kacang.apps-foundry.com/www/delivery/avw.php?zoneid=38&cb=INSERT_RANDOM_NUMBER_HERE&n=afd1f9fe', 'http://www.getscoop.com/berita/wp-content/plugins/wpfront-scroll-top/images/icons/1.png']

page.html('article .entry-content') # get html by css selector
>>> '



> nosetests

>> ----------------------------------------------------------------------

>> Ran 5 tests in 4.726s

>> OK