https://github.com/slaveofcode/pycrawler
A Python crawler tool to grab page(s) information from their html data
https://github.com/slaveofcode/pycrawler
Last synced: about 8 hours ago
JSON representation
A Python crawler tool to grab page(s) information from their html data
-
Host: GitHub
-
URL: https://github.com/slaveofcode/pycrawler
-
Owner: slaveofcode
-
License: mit
-
Created: 2015-11-19T09:26:10.000Z
(about 9 years ago)
-
Default Branch: master
-
Last Pushed: 2019-10-23T00:41:32.000Z
(about 5 years ago)
-
Last Synced: 2024-10-07T09:35:01.442Z
(about 1 month ago)
-
Language: HTML
-
Size: 63.5 KB
-
Stars: 0
-
Watchers: 2
-
Forks: 0
-
Open Issues: 1
-
Metadata Files:
-
Readme: README.md
-
License: LICENSE
Awesome Lists containing this project
[![Build Status](https://travis-ci.org/slaveofcode/pycrawler.svg?branch=master)](https://travis-ci.org/slaveofcode/pycrawler) [![GitHub license](https://img.shields.io/github/license/mashape/apistatus.svg)](https://github.com/slaveofcode/pycrawler/blob/master/LICENSE)
# Pycrawler
A Python crawler tool to grab page(s) information from their html data or web url.
This library using python 3 and some dependencies with java runtime.
# Installation
You can install this lib directly from github repository by execute
# Install from last stable release
pip install git+ssh://[email protected]/slaveofcode/pycrawler@master
# install by pip
pip install pycrawler3
# How To Use?
First of all you must installed java runtime machine to get the boilerpipe works, because it's depends on java machine.
from pycrawler.crawler import Crawler
# returns page object
page = Crawler.grab('http://www.pasarpanda.com')
# Here you can execute or get the information of page object
print(page.title) # print the title of page
print(page.images()) # get the image urls
print(page.content) # Print the extracted content
# Available Methods and Attributes
# Grab from URL
page = Crawler.grab('http://www.getscoop.com/berita/scoop-meluncurkan-fitur-baru-parental-control/')
# Grab from file
page = Crawler.from_file('/home/aditya/mydir/myhtml.html')
# Grab from string
page = Crawler.from_text('My title yoThe content of my html')
# Page Object Methods and Properties
page.title # get the title of the page object
>>> 'SCOOP Meluncurkan Fitur Baru Parental Control Untuk Mendukung Konten Edukasi dan Anak | SCOOP Berita'
page.encoding # get encoding of page
>>> 'UTF-8'
page.canonical_url # get the canonical url
>>> 'http://www.getscoop.com/berita/scoop-meluncurkan-fitur-baru-parental-control/'
page.favicon # get favicon icon as list
>>> ['http://www.getscoop.com/berita/wp-content/themes/metro-pro/images/favicon.ico']
page.language # get language
>>> 'en-US'
page.metas # get meta tags as list dictionary
>>> [{'charset': 'UTF-8'}, {'name': 'description', 'content': 'SCOOP ingin meningkatkan aktivitas edukatif dan pengaruh positif bagi anak di dunia digital. Baca selengkapnya SCOOP Meluncurkan Fitur Baru Parental Control Untuk Mendukung Konten Edukasi dan Anak.'}, {'name': 'robots', 'content': 'noodp,noydir'}, ...]
page.content # get extracted content
>>> 'SCOOP Meluncurkan Fitur Baru Parental Control Untuk Mendukung Konten Edukasi dan Anak\nNovember 18, 2015\nby Ita Istiqomah Leave a Comment\nSetelah sukses dengan fitur SCOOP Premium, kami kembali melakukan terobosan dan inovasi, salah satunya dengan merilis layanan terbaru "Parental Control” pada bulan November ini....'
page.links # get links
>>> ['http://www.getscoop.com/berita/scoop-meluncurkan-fitur-baru-parental-control/#respond', 'http://www.getscoop.com/berita/category/entrepreneurship/', 'http://www.getscoop.com/berita/category/technology/', ...]
page.original_links # get original links that same as page url
>>> ['http://www.getscoop.com/berita/2015/10/', 'http://www.getscoop.com/berita/tag/scoop/', 'http://www.getscoop.com/berita/barbie-girl-happy-sumpah-pemuda/#comment-101088', 'http://www.getscoop.com/berita/category/feature/', 'http://www.getscoop.com/berita/scoop-webstore/', ...]
page.js_links # get javascript links
>>> ['http://www.getscoop.com/berita/af-custom/js/jquery-1.7.2.min.js', 'http://www.getscoop.com/berita/wp-includes/js/jquery/jquery.js?ver=1.11.3', 'http://www.getscoop.com/berita/wp-includes/js/jquery/jquery.color.min.js?ver=2.1.1', 'http://www.getscoop.com/berita/wp-content/themes/metro-pro/js/backstretch-set.js?ver=1.0.0', ...]
page.css_links # get css links
>>> ['http://www.getscoop.com/berita/wp-content/plugins/wpfront-scroll-top/css/wpfront-scroll-top.css?ver=1.4.2', 'http://www.getscoop.com/berita/wp-content/plugins/ultimate-social-deux/public/assets/css/style.css?ver=3.1.6', '//fonts.googleapis.com/css?family=Oswald%3A400&ver=2.0.0', ...]
page.resource_links # get combined js & css links
>>> ['http://www.getscoop.com/berita/af-custom/js/jquery-1.7.2.min.js', 'http://www.getscoop.com/berita/wp-includes/js/jquery/jquery.js?ver=1.11.3', 'http://www.getscoop.com/berita/wp-includes/js/jquery/jquery.color.min.js?ver=2.1.1', ...]
page.images() # get images
>>> ['http://www.getscoop.com/berita/wp-content/uploads/2015/11/parental-control-scoop.jpg', 'http://kacang.apps-foundry.com/www/delivery/avw.php?zoneid=38&cb=INSERT_RANDOM_NUMBER_HERE&n=afd1f9fe', 'http://www.getscoop.com/berita/wp-content/plugins/wpfront-scroll-top/images/icons/1.png']
page.html('article .entry-content') # get html by css selector
>>> '