{"id":21031497,"url":"https://github.com/slaveofcode/pycrawler","last_synced_at":"2026-02-13T14:20:25.879Z","repository":{"id":62579620,"uuid":"46481722","full_name":"slaveofcode/pycrawler","owner":"slaveofcode","description":"A Python crawler tool to grab page(s) information from their html data","archived":false,"fork":false,"pushed_at":"2019-10-23T00:41:32.000Z","size":65,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-01-22T22:38:45.992Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/slaveofcode.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-11-19T09:26:10.000Z","updated_at":"2015-11-19T18:20:36.000Z","dependencies_parsed_at":"2022-11-03T19:31:50.899Z","dependency_job_id":null,"html_url":"https://github.com/slaveofcode/pycrawler","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slaveofcode%2Fpycrawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slaveofcode%2Fpycrawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slaveofcode%2Fpycrawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/slaveofcode%2Fpycrawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/slaveofcode","download_url":"https://codeload.github.com/slaveofcode/pycrawler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243472040,"owners_count":20296249,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-19T12:29:01.437Z","updated_at":"2026-02-13T14:20:20.853Z","avatar_url":"https://github.com/slaveofcode.png","language":"HTML","readme":"[![Build Status](https://travis-ci.org/slaveofcode/pycrawler.svg?branch=master)](https://travis-ci.org/slaveofcode/pycrawler) [![GitHub license](https://img.shields.io/github/license/mashape/apistatus.svg)](https://github.com/slaveofcode/pycrawler/blob/master/LICENSE)\n\n# Pycrawler\nA Python crawler tool to grab page(s) information from their html data or web url. \nThis library using python 3 and some dependencies with java runtime.  \n\n# Installation\n\nYou can install this lib directly from github repository by execute \n\n    # Install from last stable release\n    \n    pip install git+ssh://git@github.com/slaveofcode/pycrawler@master\n    \n    # install by pip\n    \n    pip install pycrawler3\n\n# How To Use?\n\nFirst of all you must installed java runtime machine to get the boilerpipe works, because it's depends on java machine.\n\n    from pycrawler.crawler import Crawler\n    \n    # returns page object\n    \n    page = Crawler.grab('http://www.pasarpanda.com')\n    \n    # Here you can execute or get the information of page object\n    \n    print(page.title)  # print the title of page\n     \n    print(page.images())  # get the image urls\n    \n    print(page.content)  # Print the extracted content\n    \n# Available Methods and Attributes\n\n    # Grab from URL\n    page = Crawler.grab('http://www.getscoop.com/berita/scoop-meluncurkan-fitur-baru-parental-control/')\n    \n    # Grab from file\n    page = Crawler.from_file('/home/aditya/mydir/myhtml.html')\n    \n    # Grab from string\n    page = Crawler.from_text('\u003chtml\u003e\u003chead\u003e\u003ctitle\u003eMy title yo\u003c/title\u003e\u003c/head\u003e\u003cbody\u003eThe content of my html\u003c/body\u003e\u003c/html\u003e')\n    \n    # Page Object Methods and Properties\n    \n    page.title  # get the title of the page object\n    \u003e\u003e\u003e 'SCOOP Meluncurkan Fitur Baru Parental Control Untuk Mendukung Konten Edukasi dan Anak | SCOOP Berita'\n    \n    page.encoding  # get encoding of page\n    \u003e\u003e\u003e 'UTF-8'\n    \n    page.canonical_url  # get the canonical url\n    \u003e\u003e\u003e 'http://www.getscoop.com/berita/scoop-meluncurkan-fitur-baru-parental-control/'\n\n    page.favicon  # get favicon icon as list\n    \u003e\u003e\u003e ['http://www.getscoop.com/berita/wp-content/themes/metro-pro/images/favicon.ico']\n    \n    page.language  # get language\n    \u003e\u003e\u003e 'en-US'\n    \n    page.metas  # get meta tags as list dictionary\n    \u003e\u003e\u003e [{'charset': 'UTF-8'}, {'name': 'description', 'content': 'SCOOP ingin meningkatkan aktivitas edukatif dan pengaruh positif bagi anak di dunia digital. Baca selengkapnya SCOOP Meluncurkan Fitur Baru Parental Control Untuk Mendukung Konten Edukasi dan Anak.'}, {'name': 'robots', 'content': 'noodp,noydir'}, ...]\n    \n    page.content  # get extracted content\n    \u003e\u003e\u003e 'SCOOP Meluncurkan Fitur Baru Parental Control Untuk Mendukung Konten Edukasi dan Anak\\nNovember 18, 2015\\nby Ita Istiqomah Leave a Comment\\nSetelah sukses dengan fitur SCOOP Premium, kami kembali melakukan terobosan dan inovasi, salah satunya dengan merilis layanan terbaru \"Parental Control” pada bulan November ini....'\n    \n    page.links  # get links\n    \u003e\u003e\u003e ['http://www.getscoop.com/berita/scoop-meluncurkan-fitur-baru-parental-control/#respond', 'http://www.getscoop.com/berita/category/entrepreneurship/', 'http://www.getscoop.com/berita/category/technology/', ...]\n    \n    page.original_links  # get original links that same as page url\n    \u003e\u003e\u003e ['http://www.getscoop.com/berita/2015/10/', 'http://www.getscoop.com/berita/tag/scoop/', 'http://www.getscoop.com/berita/barbie-girl-happy-sumpah-pemuda/#comment-101088', 'http://www.getscoop.com/berita/category/feature/', 'http://www.getscoop.com/berita/scoop-webstore/', ...]\n\n    page.js_links  # get javascript links\n    \u003e\u003e\u003e ['http://www.getscoop.com/berita/af-custom/js/jquery-1.7.2.min.js', 'http://www.getscoop.com/berita/wp-includes/js/jquery/jquery.js?ver=1.11.3', 'http://www.getscoop.com/berita/wp-includes/js/jquery/jquery.color.min.js?ver=2.1.1', 'http://www.getscoop.com/berita/wp-content/themes/metro-pro/js/backstretch-set.js?ver=1.0.0', ...]\n\n    page.css_links  # get css links\n    \u003e\u003e\u003e ['http://www.getscoop.com/berita/wp-content/plugins/wpfront-scroll-top/css/wpfront-scroll-top.css?ver=1.4.2', 'http://www.getscoop.com/berita/wp-content/plugins/ultimate-social-deux/public/assets/css/style.css?ver=3.1.6', '//fonts.googleapis.com/css?family=Oswald%3A400\u0026ver=2.0.0', ...]\n    \n    page.resource_links  # get combined js \u0026 css links\n    \u003e\u003e\u003e ['http://www.getscoop.com/berita/af-custom/js/jquery-1.7.2.min.js', 'http://www.getscoop.com/berita/wp-includes/js/jquery/jquery.js?ver=1.11.3', 'http://www.getscoop.com/berita/wp-includes/js/jquery/jquery.color.min.js?ver=2.1.1', ...]\n    \n    page.images()  # get images\n    \u003e\u003e\u003e ['http://www.getscoop.com/berita/wp-content/uploads/2015/11/parental-control-scoop.jpg', 'http://kacang.apps-foundry.com/www/delivery/avw.php?zoneid=38\u0026cb=INSERT_RANDOM_NUMBER_HERE\u0026n=afd1f9fe', 'http://www.getscoop.com/berita/wp-content/plugins/wpfront-scroll-top/images/icons/1.png']\n    \n    page.html('article .entry-content')  # get html by css selector\n    \u003e\u003e\u003e  '\u003cdiv class=\"entry-content\" itemprop=\"text\"\u003e\u003cdiv class=\"us_posts_top\" style=\"margin-top:0px;margin-bottom:0px;\"\u003e\u003cdiv class=\"us_wrapper tal\"\u003e\u003cdiv class=\"us_button us_share_text\" data-text=\"Share this:\"\u003e\u003cspan class=\"us_share_text_span\"\u003e\u003c/span\u003e\u003c/div\u003e\u003cdiv class=\"us_facebook us_button\" data-text=\"SCOOP Meluncurkan Fitur Baru Parental Control ...'\n    \n    page.text('article .entry-content')  # get text by css selector\n    \u003e\u003e\u003e '  \\nSetelah sukses dengan fitur SCOOP Premium, kami kembali melakukan terobosan dan inovasi, salah satunya dengan merilis layanan terbaru \"Parental Control” pada bulan November ini.\\nParental Control didukung dengan berbagai konten anak dan edukasi, dengan harapan SCOOP dapat meningkatkan aktivitas edukatif dan memberikan pengaruh positif bagi anak di dunia digital...'\n    \n\n## Run The Test\n\nRun the test by using nosetests, make sure nosetest already installed, \nor you can run command `pip install nose` to install them\n\n    \u003e\u003e nosetests\n    \n    \u003e\u003e ----------------------------------------------------------------------\n    \n    \u003e\u003e Ran 5 tests in 4.726s\n    \n    \u003e\u003e OK\n\n    \n    \n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fslaveofcode%2Fpycrawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fslaveofcode%2Fpycrawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fslaveofcode%2Fpycrawler/lists"}