An open API service indexing awesome lists of open source software.

https://github.com/theadnan/webscraper


https://github.com/theadnan/webscraper

devtools-extension firefox webextension webextensions webextensions-apis webscraper webscraping

Last synced: over 1 year ago
JSON representation

Awesome Lists containing this project

README

          

# webscraper

Simple CSS selector JSON definition to extract data from HTML sites.
Underlying it uses Jsoup.

Sample

````

// get html using HTTP Client or load from String
String html = getHtml(...)

// get definition from Resource path or construct the Map();
Map def = getDefinition(path);

IWebScraperExtractor webExtractor = new WebScraperJsoupExtractorImpl();

// this will return extracted values as Map
Map m = webExtractor.run(html);

````

Sample definition extractor for landing page Wikipedia https://www.wikipedia.org/

````
{
"selectors": [
{
"key":"langs",
"type":"container",
"css":".central-featured .central-featured-lang",
"items":[
{
"key":"title",
"type":"item",
"css":".link-box strong",
"attr":"text"
},
{
"key":"url",
"type":"item",
"css":"a",
"attr":"href"
}
]
},
{
"key":"otherProjects",
"type":"container",
"css":".other-projects .other-project",
"items":[
{
"key":"title",
"type":"item",
"css":".other-project-title",
"attr":"text"
},
{
"key":"url",
"type":"item",
"css":"a",
"attr":"href"
}
]
}
]
}

````
Sample output (JSON):

````
{
"langs": [
{
"title": "English",
"url": "//en.wikipedia.org/"
},
{
"title": "Español",
"url": "//es.wikipedia.org/"
},
{
"title": "日本語",
"url": "//ja.wikipedia.org/"
},
{
"title": "Deutsch",
"url": "//de.wikipedia.org/"
},
{
"title": "Русский",
"url": "//ru.wikipedia.org/"
},
{
"title": "Français",
"url": "//fr.wikipedia.org/"
},
{
"title": "Italiano",
"url": "//it.wikipedia.org/"
},
{
"title": "中文",
"url": "//zh.wikipedia.org/"
},
{
"title": "Português",
"url": "//pt.wikipedia.org/"
},
{
"title": "Polski",
"url": "//pl.wikipedia.org/"
}
],
"otherProjects": [
{
"title": "Commons",
"url": "//commons.wikimedia.org/"
},
{
"title": "Wikivoyage",
"url": "//www.wikivoyage.org/"
},
{
"title": "Wiktionary",
"url": "//www.wiktionary.org/"
},
{
"title": "Wikibooks",
"url": "//www.wikibooks.org/"
},
{
"title": "Wikinews",
"url": "//www.wikinews.org/"
},
{
"title": "Wikidata",
"url": "//www.wikidata.org/"
},
{
"title": "Wikiversity",
"url": "//www.wikiversity.org/"
},
{
"title": "Wikiquote",
"url": "//www.wikiquote.org/"
},
{
"title": "MediaWiki",
"url": "//www.mediawiki.org/"
},
{
"title": "Wikisource",
"url": "//www.wikisource.org/"
},
{
"title": "Wikispecies",
"url": "//species.wikimedia.org/"
},
{
"title": "Meta-Wiki",
"url": "//meta.wikimedia.org/"
}
]
}

````

Please check Tests for other samples...