https://github.com/theadnan/webscraper
https://github.com/theadnan/webscraper
devtools-extension firefox webextension webextensions webextensions-apis webscraper webscraping
Last synced: over 1 year ago
JSON representation
- Host: GitHub
- URL: https://github.com/theadnan/webscraper
- Owner: TheAdnan
- License: apache-2.0
- Created: 2017-11-22T15:34:18.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2022-11-16T07:28:40.000Z (over 3 years ago)
- Last Synced: 2025-01-29T09:49:24.759Z (over 1 year ago)
- Topics: devtools-extension, firefox, webextension, webextensions, webextensions-apis, webscraper, webscraping
- Language: Java
- Size: 28.3 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Security: SECURITY.md
Awesome Lists containing this project
README
# webscraper
Simple CSS selector JSON definition to extract data from HTML sites.
Underlying it uses Jsoup.
Sample
````
// get html using HTTP Client or load from String
String html = getHtml(...)
// get definition from Resource path or construct the Map();
Map def = getDefinition(path);
IWebScraperExtractor webExtractor = new WebScraperJsoupExtractorImpl();
// this will return extracted values as Map
Map m = webExtractor.run(html);
````
Sample definition extractor for landing page Wikipedia https://www.wikipedia.org/
````
{
"selectors": [
{
"key":"langs",
"type":"container",
"css":".central-featured .central-featured-lang",
"items":[
{
"key":"title",
"type":"item",
"css":".link-box strong",
"attr":"text"
},
{
"key":"url",
"type":"item",
"css":"a",
"attr":"href"
}
]
},
{
"key":"otherProjects",
"type":"container",
"css":".other-projects .other-project",
"items":[
{
"key":"title",
"type":"item",
"css":".other-project-title",
"attr":"text"
},
{
"key":"url",
"type":"item",
"css":"a",
"attr":"href"
}
]
}
]
}
````
Sample output (JSON):
````
{
"langs": [
{
"title": "English",
"url": "//en.wikipedia.org/"
},
{
"title": "Español",
"url": "//es.wikipedia.org/"
},
{
"title": "日本語",
"url": "//ja.wikipedia.org/"
},
{
"title": "Deutsch",
"url": "//de.wikipedia.org/"
},
{
"title": "Русский",
"url": "//ru.wikipedia.org/"
},
{
"title": "Français",
"url": "//fr.wikipedia.org/"
},
{
"title": "Italiano",
"url": "//it.wikipedia.org/"
},
{
"title": "中文",
"url": "//zh.wikipedia.org/"
},
{
"title": "Português",
"url": "//pt.wikipedia.org/"
},
{
"title": "Polski",
"url": "//pl.wikipedia.org/"
}
],
"otherProjects": [
{
"title": "Commons",
"url": "//commons.wikimedia.org/"
},
{
"title": "Wikivoyage",
"url": "//www.wikivoyage.org/"
},
{
"title": "Wiktionary",
"url": "//www.wiktionary.org/"
},
{
"title": "Wikibooks",
"url": "//www.wikibooks.org/"
},
{
"title": "Wikinews",
"url": "//www.wikinews.org/"
},
{
"title": "Wikidata",
"url": "//www.wikidata.org/"
},
{
"title": "Wikiversity",
"url": "//www.wikiversity.org/"
},
{
"title": "Wikiquote",
"url": "//www.wikiquote.org/"
},
{
"title": "MediaWiki",
"url": "//www.mediawiki.org/"
},
{
"title": "Wikisource",
"url": "//www.wikisource.org/"
},
{
"title": "Wikispecies",
"url": "//species.wikimedia.org/"
},
{
"title": "Meta-Wiki",
"url": "//meta.wikimedia.org/"
}
]
}
````
Please check Tests for other samples...