An open API service indexing awesome lists of open source software.

https://github.com/digitalbazaar/cc-structured-data

Analyzes HTML content for RDFa, Microdata and Microformats
https://github.com/digitalbazaar/cc-structured-data

Last synced: about 1 year ago
JSON representation

Analyzes HTML content for RDFa, Microdata and Microformats

Awesome Lists containing this project

README

          

This is a library that analyzes the CommonCrawl dataset for structured
data expressed as RDFa, Microdata or Microformats.

To build
--------

You'll need to have Apache Ant (http://ant.apache.org/manual/install.html)
installed, and once you do, just run a:

# ant dist

This step will compile the libraries and Hadoop code into an Elastic MapReduce-
friendly JAR at dist/lib/StructuredDataAnalyzer.jar, suitable for use as a
custom JAR-based Elastic MapReduce workflow.

To run locally
--------------

You'll need to be running Hadoop, and if you don't have it installed, Cloudera
provides a useful set of OS-specific Hadoop packages which will make it easy.
Check out their site:

https://ccp.cloudera.com/display/SUPPORT/Downloads

Once you've got Hadoop installed, you can use the 'hadoop jar' task to execute
the tutorial code. Here's the pattern:

hadoop jar /dist/lib/StructuredDataAnalyzer.jar
com.digitalbazaar.analyzer.StructuredDataAnalyzer