https://github.com/digitalbazaar/cc-structured-data
Analyzes HTML content for RDFa, Microdata and Microformats
https://github.com/digitalbazaar/cc-structured-data
Last synced: about 1 year ago
JSON representation
Analyzes HTML content for RDFa, Microdata and Microformats
- Host: GitHub
- URL: https://github.com/digitalbazaar/cc-structured-data
- Owner: digitalbazaar
- Created: 2012-02-04T19:35:12.000Z (over 14 years ago)
- Default Branch: master
- Last Pushed: 2012-02-04T19:52:39.000Z (over 14 years ago)
- Last Synced: 2025-04-19T11:08:38.144Z (about 1 year ago)
- Language: Java
- Homepage: http://www.w3.org/community/data-driven-standards/
- Size: 20 MB
- Stars: 6
- Watchers: 13
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README
Awesome Lists containing this project
README
This is a library that analyzes the CommonCrawl dataset for structured
data expressed as RDFa, Microdata or Microformats.
To build
--------
You'll need to have Apache Ant (http://ant.apache.org/manual/install.html)
installed, and once you do, just run a:
# ant dist
This step will compile the libraries and Hadoop code into an Elastic MapReduce-
friendly JAR at dist/lib/StructuredDataAnalyzer.jar, suitable for use as a
custom JAR-based Elastic MapReduce workflow.
To run locally
--------------
You'll need to be running Hadoop, and if you don't have it installed, Cloudera
provides a useful set of OS-specific Hadoop packages which will make it easy.
Check out their site:
https://ccp.cloudera.com/display/SUPPORT/Downloads
Once you've got Hadoop installed, you can use the 'hadoop jar' task to execute
the tutorial code. Here's the pattern:
hadoop jar /dist/lib/StructuredDataAnalyzer.jar
com.digitalbazaar.analyzer.StructuredDataAnalyzer