Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/cmader/qSKOS

Vocabulary quality assessment tools
https://github.com/cmader/qSKOS

Last synced: 4 days ago
JSON representation

Vocabulary quality assessment tools

Awesome Lists containing this project

README

        

= qSKOS

qSKOS is a tool for finding {quality issues}[https://github.com/cmader/qSKOS/wiki/Quality-Issues] in SKOS vocabularies. It can be used as command line tool or API. For release information please see the {changelog}[https://github.com/cmader/qSKOS/blob/devel/CHANGELOG.rdoc].

qSKOS is also the basis for the {PoolParty SKOS Quality Checker}[http://qskos.poolparty.biz/], an online service that lets you check your vocabularies without needing to locally
install qSKOS.

Currently we work on integrating qSKOS into the {PoolParty Thesaurus Server}[http://www.poolparty.biz], a commercially available Web-based development tool for SKOS vocabularies. In it's final stage we combine qSKOS' scientifically-grounded quality checks with the easy-to-use graphical interface of the PoolParty software. Integrated in the development workflow, quality checks can work as an assistive technology, just like syntax-checkers in development environments or spell-checkers in word processing applications.

Parts of qSKOS' functionality have also been integrated into the {rsine}[https://github.com/rsine/rsine] (Resource Subscription and Notification Service) project. It enables you to monitor changes to SKOS vocabularies as soon as they occur and send out immediate notifications (e.g., by email) if some potential quality problems are detected. Rsine is also published as open source and hosted on GitHub ({here}[https://github.com/rsine/rsine]).

== Using the Online Service

To use the service, you need to:
* Navigate to the {PoolParty SKOS Quality Checker}[http://qskos.poolparty.biz/] in your browser
* Login with your existing accounts at the networks of Google, LinkedIn, Xing or Twitter
* If you want to perform a quality check on a SKOS vocabulary, you first need to provide a name for the vocabulary in the text box on the top of the page
* Upload the SKOS vocabulary into the newly created vocabulary section and the checking process starts

If the check takes longer, you can also leave the page and come back later - the checks will continue running and, once finished, the report is available for download on the vocabulary overview page which is presented directly after you log in.

=== Please note
The intention behind requiring the user to give a name for uploaded vocabularies is to "group" different versions of the same vocabulary. For example, if you
develop a thesaurus MyThesaurus_Version1.rdf, you will give it the name MyThesaurus on the online service and upload it. As development of the vocabulary progresses, you might want to upload subsequent versions of the vocabulary, e.g. MyThesaurus_Ver2.rdf. You are advised to upload these improved vocabularies also under the name MyThesaurus. In future versions of the service we may be able to exploit this information by sending you a report on how your vocabulary improved over time.

== Local Installation

=== Use the latest released version

You can find the released versions on the {qSKOS GitHub releases page}[https://github.com/cmader/qSKOS/releases] and download the {latest}[https://github.com/cmader/qSKOS/releases/latest] .jar files.
Alternatively you can

=== Build from Source

Requirements:
* Verify that Java v.1.7 or greater is installed: javac -version
* Make sure Maven v.3.0 or greater is installed: mvn -v
* Make sure you have the current version of the {git version control system}[http://git-scm.com/] installed on your system

=== 1) Get the source code

You have two options:
* Clone the project (git clone https://github.com/cmader/qSKOS.git) to your system using git.
* Download the latest {release}[https://github.com/cmader/qSKOS/releases] of the project and extract it to a properly named directory, e.g., +qSKOS+.

=== 2) Build the tool

* Change into the newly created +qSKOS+ directory and build the application: mvn clean package. Wait for tests and build to complete.
* Two jar files are now located in the +qSKOS/target+ directory:
* qSKOS-cmd.jar: The *executable* jar file that can directly be used for vocabulary evaluation
* qSKOS-[version].jar: The library to integrate qSKOS' functionality into other applications

== Using the qSKOS command line tool

=== General Usage

* Change to the +qSKOS/target+ directory
* Run the tool using java -jar qSKOS-cmd.jar
* A synopsis on the application's parameters is displayed.

=== Examples

The following examples demonstrate typical qSKOS use cases. For demonstration purposes we use the IPSV vocabulary available from the {qSKOS-data}[https://github.com/cmader/qSKOS-data] repository: {Download IPSV vocabulary}[https://github.com/cmader/qSKOS-data/raw/master/IPSV/ipsv_skos.rdf.bz2]. In the examples below we assume that the vocabulary file is placed in the same directory than the +qSKOS-cmd.jar+ file.

==== 1) Retrieving basic vocabulary statistics
Basic statistical properties (e.g., number of concepts, semantic relations or concept schemes) can be retrieved and stored into the file +report.txt+ by issuing the command:

java -jar qSKOS-cmd.jar summarize ipsv_skos.rdf -o report.txt

==== 2) Finding quality issues
To perform an analysis for quality issues, use the following command:

java -jar qSKOS-cmd.jar analyze -dc mil,bl ipsv_skos.rdf -o report.txt

Please keep in mind that a full analysis can take quite some time, depending on the vocabulary size and structure. Especially link checking sometimes can take hours, so it is often useful to analyze only a subset of all issues. In the following examples you'll learn how this can be done. In the example above, checking for missing inlinks and broken links has been disabled to speed up the checking process (using the parameter -dc mil,bl).

==== 3) Output a list of *supported* statistical properties and quality issues
By starting the evaluation using either the +summarize+ or +analyze+ command and omitting the vocabulary filename, you get an overview about the supported statistical properties and quality issues:

java -jar qSKOS-cmd.jar summarize

or

java -jar qSKOS-cmd.jar analyze

Here's an excerpt from the output:

ID: chr
Name: Cyclic Hierarchical Relations
Description: Finds all hierarchy cycle containing components

Every property/issue is identified by an ID string, has a name and a description. For more detailed information on the quality issues see the qSKOS {wiki page}[https://github.com/cmader/qSKOS/wiki/Quality-Issues].

==== 4) Testing for a specific issue or a subset of issues
Specific issues can be tested by passing the -c parameter followed by one or more (comma-separated) issue IDs (see example above). Keep in mind, that the -c parameter has to be placed between between the +analyze+ command and the vocabulary file like this:

java -jar qSKOS-cmd.jar analyze -c ol,oc ipsv_skos.rdf -o report.txt

The command above triggers analysis of the "Overlapping Labels" and "Orphan Concepts" issues. In a very similar way it is possible to explicitly *exclude* issues from testing. For example, the command

java -jar qSKOS-cmd.jar analyze -dc mil ipsv_skos.rdf -o report.txt

checks for all issues except "Missing In-Links".

== FAQ

=== What are "Authoritative Concepts"?
Every concept in a SKOS vocabulary is a resource and should be identified by an URI to be referenced from other vocabularies on the Web. However, when using qSKOS, for some issues it is required to distinguish between concepts that are originally specified (authoritative) in the vocabulary that's about to be analyzed, and concepts (implictly or explicitly) defined in other vocabularies somewhere on the Web.

qSKOS is to some extent able to perform this distinction by examining the host part of the concept's URIs. Depending on the vocabulary's structure in some cases it might be needed to pass an "Authoritative resource identifier" (command line argument -a) to qSKOS. This is a substring of an URI that identifies a concept as authoritative.

=== What version of qSKOS do I use?
Simply pass the command line switch -v like this:

java -jar qSKOS-cmd.jar -v

The version will be printed in the first line of the output, directly before the usage information.

== Using the qSKOS API

The +QSkos+ class serves as facade for calculating all criteria. For each criterion it provides a corresponding public method. Please read the Javadoc for further infos. Here is an example:

// instantiation
qskos = new QSkos(new File("stw.rdf"));
qskos.setAuthResourceIdentifier("zbw.eu");

// the fun part
Issue orphanConcepts = qSkos.getIssues("oc").iterator().next();
long numberOfOrphans = issue.getResult().occurrenceCount());

== Publications
A subset of the quality issues qSKOS supports (including an analysis of several existing vocabularies) have been published in our paper {Finding Quality Issues in SKOS Vocabularies}[http://arxiv.org/abs/1206.1339v1].

@inproceedings{cs3444,
booktitle = {TPDL 2012 Therory and Practice of Digital Libraries},
month = {May},
title = {Finding Quality Issues in SKOS Vocabularies},
author = {Christian Mader and Bernhard Haslhofer and Antoine Isaac},
address = {Germany},
year = {2012},
url = {http://arxiv.org/pdf/1206.1339v1},
abstract = {The Simple Knowledge Organization System (SKOS) is a standard model for controlled vocabularies on the Web. However, SKOS vocabularies often differ in terms of quality, which reduces their applicability across system boundaries. Here we investigate how we can support taxonomists in improving SKOS vocabularies by pointing out quality issues that go beyond the integrity constraints defined in the SKOS specification. We identified potential quality issues and formalized them into computable quality checking functions that can find affected resources in a given SKOS vocabulary. We implemented these functions in the qSKOS quality assessment tool, analyzed 15 existing vocabularies, and found possible quality issues in all of them.}
}

We performed a survey among experts in the field of vocabulary development in order to get feedback about the quality issues checked by qSKOS. The paper {Perception and Relevance of Quality Issues in Web Vocabularies}[http://eprints.cs.univie.ac.at/3720/1/iSemantics2013-cr_version-mader-haslhofer.pdf] reports on a subset of these issues. It furthermore sets them into relation to common usage scenarios for controlled vocabularies on the Web.

@inproceedings{cs3720,
booktitle = {I-SEMANTICS 2013},
title = {Perception and Relevance of Quality Issues in Web Vocabularies},
author = {Christian Mader and Bernhard Haslhofer},
address = {Graz, AUT},
year = {2013},
url = {http://eprints.cs.univie.ac.at/3720/},
abstract = {Web vocabularies provide organization and orientation in information environments and can facilitate resource discovery and retrieval. Several tools have been developed that support quality assessment for the increasing amount of vocabularies expressed in SKOS and published as Linked Data. However, these tools do not yet take into account the users' perception of vocabulary quality. In this paper, we report the findings from an online survey conducted among experts in the field of vocabulary development to study the perception and relevance of vocabulary quality issues in the context of real-world application scenarios. Our results indicate that structural and labeling issues are the most relevant ones. We also derived design recommendations for vocabulary quality checking tools.}
}

Presentation slides are available for the {I-SEMANTICS 2013 talk}[https://docs.google.com/presentation/d/1QMprChHRNbci_4wcbccm6Z9pTsXCvbkboGw5aldReYI/edit?usp=sharing] as well as for my {ISKO 2013 presentation}[https://docs.google.com/file/d/0BzYMwvL-nDZ1NG1ZekJrZ3dPN0E/edit?usp=sharing].

In order to get an impression about the impact of using qSKOS for improving vocabularies, we set up a case study with students in a teaching context. They were given the task of creating a thesaurus, checking it's quality with qSKOS and submit a revised version of the vocabulary after being confronted with the quality report. We published details of this case study at the {HSWI2014 workshop}[http://hswi.referata.com/wiki/HSWI2014]. The paper can be retrieved {online}[http://eprints.cs.univie.ac.at/4045/1/HSWI2014_mader_wartena_cr_version_2.pdf].

@inproceedings{cs4045,
booktitle = {Workshop on Human-Semantic Web Interaction (HSWI?14)},
month = {May},
title = {Supporting Web Vocabulary Development by Automated Quality Assessment: Results of a Case Study in a Teaching Context},
author = {Christian Mader and Christian Wartena},
year = {2014},
series = {CEUR Workshop Proceedings},
url = {http://eprints.cs.univie.ac.at/4045/},
abstract = {Constructing controlled Web vocabularies such as thesauri for search and retrieval tasks is a widely intellectual process that relies on human experts. Thus, errors can occur which decrease the overall quality of the vocabulary. qSKOS is a tool that automatically checks Web vocabularies for potential quality problems and generates quality reports. In this paper we present the results of a case study designed to evaluate the impact of integrating qSKOS in the vocabulary creation process. It was carried out among students skilled in construction of controlled vocabularies. We collected in total 13 vocabularies in two versions and detected 15 different kinds of quality problems. For 11 of these problems we observed reduced occurrences after the vocabularies were revised by the participants based on the results of the generated quality report.}
}

== Contributors (in alphabetical order)

* Riccardo Albertoni ({@riccardoAlbertoni}[https://github.com/riccardoAlbertoni])
* Thomas Francart ({@tfrancart}[https://github.com/tfrancart])
* Bernhard Haslhofer ({@behas}[https://github.com/behas])
* Antoine Isaac ({@aisaac}[https://github.com/aisaac])
* Christian Mader (maintainer, {@cmader}[https://github.com/cmader])
* Lars G. Svensson ({@larsgsvensson}[https://github.com/larsgsvensson])

=== How can I contribute?

* Fork, add and/or improve, and send merge requests
* File issues and/or feature requests

== Copyright

Copyright (c) 2011-2018 Christian Mader. See LICENSE.txt for details