Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/hartwork/rnv

:tropical_fish: Relax NG Compact Syntax validator by David Tolpin; official upstream maintenance repository
https://github.com/hartwork/rnv

expat-xml-parser relax-ng relaxng rnv xml-validation

Last synced: 17 days ago
JSON representation

:tropical_fish: Relax NG Compact Syntax validator by David Tolpin; official upstream maintenance repository

Awesome Lists containing this project

README

        

RNV -- Relax NG Compact Syntax Validator in C

Version 1.7

Table of Contents

News since 1.6
New since 1.5
Aknowledgements
Package Contents
Installation
Invocation
Limitations
Applications

ARX
RVP

User-Defined Datatype Libraries

Datatype Library Plug-in
Scheme Datatypes

New versions

Abstract

RNV is an implementation of Relax NG Compact Syntax,
http://relaxng.org/compact-20021121.html. It is written in ANSI C,
the command-line utility uses Expat,
http://www.jclark.com/xml/expat.html. It is distributed under BSD
license, see license.txt for details.

RNV is a part of an on-going work, and the current code can have bugs
and shortcomings; however, it validates documents against a number of
grammars. I use it.

News since 1.6

The format for error messages is similar to that of Jing (file name,
line and column are colon-separated). Entities and DTD processing is
moved out of RNV, use XX, available from the same download location,
to expand entities.

New since 1.5

Better reporting: required and permitted content is reported
separately; it helps debug grammars. Several bugfixes; I relied on an
acquired test suite and published schemata, but have found that I can
make more bugs than they cover, thus a reworked an extended test suite
is now used for testing. The code has also been cleaned up and
simplified in places during porting to Plan9.

Aknowledgements

I would like to thank those who have helped me develop RNV.

Dave Pawson has been the first user of the program.

Alexander Peshkov helps me with testing and I have been able to
correct very well hidden errors with his help.

Sebastian Rahtz encouraged me to continue working on RNV since the
first release, and has helped me to improve it on more than one
occasion.

Package Contents

Note

I have put rnv.exe and arx.exe, Win32 executables statically linked
with a current version of Expat from
http://expat.sourceforge.net/, into a separate distribution
archive (with name ending in -win32bin). It contains only the program
binaries and should be available from the same location as the source
distribution.

The package consists of:
* the license, license.txt;
* the source code, *.[ch];
* the source code map, src.txt;
* Makefile.bsd for BSD make;
* Makefile.gnu for GNU Make;
* Makefile.bcc for Win32 and Borland C/C++ Compiler;
* tools/xck, a simple shell script I am using to validate documents;
* tools/*.rnc, sample Relax NG grammars;
* scm/*.scm, program modules in Scheme, for Scheme Datatypes
Library;
* the log of changes, changes.txt;
* this file, readme.txt.
* Other scripts, samples and plug-ins appear in tools/ eventually.

Installation

On Unix-like systems, run make -f Makefile.gnu or make -f
Makefile.bsd, depending on which flavour of make you have;
Makefile.bsd should probably work on SysV, but, unfortunately, I have
no place to check for the last couple of years. If you are using Expat
1.2, define EXPAT_H as xmlparse.h instead of expat.h).

On Windows, use rnv.exe. To recompile from the sources, use
Makefile.bcc with Borland C/C++ Compiler, or create a makefile or
project for your environment.

Invocation

The command-line syntax is

rnv {-q|-p|-c|-s|-v|-h} grammar.rnc {document1.xml}

If no documents are specified, RNV attempts to read the XML document
from the standard input. The options are:

-q
names of files being processed are not printed; in error
messages, expected elements and attributes are not listed;

-n
sets the maximum number of reported expected elements and
attributes, -q sets this to 0 and can be overriden;

-p
copies the input to the output;

-c
if the only argument is a grammar, checks the grammar and
exits;

-s
uses less memory and runs slower;

-v
prints version number;

-h
displays usage summary and exits.

Limitations

* RNV assumes that the encoding of the syntax file is UTF-8.
* Support for XML Schema Part 2: Datatypes is partial.
+ ordering for duration is not implemented;
+ only local parts of QName values are checked for equality,
ENTITY values are only checked for lexical validity.
* The schema parser does not check that all restrictions are obeyed,
in particular, restrictions 7.3 and 7.4 are not checked.
* RNV for Win32 platforms is a Unix program compiled on Win32. It
expects file paths to be written with normal slashes; if a schema
is in a different directory and includes or refers external files,
then the schema's path must be written in the Unix way for the
relative paths to work. For example, under Windows, rnv that uses
..\schema\docbook.rnc to validate userguide.dbx should be invoked
as

rnv.exe ../schema/docbook.rnc userguide.dbx

Applications

The distribution includes several utilities built upon RNV; they are
listed and described in the following sections.

ARX

ARX is a tool to automatically determine the type of a document from
its name and contents. It is inspired by James Clark's schema location
approach for nXML,
http://groups.yahoo.com/group/emacs-nxml-mode/message/259, and is
a development of the idea described in
http://relaxng.org/pipermail/relaxng-user/2003-December/000214.htm
l.

ARX is a command-line utility. The invocation syntax is

arx {-n|-v|-h} document.xml arx.conf {arx.conf}

ARX either prints a string corresponding to the document's type or
nothing if the type cannot be determined. The options are:

-n
turns off prepending base path of the configuration file to the
result, even if it looks like a relative path (useful when the
configuration file and the grammars are in separate
directories, or for association with something that is not a
file);

-v
prints current version;

-h
displays usage summary and exits.

The configuration file must conform to the following grammar:

arx = grammars route*
grammars = "grammars" "{" type2string+ "}"
type2string = type "=" literal
type = nmtoken
route = match|nomatch|valid|invalid
match = "=~" regexp "=>" type
nomatch = "!~" regexp "=>" type
valid = "valid" "{" rng "}" "=>" type
invalid = "!valid" "{" rng "}" "=>" type

literal=string in '"', '"' inside must be prepended by '\'
regexp=string in '/', '/' inside must be prepended by '\'
rng=Relax NG Compact Syntax

Comments start with # and continue till the end of line.

Rules are processed sequentially, the first matching rule determines
the file's type. Relax NG templates are matched against file contents,
regular expressions are applied to file names. The sample below
associates documents with grammars for XSLT, DocBook or XSL FO.

grammars {
docbook="docbook.rnc"
xslt="xslt.rnc"
xslfo="fo.rnc"
}

valid {
start = element (book|article|chapter|reference) {any}
any = (element * {any}|attribute * {text}|text)*
} => docbook

!valid {
default namespace xsl = "http://www.w3.org/1999/XSL/Transform"
start = element *-xsl:* {not-xsl}
not-xsl = (element *-xsl:* {not-xsl}|attribute * {text}|text)*
} => xslt

=~/.*\.xsl/ => xslt
=~/.*\.fo/ => xslfo

ARX can also be used to link documents to any type of information or
processing.

RVP

RVP is abbreviation for Relax NG Validation Pipe. It reads validation
primitives from the standard input and reports result to the standard
output; it's main purpose is to ease embedding of a Relax NG validator
into various languages and environment. An application would launch
RVP as a parallel process and use a simple protocol to perform
validation. The protocol, in BNF, is:

query ::= (
quit
| start
| start-tag-open
| attribute
| start-tag-close
| text
| end-tag) z.
quit ::= "quit".
start ::= "start" [gramno].
start-tag-open ::= "start-tag-open" patno name.
attribute ::= "attribute" patno name value.
start-tag-close :: = "start-tag-close" patno name.
text ::= ("text"|"mixed") patno text.
end-tag ::= "end-tag" patno name.
response ::= (ok | er | error) z.
ok ::= "ok" patno.
er ::= "er" patno erno.
error ::= "error" patno erno error.
z ::= "\0" .

* RVP assumes that the last colon in a name separates the local part
from the namespace URI (it is what one gets if specifies `:' as
namespace separator to Expat).
* Error codes can be grabbed from rvp sources by grep _ER_ *.h and
OR-ing them with corresponding masks from erbit.h. Additionally,
error 0 is the protocol format error.
* Either er or error responses are returned, not both; -q chooses
between concise and verbose forms (invocation syntax described
later).
* start passes the index of a grammar (first grammar in the list of
command-line arguments has number 0); if the number is omitted, 0
is assumed.
* quit is not opposite of start; instead, it quits RVP.

The command-line syntax is:

rvp {-q|-s|-v|-h} {schema.rnc}

The options are:

-q
returns only error numbers, suppresses messages;

-s
takes less memory and runs slower;

-v
prints current version;

-h
displays usage summary and exits.

To assist embedding RVP, samples in Perl (tools/rvp.pl) and Python
(tools/rvp.py) are provided. The scripts use Expat wrappers for each
of the languages to parse documents; they take a Relax NG grammar (in
the compact syntax) as the command line argument and read the XML from
the standard input. For example, the following commands validate
rnv.dbx against docbook.rnc:

perl rvp.pl docbook.rnc < rnv.dbx
python rvp.py docbook.rnc < rnv.dbx

The scripts are kept simple and unobscured to illustrate the
technique, rather than being designed as general-purpose modules.
Programmers using Perl, Python, Ruby and other languages are
encouraged to implement and share reusable RVP-based components for
their languages of choice.

User-Defined Datatype Libraries

Relax NG relies on XML Schema Datatypes to check validity of data in
an XML document. The specification allows the implementation to
support other datatype libraries, a library is required to provide two
services, datatypeAllows and datatypeEqual.

A powerful and popular technique is the use of string regular
expressions to restrict values of attributes and character data.
However, XML Schema regular expressions must be written as single
strings, without any parameterization; they often grow to several
dozens of characters in length and are very hard to read or debug.

A solution for these problem would be to allow the user to define
custom datatypes and to specify them in a high-level programming
language. The user can then either use regular expressions as such,
employ lex for lexical analysis, or any other technique which is best
suited for each particular case (for example XSL FO datatypes would
benefit from a custom datatype library). With many datatype libraries
eventually implemented, it is likely that a clearer picture of the
right language for validation of data will eventually emerge.

RNV provides two different ways to implement this solution; I believe
that they correspond to different tastes and traditions. In both
cases, a high-level language can be used to implement a datatype
library, the language is not related to the implementation language of
RNV, and RNV need not be recompiled to add a new datatype library.

Datatype Library Plug-in

A datatype plug-in is an executable. RNV invokes it as either
program allows type key value ... data

or
program equal type data1 data2

program is the executable's, name, the rest is the command line; key
and value pairs are datatype parameters and can be repeated. The
program is executed for each datatype in library
http://davidashen.net/relaxng/pluggable-datatypes; if the exit status
is 0 for success, non-zero for failure.

Both RNV and RVP can use pluggable datatypes, and must be compiled
with DXL_EXC set to 1 (make DXL_EXC=1) to support them, in which case
they accept an additional command-line option -d with the name of the
plugin as the argument. An implementation of XML Schema datatypes as a
plugin (in C) is included in the distribution, see xsdck.c. For
example,
rnv -d xsdck xslt-dxl.rnc $HOME/work/docbook/xsl/*/*.xsl

will validate all DocBook XSL stylesheets on my workstation against a
grammar for XSLT 1.0 modified to use RNV Pluggable Datatypes Library
instead of XML Schema Datatypes.

Scheme Datatypes

Another way to add custom datatypes to RNV is to use the built-in
Scheme interpeter (SCM,
http://www.swiss.ai.mit.edu/~jaffer/SCM.html) to implement the
library in Scheme, a dialect of Lisp. This solution is more flexible
and robust than the previous one, but requires knowledge of a
particular programming language (or at least desire to learn it, and
the result is definitely worth the effort).

To support it, SCM must be installed on the computer, and RNV or RVP
must be compiled with DSL_SCM set to 1 (make DSL_SCM=1), in which case
they accept an additional option -e with the name of a scheme program
as an argument. The datatype library is bound to
http://davidashen.net/relaxng/scheme-datatypes; a sample
implementation is in scm/dsl.scm. For example,
rnv -e scm/dsl.scm xslt-dsl.rnc $HOME/work/docbook/xsl/*/*.xsl

check the stylesheets against an XSLT 1.0 grammar modified to use an
RNV Scheme Datatypes Library implemented in scm/dsl.scm.

A Datatype Library in Scheme must provide two functions in top-level
environment:
(dsl-equal? string string string)

and
(dsl-allows? string '((string . string)*) string)

To assist development of datatype libraries, a Scheme implementation
of XML Schema Regular Expressions is included in the distribution as
scm/rx.scm. The Regular Expression library is not just a way to
re-implement the built-in datatypes. Owing to flexibility of the
language it is much easier to write and debug regular expressions in
Scheme, even if they are to be used with built-in XML Schema Datatypes
in the end. For example, a regular expression for e-mail address, with
insignificant simplifications, is:
pattern=
"(\(([^\(\)\\]|\\.)*\) )?"
~ "([a-zA-Z0-9!#$%&'*+\-/=?\^_`{|}~]+"
~ "(\.[a-zA-Z0-9!#$%&'*+\-/=?\^_`{|}~]+)*"
~ """|"([^"\\]|\\.)*")"""
~ "@"
~ "([a-zA-Z0-9!#$%&'*+\-/=?\^_`{|}~]+"
~ "(\.[a-zA-Z0-9!#$%&'*+\-/=?\^_`{|}~]+)*"
~ "|\[([^\[\]\\]|\\.)*\])"
~ "( \(([^\(\)\\]|\\.)*\))?"

which, even split into four lines, is ugly-looking and hard to read.
Meanwhile, it consists of a few repeating subexpressions, which could
easily be factored out, but the syntax does not have the means for
that.

Using Scheme interpreter, it is as simple as
(define addr-spec-regex
(let* (
(atom "[a-zA-Z0-9!#$%&'*+\\-/=?\\^_`{|}~]+")
(person "\"([^"\\\\]|\\\\.)\"")
(location "\\[([^\\[\\]\\\\]|\\\\.)*\\]")
(domain (string-append atom "(\\." atom ")*")))
(string-append
"(" domain "|" person ")"
"@"
"(" domain "|" location ")")))

This code is much simpler to read and debug, and then the parts can be
joined and added to the grammar for production use. Furthermore, it is
easy to implement the parsing of structured regular expressions
embedded into parameters of datatypes in Relax NG itself. dsl.scm, the
sample datatype library, can handle parameter s-pattern with regular
expressions split into named parts, and the example above becomes:
s-pattern="""
comment = "\(([^\(\)\\]|\\.)*\)"
atom = "[a-zA-Z0-9!#$%&'*+\-/=?\^_`{|}~]+"
atoms = atom "(\." atom ")*"
person = "\"([^\"\\]|\\.)*\""
location = "\[([^\[\]\\]|\\.)*\]"
local-part = "(" atom "|" person ")"
domain = "(" atoms "|" location ")"
start = "(" comment " )?" local-part "@" domain "( " comment ")?"
"""

addr-spec-dsl.rnc is included in the distribution.

New versions

Visit http://davidashen.net/ for news and downloads.