Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/candlerb/zml

Store and retrieve XML in a SQL database
https://github.com/candlerb/zml

Last synced: 5 days ago
JSON representation

Store and retrieve XML in a SQL database

Host: GitHub
URL: https://github.com/candlerb/zml
Owner: candlerb
Created: 2010-04-30T20:44:20.000Z (over 14 years ago)
Default Branch: master
Last Pushed: 2010-04-30T20:48:48.000Z (over 14 years ago)
Last Synced: 2024-10-30T12:12:00.825Z (about 2 months ago)
Language: Ruby
Homepage:
Size: 223 KB
Stars: 2
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README

Awesome Lists containing this project

README

        WHAT IS ZML?

============

ZML (pronounced "Zimmel") is currently just a toy, a proof-of-concept (but

it does actually run!) It allows you to store and retrieve native XML

documents in a SQL database, retrieving either whole documents or just

subtrees.

Documentation is pretty much non-existent, apart from this file.

HOW DOES IT WORK?

=================

(1) PATHS

ZML keeps tables for elements and attributes, and ancilliary tables which

map tag id to name, attribute id to name, and namespace id to URI.

The vaguely clever bit is its use of a 'path' string as the primary key. An

XML document is a tree, and the path is rather like the path used to

traverse a filesystem tree (/dir1/dir2/...), but is actually stored as a

compacted string value.

                              filesystem   string

                           /           ""

                         /0          "0"

                          /1          "1"

                    /1/0        "10"

                    /1/1        "11"

                    /1/2        "12"

     

   

The values 0-9 and A-V are used to represent child 0 to child 31. If a node

has more than 32 children, longer values are used:

     Wxx         32 to 2^10-1     (x = 0-9 or A-V)

     Xxxxx       2^10 to 2^20-1

     Yxxxxxx     2^20-1 to 2^30

     Zxxxxxxxx   2^30 to 2^40-1

Why go to this trouble? Well, several reasons. Firstly, you can locate all

elements under node 123 just by saying "path like '123%'". Secondly, if you

ORDER BY this field, then you get the nodes out in exactly the right order

you need to regenerate the XML: every node preceeds its children.

Hence, this system allows you to spool XML into and out of the database,

without actually creating any intermediate representation of the tree in

memory; document size is limited only by the storage capacity of your

database. See for example bin/zml_dump and lib/zml/sql/sqltostream.rb for

the code which does this.

Furthermore, it should allow a large subset of XPATH queries to be mapped

directly into SQL queries. See README.paths for a fuller description of why

I choose this path structure, and how it can be used for XPATH.

(2) ELEMENTS

The elements table has a (unique) path, an element type, and some optional

content.

A leaf text node is the most common case, and is represented by a single row

in the database containing both the tag and its content: hello at

path 123 is

	path	elem	content

	----	----	-------

	123	foo	"hello"

(where 'foo' is actually an integer foreign key into the element_tags table,

but I'll ignore that for now)

 is the same, but the content is NULL.

Nesting of elements is implied from the paths:

	123	p	"hello "

	1230	b	"world!"

is:

hello world!


There is a TEXT element for cases where text follows a child element:

	123	p	"This is a "

	1230	i	"concrete"

	1231	TEXT	" example"

This is a concrete example


There are other elements for COMMENT and processing instructions.

The element row also contains the number of the next child to be added. In

effect, each element contains a 'sequence' for its children.

(3) ATTRIBUTES

are just held in an attributes table, indexed by the element path and the

attribute id (which forces you to have no more than one instance of any

particular attribute, as required by the XML spec)

Quick start demonstration of command-line zml utilities

=======================================================

These examples assume you are using Sqlite as the backend. However, you

should be able to use other DBI backends (tested with dbi:mysql, dbi:pg)

(1) Create file ~/.zmlconf.rb which contains the path to the zml libraries;

this is used by all the zml binaries, and means you don't actually have to

install zml anywhere in the library search path.

$ vi ~/.zmlconf.rb

    # Set the path to the zml/lib directory here

    $:.push "/home/brian/projects/zml/lib"

    # You can also set a default ZML database if you wish:

    #$zmldb = ['dbi:sqlite:/home/brian/projects/zml/test/mydb.db']

(2) Use zml_initdb to create the tables

$ cd test

$ ../bin/zml_initdb dbi:sqlite:mydb.db

(3) Replace the entire (empty) database with a root node, and dump it

    back out

$ ../bin/zml_restore -r -f root.xml dbi:sqlite:mydb.db

Load complete, element path=""    <-- this is the root node

$ ../bin/zml_dump dbi:sqlite:mydb.db

(4) Add some more documents under the root

$ ../bin/zml_restore -f test1.xml dbi:sqlite:mydb.db

Load complete, element path="0"

$ ../bin/zml_restore -f XMLSchema.xsd dbi:sqlite:mydb.db

Load complete, element path="1"

Note: Leading/trailing whitespace is not preserved in the element/document

unless you set attribute xml:space='preserve'

(5) You can dump the entire document tree, or individual documents

$ ../bin/zml_dump dbi:sqlite:mydb.db          # whole database

$ ../bin/zml_dump -p "0" dbi:sqlite:mydb.db   # just elements under path "0"

$ ../bin/zml_dump -p "02" dbi:sqlite:mydb.db  # just elements under path "02"

(6) Look at the data using SQL queries: e.g. to see the elements (with

    their element ids mapped into names) under path "0" you can type

$ sqlite mydb.db

sqlite> select e.path, ns.prefix, et.tag, e.content from elements e

   ...> left join element_tags et on e.elemid = et.elemid

   ...> left join namespaces ns on et.nsid = ns.nsid

   ...> where e.path like '0%' order by e.path;

(6) Other options:

zml_restore -p "path"     # add a new child element underneath node "path"

zml_restore -r -p "path"  # *replace* node "path" with this data

(path defaults to "", i.e. the root node)

zml_dump can only select elements by path at the moment. The next big module

is to convert XPATH queries into SQL, and do indexing of elements and

attributes.