An open API service indexing awesome lists of open source software.

https://github.com/fgregg/smered

Mirror of https://bitbucket.org/resteorts/smered
https://github.com/fgregg/smered

deduplication entity-resolution record-linkage

Last synced: 7 months ago
JSON representation

Mirror of https://bitbucket.org/resteorts/smered

Awesome Lists containing this project

README

          

# Bayesian record linkage

This repository builds a program to perform Bayesian record linkage using the model described in a forthcoming paper (ref. when available). The source code comes with an [ant](http://ant.apache.org) build script. To compile the program simply run 'ant' from the base directory. This will create an executable jar file named MHSampler.jar. You can then run the program by calling

> java -jar MHSampler.jar CONFIG_FILE FILE FILE ...

where the first command-line argument, `CONFIG_FILE`, is an XML configuration file and the remaining command-line arguments are whitespace-delimited data files that you wish to link. You should supply at least two files to link. For example:

> java -jar MHSampler.jar config.xml *.dat

(assuming there are at least two `.dat` files in the current directory).

## Configuration file format

The configuration file is an XML file with a top-level `` element which contains ``, ``, and `` elements.

The `` element is optional. If present, it contains elements corresponding to the specific options you wish to set. Options are set using the `value` attribute. Supported options are:

* ``, boolean, if true then all files are assumed to be deduplicated (default: false).
* ``, positive integer, number of split-merge (MH) steps per outer iteration (default: 10,000).
* ``, positive integer, write output every so many Gibbs iterations (default: 100).
* ``, positive integer, begin taking averages only after this many Gibbs iterations (default 5,000).
* ``, positive integer, number of Gibb's iterations (default: 1,005,001).

For example, to specify a burn-in of 7,000, you would write




The `` element contains a number of `` elements corresponding to the fields in the files you wish to match. Each `` element has a `name` and `type` attribute. The `type` must be one of `KEY` or `VAR`. There can be at most one field of type `KEY` and, if present, it must be the first field. The `` element is required.

The `` element is optional, and consists of a number of `` elements. Each `` element must have a `name` attribute, and the names given should correspond to names of fields in the ``.

Here is a complete example configuration file: