https://github.com/fgregg/smered
Mirror of https://bitbucket.org/resteorts/smered
https://github.com/fgregg/smered
deduplication entity-resolution record-linkage
Last synced: 7 months ago
JSON representation
Mirror of https://bitbucket.org/resteorts/smered
- Host: GitHub
- URL: https://github.com/fgregg/smered
- Owner: fgregg
- License: bsd-3-clause
- Created: 2017-03-11T23:25:09.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2017-03-12T03:27:58.000Z (almost 9 years ago)
- Last Synced: 2025-04-14T10:49:29.438Z (10 months ago)
- Topics: deduplication, entity-resolution, record-linkage
- Language: Java
- Size: 4.48 MB
- Stars: 5
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README
- License: LICENSE.txt
Awesome Lists containing this project
README
# Bayesian record linkage
This repository builds a program to perform Bayesian record linkage using the model described in a forthcoming paper (ref. when available). The source code comes with an [ant](http://ant.apache.org) build script. To compile the program simply run 'ant' from the base directory. This will create an executable jar file named MHSampler.jar. You can then run the program by calling
> java -jar MHSampler.jar CONFIG_FILE FILE FILE ...
where the first command-line argument, `CONFIG_FILE`, is an XML configuration file and the remaining command-line arguments are whitespace-delimited data files that you wish to link. You should supply at least two files to link. For example:
> java -jar MHSampler.jar config.xml *.dat
(assuming there are at least two `.dat` files in the current directory).
## Configuration file format
The configuration file is an XML file with a top-level `` element which contains ``, ``, and `` elements.
The `` element is optional. If present, it contains elements corresponding to the specific options you wish to set. Options are set using the `value` attribute. Supported options are:
* ``, boolean, if true then all files are assumed to be deduplicated (default: false).
* ``, positive integer, number of split-merge (MH) steps per outer iteration (default: 10,000).
* ``, positive integer, write output every so many Gibbs iterations (default: 100).
* ``, positive integer, begin taking averages only after this many Gibbs iterations (default 5,000).
* ``, positive integer, number of Gibb's iterations (default: 1,005,001).
For example, to specify a burn-in of 7,000, you would write
The `` element contains a number of `` elements corresponding to the fields in the files you wish to match. Each `` element has a `name` and `type` attribute. The `type` must be one of `KEY` or `VAR`. There can be at most one field of type `KEY` and, if present, it must be the first field. The `` element is required.
The `` element is optional, and consists of a number of `` elements. Each `` element must have a `name` attribute, and the names given should correspond to names of fields in the ``.
Here is a complete example configuration file: