https://github.com/vida-nyu/urban-data-db
https://github.com/vida-nyu/urban-data-db
Last synced: 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/vida-nyu/urban-data-db
- Owner: VIDA-NYU
- License: apache-2.0
- Created: 2018-09-26T18:19:29.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2020-03-20T14:45:36.000Z (about 5 years ago)
- Last Synced: 2025-01-24T15:36:37.711Z (4 months ago)
- Language: Java
- Size: 2.68 MB
- Stars: 0
- Watchers: 7
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Urban Data Integration - Database
=================================This Java library is part of the **Urban Data Integration** project. It provides classes and functionality to maintain and transform (open urban) data sets.
Compute Database Column Similarity
----------------------------------The library contains JAR files to compute pairwise similarity between database columns. Computation assumes that a unique term index has been generated a-priori.
### Create Unique Term Index
The JAR file `TermIndexGenerator.jar` is used to generate a set of unique terms in the database. Each term is assigned a unique identifier. With each term the index maintains the assigned data type and a comma-separated list of column:frequency pairs. Each pair denotes the frequency of the term in the identified column.
The possible data types that a term can have are:
```
1=INTEGER
2=DECIMAL
3=LONG
4=DATE
5=STRING
```Usage:
```
java -jar TermIndexGenerator.jar
: Directory with column files
: Size of the memory-buffer. Once buffer is full intermediate results are written to disk.
: Output file for term index
```### Compute Pairwise Column Similarity
Use JAR file `ComputeColumnSimilarity.jar` to compute pairwise similarity between columns. Requires a term-index-file generated using `TermIndexGenerator.jar`.
```
java -jar ComputeColumnSimilarity.jar
: The term-index file generated using TermIndexGenerator.jar
: Similarity function [
JI : Jaccard-Index |
WJI-COLSIZE : Weighted Jaccard-Index that uses total number of values in a column as normalization scale |
WJI-COLMAX : Weighted Jaccard-Index that uses the most frequent value in a column as normalization scale
]
: Outputs only column pairs with similarity above the given threshold
: Number of parallel threads to use
: Output file for similarities. The format is tab-delimited:
1) ID of column 1,
2) ID of column 2,
3) similarity
```### N-Gram Column Generator
When computing column similarity based on n-grams one first has to transform the database column files into n-gram column files. These files only contain the n-grams for all column values. One then has to use `TermIndexGenerator.jar` to create a unique n-gram index before column similarity can be computed.
```
java -jar NGramColumnGenerator.jar
: Directory with database column files
: Size of the generated n-grams
: Add special padding characters ('$' and '#') at beginning and end of each value [true | false]
: Output directory for n-gram column files. The file format is the same as for the input files
```