https://github.com/vida-nyu/urban-data-db

Last synced: 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/vida-nyu/urban-data-db
Owner: VIDA-NYU
License: apache-2.0
Created: 2018-09-26T18:19:29.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2020-03-20T14:45:36.000Z (about 5 years ago)
Last Synced: 2025-01-24T15:36:37.711Z (4 months ago)
Language: Java
Size: 2.68 MB
Stars: 0
Watchers: 7
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

Urban Data Integration - Database
=================================

This Java library is part of the **Urban Data Integration** project. It provides classes and functionality to maintain and transform (open urban) data sets.

Compute Database Column Similarity
----------------------------------

The library contains JAR files to compute pairwise similarity between database columns. Computation assumes that a unique term index has been generated a-priori.

### Create Unique Term Index

The JAR file `TermIndexGenerator.jar` is used to generate a set of unique terms in the database. Each term is assigned a unique identifier. With each term the index maintains the assigned data type and a comma-separated list of column:frequency pairs. Each pair denotes the frequency of the term in the identified column.

The possible data types that a term can have are:

```
1=INTEGER
2=DECIMAL
3=LONG
4=DATE
5=STRING
```

Usage:

```
java -jar TermIndexGenerator.jar
: Directory with column files
: Size of the memory-buffer. Once buffer is full intermediate results are written to disk.
: Output file for term index
```

### Compute Pairwise Column Similarity

Use JAR file `ComputeColumnSimilarity.jar` to compute pairwise similarity between columns. Requires a term-index-file generated using `TermIndexGenerator.jar`.

```
java -jar ComputeColumnSimilarity.jar
: The term-index file generated using TermIndexGenerator.jar
: Similarity function [
JI : Jaccard-Index |
WJI-COLSIZE : Weighted Jaccard-Index that uses total number of values in a column as normalization scale |
WJI-COLMAX : Weighted Jaccard-Index that uses the most frequent value in a column as normalization scale
]
: Outputs only column pairs with similarity above the given threshold
: Number of parallel threads to use
: Output file for similarities. The format is tab-delimited:
1) ID of column 1,
2) ID of column 2,
3) similarity
```

### N-Gram Column Generator

When computing column similarity based on n-grams one first has to transform the database column files into n-gram column files. These files only contain the n-grams for all column values. One then has to use `TermIndexGenerator.jar` to create a unique n-gram index before column similarity can be computed.

```
java -jar NGramColumnGenerator.jar
: Directory with database column files
: Size of the generated n-grams
: Add special padding characters ('$' and '#') at beginning and end of each value [true | false]
: Output directory for n-gram column files. The file format is the same as for the input files
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vida-nyu/urban-data-db

Awesome Lists containing this project

README