https://github.com/okdistribute/unique-columns

Find the unique columns in a tabular dataset.
https://github.com/okdistribute/unique-columns

Last synced: 5 months ago
JSON representation

Find the unique columns in a tabular dataset.

Host: GitHub
URL: https://github.com/okdistribute/unique-columns
Owner: okdistribute
Created: 2015-08-07T01:38:27.000Z (almost 10 years ago)
Default Branch: master
Last Pushed: 2016-01-13T09:19:41.000Z (over 9 years ago)
Last Synced: 2024-05-01T19:29:33.831Z (about 1 year ago)
Language: JavaScript
Homepage:
Size: 9.77 KB
Stars: 12
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        unique-columns

----------------

 [![Travis](http://img.shields.io/travis/karissa/unique-columns.svg?style=flat)](https://travis-ci.org/karissa/unique-columns)

Get the unique columns in a dataset. Useful if you need to quickly see what columns in a table are unique. `unique-columns` is **streaming**, too, which means that running this on a very large dataset won't kill your RAM because the table is only loaded into memory one row at a time. Cool!

```

npm install -g unique-columns

```

## Example

Imagine, your worst nightmare. You are given a dataset that contains two different columns that could possibly be the unique identifier. But you aren't sure.

In this case, two of these columns could be unique, 'username' and 'id', but how will you know for sure?

Run `unique-columns people.csv`, of course!

people.csv

```

name, age, id, username, description

angela, 29, 1, ange___1, "just a person here"

bob, 34, 2, iambob, "innovator in technology"

bob, 55, 3, bob2234, "Dad, Father, Do-gooder"

samantha, 34, 5, sam_does, "making a living, being a person"

laura, 34, 5, hanna_boss, "continuing to be a badass"

```

```

$ unique-columns people.csv

uniques:

   username

   description

duplicates:

   age: 3

   id: 2

```

We see that `id` and `age` contain duplicate values, while `username` and `description` are unique.

## Usage

unique-columns will try to guess the format, but you can supply it if you really want to.

```

$ unique-columns  [--format=csv/ndjson]

OR

$ cat  | unique-columns -

```

## Options

`--format/-f` : the format in which to expect data. tries to guess if not supplied.

`--json`: output machine-readable format in json

## JavaScript usage

An incoming jsonStream gets turned into a dictionary of the duplicates:

```

  {

    "name": 2,

    "age": 3,

    "id": 0,

    "description": 0,

    "username": 0,

  }

```

Use it with any parsed json input stream:

```js

var parseInputStream = require('parse-input-stream')

var dupes = require('unique-columns')

var args = {"format": "csv"}

var jsonStream = process.stdin.pipe(parseInputStream(args))

dupes(jsonStream, function done (err, duplicates) {

  console.log(duplicates)

})

```

## TODO

1. Give a limit for the number of rows to peek on before stopping

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/okdistribute/unique-columns

Awesome Lists containing this project

README