Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/glynnbird/couchimport
CouchDB import tool to allow data to be bulk inserted
https://github.com/glynnbird/couchimport
command-line couchdb csv export import
Last synced: 4 days ago
JSON representation
CouchDB import tool to allow data to be bulk inserted
- Host: GitHub
- URL: https://github.com/glynnbird/couchimport
- Owner: glynnbird
- License: other
- Created: 2014-06-30T11:08:19.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2024-07-31T11:52:57.000Z (6 months ago)
- Last Synced: 2025-01-22T12:37:23.555Z (8 days ago)
- Topics: command-line, couchdb, csv, export, import
- Language: JavaScript
- Size: 1.54 MB
- Stars: 140
- Watchers: 9
- Forks: 30
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# couchimport
## Introduction
When populating CouchDB databases, often the source of the data is initially some JSON documents in a file, or some structured CSV/TSV data from another database's export.
*couchimport* is designed to assist with importing such data into CouchDb efficiently. Simply pipe a file full of JSON documents into *couchimport*, telling the URL and database to send the data to.
> Note: `couchimport` used to handle the CSV to JSON conversion, but this part is now handled by [csvtojsonlines](https://www.npmjs.com/package/csvtojsonlines), keeping this package smaller and easier to maintain. The `[email protected]` package is the last version to support CSV/TSV natively - from 2.0 onwards, `couchimport` is only for pouring JSONL files into CouchDB.
> Also note: the companion CSV export utility (couchexport) is now hosted at [couchcsvexport](https://www.npmjs.com/package/couchcsvexport).
## Installation
Install using npm or another Node.js package manager:
```sh
npm install -g couchimport
```## Usage
_couchimport_ can either read JSON docs (one per line) from _stdin_ e.g.
```sh
cat myfile.json | couchimport
```or by passing a filename as the last parameter:
```sh
couchimport myfile.json
```*couchimport*'s configuration parameters can be stored in environment variables or supplied as command line arguments.
## Configuration - environment variables
Simply set the `COUCH_URL` environment variable e.g. for a hosted Cloudant database
```sh
export COUCH_URL="https://myusername:[email protected]"
```and define the name of the CouchDB database to write to by setting the `COUCH_DATABASE` environment variable e.g.
```sh
export COUCH_DATABASE="mydatabase"
```Simply pipe the text data into "couchimport":
```sh
cat mydata.jsonl | couchimport
```## Configuring - command-line options
Supply the `--url` and `--database` parameters as command-line parameters instead:
```sh
couchimport --url "http://user:password@localhost:5984" --database "mydata" mydata.jsonl
```or by piping data into _stdin_:
```sh
cat mydata.jsonl | couchimport --url "http://user:password@localhost:5984" --database "mydata"
```## Handling CSV/TSV data
We can use another package [csvtojsonlines](https://www.npmjs.com/package/csvtojsonlines) to convert CSV/TSV files into a JSONL stream acceptable to `couchimport`:
```sh
# CSV file ----> JSON lines ---> CouchDB
cat transactions.csv | csvtojsonlines --delimiter ',' | couchimport --db ledger
```## Generating random data
_couchimport_ can be paired with [datamaker](https://www.npmjs.com/package/datamaker) to generate any amount of sample data:
```sh
# template ---> datamaker ---> 100 JSON docs ---> couchimport ---> CouchDB
echo '{"_id":"{{uuid}}","name":"{{name}}","email":"{{email true}}","dob":"{{date 1950-01-01}}"}' | datamaker -f json -i 100 | couchimport --db people
written {"docCount":100,"successCount":1,"failCount":0,"statusCodes":{"201":1}}
written {"batch":1,"batchSize":100,"docSuccessCount":100,"docFailCount":0,"statusCodes":{"201":1},"errors":{}}
Import complete
```or with the template as a file:
```sh
cat template.json | datamaker -f json -i 10000 | couchimport --db people
```## Understanding errors
We know if we get an HTTP 4xx/5xx response, then all of the documents failed to be written to the database. But as _couchimport_ is writing data in bulk, the bulk request may get an HTTP 201 response that doesn't mean that _all_ of the documents were written. Some of the document ids may have been in the database already. So the _couchimport_ output includes counts of the number of documents that were written successfully and the number that failed, and a tally of the HTTP response codes and individual document error messages:
e.g.
```js
written {"batch":10,"batchSize":1,"docSuccessCount":4,"docFailCount":6,"statusCodes":{"201":10},"errors":{"conflict":6}}
```The above log line shows that after the tenth batch of writes, we have written 4 documents and failed to write 6 others. There were six "conflict" errors, meaning that there was a clash of document id or id/rev combination.
## Parallel writes
Older versions of _couchimport_ supported the ability to have multiple HTTP requests in flight at any one time, but the new simplified _couchimport_ does not. To achieve the same thing, simply split your file of JSON docs into smaller pieces and run multiple _couchimport_ jobs:
```sh
# split large file into files 1m lines each
# this will create files xaa, xab, xac etc
split -l 1000000 massive.txt
# find all files starting with x and using xargs,
# spawn a max of 2 process at once running couchimport,
# one for each file
find . -name "x*" | xargs -t -I % -P 2 couchimport --db test %
```## Environment variables reference
* COUCH_URL - the url of the CouchDB instance (required, or to be supplied on the command line)
* COUCH_DATABASE - the database to deal with (required, or to be supplied on the command line)
* COUCH_BUFFER_SIZE - the number of records written to CouchDB per bulk write (defaults to 500, not required)
* IAM_API_KEY - to use IBM IAM to do authentication, set the IAM_API_KEY to your api key and a bearer token will be used in the HTTP requests.## Command-line parameters reference
You can also configure `couchimport` using command-line parameters:
* `--help` - show help
* `--url`/`-u` - the url of the CouchDB instance (required, or to be supplied in the environment)
* `--database`/`--db`/`-d` - the database to deal with (required, or to be supplied in the environment)
* `--buffer`/`-b` - the number of records written to CouchDB per bulk write (defaults to 500, not required)## Using programmatically
In your project, add `couchimport` into the dependencies of your package.json or run `npm install --save couchimport`. In your code, require the library with
```js
const couchimport = require('couchimport')
```and your options are set in an object whose keys are the same as the command line paramters:
e.g.
```js
const opts = { url: "http://localhost:5984", database: "mydb", rs: fs.createReadStream('myfile.json') }
await couchimport(opts)
```> Note: `rs` is the readstream where data will be read (default: `stdin`) and `ws` is the write stream where the output will be written (default: `stdout`)