Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/glynnbird/docstream
Node.js utility to turn CouchDB's _all_docs stream in to stream of plain documents
https://github.com/glynnbird/docstream
Last synced: 5 days ago
JSON representation
Node.js utility to turn CouchDB's _all_docs stream in to stream of plain documents
- Host: GitHub
- URL: https://github.com/glynnbird/docstream
- Owner: glynnbird
- Created: 2014-04-25T09:37:01.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2014-05-23T08:43:05.000Z (over 10 years ago)
- Last Synced: 2024-04-26T01:19:46.023Z (7 months ago)
- Language: JavaScript
- Size: 148 KB
- Stars: 0
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# DocStream
## Installation
This utility requires Node.js and can be installed via the "npm" utility
```
sudo npm install -g docstream
```After that, the "docstream" command should be available at the command-line
## Introduction
CouchDB's "_all_docs" endpoint gets you all the documents in database in this form
e.g
http://mycouchdbserver?_all_docs?include_docs=true_
```
{
"total_rows": 93985131,
"offset": 0,
"rows": [
{
"id": "0000230a35e724e12b8c18a8f700065d",
"key": "0000230a35e724e12b8c18a8f700065d",
"value": {
"rev": "1-adf8311047fcdd953543118e7d501fa1"
},
"doc": {
"_id": "0000230a35e724e12b8c18a8f700065d",
"_rev": "1-adf8311047fcdd953543118e7d501fa1",
"a": "1",
"b": "2",
"c": "3"
}
},
{
"id": "0000230a35e724e12b8c18a8f7000ccd",
"key": "0000230a35e724e12b8c18a8f7000ccd",
"value": {
"rev": "1-5ce610ff79bc1cfe62b4a1a68e5b09cf"
},
"doc": {
"_id": "0000230a35e724e12b8c18a8f7000ccd",
"_rev": "1-5ce610ff79bc1cfe62b4a1a68e5b09cf",
"a": "2",
"b": "5",
"c": "6"
}
}
]
}
```Notice the documents themselves are contained inside an object inside an array. In real life, the data comes out like this:
```
{"total_rows":93985131,"offset":0,"rows":[
{"id":"0000230a35e724e12b8c18a8f700065d","key":"0000230a35e724e12b8c18a8f700065d","value":{"rev":"1-adf8311047fcdd953543118e7d501fa1"},"doc":{"_id":"0000230a35e724e12b8c18a8f700065d","_rev":"1-adf8311047fcdd953543118e7d501fa1","a":"1","b":"2","c":"3"}},
{"id":"0000230a35e724e12b8c18a8f7000ccd","key":"0000230a35e724e12b8c18a8f7000ccd","value":{"rev":"1-5ce610ff79bc1cfe62b4a1a68e5b09cf"},"doc":{"_id":"0000230a35e724e12b8c18a8f7000ccd","_rev":"1-5ce610ff79bc1cfe62b4a1a68e5b09cf","a":"2","b":"5","c":"6"}}
]}```
with each object on its own line.If you are wanting to export the data and put it in Redshift, for example, the JSON needs to be in this form:
```
{"_id":"0000230a35e724e12b8c18a8f700065d","_rev":"1-adf8311047fcdd953543118e7d501fa1","a":"1","b":"2","c":"3"}
{"_id":"0000230a35e724e12b8c18a8f7000ccd","_rev":"1-5ce610ff79bc1cfe62b4a1a68e5b09cf","a":"2","b":"5","c":"6"}
```### Solution 1 - jq
The jq utility allows JSON to be parsed and reformatted on the command-line. e.g.
```
curl 'http://mycouchdbserver?_all_docs?include_docs=true' | tail -n +2 | head -n -1 | sed 's/,\s*$//' | jq '.doc'
```( Thanks to @fugu13 for this solution)
### Solution 2 - Use this docstream.js utility
DocStream takes _all_docs data in on stdin and outputs just the "doc" section:
```
curl 'http://mycouchdbserver?_all_docs?include_docs=true' | docstream
```This is should work with any size of data set, as long as each document appears per line.
e.g.
```
cat sample.txt | docstream | gzip > output.txt.gz
```