Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/simonw/csv-diff
Python CLI tool and library for diffing CSV and JSON files
https://github.com/simonw/csv-diff
click csv csv-diff datasette-io datasette-tool diff git-scraping tsv-diff
Last synced: 22 days ago
JSON representation
Python CLI tool and library for diffing CSV and JSON files
- Host: GitHub
- URL: https://github.com/simonw/csv-diff
- Owner: simonw
- License: apache-2.0
- Created: 2019-03-13T01:11:26.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2024-09-06T05:20:04.000Z (2 months ago)
- Last Synced: 2024-10-06T20:31:29.410Z (about 1 month ago)
- Topics: click, csv, csv-diff, datasette-io, datasette-tool, diff, git-scraping, tsv-diff
- Language: Python
- Homepage:
- Size: 37.1 KB
- Stars: 292
- Watchers: 9
- Forks: 47
- Open Issues: 24
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-starred - simonw/csv-diff - Python CLI tool and library for diffing CSV and JSON files (others)
README
# csv-diff
[![PyPI](https://img.shields.io/pypi/v/csv-diff.svg)](https://pypi.org/project/csv-diff/)
[![Changelog](https://img.shields.io/github/v/release/simonw/csv-diff?include_prereleases&label=changelog)](https://github.com/simonw/csv-diff/releases)
[![Tests](https://github.com/simonw/csv-diff/workflows/Test/badge.svg)](https://github.com/simonw/csv-diff/actions?query=workflow%3ATest)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/simonw/csv-diff/blob/main/LICENSE)Tool for viewing the difference between two CSV, TSV or JSON files. See [Generating a commit log for San Francisco’s official list of trees](https://simonwillison.net/2019/Mar/13/tree-history/) (and the [sf-tree-history repo commit log](https://github.com/simonw/sf-tree-history/commits)) for background information on this project.
## Installation
pip install csv-diff
## Usage
Consider two CSV files:
`one.csv`
id,name,age
1,Cleo,4
2,Pancakes,2`two.csv`
id,name,age
1,Cleo,5
3,Bailey,1`csv-diff` can show a human-readable summary of differences between the files:
$ csv-diff one.csv two.csv --key=id
1 row changed, 1 row added, 1 row removed1 row changed
Row 1
age: "4" => "5"1 row added
id: 3
name: Bailey
age: 11 row removed
id: 2
name: Pancakes
age: 2The `--key=id` option means that the `id` column should be treated as the unique key, to identify which records have changed.
The tool will automatically detect if your files are comma- or tab-separated. You can over-ride this automatic detection and force the tool to use a specific format using `--format=tsv` or `--format=csv`.
You can also feed it JSON files, provided they are a JSON array of objects where each object has the same keys. Use `--format=json` if your input files are JSON.
Use `--show-unchanged` to include full details of the unchanged values for rows with at least one change in the diff output:
% csv-diff one.csv two.csv --key=id --show-unchanged
1 row changedid: 1
age: "4" => "5"Unchanged:
name: "Cleo"### JSON output
You can use the `--json` option to get a machine-readable difference:
$ csv-diff one.csv two.csv --key=id --json
{
"added": [
{
"id": "3",
"name": "Bailey",
"age": "1"
}
],
"removed": [
{
"id": "2",
"name": "Pancakes",
"age": "2"
}
],
"changed": [
{
"key": "1",
"changes": {
"age": [
"4",
"5"
]
}
}
],
"columns_added": [],
"columns_removed": []
}### Adding templated extras
You can specify additional keys to be displayed in the human-readable format using the `--extra` option:
--extra name "Python format string with {id} for variables"
For example, to output a link to `https://news.ycombinator.com/latest?id={id}` for each item with an ID, you could use this:
```bash
csv-diff one.csv two.csv --key=id \
--extra latest "https://news.ycombinator.com/latest?id={id}"
```
These extras display something like this:
```
1 row changedid: 41459472
points: "24" => "25"
numComments: "5" => "6"
extras:
latest: https://news.ycombinator.com/latest?id=41459472
```## As a Python library
You can also import the Python library into your own code like so:
from csv_diff import load_csv, compare
diff = compare(
load_csv(open("one.csv"), key="id"),
load_csv(open("two.csv"), key="id")
)`diff` will now contain the same data structure as the output in the `--json` example above.
If the columns in the CSV have changed, those added or removed columns will be ignored when calculating changes made to specific rows.
## As a Docker container
### Build the image
$ docker build -t csvdiff .
### Run the container
$ docker run --rm -v $(pwd):/files csvdiff
Suppose current directory contains two csv files : one.csv two.csv
$ docker run --rm -v $(pwd):/files csvdiff one.csv two.csv
## Alternatives- [csvdiff](https://github.com/aswinkarthik/csvdiff) is a "fast diff tool for comparing CSV files" - you may get better results from this than from `csv-diff` against larger files.