https://github.com/zgornel/datalinter
A data linter written at the VUB AI Lab
https://github.com/zgornel/datalinter
Last synced: 7 months ago
JSON representation
A data linter written at the VUB AI Lab
- Host: GitHub
- URL: https://github.com/zgornel/datalinter
- Owner: zgornel
- License: gpl-3.0
- Created: 2024-08-06T17:03:03.000Z (about 1 year ago)
- Default Branch: master
- Last Pushed: 2025-02-19T13:34:09.000Z (8 months ago)
- Last Synced: 2025-02-19T14:31:41.429Z (8 months ago)
- Language: Julia
- Size: 453 KB
- Stars: 3
- Watchers: 3
- Forks: 0
- Open Issues: 11
-
Metadata Files:
- Readme: README.md
- Changelog: NEWS.md
- License: LICENSE
Awesome Lists containing this project
README
# DataLinter
A data linter written in Julia at the Vrije Universiteit Brussel.
[](https://github.com/zgornel/DataLinter/actions/workflows/ci.yml?query=branch%3Amaster)
[](LICENSE.md)
[](https://zgornel.github.io/DataLinter/dev)
## Installation
The recommended way to install `DataLinter` is to download the docker image:
```
$ docker pull ghcr.io/zgornel/datalinter-compiled:latest
```
This will download a Docker image with the compiled version of the data linter. For development, one can dowload the repository and build the Docker image separately if needed.> Note: Before running the linter, make sure that the Docker container has mapped all the relevant directories. Check out [the Dockerfile](https://github.com/zgornel/DataLinter/blob/master/docker/Dockerfile.datalinter-compiled.alpine) of the image to see what directories are available inside the container (created with the `mkdir -p` commands).
## Running the linter
To perform a smaple run on the test dataset from the repository from inside the root of the repository:
```
$ time docker run -it --rm \
--volume=./test/data:/_data \
--volume=./config:/_config \
ghcr.io/zgornel/datalinter-compiled:latest \
/datalinter/bin/datalinter /_data/data.csv \
--config-path /_config/default.toml \
--log-level warn
```The output should look something like:
```
┌ Warning: Could not load KB@. Returning empty Dict().
└ @ DataLinter.KnowledgeBaseNative ~/.julia/packages/DataLinter/5mybQ/src/kb.jl:22
• info (tokenizable_string) column: x6 the values of 'column: x6' could be tokenizable i.e. contain spaces
• info (tokenizable_string) column: x8 the values of 'column: x8' could be tokenizable i.e. contain spaces
• info (large_outliers) column: x1 the values of 'column: x1' contain large outliers
! warn (int_as_float) column: x4 the values of 'column: x4' are floating point but can be integers
! warn (enum_detector) column: x5 just a few distinct values in 'column: x5', it could be an enum
! warn (enum_detector) column: x8 just a few distinct values in 'column: x8', it could be an enum
! warn (enum_detector) column: x4 just a few distinct values in 'column: x4', it could be an enum
! warn (empty_example) row: 10 the example at 'row: 10' looks empty
! warn (empty_example) row: 11 the example at 'row: 11' looks empty
! warn (uncommon_signs) column: x1 uncommon signs (+/-/NaN/0) present in 'column: x1'
! warn (long_tailed_distrib) column: x1 the distribution for 'column: x1' has 'long tails'
11 issues found from 14 linters applied (13 OK, 1 N/A) .
docker run -it --rm --volume=./test/data:/_data --volume=./config:/_config 0.02s user 0.01s system 0% cpu 4.197 total
```### Using the script
The linter can be run quickly through the `datalinter.sh` shell script. To run in on the test dataset, one can do:
```
$ ./datalinter.sh ./test/data/data.csv
```
The script can be ran from any directory and accepts a single argument, the dataset that is to be linted.## License
This code has an GPL license and therefore it is free as beer.
## Reporting Bugs
Please [file an issue](https://github.com/zgornel/DataLinter/issues/new) to report a bug or request a feature.
## References
[1] https://en.wikipedia.org/wiki/Lint_(software)
[2] A [data linter](https://github.com/brain-research/data-linter) written by Google
## Acknowledgements
The initial version of DataLinter was fully inspired by [this work](https://github.com/brain-research/data-linter) written by Google brain research.