Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/warfox/dqt

Data Quality Tool
https://github.com/warfox/dqt

clojure data-quality data-reliability

Last synced: about 2 months ago
JSON representation

Data Quality Tool

Awesome Lists containing this project

README

        

#+title: dqt - Data Quality Tool

A simple data quality tool. Collect and publish metrics about quality of data anywhere.

* Docs

1. [[./docs/dimensions.org][Dimensions of data quality]]

* Features [2/12]

- [X] get metrics
- [X] run tests
- [ ] publish metrics to aws
- [ ] publish metrics to prometheus
- [ ] publish metadata to DataHub
- [ ] build dashboards
- [ ] alets based on CloudWatch
- [ ] other cloud providers?
- [ ] multiple data sources
- [ ] example dag with airflow
- [ ] example with prefect
- [ ] re-conciliation between two data sources, % missing, matching columns vs mismatch

* Installation

Download from https://github.com/warfox/dqt

* Usage

`dqt` is a command line tool that runs on JVM.

Make sure you have the jdbc drivers in classpath.

#+begin_src
java -jar dqt.jar run -d datasource.edn -t table.edn
#+end_src

#+begin_src
java -cp "/path/to/jdbc/driver/jar/:./dqt.jar" dqt.core run -d examples/postgres.edn -t examples/tables/employees.edn
#+end_src

** datasource.edn

** table.edn

* Development

Run the project directly, via `:main-opts` (`-m dqt.core`):

#+begin_src
$ clojure -M:run
#+end_src

Run the project, with parameters

#+begin_src
$ clojure -M:run -d datasource.edn -t table.edn
#+end_src

Run the project's tests (they'll fail until you edit them):

#+begin_src
$ clojure -T:build test
#+end_src

#+begin_src
$ ./bin/kaocha
#+end_src

Build uberjar

#+begin_src
$ clojure -T:build uberjar
#+end_src

This will produce an updated =pom.xml= file with synchronized dependencies inside the =META-INF=
directory inside =target/classes= and the uberjar in =target=. You can update the version (and SCM tag)
information in generated =pom.xml= by updating =build.clj=.

If you don't want the =pom.xml= file in your project, you can remove it. The =ci= task will
still generate a minimal =pom.xml= as part of the =uber= task, unless you remove =version=
from =build.clj=.

Run that uberjar:

#+begin_src
$ java -jar target/dqt-0.1.0-SNAPSHOT.jar
#+end_src

If you remove =version= from =build.clj=, the uberjar will become =target/dqt-standalone.jar=.

* Options

FIXME: listing of options this app accepts.

* Examples

** datasource file datasource.edn
#+begin_src clojure
{:dbtype "postgresql"
:dbname "postgres"
:host #or [#env DATABASE_HOSTNAME "localhost"]
:user "postgres"
:password "postgres"
:ssl false
:classname "org.postgres.Driver"
:sslfactory "org.postgresql.ssl.NonValidatingFactory"}
#+end_src

** table file employees.edn

#+begin_src clojure
{:table-name :employees
:metrics [:row-count
:avg-length
:max-length
:min-length
:avg
:sum
:max
:min
:stddev
:variance]

:tests [[:row-count > 10]
[:avg-length-phone-number < 13]
[:stddev-salary > 4500]
[:sum-salary > 20000]
[:max-length-email < 30]]}
#+end_src

** Run

#+begin_src shell
clj -M:dev:run run -d datasource.edn -t tables/employees.edn
#+end_src

#+begin_src shell
bb run-example
#+end_src

* Development

** Run development mode with babashka

#+begin_src shell
bb dev
#+end_src

** Test database

Run =docker compose up= to have postgress running

** Run migraion

#+begin_src shell
bb migrate
#+end_src

** Run test

#+begin_src shell
clj -M:dev:test
clj -M:dev:test --watch
bb test
bb test:watch
#+end_src

#+begin_src
$ bin/koacha
#+end_src

#+begin_src
$ bin/koacha --watch
#+end_src

* References

- https://www.sweettooth.dev/endpoint/dev/architecture/integrant-tutorial.html

* License

Copyright © 2021 Warfox

Distributed under the MIT License.