{"id":21185595,"url":"https://github.com/raystack/predator","last_synced_at":"2025-07-10T01:30:51.157Z","repository":{"id":103261712,"uuid":"527945301","full_name":"raystack/predator","owner":"raystack","description":null,"archived":false,"fork":false,"pushed_at":"2023-03-02T11:44:30.000Z","size":264,"stargazers_count":3,"open_issues_count":4,"forks_count":3,"subscribers_count":9,"default_branch":"main","last_synced_at":"2024-06-21T20:02:43.999Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/raystack.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":"audit/result_store.go","citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2022-08-23T10:46:49.000Z","updated_at":"2022-12-07T02:26:20.000Z","dependencies_parsed_at":null,"dependency_job_id":"c37d5bae-cee3-435d-9a4e-49fc328c10d2","html_url":"https://github.com/raystack/predator","commit_stats":null,"previous_names":["raystack/predator","odpf/predator"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raystack%2Fpredator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raystack%2Fpredator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raystack%2Fpredator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raystack%2Fpredator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/raystack","download_url":"https://codeload.github.com/raystack/predator/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225606462,"owners_count":17495551,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-20T18:17:14.872Z","updated_at":"2024-11-20T18:17:15.619Z","avatar_url":"https://github.com/raystack.png","language":"Go","readme":"# README\n\nPredator (Profiler and Auditor) is a tool to provide statistical description and data quality checking of downstream data.\n\nPredator consist of two components:\n* Profile : Collect basic metrics of table and column and calculate data quality metrics.\n* Audit : Compare the data quality metrics against tolerance rules.\n\n\n### Requirements\n\n* Go v1.18\n* Postgres Instance\n\n  ```\n  docker run -d -p 127.0.0.1:5432:5432/tcp --name predator-abcd -e POSTGRES_PASSWORD=secretpassword -e POSTGRES_DB=predator -e POSTGRES_USER=predator postgres\n  ```\n* Tolerance Store\n  \n  * Local directory\n\n    For producing metrics on Profile and check issues using Audit, tolerance specification is needed. Each of `.yaml` files in the local directory represents tolerance specification for a bigquery table. This options can be used for local testing. This store can be used by using local directory as `TOLERANCE_STORE_URL`\n\n    ```\n    example/tolerance\n    ```\n  \n  * Google Cloud Storage\n  \n    Google cloud bucket is preferred for having file based tolerance spec to be used by Predator service, especially when combined with git repository for tolerance spec files collaboration with multiple users\n    \n    Please read this doc for creating gcs bucket [here](https://cloud.google.com/storage/docs/creating-buckets). The gcs bucket can be used as tolerance storage configuration in `TOLERANCE_STORE_URL`\n\n    ```\n    gs://your-bucket/audit-spec\n    ```\n\n\n* Unique Constraint Store (optional)\n\n  Source of unique constraint column for each resource to calculate unique count and duplication percentage metrics, \n  in a single CSV file. This is an alternative solution if the unique constraint column is not specified in the tolerance \n  specification of each table. Please see documentation below for details of CSV content format.\n  \n* Publisher\n\n  Predator publish data for profile and audit to for realtime data/event processing\n\n  * Apache Kafka\n      * Download apache kafka https://kafka.apache.org/quickstart\n      * Start zookeeper `bin/zookeeper-server-start.sh config/zookeeper.properties`\n      * Start kafka `bin/kafka-server-start.sh config/server.properties`\n      * Create kafka topics for profile and audit\n          * `bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic profile`\n          * `bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic audit`\n  \n  * Console\n    \n    If Kafka broker and topic configuration are empty Predator publish the data to terminal/console. This type of Publisher is intended for local testing purpose\n\n* Google Cloud credentials \n\n  Google cloud credentials is needed for predator to access Bigquery API\n\n  * Google cloud personal account credentials\n\n    Using this credential we can use our own Google suite email to access google cloud API including Bigquery API. \n    This credential is the most suitable for local testing/exploration purpose.\n\n  * Application Default Credentials \n    \n    This type of google cloud credentials is needed for deploy predator as service especially to use predator in a non-local environment.\n\n    * Create google cloud application credentials\n      Please read this [doc](https://cloud.google.com/docs/authentication/production) for creating an application default credentials (ADC)\n\n    * Set local environment variable \n      ```\n      GOOGLE_APPLICATION_CREDENTIALS=/path/key.json\n      ```\n\n### How to Build\n* `make`\n\n### How to Test\n`make test`\n\n### How to run predator service\n\n#### Create .env file\n\n1. Create copy conf/.env.template and create .env file\n2. Put .env file to the root of repository\n3. Set env variable\n\n    example of config to run\n    ```\n    PORT=\n\n    DB_HOST=localhost\n    DB_PORT=5432\n    DB_NAME=predator\n    DB_USER=predator\n    DB_PASS=secretpassword\n\n    BIGQUERY_PROJECT_ID=sample-project\n\n    PROFILE_KAFKA_TOPIC=profile\n    AUDIT_KAFKA_TOPIC=audit\n    KAFKA_BROKER=localhost:6668\n\n    TOLERANCE_STORE_URL=example/tolerance\n\n    UNIQUE_CONSTRAINT_STORE_URL=example/uniqueconstraints.csv\n    MULTI_TENANCY_ENABLED=true\n    GIT_AUTH_PRIVATE_KEY_PATH=~/.ssh/private.key\n    TZ=UTC\n    ```\n\n#### Setup DB\n`./predator migrate -e .env` to run the DB migration\n\nNote: If any changes made on the migration files, re-run this command to re-generate the migration resource.  \n`make generate-db-resource`\n\n#### How to Run\n`./predator start -e .env`\n\n#### How to do Profile and Audit using API Call\nBefore begin, decide below profiling details.\n  * URN\n    Target table ID\n  * Filter (optional)\n    Filter expression in SQL syntax. This expression will be applied in the WHERE clause of profiling query. \n    For example: `__PARTITION__ = '2021-01-01'`.\n  * Group (optional)\n    Which field the result should be grouped with. Can be any field or __PARTITION__\n  * Mode\n    Profiling mode will differentiate how the result will be visualized. `complete` for presenting the results as \n    independent data result, or `incremental` for presenting it as part of another same group results.\n  * Audit time\n    Timestamp of when audit happened. \n\n1. Create profile job : `POST /v1beta1/profile`. Please include the profiling details as the payload.\n2. Wait until `status` becomes `completed` \n\n    Call `GET /v1beta1/profile/{profile_id}` periodically until `status` becomes `completed` \n\n3. Audit the profiled data : `POST /v1beta1/profile/{profile_id}/audit`\n\n\n#### How to do Profile and Audit using CLI\nFirst, build by running `make build`\n\n* To profile and audit\n  `profile_audit -s {server} -u {urn} -f {filter} -g {group} -m {mode} -a {audit_time}`\n\n* To only profile\n  `profile -s {server} -u {urn} -f {filter} -g {group} -m {mode} -a {audit_time}`\n\nUsage example:\n```shell\npredator profile_audit \\\n-s http://sample-predator-server \\\n-u sample-project.sample_dataset.sample_table \\\n-g \"date(sample_timestamp_field)\" \\\n-f \"date(sample_timestamp_field) in (\\\"2020-12-02\\\",\\\"2020-12-01\\\",\\\"2020-11-30\\\")\" \\\n-m complete \\\n-a \"2020-12-02T07:00:00.000Z\"\n```\n\nUsage example by using Docker:\n```shell\ndocker run --rm -e SUB_COMMAND=profile_audit \\\n-e PREDATOR_URL=http://sample-predator-server \\\n-e URN=sample-project.sample_dataset.sample_table \\\n-e GROUP=\"date(sample_timestamp_field)\" \\\n-e FILTER=\"__PARTITION__ = \\\"2020-11-01\\\"\" \\\n-e MODE=complete \\\n-e AUDIT_TIME=\"2020-12-02T07:00:00.000Z\" \\\npredator:latest\n```\n\n### Local Testing Guide\n\n#### Dependencies\n\nWhen doing local testing, some external dependency can be replaced with local files and folders. Here is the step by \nstep for set up the configuration and running predator for local testing purpose. \n\n* Tolerance Rules Configuration\n  Using yaml file in `example/tolerance`.\n\n* Publisher\n  For local testing, Apache Kafka is not required. The protobuf serialised message will be shown as console log.\n\n\n#### How to do local testing\n\n* checkout predator repository\n* go to predator repository directory\n* build predator binary by running `make build` script\n* create .env file\n* setup postgres database, please follow details on `Requirements` section for quick setup of postgres db. make sure\n  to also run the db migration `./predator migrate -e .env`\n* run predator service `./predator start -e .env`\n* prepare the tolerance spec file\n* create Profile job using API call\n    ```shell script\n        curl --location --request POST 'http://localhost:5000/v1beta1/profile' \\\n        --header 'Content-Type: application/json' \\\n        --data-raw '{\n            \"urn\": \"sample-project.sample_dataset.sample_table\",\n            \"filter\": \"__PARTITION__ = '2020-03-01'\",\n            \"group\": \"__PARTITION__\",\n            \"mode\": \"complete\"\n        }'\n    ```\n* API call to get the Profile job status \u0026 result, poll the status until the status becomes `completed`\n    ```shell script\n   curl --location --request GET 'http://localhost:5000/v1beta1/profile/${profile_id}'\n    ```\n* API call to audit and get the result\n    ```shell script\n    curl --location --request POST 'http://localhost:5000/v1beta1/profile/${profile_id}/audit'\n    ```\n\n## Register Entity (optional)\nPredator provide Upload tolerance spec feature for better collaboration among users (using git) and within a multiple entity \nenvironment. Each entity can be registered with its own git url, which at the time of upload Predator will clone the \ngit repository to find the tolerance specs and upload them to the destination storage and being used when profile \u0026 auditing.\n \n* register entity\n    ```shell script\n    curl --location --request POST 'http://localhost:5000/v1/entity/entity-1' \\\n    --header 'Content-Type: application/json' \\\n    --data-raw '{\n        \"entity_name\": \"sample-entity-1\",\n        \"git_url\": \"git@sample-url:sample-entity-1.git\",\n        \"environment\" : \"sample-env\",\n        \"gcloud_project_ids\": [\n            \"entity-1-project-1\"\n        ]\n    }'\n    ```\n\n\n## Data Quality Spec\n\n### Specifying Data Quality Spec\n\n```\n  tableid: \"sample-project.sample_dataset.sample_table\"\n\n  tablemetrics:\n  - metricname: \"duplication_pct\"\n    tolerance:\n      less_than_eq: 0\n    metadata:\n      uniquefields:\n      - field_1\n\n  fields:\n  - fieldid: \"field_1\"\n    fieldmetrics:\n    - metricname: \"nullness_pct\"\n      tolerance:\n        less_than_eq: 10.0\n  ```\n\n  * Tolerance Rules\n    * `less_than_eq`\n    * `less_than`\n    * `more_than_eq`\n    * `more_than_eq`\n\n  * Data quality metric available\n    * `duplication_pct` (need uniquefields metadata) \n    * `nullness_pct`\n    * `trend_inconsistency_pct`\n    * `row_count`\n\n### Data Quality Spec storage\n  * Using Google cloud storage as file store\n    * Decide GCS the bucket and base path\n      \n      for example if `gs://our-bucket` is our GCS bucket we can add `audit-spec` folder. So our base path folder become `gs://our-bucket/audit-spec`\n\n    * save the spec to file with naming `\u003cgcp-project-id\u003e.\u003cdataset\u003e.\u003ctablename\u003e.yaml` format for example : `sample-project.sample_dataset.sample_table.yaml`\n\n    * upload the file in format to this path `gs://sample-bucket/audit-spec/sample-project.sample_dataset.sample_table.yaml`\n    * put another spec in the same folder/base path\n  \n  * Using local as file store\n  \n    * create directory on local for example `/Users/username/Documents/predator/tolerance`\n\n    * save the spec to file with naming `\u003cgcp-project-id\u003e.\u003cdataset\u003e.\u003ctablename\u003e.yaml` format for    example : `sample-project.sample_dataset.sample_table.yaml`\n\n    * move the file to the created directory so the file location will be `/Users/username/Documents/predator/tolerance/sample-project.sample_dataset.sample_table.yaml`\n    * put more spec file to the directory as needed\n    \n\n\n### Upload Data Quality Spec\nThere are multiple way to upload data quality spec to predator storage, one of them is using `POST v1beta1/spec/upload` API.\nPredator also provide cli to provide the same functionality. \n\n#### Upload through Predator CLI\n```shell script\n    usage: predator upload --host=HOST --git-url=GIT-URL [\u003cflags\u003e]\n    \n    upload spec from git repository to storage\n    \n    Flags:\n          --help             Show context-sensitive help (also try --help-long and --help-man).\n      -h, --host=http://sample-predator-server        predator server\n      -g, --git-url=git@sample-url:sample-entity.git  url of git, the source of data quality spec\n      -c, --commit-id=\"[sample-commit-id]\"     specific git commit hash, default value will be empty and always upload latest commit\n      -p, --path-prefix=\"predator\"   path to root of predator specs directory, default will be empty\n```\n\n* Path Prefix (`--path-prefix`) is path to predator folder root directory on a git repository, fill this value if the directory root is not the same as git root. \n    ```yaml\n    git_root:\n        predator:\n          sample-entity-1-project-1:\n            dataset_a:\n              table_x.yaml\n    ```\n* Commit ID (`--commit-id`) is commit hash of git that will be uploaded this is optional, when not set the latest commit will be used\n* Git URL (`--git-url`) git url that used on git clone, only this `git@sample-url:sample-entity.git` format that is supported \n\n```shell script\n    ./predator upload \\\n    --host http://sample-predator-server \\\n    --path-prefix predator --git-url git@sample-url:sample-entity-1.git \\\n    --commit-id sample-commit-id\n```\n\n#### Example of Upload through API call\nfrom git repository to tolerance store (optional)\n```shell script\n    curl --location --request POST 'http://localhost:5000/v1beta1/spec/upload' \\\n    --header 'Content-Type: application/json' \\\n    --data-raw '{\n        \"git_url\": \"git@sample-url:sample-entity.git\",\n        \"commit_id\": \"sample-commit-id\",\n        \"path_prefix\": \"predator\"\n    }'\n```\n\n\n### API docs\n\n`api/predator.postman_collection.json` or `api/swagger.json`\n\n### Tech Debt\n* remove ProfileMetric type and use only Metric type\n* remove Meta from MetricSpec and Metric\n* better abstraction of QualityMetricProfiler\n* better abstraction of BasicMetricProfiler\n\n### Monitoring\n\nHow to setup monitoring:\n\nThis step by step tutorial is taken from [cortex getting started tutorial](https://cortexmetrics.io/docs/getting-started/getting-started-chunks-storage/)\nPrometheus is not required, because it only used as metric collector for Cortex, in this setup stats pushed from telegraf to cortex directly using remote write\n\n#### Cortex\n\n* build cortex\n```shell\ngit clone https://github.com/cortexproject/cortex.git\ncd cortex\ngo build ./cmd/cortex\n```\n\n* run cortex\n```shell\n./cortex -config.file=${PREDATOR_REPO_ROOT}/example/monitoring/single-process-config.yaml\n```\n\n#### Grafana\n```shell\ndocker run --rm -d --name=grafana -p 3000:3000 grafana/grafana\n```\n\nIn the Grafana UI (username/password admin/admin), add a Prometheus datasource for Cortex (http://host.docker.internal:9009/api/prom).\nDashboard config will be added later\n\nImport dashboard by upload this [file](./example/monitoring/Predator-1614083874842.json)\n\n#### Telegraf\n\n* clone telegraf\n```shell\ncd ~/src\ngit clone https://github.com/influxdata/telegraf.git\n```\n\n* make binary\n```shell\ncd ~/src/telegraf\nmake\n```\n\n* run telegraf\n```shell\n./telegraf --config ${PREDATOR_REPO_ROOT}/example/monitoring/telegraf.conf\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fraystack%2Fpredator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fraystack%2Fpredator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fraystack%2Fpredator/lists"}