{"id":20347450,"url":"https://github.com/opendatadiscovery/odd-collectors","last_synced_at":"2025-04-12T00:55:02.817Z","repository":{"id":204424399,"uuid":"692091792","full_name":"opendatadiscovery/odd-collectors","owner":"opendatadiscovery","description":null,"archived":false,"fork":false,"pushed_at":"2025-01-29T12:52:46.000Z","size":2115,"stargazers_count":9,"open_issues_count":26,"forks_count":11,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-04-12T00:54:55.384Z","etag":null,"topics":["data-catalog","data-governance","data-observability"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/opendatadiscovery.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-09-15T14:38:35.000Z","updated_at":"2025-03-26T16:47:57.000Z","dependencies_parsed_at":null,"dependency_job_id":"e7a8c53d-50d4-40fe-ba16-029bf0492a5d","html_url":"https://github.com/opendatadiscovery/odd-collectors","commit_stats":null,"previous_names":["opendatadiscovery/odd-collectors"],"tags_count":25,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/opendatadiscovery%2Fodd-collectors","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/opendatadiscovery%2Fodd-collectors/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/opendatadiscovery%2Fodd-collectors/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/opendatadiscovery%2Fodd-collectors/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/opendatadiscovery","download_url":"https://codeload.github.com/opendatadiscovery/odd-collectors/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248501880,"owners_count":21114683,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-catalog","data-governance","data-observability"],"created_at":"2024-11-14T22:16:41.589Z","updated_at":"2025-04-12T00:55:02.804Z","avatar_url":"https://github.com/opendatadiscovery.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# OpenDataDiscovery Collectors.\n\n[Generic Collector](#odd-collector) | [AWS Collector](#odd-collector-aws) | [Azure Collector](#odd-collector-azure) | [GCP Collector](#odd-collector-gcp)\n\n\n### What is collector?\n[Usage example](#usage-example)\n\nCollector is a service that loads and runs [adapters](#what-is-adapter). Collectors are separated by data sources. Each collector has examples of configuration files for each adapter.\nCollector works as a daemon and periodically load metadata from data sources. Data sources can be configured in plugins field of collector config.\nEach plugin has own configuration to connect to data source and load metadata.\n\n### What is adapter?\nAdapter is an abstraction that allows to load metadata from different data sources.\nProviding all the necessary information to connect to the data source, adapter can load metadata from it and send it to ODD Platform.\nAdapters do not have any dependencies on each other and can be used separately. Adapter do not read real data from data source, only metadata.\n\n# odd-collector\n[Image](https://github.com/opendatadiscovery/odd-collector/pkgs/container/odd-collector) | [Configuration examples](odd-collector/config_examples)\n\nCollector to the common data sources, it shares adapters to Databases, Vector Stores (pgvector PostgreSQL extension), BI Tools or ML platforms as MLFlow.\n\nSupported data sources:\n- [Airbyte](odd-collector/README.md)\n- [Cassandra](odd-collector/README.md)\n- [CKAN](odd-collector/README.md)\n- [ClickHouse](odd-collector/README.md)\n- [cockroachdb](odd-collector/README.md)\n- [CubeJS](odd-collector/README.md)\n- [Druid](odd-collector/README.md)\n- [Duckdb](odd-collector/README.md)\n- [Elasticsearch](odd-collector/README.md)\n- [Feast](odd-collector/README.md)\n- [Fivetran](odd-collector/README.md)\n- [Hive](odd-collector/README.md)\n- [Kafka](odd-collector/README.md)\n- [Kubeflow](odd-collector/README.md)\n- [MariaDB](odd-collector/README.md)\n- [Metabase](odd-collector/README.md)\n- [Mlflow](odd-collector/README.md)\n- [Mode](odd-collector/README.md)\n- [MongoDB](odd-collector/README.md)\n- [MSSql](odd-collector/README.md)\n- [MySql](odd-collector/README.md)\n- [Neo4j](odd-collector/README.md)\n- [ODBC](odd-collector/README.md)\n- [ODD Adapter](odd-collector/README.md)\n- [PostgreSQL](odd-collector/README.md)\n- [Presto](odd-collector/README.md)\n- [Redash](odd-collector/README.md)\n- [Redshift](odd-collector/README.md)\n- [Scylladb](odd-collector/README.md)\n- [SingleStore](odd-collector/README.md)\n- [Snowflake](odd-collector/README.md)\n- [Superset](odd-collector/README.md)\n- [sqlite](odd-collector/README.md)\n- [Tableau](odd-collector/README.md)\n- [Tarantool](odd-collector/README.md)\n- [Trino](odd-collector/README.md)\n- [Vertica](odd-collector/README.md)\n\n# odd-collector-aws\n[Image](https://github.com/opendatadiscovery/odd-collector/pkgs/container/odd-collector-aws) | [Configuration examples](odd-collector-aws/config_examples)\n\nCollector provides adapter for Amazon cloud services\n\nSupported data sources:\n- [Athena](odd-collector-aws/README.md)\n- [DynamoDB](odd-collector-aws/README.md)\n- [Database Migration Service \\(DMS\\)](odd-collector-aws/README.md)\n- [Glue](odd-collector-aws/README.md)\n- [Quicksight](odd-collector-aws/README.md)\n- [Kinesis](odd-collector-aws/README.md)\n- [S3](odd-collector-aws/README.md)\n- [S3 Delta](odd-collector-aws/README.md)\n- [Sagemaker](odd-collector-aws/README.md)\n- [SagemakerFeaturestore](odd-collector-aws/README.md)\n- [SQS](odd-collector-aws/README.md)\n\n# odd-collector-azure\n[Image](https://github.com/opendatadiscovery/odd-collector/pkgs/container/odd-collector-azure) | [Configuration examples](odd-collector-azure/config_examples)\n\nCollector provides adapter for Microsoft Azure cloud services\nSupported data sources:\n- [Azure SQL](odd-collector-azure/README.md)\n- [Blob Storage](odd-collector-azure/README.md)\n- [PowerBI](odd-collector-azure/README.md)\n\n# odd-collector-gcp\n[Image](https://github.com/opendatadiscovery/odd-collector/pkgs/container/odd-collector-gcp) | [Configuration examples](odd-collector-gcp/config_examples)\n\nCollector provides adapter for Google Cloud services. [Detailed documentation](odd-collector-gcp/README.md).\n\nSupported data sources:\n- [BigQuery](odd-collector-gcp/README.md#bigquery)\n- [BigTable](odd-collector-gcp/README.md#bigtable)\n- [GoogleCloudStorage](odd-collector-gcp/README.md#googlecloudstorage)\n- [GoogleCloudStoraDeltaTables](odd-collector-gcp/README.md##googlecloudstoragedeltatables)\n\n# Ingestion Filters Configuration\nThis section provides a comprehensive reference for configuring Ingestion Filters that are available within several ODD Data Collectors. \nThe table below outlines key information about those Collectors along with Adapters, Filter Configuration Parameters and brief Descriptions of Filter for each of them. \n\n| Collector           | Adapter                    | Filter Config Parameter   | Filter Description                    |\n|---------------------|----------------------------|---------------------------|---------------------------------------|\n| odd-collector       | PostgreSQL                 | schemas_filter            | Filter object by database schema name |\n| odd-collector       | Snowflake                  | schemas_filter            | Filter object by database schema name |\n| odd-collector-aws   | S3                         | filename_filter           | Filter by file name                   |\n| odd-collector-aws   | S3 Delta                   | filter                    | Filter by file name                   |\n| odd-collector-gcp   | BigQuery                   | datasets_filter           | Filter by data set name               |\n| odd-collector-gcp   | Google Cloud Storage       | filename_filter           | Filter by file name                   |\n| odd-collector-gcp   | Google Cloud Storage Delta | filter                    | Filter by file name                   |\n| odd-collector-azure | Azure Data Factory (ADF)   | pipeline_filter           | Filter by pipeline name               |\n| odd-collector-azure | Azure BLOB Storage         | file_filter               | Filter by file name                   |\n\n# Relationships\nThe goal of this feature is to build relationships on the top of core data entities that are logically related.\nThe table below represents what adapters currently support this feature and are capable of constructing the\nRelationship DataEntity.\nThere are 2 types of relationships: ERD(Entity-Relationship Diagram) and GRAPH.\n- ERD relationships represent associations between entities within a relational database. We determine 4 cardinality types of relationships:\n    - ONE_TO_EXACTLY_ONE - a single instance of an entity is related to a single instance of another entity.\n    - ONE_TO_ZERO_OR_ONE - a single instance of an entity is related to either zero instances or one instance of another entity.\n    - ONE_TO_ONE_OR_MORE - a single instance of an entity is related to multiple instances of another entity.\n    - ONE_TO_ZERO_ONE_OR_MORE - a single instance of an entity is related to zero instances or one or more instances of another entity.\n- GRAPH relationships refer to connections between entities represented in a graph data structure. For example in Neo4j it will be\n  relationships between nodes.\n\n\n| Collector             | Adapter    | Relationship Type | Relationship Description                                                                                                    |\n|-----------------------|------------|-------------------|-----------------------------------------------------------------------------------------------------------------------------|\n| odd-collector         | PostgreSQL | ERD               | Relationship between 2 related table entities (supports cross-schema relation) that is determined by foreign key constraint |\n| odd-collector         | Snowflake  | ERD               | Relationship between 2 related table entities (supports cross-schema relation) that is determined by foreign key constraint |\n\n\n# Collector configuration using alternative Secrets Backend\nThere is an option to store collector configuration settings via Secrets Backend (only AWS SSM Parameter Store is supported for now).\nUsing this approach you need to create your secrets in the chosen Secret Backend provider according to the naming and backend configuration\nspecified in `secrets_backend` section of `collector_config.yaml`. More detailed information with usage examples you can find below in\n\"Usage Example\" section. Also some actual information can be found in `odd-collector` documentation and `odd-collector/collector_config.yaml` snippet.\n\n# Usage Example\n\n## Collector configuration\nConfig file must be named `collector_config.yaml` and placed in the same directory as the collector package.\nCollector config fields:\n```yaml\ndefault_pulling_interval: Optional[int] = None # Minutes to wait between runs of the job, if not set, job will be run only once\ntoken: str # Token to access ODD Platform\nplugins: list[Plugin] # List of adapters configs to be loaded\nplatform_host_url: str # URL of ODD Platform instance, i.e. http://localhost:8080\nchunk_size: int = 250 # Number of records to be sent in one request to the platform\nconnection_timeout_seconds: int = 300 # Seconds to wait for connection to the platform\nmisfire_grace_time: Optional[int] = None  # seconds after the designated runtime that the job is still allowed to be run\nmax_instances: Optional[int] = 1  # maximum number of concurrently running instances allowed\nverify_ssl: bool = True # For cases when self-signed certificates are used\n```\nThe priority of fields initialization:\n1) Fetching fields from `Secrets Backend`(if configured, see \"Secrets Backend configuration\" paragraph).\nAll collector config fields described above can be stored via `Secrets Backend`. If there is a field\nconfigured both in `Secrets Backend` and in `collector_config.yaml` the priority is given to the value\nstored in `Secrets Backend`. The information for one plugin must be stored in the unified place:\nall connection settings stored in `collector_config.yaml` or in `Secrets Backend`.\nBut there is a possibility to store one plugin in one place, and the other one in the second place.\nIn case information about one plugin(determined by name) is stored in both `Secrets Backend` and `collector_config.yaml`\nthe priority is given to the `Secrets Backend`.\n2) Fetching fields from `collector_config.yaml`.\n3) Fetching fields from `Environment variables`(all fields except `plugins`). Environments variables must have\nthe same name as fields in `collector_config.yaml`, but they are case-insensitive, so `platform_host_url`,\n`PLATFORM_HOST_URL` and `PlAtFoRm_HoSt_UrL` - are all valid environment variables names.\n4) Default values setting. For `default_pulling_interval`, `chunk_size`, `connection_timeout_seconds`, `misfire_grace_time`,\n`max_instances` and `verify_ssl` default values are acceptable(see in fields description above).\n\nIf `token`, `plugins` and `platform_host_url` fields are not specified in any way - the collector will\nthrow config parsing error.\n\n## Secrets Backend configuration\nSecrets Backend section must be specified only in the case when you are using one of the supported\nbackends. In case when you use only local `collector_config.yaml` file for configuration you might\nskip the `secrets_backend:` section (delete it, or left commented).\nSo, if you need this functionality it must be configured in the `collector_config.yaml` as well as Collector config.\nAs only AWSSystemsManagerParameterStore is supported for now, all the examples are attached to this case for now.\n```yaml\nsecrets_backend:\n  provider: \"AWSSystemsManagerParameterStore\"\n  # the section below is for key-value arguments provider needs\n  region_name: \"eu-central-1\"    # region where you store secrets\n  collector_settings_parameter_name: \"/odd/collector_config/collector_settings\"   # parameter name for storing\n                                     # collector settings, default is \"/odd/collector_config/collector_settings\"\n  collector_plugins_prefix: \"/odd/collector_config/plugins\"   # prefix for parameters, that contain\n                            # plugins configurations, default is \"/odd/collector_config/plugins\"\n```\n`provider` is must have to specify parameter, without default value.\n\n`region_name` information is retreiving in the following logic:\n1. The most priority has environment variable `AWS_REGION`, if it is specified - it's value will be used.\n2. If no `AWS_REGION` provided, the information from `collector_config.yaml` will be used.\n3. If `region_name` is not specified, we are trying to retreive AWS region information from instance metadata service (IMDS).\n4. If none of the above worked, adapter will throw an error, as we can not instantiate the connection to the service.\n\n`collector_settings_parameter_name` and `collector_plugins_prefix` have the default values, so if naming seems good for you,\nthis parameters can be skipped.\n\n## Example of collector config:\n```yaml\nsecrets_backend:\n  provider: \"AWSSystemsManagerParameterStore\"\n  # the section below is for key-value arguments provider needs\n  region_name: \"eu-central-1\"\n  collector_settings_parameter_name: \"/odd/collector_config/collector_settings\"\n  collector_plugins_prefix: \"/odd/collector_config/plugins\"\n\ndefault_pulling_interval: 10\ntoken: '****'\nplatform_host_url: http://localhost:8080\nchunk_size: 1000\nplugins:\n  - type: postgresql\n    name: postgresql_adapter\n    database: database\n    host: localhost\n    port: 5432\n    user: postgres\n    password: !ENV ${POSTGRES_PASSWORD}\n```\n\n## Using any collector in a docker container:\nFor more completed example take a look at [docker compose for demo](https://github.com/opendatadiscovery/odd-platform/blob/main/docker/README.md).\n```yaml\nversion: \"3.8\"\nservices:\n  odd-collector:\n    image: ghcr.io/opendatadiscovery/odd-collector:latest\n    restart: always\n    volumes:\n      - collector_config.yaml:/app/collector_config.yaml\n    environment:\n      - LOGLEVEL=DEBUG # Optional default INFO, use DEBUG for more verbose logs\n      - PLATFORM_HOST_URL=${PLATFORM_HOST_URL}\n      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}\n```\n\n## For developers\n\n### Collectors release process\nBig part of this process is automated using GitHub Actions and named \"ODD Collector release\".\nRequired steps to create a release:\n1. Merge all code changes to the `main` branch, including manual package version bump\n(this part is not automated, you need to update version in `./\u003codd_collector\u003e/__version__.py`\nand `./pyproject.toml` files).\n2. Update locally `main` branch with the command: `git pull`.\n3. Create tag locally using command: `git tag \u003ctag_name\u003e`. Where `\u003ctag_name\u003e` should be named based on the\ncollector you are releasing and it's version number: odd-collector - `generic/1.0.0`,\nodd-collector-aws - `aws/1.0.0`, odd-collector-azure - `azure/1.0.0`, odd-collector-gcp - `gcp/1.0.0`.\n4. Push locally created tag to the repository: `git push origin \u003ctag_name\u003e`.\n5. Go to the GitHub Actions and choose \"ODD Collector release\" action.\n6. On the right side click \"Run workflow\" and choose the appropriate tag (in \"Use workflow from\")\nand service you are releasing (in \"Select service to build\"). Example: you are releasing odd-collector,\nso tag should look like `generic/0.1.61` and service - `odd-collector`.\n7. Click \"Run workflow\" and wait until the action completes. In the result the newer image will be\npublished, you can check it here: https://github.com/orgs/opendatadiscovery/packages.\n8. Now go back to the odd-collectors repo - https://github.com/opendatadiscovery/odd-collectors, go\nto the \"Releases\" on the right side and edit the release draft (if needed) that was created for you. Naming\nconvention is the following (depends on the service you have released): \"Generic ODD Collector 1.0.0\",\n\"AWS ODD Collector 1.0.0\", \"Azure ODD Collector 1.0.0\", \"GCP ODD Collector 1.0.0\".\n9. Save the release changes.\n\n### Testing\n1. To invoke tests you should go to the folder of needed collector type. For generic collector - \n`cd odd-collector`.\n2. Activate poetry virtual environment with installed dependencies - `poetry shell`.\n3. Invoke tests - `pytest ./tests -v`. where `-v` is a not mandatory option, but it stands for\n__*verbose*__ and can give more detailed feedback on tests' results. Also if you want to run tests\nonly for a particular adapter, you can just modify the relative path, like this - \n`pytest ./tests/integration/test_postgres.py -v `. Also tests can be invoked with\n`poetry run pytest ./tests/integration/test_postgres.py -v`, for instance it can be helpful\nfor making automation testing in github actions, where you can not directly activate venv with\n`poetry shell` in the created testing environment.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopendatadiscovery%2Fodd-collectors","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopendatadiscovery%2Fodd-collectors","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopendatadiscovery%2Fodd-collectors/lists"}