{"id":24763195,"url":"https://github.com/googlecloudplatform/dcm2bq","last_synced_at":"2026-02-28T02:58:09.908Z","repository":{"id":274541618,"uuid":"920306017","full_name":"GoogleCloudPlatform/dcm2bq","owner":"GoogleCloudPlatform","description":"A service for creating a JSON metadata representation for DICOM from multiple input sources and storing into Google Cloud Big Query (BQ).","archived":false,"fork":false,"pushed_at":"2025-09-13T00:17:01.000Z","size":10227,"stargazers_count":3,"open_issues_count":2,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-09-13T02:28:16.133Z","etag":null,"topics":["bigquery","dicom","gcs","googlecloud","googlecloudplatform","googlecloudstorage","json"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GoogleCloudPlatform.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-01-21T23:12:00.000Z","updated_at":"2025-09-13T00:17:04.000Z","dependencies_parsed_at":"2025-08-22T18:37:47.302Z","dependency_job_id":null,"html_url":"https://github.com/GoogleCloudPlatform/dcm2bq","commit_stats":null,"previous_names":["googlecloudplatform/dcm2bq"],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/GoogleCloudPlatform/dcm2bq","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fdcm2bq","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fdcm2bq/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fdcm2bq/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fdcm2bq/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GoogleCloudPlatform","download_url":"https://codeload.github.com/GoogleCloudPlatform/dcm2bq/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fdcm2bq/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279006846,"owners_count":26084206,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-11T02:00:06.511Z","response_time":55,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigquery","dicom","gcs","googlecloud","googlecloudplatform","googlecloudstorage","json"],"created_at":"2025-01-28T20:21:06.811Z","updated_at":"2026-01-26T06:13:41.363Z","avatar_url":"https://github.com/GoogleCloudPlatform.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DCM2BQ\n\n\n`DCM2BQ` (DICOM to BigQuery) is a tool for extracting metadata and generating vector embeddings from DICOM files, loading both into Google BigQuery. It can be run as a standalone CLI or as a containerized service, making it easy to integrate into data pipelines.\n\nBy generating vector embeddings for DICOM images, Structured Reports, and PDFs, DCM2BQ enables powerful semantic search and similarity-based retrieval across your medical imaging data. This allows you to find related studies, cases, or reports even when traditional metadata fields do not match exactly.\n\nThis open-source package can be used as an alternative to the DICOM metadata streaming feature in the [Google Cloud Healthcare API](https://cloud.google.com/healthcare-api), enabling similar functionality for DICOM data stored in [Google Cloud Storage](https://cloud.google.com/storage). It can also be used to complement a Healthcare API DICOM store by generating embeddings for existing or new data.\n\n## Why DCM2BQ?\n\nTraditional imaging systems like PACS and VNAs offer limited query capabilities over DICOM metadata. By ingesting the complete metadata and vector embeddings into [BigQuery](https://cloud.google.com/bigquery), you unlock powerful, large-scale analytics and insights from your imaging data.\n\n**Benefits of Embedding-Based Search:**\n\n- Go beyond exact field matching: Find similar images, reports, or studies based on visual or textual content, not just metadata.\n- Enable content-based retrieval: Search for \"cases like this one\" or \"find similar findings\" using embeddings.\n- Support multi-modal queries: Use embeddings from images, SRs, and PDFs for unified search across modalities.\n- Improve research, cohort discovery, and clinical decision support by surfacing relevant cases that would be missed by keyword or tag-based search alone.\n\n## Features\n\n-   Parse DICOM Part 10 files.\n-   Convert DICOM metadata to a flexible JSON representation.\n-   Load DICOM metadata and vector embeddings into a BigQuery table.\n-   Enable semantic and similarity search over your imaging archive using embeddings.\n-   Run as a containerized service, ideal for event-driven pipelines.\n-   Run as a command-line interface (CLI) for manual or scripted processing.\n-   Handle Google Cloud Storage object lifecycle events (creation, deletion) to keep BigQuery synchronized.\n-   Process zip and tar.gz/tgz archives containing multiple DICOM files with a single event.\n-   Generate vector embeddings from DICOM images, Structured Reports, and encapsulated PDFs using Google's multi-modal embedding model.\n-   Highly configurable to adapt to your needs.\n\n## BigQuery schema\n\nThe project stores DICOM metadata and vector embeddings in a single consolidated BigQuery table with the following columns:\n\n- `id`: STRING (REQUIRED) - Deterministic SHA256 hash of `path|version`\n- `timestamp`: TIMESTAMP (REQUIRED) - When the record was written\n- `path`: STRING (REQUIRED) - Full path to the DICOM file\n- `version`: STRING (NULLABLE) - Object version identifier\n- `info`: RECORD (REQUIRED) - Processing metadata with structured fields:\n  - `event`: STRING - Event type (e.g., OBJECT_FINALIZE)\n  - `input`: RECORD - DICOM file metadata (size, type)\n  - `embedding`: RECORD - Embedding generation details\n    - `model`: STRING - Model used for embedding\n    - `input`: RECORD - Object used for embedding (path, size, mimeType)\n- `metadata`: JSON (NULLABLE) - Complete DICOM JSON metadata\n- `embeddingVector`: FLOAT ARRAY (NULLABLE) - Vector embedding for semantic search\n\nThe Cloud Run service is configured with the table ID via the `gcpConfig.bigQuery.instancesTableId` setting (see `config.defaults.js`). Use the `embeddingVector` column when running vector searches or creating vector indexes and models.\n\nNote: the project includes sample DDL and queries — see `src/bq-samples.sql`.\n\n## Example queries\n\nYou can find example queries and DDL for creating the embedding model and vector index in `src/bq-samples.sql`. The file includes:\n\n- example SELECTs against the consolidated metadata table,\n- sample aggregation queries for vector search,\n- and DDL samples to create an embedding model and a vector index on the `embeddingVector` column.\n\nBefore running vector searches, ensure you have created the embedding model and vector index (the samples show how to do this with `bq query`).\n\n## Installation\n\n### Dependencies\n\nFor image processing and vector embedding generation, `dcm2bq` relies on two external toolkits that must be installed in the execution environment:\n\n-   **DCMTK**: A collection of libraries and applications for working with DICOM files.\n-   **GDCM**: A library for reading and writing DICOM files, used here for image format conversion.\n\nThese are included in the provided Docker image. If you are building from source or running the CLI locally, you will need to install them manually.\n\n**On Debian/Ubuntu:**\n```bash\nsudo apt-get update \u0026\u0026 sudo apt-get install -y dcmtk gdcm-tools\n```\n\n### Docker\n\nThe service is distributed as a container image. You can find the latest releases on [Docker Hub](https://hub.docker.com/r/jasonklotzer/dcm2bq).\n\n```bash\ndocker pull jasonklotzer/dcm2bq:latest\n```\n\n### From Source (for CLI)\n\nTo use the CLI, you can install it from the source code.\n\n1.  Ensure you have `node` and `npm` installed. We recommend using nvm.\n2.  Ensure you have installed the required [Dependencies](#dependencies).\n3.  Clone the repository:\n    ```bash\n    git clone https://github.com/googlecloudplatform/dcm2bq.git\n    ```\n4.  Navigate to the directory and install dependencies and the CLI:\n    ```bash\n    cd dcm2bq\n    npm install\n    npm install -g .\n    ```\n5.  Verify the installation:\n    ```bash\n    dcm2bq --help\n    ```\n\n## Usage\n\n### As a Service (Cloud Run)\n\nThe recommended deployment uses Google Cloud Storage, Pub/Sub, and Cloud Run.\n\n![Deployment Architecture](assets/arch.svg)\n\nThe workflow is as follows:\n\n1.  An object operation (e.g., creation, deletion) occurs in a GCS bucket.\n2.  A notification is sent to a Pub/Sub topic.\n3.  A Pub/Sub subscription pushes the message to a Cloud Run service running the `dcm2bq` container.\n4.  The `dcm2bq` container processes the message:\n    -   It validates the message schema and checks for a DICOM-like file extension (e.g., `.dcm`) or supported archive (`.zip`, `.tar.gz`, `.tgz`).\n    -   For new objects, it reads the file from GCS and parses the DICOM metadata.\n    -   For archive files, it extracts all `.dcm` files and processes each one individually.\n    -   If embeddings are enabled, it generates a vector embedding from the DICOM data (for supported types like images, SRs, and PDFs) by calling the Vertex AI Embeddings API.\n    -   It inserts a JSON representation of the metadata and the embedding into BigQuery.\n    -   For deleted objects, it records the deletion event in BigQuery.\n5.  If an error occurs, the message is NACK'd for retry. After maximum retries, it's sent to a dead-letter topic for analysis.\n\n**Note:** When deploying to Cloud Run, ensure the container has enough memory allocated to handle your largest DICOM files.\n\n### Archive Support (.zip, .tar.gz, .tgz)\n\nThe service can process archives containing multiple DICOM files. When a `.zip`, `.tar.gz`, or `.tgz` file is uploaded to the configured GCS bucket:\n\n1. The archive is downloaded to memory\n2. All `.dcm` files are extracted to a temporary directory\n3. Each DICOM file is processed individually (metadata extraction and optional embedding generation)\n4. All files share the same base path (the archive file path) for tracking purposes\n5. Temporary files are automatically cleaned up after processing\n\nThis feature is useful for batch uploads or when DICOM files are already archived. All DICOM files within the archive will be processed as separate entries in BigQuery, maintaining the original archive file path as the base path for version tracking.\n\n### As a CLI\n\nThe CLI is useful for testing, development, and batch processing.\n\n\n**Example: Dump DICOM metadata as JSON**\n\n```bash\ndcm2bq dump test/files/dcm/ct.dcm | jq\n```\n\nThis command will output the full DICOM metadata in JSON format, which can be piped to tools like `jq` for filtering and inspection.\n\n**Example: Generate a vector embedding**\n\n```bash\ndcm2bq embed test/files/dcm/ct.dcm\n```\n\nThis command will process the DICOM file, generate a vector embedding using the configured model, and output the embedding as a JSON array.\n\n**Example: Extract rendered image or text from a DICOM file**\n\n```bash\ndcm2bq extract test/files/dcm/ct.dcm\n```\n\nThis command will extract and save a rendered image (JPG) or extracted text (TXT) from the DICOM file, depending on its type (image, SR, or PDF). The output file extension is chosen automatically unless you specify `--output`.\n\n**Example: Extract with summarization (SR/PDF only)**\n\n```bash\ndcm2bq extract test/files/dcm/sr.dcm --summary\n```\n\nBy default, summarization is disabled for extracted text. If you pass `--summary`, the extracted text from Structured Reports (SR) or PDFs will be summarized using Gemini before saving. This is useful for generating concise, embedding-friendly text.\n\n**Example: Extract without summarization (explicitly)**\n\n```bash\ndcm2bq extract test/files/dcm/sr.dcm\n```\n\nIf you do not pass `--summary`, the full extracted text will be saved (subject to length limits for embedding).\n\n## Configuration\n\nConfiguration options can be found in the [default config file](./src/config.defaults.js).\n\nYou can override these defaults in two ways.\n\n**Important:** When providing an override via environment variable or a file, you must supply the entire configuration object. The default configuration is not merged with your overrides; your provided configuration will be used as-is.\n\n1.  **Environment Variable:** Set `DCM2BQ_CONFIG` to a JSON string containing the full configuration.\n    ```bash\n    export DCM2BQ_CONFIG='{\"bigquery\":{\"datasetId\":\"my_dataset\",\"instancesTableId\":\"my_table\"},\"gcpConfig\":{\"projectId\":\"my-gcp-project\",\"embeddings\":{\"enabled\":true,\"model\":\"multimodalembedding@001\"}},\"jsonOutput\":{...}}'\n    ```\n2.  **Config File:** Set `DCM2BQ_CONFIG_FILE` to the path of a JSON file containing your full configuration.\n    ```bash\n    # config.json\n    # {\n    #   \"bigquery\": {\n    #     \"datasetId\": \"my_dataset\",\n    #     \"instancesTableId\": \"my_table\"\n    #   },\n    #   \"gcpConfig\": {\n    #     \"projectId\": \"my-gcp-project\",\n    #     \"embeddings\": {\n    #       \"enabled\": true,\n    #       \"model\": \"multimodalembedding@001\"\n    #     }\n    #   },\n    #   \"jsonOutput\": {\n    #      ...\n    #   }\n    # }\n    export DCM2BQ_CONFIG_FILE=./config.json\n    ```\n\n\n### Embedding and Summarization Configuration\n\nTo enable vector embedding generation and input extraction, configure the `embedding.input` section within `gcpConfig`. The configuration uses a hierarchical structure where the presence of settings indicates they are enabled.\n\nExample `config.json` override:\n```json\n{\n  \"gcpConfig\": {\n    \"embedding\": {\n      \"input\": {\n        \"gcsBucketPath\": \"gs://my-bucket/processed-data\",\n        \"summarizeText\": {\n          \"model\": \"gemini-2.5-flash-lite\",\n          \"maxLength\": 1024\n        },\n        \"vector\": {\n          \"model\": \"multimodalembedding@001\"\n        }\n      }\n    }\n  }\n}\n```\n\n**Note:** The JSON snippet above is a partial example showing only the embeddings-related settings. When providing an override (via `DCM2BQ_CONFIG` or `DCM2BQ_CONFIG_FILE`), you must supply the entire configuration object — partial merges are not supported.\n\n### Embedding Input Configuration\n\n- `embedding.input.gcsBucketPath`: GCS bucket path where processed images (.jpg) and text (.txt) files will be saved. Format: `gs://bucket-name/optional-path`. Files are organized as `{gcsBucketPath}/{StudyInstanceUID}/{SeriesInstanceUID}/{SOPInstanceUID}.{jpg|txt}`. If this is omitted or empty, no files will be saved. **Important:** This bucket should be separate from the DICOM source bucket to avoid triggering unwanted events when processed files are created.\n- `embedding.input.vector.model`: If present, vector embeddings will be generated using the specified Vertex AI model (e.g., `multimodalembedding@001`). Omit this section to only extract and save inputs without generating embeddings.\n\n### Text Summarization Configuration\n\n- `embedding.input.summarizeText.model`: If present, long text extracted from SR/PDF will be summarized using the specified Gemini model before processing. Omit this section to skip summarization. This can be overridden at runtime by the CLI `--summary` flag.\n- `embedding.input.summarizeText.maxLength`: Maximum character length for summarized text (default: 1024). The summarization prompt instructs the model to keep output under this limit. This also controls when summarization is triggered: text longer than `maxLength` will be summarized when embedding compatibility is required.\n\n## Development\n\nTo get started with development, follow the installation steps for the CLI.\n\nThe `test` directory contains numerous examples, unit tests, and integration tests that are helpful for understanding the codebase and validating changes.\n\n### Running Tests\n\nThe unit tests are fully mocked and can be run without any GCP dependencies or configuration files. All external service calls (BigQuery, Cloud Storage, Vertex AI, Gemini) are stubbed to ensure fast, reliable test execution.\n\nTo run the unit test suite:\n\n```bash\nnpm test\n# or using the helper script\n./helpers/run-unit-tests.sh\n```\n\nThe tests use a mock configuration defined in [test/test-config.js](test/test-config.js) and don't require any real GCP resources or the `test/testconfig.json` file.\n\n### Integration Tests\n\nFor testing against real GCP services, integration test files are available that require:\n\n1. A properly configured `test/testconfig.json` file (generated by running `./helpers/deploy.sh my-project-name`)\n2. GCP authentication (`gcloud auth application-default login`)\n3. Deployed GCP resources (BigQuery dataset/table, GCS buckets)\n\n**When to run integration tests**\n- Run unit tests (`npm test`) locally before sending a PR or release tag; they are fully mocked and fast.\n- Run integration tests **only after** deploying the test stack (e.g., `./helpers/deploy.sh upload \u003cproject\u003e`) or promoting a build to a staging environment, because they need live GCP resources.\n- Recommended checkpoints: after dependency or schema changes, before a release cut once the candidate container is deployed to the test/staging project, and periodically in CI on a schedule against that deployed environment.\n\nAvailable integration test suites:\n\n- **`semantic_compare.integration.js`** - Tests semantic similarity between text and image embeddings\n- **`pipeline.integration.js`** - End-to-end pipeline tests (GCS upload → processing → BigQuery insertion)\n- **`storage-embeddings.integration.js`** - Storage and embedding feature tests\n- **`config-validation.integration.js`** - Configuration, schema, and permissions validation tests\n\nTo run all integration tests:\n\n```bash\nnpm run test:integration\n# or using the helper script directly\n./helpers/run-integration-tests.sh\n```\n\nOr manually with mocha:\n\n```bash\nDCM2BQ_CONFIG_FILE=test/testconfig.json mocha test/*.integration.js\n```\n\nTo run a specific integration test suite:\n\n```bash\nDCM2BQ_CONFIG_FILE=test/testconfig.json mocha test/pipeline.integration.js\n```\n\n**Note:** Integration tests make real API calls to Google Cloud services and may incur costs. They also upload test files to GCS and insert rows into BigQuery (cleanup is performed automatically).\n\n## Contributing\n\nContributions are welcome! Please see [CONTRIBUTING.md](./CONTRIBUTING.md) for details on how to contribute to this project.\n\n## License\n\nThis project is licensed under the Apache 2.0 License.\n\n## Deployment with Terraform\n\nThe recommended way to deploy the service and all required Google Cloud resources is using Terraform. This will provision:\n- Google Cloud Storage bucket(s)\n- Pub/Sub topics and subscriptions\n- BigQuery dataset and tables\n- Cloud Run service\n- All necessary IAM permissions\n\nA helper script is provided to automate the process:\n\n```bash\n./helpers/deploy.sh [OPTIONS] [destroy|upload] \u003cgcp_project_id\u003e\n```\n- `upload`: Upload test DICOM files from `test/files/dcm/*.dcm` to the GCS bucket created by Terraform (standalone; does not deploy).\n- `destroy`: Destroy all previously created resources (cleanup).\n- `--debug`: Enable debug mode with verbose logging in the Cloud Run service.\n- `--help` or `-h`: Show usage instructions.\n\n**Examples**\n\n- Deploy infrastructure:\n  ```bash\n  ./helpers/deploy.sh my-gcp-project-id\n  ```\n\n- Deploy with debug mode enabled:\n  ```bash\n  ./helpers/deploy.sh --debug my-gcp-project-id\n  ```\n\n- Upload test data only (no deploy):\n  ```bash\n  ./helpers/deploy.sh upload my-gcp-project-id\n  ```\n\n- Deploy and then upload test data (two steps):\n  ```bash\n  ./helpers/deploy.sh my-gcp-project-id\n  ./helpers/deploy.sh upload my-gcp-project-id\n  ```\n\n- Destroy all resources:\n  ```bash\n  ./helpers/deploy.sh destroy my-gcp-project-id\n  ```\n\nThe script will:\n1. Ensure all dependencies (Terraform, gcloud, gsutil) are installed.\n2. Create a GCS bucket for Terraform state (if needed).\n3. Generate a backend config for Terraform.\n4. Deploy all infrastructure using Terraform.\n5. Optionally upload test DICOM files if the flag is supplied.\n\n\u003e **Note:** All resource names (buckets, datasets, tables, etc.) are made unique per deployment to avoid collisions.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgooglecloudplatform%2Fdcm2bq","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgooglecloudplatform%2Fdcm2bq","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgooglecloudplatform%2Fdcm2bq/lists"}