https://github.com/zefdelgadillo/gcs-fixity-function
πFixity metadata tracker for Google Cloud Storage file archives
https://github.com/zefdelgadillo/gcs-fixity-function
fixity google-cloud-platform
Last synced: 13 days ago
JSON representation
πFixity metadata tracker for Google Cloud Storage file archives
- Host: GitHub
- URL: https://github.com/zefdelgadillo/gcs-fixity-function
- Owner: zefdelgadillo
- Created: 2019-08-27T03:22:14.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2022-12-08T06:41:11.000Z (over 3 years ago)
- Last Synced: 2025-01-12T13:24:30.908Z (over 1 year ago)
- Topics: fixity, google-cloud-platform
- Language: Python
- Homepage:
- Size: 88.9 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Fixity Metadata for GCS π
This script pulls metadata and checksums for file archives in Google Cloud Storage and stores them in a manifest file and in BigQuery to track changes over time. The script uses the [BagIt](https://tools.ietf.org/html/rfc8493) specification.
## Overview
Each time this Fixity function is run for any file archive bag using the BagIt specification, the following is created:
- An MD5 checksum manifest file
- Records in BigQuery containing the following metadata: bucket, bag, file name, file size, checksum, file modified date, fixity run date.
## Process

* Google Cloud Function listens on changes to a GCS Bucket (file archives, file updates)
* (or) Google Cloud Scheduler invokes Cloud Function manually or via a predefined schedule
* Function reads metadata of files for each Bag* that has file updates and writes a new Manifest file into each Bag
* Function writes records into BigQuery for each Bag with new metadata
_* If function is invoked by listening to changes on a GCS bucket, then Fixity is run only for the Bag that had the change. If function is invoked by Cloud Scheduler, then Fixity is run for the entire GCS Bucket_
### Buckets
This Fixity function is configured for 1 Google Cloud Storage bucket containing any number of Bags.
### Bags
Bags should be created using the [BagIt Specification (RFC 8493)](https://tools.ietf.org/html/rfc8493). A Bag is a directory in a GCS bucket that contains a `data/` directory containing archived files.
Any number of bags can be created in a GCS bucket, **as long as each bag contains a `data/` directory**. In the following example, this function will recognize 4 bags: `collection-europe/italy/`, `collection-europe/france/`, `collection-na/1700s/`, and `uncategorized/`.
```
BUCKET: Rare Books
.
βββ collection-europe
βΒ Β βββ italy
βΒ Β βΒ Β βββ data
βΒ Β βΒ Β βββ book1
βΒ Β βΒ Β βββ book2
βΒ Β βΒ Β βββ book3
βΒ Β βββ france
βΒ Β βββ data
βΒ Β βββ book1
βΒ Β βββ book2
βββ collection-na
βΒ Β βββ 1700s
βΒ Β Β Β βββ data
βΒ Β Β Β βββ book1
βΒ Β Β Β βββ book2
βΒ Β Β Β βββ book3
βββ uncategorized
Β Β βββ data
Β Β Β Β βββ a
```
### BigQuery
The setup instructions create the following BigQuery views:
- `fixity.current_manifest`: A current list of all files in the archive across all Bags.
- `fixity.file_operations`: A running list of all file operations (file updated, file changed, file created) across all bags.
## Setup
* Use **[Setup Instructions](./docs/setup.md)** to setup Fixity functions.
* Use **[Removal Instructions](./docs/remove.md)** to disable or remove Fixity functions.
## Limitations
This Cloud Functions has a default memory limit of 256MB per function invocation. To avoid hitting memory limits, distribute bags and objects across many different buckets. It's recommended to maintain under 250,000 objects per bucket to avoid running into memory limitations.