{"id":19119768,"url":"https://github.com/wintermi/imdb-dataform","last_synced_at":"2025-05-05T15:27:42.273Z","repository":{"id":136385243,"uuid":"579570794","full_name":"wintermi/imdb-dataform","owner":"wintermi","description":"An example Dataform project to load and transform the publicly available dataset from IMDB.","archived":false,"fork":false,"pushed_at":"2024-04-27T14:43:25.000Z","size":58,"stargazers_count":10,"open_issues_count":0,"forks_count":7,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-27T00:26:16.782Z","etag":null,"topics":["bigquery","dataform","google-cloud","google-cloud-platform"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wintermi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-12-18T05:55:07.000Z","updated_at":"2025-04-15T13:33:31.000Z","dependencies_parsed_at":"2024-11-09T05:11:08.560Z","dependency_job_id":"663a8c3f-506b-4417-8a56-892c47507bb7","html_url":"https://github.com/wintermi/imdb-dataform","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wintermi%2Fimdb-dataform","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wintermi%2Fimdb-dataform/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wintermi%2Fimdb-dataform/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wintermi%2Fimdb-dataform/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wintermi","download_url":"https://codeload.github.com/wintermi/imdb-dataform/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252523472,"owners_count":21761923,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigquery","dataform","google-cloud","google-cloud-platform"],"created_at":"2024-11-09T05:11:03.307Z","updated_at":"2025-05-05T15:27:42.254Z","avatar_url":"https://github.com/wintermi.png","language":"Shell","readme":"# **IMDB Dataform Project**\n\n# About\n\nAn example Dataform project to load and transform the publicly available dataset from [IMDB](https://imdb.com).\n\n# Prerequisites\n\n## Google Cloud Project\n\nGoogle Cloud projects form the basis for creating, enabling, and using all Google Cloud services, such as Dataform and BigQuery.\n\nIf you do not already have a Google Cloud project for which you want to load the IMDB dataset into, then you will need to create a new Google Cloud project. The documentation on how to do this can be found [here](https://cloud.google.com/resource-manager/docs/creating-managing-projects#creating_a_project).\n\nOnce you have a Google Cloud project, remember to take note of the Project Number and Project ID. These can be found on the Google Cloud project console welcome page, which you can find [here](https://console.cloud.google.com/welcome).\n\n## Google Cloud Storage Bucket\n\nNow you have a Google Cloud project, you need to create a Google Cloud Storage Bucket for which the IMDB dataset will be uploaded into and Dataform will use to source the data in which to load data into BigQuery. The documentation on how to create a new storage bucket can be found [here](https://cloud.google.com/storage/docs/creating-buckets).\n\nRemeber to take note of the bucket name as this will be required for one of the Dataform config variables.\n\n## Enable Dataform Service\n\nNext, you will need to enable the Dataform service within the Google Cloud project just created. This can be achieved by clicking the \"Enable\" button [here](https://console.cloud.google.com/marketplace/product/google/dataform.googleapis.com).\n\n## Create a Dataform Repository\n\nAfter the Dataform Service has been enabled, you will be redirected to the BigQuery Dataform page within the Google Cloud console. For reference, this can be found [here](https://console.cloud.google.com/bigquery/dataform).\n\nGo ahead and create a repository. For more information on how to do this, go to the documentation page found [here](https://cloud.google.com/dataform/docs/create-repository).\n\n## Grant Permissions to Dataform Service Account\n\nWhen you create your first Dataform repository, Dataform automatically generates a service account. Dataform uses the service account to interact with BigQuery on your behalf.\n\nYour Dataform service account ID is in the following format:\n\n```\nservice-YOUR_PROJECT_NUMBER@gcp-sa-dataform.iam.gserviceaccount.com\n```\n\nReplace YOUR_PROJECT_NUMBER with the Project Number of your Google Cloud project, which you previously took note of.\n\nThe Dataform service account requires a number of IAM roles with which to be able to execute the workflows in BigQuery and load data from the Google Cloud Storage Bucket. This can be achieved by following these steps:\n\n1. In the Google Cloud console, go to the [IAM page](https://console.cloud.google.com/iam-admin).\n2. Click Add.\n3. In the New principals field, enter your Dataform service account ID.\n4. In the Select a role drop-down list, select the BigQuery Job User role.\n5. Click Add another role, and then in the Select a role drop-down list, select the BigQuery Data Editor role.\n6. Click Add another role, and then in the Select a role drop-down list, select the BigQuery Data Viewer role.\n7. Click Add another role, and then in the Select a role drop-down list, select the Storage Object Viewer role.\n8. Click Save.\n\n## IMDB Public Dataset\n\nEven though the public datasets can be accessed without any user credentials, it is recommended that you create an IMDB account, which you can do [here](https://contribute.imdb.com/dataset). Once signed in, you can go to the [Alternate Interfaces](http://www.imdb.com/interfaces) page for general data access, which will provide you details on licensing and metadata for each of the data files.\n\nThe data files can be downloaded directly from [here](https://datasets.imdbws.com/), or individually listed below:\n\n-   [name.basics.tsv.gz](https://datasets.imdbws.com/name.basics.tsv.gz)\n-   [title.akas.tsv.gz](https://datasets.imdbws.com/title.akas.tsv.gz)\n-   [title.basics.tsv.gz](https://datasets.imdbws.com/title.basics.tsv.gz)\n-   [title.crew.tsv.gz](https://datasets.imdbws.com/title.crew.tsv.gz)\n-   [title.episode.tsv.gz](https://datasets.imdbws.com/title.episode.tsv.gz)\n-   [title.principals.tsv.gz](https://datasets.imdbws.com/title.principals.tsv.gz)\n-   [title.ratings.tsv.gz](https://datasets.imdbws.com/title.ratings.tsv.gz)\n\nDownload each of these files, before uploading them to the Google Cloud Storage Bucket you created\n\n## DataForm Workflow Settings\n\nThe `workflow_settings.yaml` contains the following parameters\n\n-   `defaultProject`: The Project ID of your Google Cloud project, which you previously took note of\n-   `defaultLocation`: Target BigQuery Location\n-   `defaultDataset`: Name of the BigQuery Dataset for which the IMDB tables are to be created\n-   `defaultAssertionDataset`: Name of the BigQuery Dataset for which any Dataform Assertions are to be created and executed against\n-   `LOAD_GCS_BUCKET`: Name of the Google Cloud Storage Bucket, which you previously took note of\n-   `RAW_DATA`: Name of the BigQuery Dataset for which the IMDB data files are to be loaded into\n-   `TARGET_DATA`: Name of the BigQuery Dataset for which the final transformed IMDB tables are to be located\n\nHere is what an example configuration looks like\n\n```yaml\ndataformCoreVersion: 3.0.0-beta.4\ndefaultProject: winter-dataform\ndefaultLocation: australia-southeast1\ndefaultDataset: imdb\ndefaultAssertionDataset: imdb_assertions\nvars:\n    LOAD_GCS_BUCKET: winter-data/imdb\n    RAW_DATA: imdb_staging\n    TARGET_DATA: imdb\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwintermi%2Fimdb-dataform","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwintermi%2Fimdb-dataform","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwintermi%2Fimdb-dataform/lists"}