{"id":23250597,"url":"https://github.com/inrupt/etl-tutorial","last_synced_at":"2025-06-22T11:08:39.991Z","repository":{"id":37525966,"uuid":"447216840","full_name":"inrupt/etl-tutorial","owner":"inrupt","description":"A tutorial showing how to implement an ETL process that Extracts from various data sources (e.g., APIs, local files, JSON objects), Transforms the extracted data to Linked Data, and Loads to Solid Pods (and, optionally, triplestores).","archived":false,"fork":false,"pushed_at":"2024-04-04T16:18:13.000Z","size":29927,"stargazers_count":0,"open_issues_count":2,"forks_count":1,"subscribers_count":14,"default_branch":"main","last_synced_at":"2025-06-22T11:08:31.636Z","etag":null,"topics":["etl","linkeddata","pod","rdf","solid","vocabs"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/inrupt.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-01-12T12:52:43.000Z","updated_at":"2023-10-19T14:31:12.000Z","dependencies_parsed_at":"2025-02-15T20:01:38.379Z","dependency_job_id":null,"html_url":"https://github.com/inrupt/etl-tutorial","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/inrupt/etl-tutorial","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/inrupt%2Fetl-tutorial","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/inrupt%2Fetl-tutorial/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/inrupt%2Fetl-tutorial/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/inrupt%2Fetl-tutorial/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/inrupt","download_url":"https://codeload.github.com/inrupt/etl-tutorial/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/inrupt%2Fetl-tutorial/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261282320,"owners_count":23134940,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["etl","linkeddata","pod","rdf","solid","vocabs"],"created_at":"2024-12-19T09:14:11.144Z","updated_at":"2025-06-22T11:08:34.980Z","avatar_url":"https://github.com/inrupt.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Inrupt ETL Tutorial\n\nThis repository contains code demonstrating how to Extract, Transform, and\nLoad (ETL) into user Pods from various data sources, including publicly\naccessible 3rd-party data sources, local files, JSON Objects, etc.\n\nDeveloped by [Inrupt, inc](https://www.inrupt.com).\n\n## Prerequisites\n\nThis repository is intended for use by experienced TypeScript developers who are\nalso familiar with running Node.JS applications from the command-line, and who\nare also somewhat familiar with the principles of Linked Data and using Solid\nPods.\n\n## Background\n\nTo aid in the understanding of Linked Data, which is the foundation for\neverything in Solid, we first recommend reading this\n[High-level overview of how Solid stores data](docs/LinkedDataOverview/LinkedData-HighLevel.md).\n\n## Quick-start Worksheet\n\nIf you want a complete but quick, start-to-finish run-through of the entire\nprocess of installing, configuring, testing, and running the ETL Tutorial, then\nyou can follow our detailed Worksheet instructions\n[here](docs/Worksheet/Worksheet.md).\n\nThe following instructions provide more background, and go into greater\ndetail.\n\n## Install and Run\n\nSince you may not yet wish to publicly publish any of the vocabularies we\ndevelop during this tutorial (namely the vocabularies you create on behalf of\n3rd-party data sources that don't yet provide RDF vocabularies themselves), we\nrecommend first generating a local `npm` package that bundles together all the\nJavaScript classes representing all the terms from all those vocabularies.\n\nTo do this, we run Inrupt's open-source\n[Artifact Generator](https://github.com/inrupt/artifact-generator), pointing\nit at our local configuration YAML file that references all the local\nvocabularies we wish to use terms from, and that bundles together the\ngenerated JavaScript classes that contain constants for all the terms from\nall of those vocabularies (which are all located in the\n[./resources/Vocab](./resources/Vocab) directory):\n\n```script\nnpx @inrupt/artifact-generator generate --vocabListFile \"resources/Vocab/vocab-etl-tutorial-bundle-all.yml\" --outputDirectory \"./src/InruptTooling/Vocab/EtlTutorial\" --noPrompt --force --publish npmInstallAndBuild\n```\n\n**Note:** If you have the Artifact Generator installed locally (e.g., for\nfaster execution), then you can run it directly:\n\n```script\nnode ../SDK/artifact-generator/src/index.js generate --vocabListFile \"resources/Vocab/vocab-etl-tutorial-bundle-all.yml\" --outputDirectory \"./src/InruptTooling/Vocab/EtlTutorial\" --noPrompt --force --publish npmInstallAndBuild\n```\n\n**Note**: During ETL development generally it's common to re-run the Artifact\nGenerator regularly, for example after any local vocabulary changes or updates.\nWe can also keep it running constantly in 'file watcher' mode so that it re-runs\nautomatically on any local vocab file changes, which can be really convenient.\n\nSo since it can be run so regularly, it can generally be a good idea to clone\nand run the open-source Artifact Generator locally (as that's much faster than\nusing `npx`).\n\nNow install our ETL Tutorial:\n\n```script\nnpm install\n```\n\nFinally, execute the units tests to ensure everything is configured correctly:\n\n**Note**: You can expect to see error output in the console, as we have **_100%\nbranch code coverage_**, meaning our tests deliberately force lots of error\nsituations so that we can test that our code handles all those situations\ncorrectly. What we expect to see is a completely green final Jest report, with\n100% coverage across the board.\n\n```script\nnpm test\n```\n\n## End-2-End tests\n\nWe provide multiple forms of End-2-End tests to demonstrate and test different\naspects of the overall ETL process in isolation. This clear separation also\nallows you to understand the various different credentials we generally need for\nan overall ETL process flow.\n\nFor all these tests, and indeed when running the actual ETL process 'for real',\nwe only need to create and edit a single local environment file to provide the\nvarious credentials needed.\n\nNote however, that all these End-2-End tests, and the ETL process itself, will\nalso all 'pass' without any such environment file at all, as the ETL code\nactually treats the loading of data into Pods or a triplestore as completely\noptional (to allow easy testing of just the Extract phase, or just the\nTransformation phase). So even if no credentials are provided at all, everything\nwill still pass, but you'll see lots of console output saying the Extraction\nand/or the Loading phases are being ignored!\n\n### Overview of our End-2-End test suites\n\nThe End-2-End test suites we provide are described below, and you can safely\nrun them now without first creating the local environment file. But without\ncredentials, you should see the console output telling you various steps are\nbeing 'ignored':\n\n1. Extract, Transform, and display to console:\n   ```script\n   npm run e2e-test-node-ExtractTransform-display\n   ```\n   (Without any credentials, we'll see this test successfully Extract data\n   from local copies of 3rd-party data, successfully transform that data into\n   Linked Data, and then display that data to the console, but we'll see it\n   ignore Extraction from any 3rd-parties that **require** credentials.)\n2. Load locally, Transform, and Load to Pods (and/or a triplestore, if\n   configured):\n   ```script\n   npm run e2e-test-node-localExtract-TransformLoad\n   ```\n   (Without any credentials, we'll see this test successfully Extract data\n   from local copies of 3rd-party data, successfully transform that data into\n   Linked Data, but then ignore all attempts to Load those resources into any\n   Pod or triplestore.)\n\n### Overview of test suites\n\nHere we describe our test suites in a bit more detail...\n\n#### 1. Extract, Transform, and display to console.\n\nTests that connect to each of our 3rd-party data sources to Extract data,\nTransform that extracted data into Linked Data, and then just outputs some of\nthat data to the console (rather than Loading it anywhere!). This is to\ndemonstrate and test **_only_** the Extract and Transform stages of the ETL\nprocess, and so for these tests we don't need to configure or setup anything to\ndo with Solid Pods or triplestores (since we deliberately don't attempt to\n'Load' this Extracted and Transformed data anywhere yet).\n\n#### 2. Load locally, Transform, and Load to Pods (and/or triplestore).\n\nTests that read local copies of 3rd-party data (so in this case, we are\ndeliberately avoiding the need for any credentials to connect to any of our\n3rd-party data sources). These tests Transform that local data into Linked Data,\nand attempt to Load it into a Solid Pod (and optionally, a triplestore). In\nother words, this is for demonstrating and testing **_only_** the\nTransformation and Loading phases of the ETL process.\n\n### Create a local-only environment file\n\nTo run our ETL Tutorial or execute our End-2-End tests for 'real' (i.e., where\nwe attempt to Extract real data from actual 3rd-parties, and/or Load data into\nreal Solid Pods or a triplestore), we need to provide real, valid credentials,\ni.e., to allow our application to authenticate with the real APIs of our\n3rd-party data sources, and/or to allow our application to write Linked Data to\nreal user's Solid Pods (and/or, optionally, to a triplestore).\n\nTo allow us do all of this, we simply need to create and configure a single\nlocal environment file, as follows:\n\n1. Make a copy of the `./e2e/node/.env.example` file, naming your copy\n   `./e2e/node/.env.test.local`.\n2. Now we need to replace the example placeholder values with valid credential\n   values, depending on what we wish to do (i.e., run one or more of the\n   End-2-End tests, and/or the full ETL process itself).\n\nWe can now configure this local environment file in various ways, and re-run\nour End-2-End test suites to understand all the possible mix-and-match\nvariations of ETL.\n\n### Loading into a triplestore\n\nIf you are already familiar with triplestores, then perhaps the easiest option\ninitially is to simply create a new repository in your favorite triplestore\nand provide that repository's SPARQL update endpoint in your local environment\nfile.\n\nIf you are not already familiar with triplestores, you can safely ignore this\nentire section (although registering, downloading, and running a free\ntriplestore can literally take less than 10 minutes - see detailed\ninstructions [here](./docs/VisualizePodData/VisualizePodData.md)).\n\n#### Configuring your triplestore\n\nFor example, if your favored triplestore was\n[Ontotext's GraphDB](https://www.ontotext.com/products/graphdb/), and you\ncreated a default repository named `inrupt-etl-tutorial`, then you'd simply\nneed to add the following value to your `.env.test.local` environment file:\n\n```\nINRUPT_TRIPLESTORE_ENDPOINT_UPDATE=\"http://localhost:7200/repositories/inrupt-etl-tutorial/statements\"\n```\n\nNow when you run the End-2-End tests to Load (there's little point in\nre-running the 'only-display-to-console' test suite, since it deliberately\nnever attempts to Load any data anywhere anyway)...\n\n```script\nnpm run e2e-test-node-localExtract-TransformLoad\n```\n\n...you should see console output informing you that our ETL application\nsuccessfully Loaded resources into your triplestore repository. If your\ntriplestore is not running, or you provided an incorrect SPARQL Update\nendpoint, then this test suite should fail (for example, with\n`connect ECONNREFUSED` errors).\n\nAssuming the test suite passes, you should now be able to see and query the\nLoaded data using your familiar triplestore tools (for example, simply running\na `select * where { ?s ?p ?o }` SPARQL query should show you all our ETL'ed\ndata - (if you like, you can turn off the 'Include inferred data in the\nresults' option (i.e., using the '\u003e\u003e' icon on the right-hand side of the\n'SPARQL Query \u0026 Update' user interface) - this will reduce the noise of the\ndefault inferred triples in all search results)).\n\n**Note**: At this stage (i.e., by only configuring our triplestore via the ETL\nenvironment file), all triples will be added directly to the `default` Named\nGraph. For more information on how to populate separate Named Graphs per user,\nsee later in this documentation.\n\nIf your triplestore supports visualizing triples (such as GraphDB), then our\ndata can already be intuitively inspected and navigated by starting at the\nfollowing test node, generated by default by our test suite if no Solid Pod\nstorage root was configured:\n`https://different.domain.example.com/testStorageRoot/private/inrupt/etl-tutorial/etl-run-1/`\n\n**Note**: One thing you might notice is that the triplestore does not contain\nany Linked Data Platform (LDP) containment triples (e.g., no `ldp:contains` or\n`ldp:BasicContainer`, etc., triples). This is because this test specifically\nLoaded data into a raw triplestore, which has no inherent notion of LDP\ncontainment. We'll see later that Loading the same resources into a Solid Pod\ndoes result in these containment triples, since they'll have been created by\nvirtue of Solid servers (currently) also being LDP servers.\n\n### Running just Extract and Transform\n\nThe test suite `e2e/node/ExtractTransform-display.test.ts` tests the\nExtraction of data from each of our 3rd-party data sources, Transforms that\nExtracted data into Linked Data, and then displays it to the console for\nmanual, visual verification (i.e., it deliberately does **_not_** attempt to\nLoad this Transformed data anywhere, such as a Solid Pod or a triplestore).\n\nTo execute this test suite, run this script from the root directory:\n\n```script\nnpm run e2e-test-node-ExtractTransform-display\n```\n\n**Note**: We can still run these tests without any environment file at all,\nbut the code simply won't attempt any Extraction or Loading.\n\nIf the credentials you supplied were all valid, you should see data displayed\non-screen, with colorful console output (via the\n[debug](https://www.npmjs.com/package/debug) library) from all data sources that\nhave configured credentials. Data sources without credentials are simply\nignored, so these tests are convenient for testing individual data sources in\nisolation (i.e., simply comment out the credentials for the other data sources),\nor collectively.\n\n### Running the ETL application 'for real'\n\nTo execute the entire ETL process 'for real' (i.e., hitting the 3rd-party APIs\nand populating real Solid Pods (and optionally also a triplestore)), we run\nthe application from the command-line, and drive it via credential resources\nstored as Linked Data.\n\nFor example, we can have user credential resources as local Turtle files (or\nwe could extend this to store these credentials as resources in user Pods),\none per user, configured with that user's API credentials for each of the\n3rd-party data sources that that user has access to (each individual user may\nhave credentials for none, some, or all, of the data sources), and also\nproviding their Solid Pod credentials (such as their WebID and storage root,\nand also the ETL application registration credentials (see\n[below](#registering_the_etl_application_for_each_user))).\n\n**_Note:_** We can also provide within these user-specific credential\nresources a SPARQL Update endpoint URL for a triplestore, and also a Named\nGraph IRI to represent that user's separate 'Pod' within that triplestore\nrepository. This allows us to populate multiple user's data in a single\ntriplestore instance, with each user's Pod isolated by having its data in its\nown Named Graph. If no Named Graph value is provided, then that user's data will\nbe loaded into the 'default' graph of the triplestore, which is really only\nuseful if running the ETL for a single user (as loading multiple users would\njust result in each user's data overwriting the data of the previously ETL'ed\nuser).\n\n## ETL Application\n\nOur ETL process runs as an automated application - one that individual end users\nneed to specifically grant access to, to allow that application Load data into\ntheir Pods on their behalf. (Note: if an Enterprise provides their users with\nPods, then that Pod provisioning phase can automate the granting of that\npermission, so the actual end users themselves may never need to take any\nspecific action here at all).\n\nTo allow the ETL application to be granted access to any Pod, it needs to have\nan identifier (i.e., it needs a WebID). The easiest way to do this is simply to\ncreate a new Pod for **_your_** ETL Tutorial application.\n\n**Note**: [YopMail](https://yopmail.com/en/) is a very convenient, easy-to-use\ntool that can be used to create 'burner' email addresses for creating\ndevelopment or test accounts.\n\n**Note**: [Secure Password Generator](https://passwordsgenerator.net/) is a very\nconvenient, easy-to-use tool that can be used to create secure 'burner'\npasswords for creating development or test accounts.\n\n### Registering our ETL application for each user\n\nIn order for our ETL application to populate any user's Pod, the application\nitself must first be registered. This simple registration process will generate\nstandard OAuth `Client ID` and `Client Secret` values that our application will\nuse to authenticate itself to allow it access individual user Pod's to Load\ntheir respective data.\n\n1. Go to the `registration` endpoint of the user's Identity Provider. For\n   example, for Pods registered with Inrupt's PodSpaces, that would be:\n\n   ```\n   https://login.inrupt.com/registration.html\n   ```\n\n2. Login as the ETL user.\n3. After successful login, the \"Inrupt Application Registration\" is\n   redisplayed.\n4. In the \"Register an App\" field, enter a descriptive name for our ETL\n   application, and click the \"REGISTER\" button.\n5. After registration, record the displayed `Client ID` and `Client Secret`\n   values, which we'll need in the next step.\n\n### Providing a Turtle credential file for the ETL application\n\nOur ETL application needs credentials with which it can connect to user Pods, so\nthat (once authorized by each user) it can then Load user-specific data into\nthose Pods.\n\nThe easiest way to provide these credentials is to use a local Turtle file. An\nexample Turtle file is provided here:\n` resources/CredentialResource/RegisteredApp/example-registered-app-credential.ttl`.\n\nFor more detailed instructions, see the\n[README.md](resources/CredentialResource/RegisteredApp/README.md) file in that\ndirectory.\n\n### Providing a Turtle credential file per user\n\nWe can drive our ETL application using a credential resource per user, and the\neasiest way to provide these resources is to use local Turtle files - one per\nuser (these could also be stored as resources within each user's individual\nPod!).\n\n1. Make a copy of the example user credentials Turtle file\n   `resources/CredentialResource/User/example-user-credential.ttl` in the same\n   directory.\n2. Name the copied file using a simple naming convention such as\n   `user-credential-\u003cUSER-NAME\u003e.ttl`.\n3. Repeat this process, once for each user, filling in that user's 3rd-party\n   API and Solid Pod credentials as appropriate for each user (if a user\n   doesn't have credentials for a particular data source simply leave out\n   those credentials, or provide empty string values - our ETL application will\n   skip that data source for that user).\n\n### Executing the ETL application\n\nMake sure the project is successfully compiled to JavaScript:\n\n```script\nnpm run build\n```\n\n...and then run it from the command line:\n\n```script\nnode dist/index.js runEtl --etlCredentialResource \u003cresources/CredentialResource/RegisteredApp/MY-REGISTERED-APP-CREDENTIAL-FILE.ttl\u003e --localUserCredentialResourceGlob \"resources/CredentialResource/User/user-credential-*.ttl\"\n```\n\n**_Note_**: Make sure to enclose the directory and glob pattern matching the\nnaming convention you used to name your user credential files in double quote\ncharacters, so that any wildcard characters (like asterisks or question marks)\nwill be correctly interpreted.\n\n### Using `ts-node` to speed up running the ETL application repeatedly\n\nSince this project is TypeScript, it can be very convenient to use `ts-node`\nso that we don't have to repeatedly re-run the TypeScript compilation step. We\ninstall `ts-node` globally, but of course you don't have to do this, you can\ninstall it locally too, or you could choose to use `npx`:\n\n```script\nnpm install -g ts-node typescript '@types/node'\n```\n\nNow the entire ETL application can be run for multiple users with just a single\ncommand:\n\n```script\nts-node src/index.ts runEtl --etlCredentialResource \u003cresources/CredentialResource/RegisteredApp/MY-REGISTERED-APP-CREDENTIAL-FILE.ttl\u003e --localUserCredentialResourceGlob \"resources/UserCredentialResource/user*.ttl\"\n```\n\n**_Note_**: Make sure to enclose the directory and glob pattern matching the\nnaming convention you used to name your user credential files in double quote\ncharacters, so that any wildcard characters (like asterisks or question marks)\nwill be correctly interpreted.\n\n### Running Extract, Transform, and Load (from local data only)\n\nThe next set of End-2-End tests run the full ETL process, and can populate\nboth real Solid Pods and a triplestore, but are fed from local copies of data\nsource API responses. These are convenient if we don't wish to continually be\nhitting the 3rd-party data sources to retrieve information (e.g., some data\nsources can have rate limits, or actually charge per API invocation).\n\nBeing able to run the entire ETL process from local copies of API responses is\nvery convenient during active development, or to run purely locally without\nany need for an internet connection (e.g., to populate a locally running\ntriplestore, or Pods hosted on a locally running Solid server).\n\nFor these tests to run though, we need local copies of real API responses.\nCurrently, the tests are hard-coded to look for specifically named JSON files\nin the directory `resources/test/RealData/RealApiResponse/` (see the imports\nat the top of the test file `e2e/node/localExtract-TransformLoad.test.ts` for\nthe expected filenames).\n\nTo run these tests, execute this script from the root directory:\n\n```script\nnpm run e2e-test-node-localExtract-TransformLoad\n```\n\nIf the credentials you supplied are all valid, you should see data displayed\non-screen (with colorful console output via the\n[debug](https://www.npmjs.com/package/debug) library) from all data sources\nthat have configured credentials.\n\n## Publishing simulating events to a user Pod\n\nThe ETL application can publish simulated events to any user Pod, using the\ncommand-line command `publishEvent`. Events are represented as JSON files,\nwith the format of the event being determined by the data source that produces\nthe event (e.g., Flume for water-usage or leak-related events, or Sense for\ndevice or electricity-usage events like devices turning on or off).\n\nThere are a number of example event files provided in the directory\n`./resources/test/DummyData/DummyEvent`.\n\nWe need to provide command-line switches with references to the ETL\napplication's credential resource, the targeted user's credential resource,\nand the event filename:\n\n```\nts-node src/index.ts publishEvent --etlCredentialResource \u003cMY-REGISTERED-APP-CREDENTIAL-FILE.ttl\u003e --userCredentialResource \u003cUSER-CREDENTIALS.ttl\u003e --eventResource ../../resources/test/DummyData/DummyEvent/Flume/dummy-event-flume-leak.json\n```\n\n## Generating HTML documentation\n\nThe Inrupt [Artifact Generator](https://github.com/inrupt/artifact-generator)\nintegrates closely with a sophisticated open-source documentation-generating\ntool named [Widoco](https://github.com/dgarijo/Widoco).\n\nWidoco automatically generates a detailed website describing all the\ninformation contained in a vocabulary (it can actually be configured to\ngenerate a website per human-language, meaning if our vocabularies have\ndescriptive meta-data in English, French, Spanish, etc., Widoco can generate\nwebsites in each of those languages, with very convenient links between them\nall).\n\n**_Note_**: Widoco is a Java application, and requires Java version 8 or\nhigher to be installed on your machine. See\n[here](https://github.com/inrupt/artifact-generator/blob/main/documentation/feature-overview.md#to-generate-human-readable-documentation-for-a-vocabulary-using-widoco)\nfor installation and setup guidance.\n\nTo tell the Artifact Generator to generate Widoco documentation, simply add\nthe following command-line switch:\n\n```\n--runWidoco\n```\n\nWebsites will automatically be generated in the\n`\u003cOUTPUT DIRECTORY\u003e/Generated/Widoco` directory (e.g., in\n`./src/InruptTooling/Vocab/EtlTutorial/Generated/Widoco`).\n\nDocumentation can be generated in multiple human-languages, configured via the\nArtifact Generator configuration file using the `widocoLanguages` field, and\nproviding 2-character languages codes with hyphen separators (e.g., `en-es-fr`\nfor English, Spanish, and French documentation).\n\nFor example usage, see our local configuration file here:\n[vocab-etl-tutorial-bundle-all.yml](resources/Vocab/vocab-etl-tutorial-bundle-all.yml)).\n\nIf you successfully ran the Artifact Generator locally with the `--runWidoco`\nswitch, you should see the documentation for our local vocabularies\n**in Spanish**\n[here](src/InruptTooling/Vocab/EtlTutorial/Generated/Widoco/etl-tutorial/index-es.html).\n\n## Postman collections to invoke APIs\n\nPostman provides a very convenient means to quickly and easily make API calls,\nand we've provided example Postman collections (version 2.1 format) in the\ndirectory [./resources/Postman](./resources/Postman). Each collection provides\na number of sample API calls.\n\n### Credentials provided by environment variables\n\nWe've been careful not to include any credentials in our Postman collections,\ninstead relying on environment variables. To see which environment variables\nare required for the various data sources, see the example environment file we\nprovide for running our End-2-End tests [here](e2e/node/.env.example).\n\n### Making Postman calls\n\nThe general approach for invoking 3rd-party APIs is to first request a fresh\naccess token, which generally requires providing identifying credentials, such\nas a username and password. This access token is then generally provided as\nthe value of the HTTP `Authorization:` header in all subsequent API calls.\n\nOur collections automatically store the data-source-specific access token in\ninternal environment variables using the 'Tests' feature of Postman. For\nexample, for EagleView, open the `https://webservices-integrations.eagleview.com/Token`\nrequest in Postman and look at the 'Tests' tab to see the code that copies the\naccess token from a successful authentication request into the `eagleViewAuthToken`\nenvironment variable. This environment variable is then used in subsequent\nEagleView calls, such as the\n`https://webservices-integrations.eagleview.com/v2/Product/GetAvailableProducts`\nrequest, where it is specified as the `Authorization:` header value (look in\nthe 'Headers' tab) .\n\n## Advanced Vocabulary Management\n\nFor details on how to efficiently update even remotely published vocabularies\nfrom 3rd-parties, see\n[Advanced Vocabulary management](docs/AdvancedVocabManagement/AdvancedVocabularyManagement.md).\n\n## Contents\n\n[\"_People think [Linked Data] is a pain because it is complicated._\"](https://book.validatingrdf.com/bookHtml005.html)) - Dan Brickley (Google and Schema.org) and Libby Miller (BBC).\n\n**Vocabularies:**\n\n- Schema.org (from Google, Microsoft and Yahoo): [RDF](https://schema.org/docs/developers.html#defs)\n- Inrupt: [Common Terms](https://github.com/inrupt/solid-common-vocab-rdf/blob/main/inrupt-rdf/Core/CopyOfVocab/inrupt-common.ttl)\n- 3rd-parties: [Vocabulary per data source](./resources/Vocab/ThirdParty/CopyOfVocab)\n\n**Generated artifacts:**\n\n- After running the Artifact Generator as described in the install instructions\n  above, a series of artifacts will be generated in\n  [./src/InruptTooling/Vocab/EtlTutorial](./src/InruptTooling/Vocab/EtlTutorial).\n- You'll notice multiple 'forms' of generated artifacts here, for both Java\n  and JavaScript (see\n  [https://github.com/inrupt/artifact-generator/blob/main/documentation/multiple-forms-of-artifact.md]()\n  for a detailed description of these various forms of artifact).\n- Initially we'll use the simplest form (just term identifiers as simple\n  strings), but we'll build up to demonstrate also using RDF library IRI\n  forms, and finally to using the form that provides full programmatic access\n  to all term meta-data (e.g., each term's labels and descriptions (available\n  in multiple human languages), 'seeAlso' links, etc.).\n\n**Generated documentation:**\n\n- If you installed Widoco, and ran the Artifact Generator command from the\n  install instructions above with the optional `--runWidoco` switch, the\n  generator will also generate really nice HTML documentation for each of the\n  vocabularies we use in this project (and do so in both English and Spanish).\n- This documentation is in the form of an entire website per vocabulary, with\n  each website generated under\n  [./src/InruptTooling/Vocab/EtlTutorial/Generated/Widoco](./src/InruptTooling/Vocab/EtlTutorial/Generated/Widoco).\n- To open any of these websites, browse to the `./index-en.html` file in the\n  root of each directory.\n  **_Note:_** Notice that the documentation is generated in both English and\n  Spanish (language selection is available in the top-right-hand corner of the\n  vocabulary webpage), as all our vocabularies describe themselves and the\n  terms they contain in both of those languages (see our\n  [Companies House vocabulary](./resources/Vocab/ThirdParty/CopyOfVocab/inrupt-3rd-party-companies-house-uk.ttl)\n  for instance).\n\n**Postman Collections for API calls:**\n\n- [Postman collections](./resources/Postman)\n- We've been careful not to include any credentials in our Postman\n  collections, instead relying on environment variables to provide these.\n\n**Extract, Transform, and Load (ETL):**\n\n- TypeScript modules for external data sources (all under\n  [./src/dataSource](./src/dataSource)):\n\n  - Companies House UK.\n  - Local Passport data as JSON.\n\n- **_Note:_** We have 100% branch code coverage, which we plan to maintain\n  throughout this project.\n\n**Real-time data (example using Sense electricity monitor):**\n\n- **_Note:_** Currently _NOT_ included - but we expect to add it soon!\n\n- WebSocket client for Sense: [./src/websocket](./src/websocket)\n\n- To run, you need to provide valid credentials for Sense in the End-2-End\n  environment file (i.e., in your local `./e2e/node/.env.test.local` file,\n  within which you provide credentials for all data sources), and then run:\n\n  ```\n  cd ./src/websocket/Sense\n  npm i\n  node index.js\n  ```\n\n  You should see a successful authentication to Sense, and then live updates\n  every 500 milliseconds from the Sense monitor.\n\n## References\n\n**Vocabularies**\n\n- Inrupt's vocabulary repository (**_public_**):\n  [inrupt/solid-common-vocab-rdf](https://github.com/inrupt/solid-common-vocab-rdf)\n- Inrupt's Artifact Generator (**_public_**) :\n  [inrupt/artifact-generator](https://github.com/inrupt/artifact-generator)\n- Inrupt's published artifacts:\n  - Java: Cloudsmith [SDK Development](https://cloudsmith.io/~inrupt/repos/sdk-development/packages/)\n    (**_private_** - we intend to open-source, based on demand)\n  - JavaScript: [npmjs.org](https://www.npmjs.com/search?q=%40inrupt%2Fvocab-) (**_public_**)\n\n**Utilities**\n\n- Bash scripts to manage Artifact Generator configuration files (YAMLs) (**_public_**):\n  [inrupt/solid-common-vocab-script](https://github.com/inrupt/solid-common-vocab-script)\n- Widoco documentation-generation (integrated with Inrupt's Artifact Generator) (**_public_**):\n  [Widoco](https://github.com/dgarijo/Widoco)\n\n**Libraries**\n\n- JavaScript:\n  - Linked Data (RDF) library:\n    [solid-client-js](https://github.com/inrupt/solid-client-js)\n  - Solid Authentication Library:\n    [solid-client-authn-js](https://github.com/inrupt/solid-client-authn-js)\n- Java:\n  - Linked Data (RDF) library: (**_private_** - we intend to open-source, based\n    on demand)\n\n## Potential Future Work\n\n- We should probably only delete Containers for data sources that are being\n  re-loaded (so if I want to re-run just for InsideMaps and leave all other\n  creds empty). But then, how would I delete or remove entire data source\n  Containers?\n\n- Error handling - should the ETL tool exit when it encounters an error\n  reading from a data source? Should it just fail for the current user, or the\n  entire process?\n\n- Investigate getting local government public data, e.g., perhaps local areas\n  publish their local water charge rates (so we could query that data in\n  real-time to provide water charges based on the user's \"census_tract\"\n  field).\n\n- User preferences, for things like gallons vs liters, or feet vs meters.\n  Ideally these should be accessible by any app accessing the Pod, but if not\n  yet defined in the Pod, the user could stipulate they wish to create these\n  Pod-wide preferences based on current preferences they may have set.\n\n- Drive the ETL process based on direct user input via a WebApp, to do things\n  like:\n  - Manually enter credentials for pre-defined data sources.\n  - Detect discrepancies in preferences between data sources (e.g., user has\n    set imperial as their Pod-wide choice, but the preference set within a\n    specific data source is metric). Alert the user to this, and allow them to\n    update their Pod-wide preference if they choose.\n\n## Knowledge Transfer\n\n- Runs as a new CLI option (i.e., `--runEod` (for 'Run End-of-Day')).\n  - Mandatory param: `--etlCredentialResource`\n  - Mandatory param: `--localUserCredentialResourceGlob`\n  - Optional param: `--localUserCredentialResourceGlobIgnore`\n  - Optional param: `--fromDate` (defaults to yesterday)\n  - Optional param: `--toDate` (defaults to yesterday)\n- Code is also called as part of the normal `--runEtl` process too for each\n  user, and defaults to yesterday.\n\n## Frequently Asked Questions\n\n- Is it ok to create the same `Thing` or `Container` multiple times?\n\n  - A: Yes.\n\n- Should I extract API responses from multiple data sources and then do a single\n  transform?\n\n  - A: Its just personal choice really. Personally I don't think I would, as\n    generally all data sources should be isolated from each other, and know\n    nothing of each other. So I would perform the ETL for each source\n    independently, and Load each data source's usage data into their own\n    resources in the Pod.\n\n- Where should we put partner credential secrets?\n\n  - A: It may be easiest (and perhaps most appropriate) to simply add these\n    credentials to the existing registered application credentials resource\n    (e.g.,\n    [here](resources/CredentialResource/RegisteredApp/example-registered-app-credential.ttl))\n    Alternatively, I already added a placeholder partner-credential resource\n    here:\n    `resources/CredentialResource/Partner/example-partner-credential.ttl`, but\n    the ETL tool would need to be extended to look for, read from, and use, any\n    values from this resource.\n\n- What is `owl:Property`?\n\n  - A: It's a fairly common vocabulary practice to include simple metadata from\n    OWL for things like Classes, Properties and NamedIndividuals, but it's\n    certainly not required. I only included them in the vocabs for this project\n    because the documentation tool [Widoco](https://github.com/dgarijo/Widoco)\n    looks for them when generating its HTML documentation for vocabularies.\n\n- How much code should be shared between one-off jobs and `runEtl`?\n\n  - A: Just in general, I'd try and reuse code as much as possible. Commonly,\n    EOD jobs run either reconciliation or aggregation functions, and so\n    conceptually it can be very useful to run directly after all ETL jobs. It\n    certainly makes sense in our case to run the EOD process directly after the\n    initial ETL so that the user can immediately see yesterday's usage\n    statistics as soon as their Pod is created (as opposed to having to wait\n    until the following day to allow the EOD job to run that night).\n\n- What are the relationships between resources, datasets, containers, and\n  things?\n\n  - A: 'Resource' is the most general term we use to refer to any 'entity' in a\n    Solid Pod, so it could be a Container, an RDF resource or a binary blob.\n\n    A 'Dataset' is the formal name for a collection of RDF quads, and really\n    Solid Pods are made up of **only** either RDF Datasets, or binary Blobs.\n\n    'Containers' are simply RDF Datasets that logically 'contain' other\n    Resources. They themselves are always RDF Datasets, and the name basically\n    comes from the W3C standard of Linked Data Platform (but that standard\n    should just be seen as a guide - i.e., Solid took initial inspiration from\n    that standard, but may move away from formally requiring compliance with\n    that standard).\n\n    A 'Thing' is simply a collection of RDF triples where all the triples have\n    the exact same RDF Subject IRI. It can be a convenient conceptual model when\n    working with RDF, as we often wish to read or write a number of properties\n    of a single 'thing' or entity at once.\n\n- When should we call `getThing` as opposed to `getSolidDataset`?\n  - A: We use `getThing` to extract a very specific collection of triples from\n    an RDF Dataset (i.e., from a SolidDataset), where all the triples in that\n    collection have the exact same RDF Subject IRI value (which is the IRI value\n    we pass as the 2nd parameter to the `getThing` function). This is useful\n    because a single Dataset can contain any number of triples with differing\n    RDF Subject values (i.e., a single Dataset can have triples describing\n    multiple 'things').\n\n## Changelog\n\nSee [the release notes](https://github.com/inrupt/etl-tutorial/blob/main/CHANGELOG.md).\n\n## License\n\nMIT © [Inrupt](https://inrupt.com)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finrupt%2Fetl-tutorial","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Finrupt%2Fetl-tutorial","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finrupt%2Fetl-tutorial/lists"}