{"id":24546794,"url":"https://github.com/materials-data-science-and-informatics/metador-push","last_synced_at":"2025-08-25T16:19:06.350Z","repository":{"id":40505139,"uuid":"426229818","full_name":"Materials-Data-Science-and-Informatics/metador-push","owner":"Materials-Data-Science-and-Informatics","description":"The metadata-aware mailbox and structured submission interface for research data submission.","archived":false,"fork":false,"pushed_at":"2023-02-06T14:53:00.000Z","size":959,"stargazers_count":18,"open_issues_count":10,"forks_count":1,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-08-18T01:44:46.640Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Materials-Data-Science-and-Informatics.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS.md","dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2021-11-09T12:58:24.000Z","updated_at":"2025-05-13T13:35:43.000Z","dependencies_parsed_at":"2025-04-16T09:19:00.195Z","dependency_job_id":null,"html_url":"https://github.com/Materials-Data-Science-and-Informatics/metador-push","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Materials-Data-Science-and-Informatics/metador-push","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Materials-Data-Science-and-Informatics%2Fmetador-push","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Materials-Data-Science-and-Informatics%2Fmetador-push/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Materials-Data-Science-and-Informatics%2Fmetador-push/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Materials-Data-Science-and-Informatics%2Fmetador-push/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Materials-Data-Science-and-Informatics","download_url":"https://codeload.github.com/Materials-Data-Science-and-Informatics/metador-push/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Materials-Data-Science-and-Informatics%2Fmetador-push/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":272094042,"owners_count":24872278,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-25T02:00:12.092Z","response_time":1107,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-22T22:17:10.927Z","updated_at":"2025-08-25T16:19:06.342Z","avatar_url":"https://github.com/Materials-Data-Science-and-Informatics.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Metador\n\n![Project status](https://img.shields.io/badge/status-beta-%23ffff00)\n[\n![Test](https://img.shields.io/github/workflow/status/Materials-Data-Science-and-Informatics/metador/test?label=test)\n](https://github.com/Materials-Data-Science-and-Informatics/metador/actions?query=workflow:test)\n[\n![Coverage](https://img.shields.io/codecov/c/gh/Materials-Data-Science-and-Informatics/metador?token=BJJ15RHNJA)\n](https://app.codecov.io/gh/Materials-Data-Science-and-Informatics/metador)\n[\n![Docs](https://img.shields.io/badge/read-docs-success)\n](https://materials-data-science-and-informatics.github.io/metador/)\n\u003c!-- TODO: dockerhub badge or something like that, if we ever offer prebuilt containers --\u003e\n\n\u003cimg style=\"center-align: middle;\" alt=\"Metador Logo\" src=\"https://github.com/Materials-Data-Science-and-Informatics/Logos/raw/main/Metador/Metador_Logo_Text.png\" width=70% height=70% /\u003e\n\n**M**etadata **E**nrichment and **T**ransmission **A**ssistance for **D**igital **O**bjects in **R**esearch\n\n## TL;DR\n\n* **Summary:** File upload service with resumable uploads and rich metadata requirements\n* **Purpose:** Comfortably get data from someone else while **enforcing submission of relevant metadata**\n* Easy to set up, no complicated dependencies or requirements\n* Metadata validation based on dataset profiles using file name pattern matching and [JSON Schema](https://json-schema.org/)\n* Authentication via ORCID, with optional allowlist to restrict access\n* Successfully uploaded and annotated datasets can be passed over to some post-processing:\n  - Either launch a script to handle the completed dataset directory,\n  - or notify a different service via HTTP,\n  - (or just collect the (meta)data from your \"data mailbox\" manually)\n\nhttps://user-images.githubusercontent.com/371708/155495188-caa23208-3092-4639-acfc-970f062f98c4.mp4\n\n## Overview\n\nMetador is a metadata-aware mailbox for research data.\n\nLike a real mailbox, it should be really simple to set up and use and should not be in your way.\n\nUnlike a real mailbox where any content can be dropped, Metador wants to help you, the\ndata receiver, to make sense of the data, by requiring the uploader to fill out a form of\nyour choice for each file to provide all necessary metadata.\n\n**Thereby, Metador faciliates FAIRification of research data by providing a structured\ninterface to condensate implicit contextual domain knowledge into machine-readable\nstructured metadata.**\n\nIf you **formalize** your metadata requirements in the form of JSON Schemas, then Metador\nwill **enforce** those requirements, if you let your collaborators share their data\nwith you through it. At the same time, you are **in full control** of the data, because\nMetador is simple to set-up locally and eliminates the need of a middle-man service that\nassists you with moving larger (i.e., multiple GB) files over the internet.\n\nUsing Metador, the sender can upload a dataset (i.e., a collection of files) and must\nprovide metadata for the files according to your requirements. After the upload and\nmetadata annotation is completed, Metador can notify other services to collect and further\nprocess this data (*post-processing hooks*). For example, you can use this to put\ncompleted and fully annotated datasets into your existing in-lab repository or storage, or\napply necessary transformations on the data or metadata. This makes Metador **easy to\nintegrate into your existing workflows**.\n\nTo achieve these goals, Metador combines state-of-the-art resumable file-upload technology\nusing [Uppy](https://uppy.io) and the [tus](https://tus.io/) protocol\nwith a [JSON Schema](https://json-schema.org/) driven multi-view metadata editor based on\n[react-jsonschema-form](https://github.com/rjsf-team/react-jsonschema-form)\nand [JSONEditor](https://github.com/josdejong/jsoneditor).\n\n## Why not a different self-hosted file uploader?\n\nTo the best of our knowledge, before starting work on Metador, there was no off-the-shelf\nsolution checking all of the following boxes:\n\n* lightweight (easy to deploy and use on a typical institutional Linux server)\n* supports convenient upload of large files (with progress indication, pauseable/resumable)\n* supports rich and expressive metadata annotation that is **generic** (schema-driven)\n\nIf you care about this combination of features, then Metador is for you.\nIf you do not care about collecting metadata, feel free to pick a different solution.\n\n## Installation\n\n*If you are not a fluent command line user, it is recommended to let your local system\nadministrator set up an instance of Metador for you. You should then provide them with the\ndomain-specific dataset profiles and JSON Schemas as it is explained further below.*\n\n* Clone this repository: `git clone git@github.com:Materials-Data-Science-and-Informatics/metador.git`\n\n* Check that you have Python \u003e=3.7 and Node.js \u003e= 14.15 by running `python --version`, `node --version`\n\nIf you do not have a sufficiently recent Python or Node.js version installed,\nuse respectively [`pyenv`](https://github.com/pyenv/pyenv) or [`nvm`](https://github.com/nvm-sh/nvm)\nin order to install a suitable Python or Node.js locally.\n\n* Download and install [`tusd`](https://github.com/tus/tusd) (tested with 1.6.0)\n\nThis is the server component for the Tus protocol that will handle the low-level details\nof the file uploading process. It is downloaded as a OS-specific executable to be placed into\neither `/usr/bin` (global) or `~/.local/bin` (user) directory.\n\n* go to the `frontend` subdirectory in the cloned repository and run `npm install \u0026\u0026 npm run build` to build the frontend.\n\n* go back to the top level directory of the repository and install Metador using `poetry install`, if you use poetry, otherwise use `pip install --user .` (as usual, using a `venv` is recommended).\n\n## Usage\n\n### I want to see it in action, now!\n\n* Ensure that tusd and Metador are installed\nand that the `tusd` and `metador-cli` scripts are on your path, i.e. executable.\n\n* Run `tusd` like this: `tusd -hooks-http \"$(metador-cli tusd-hook-url)\"`\n\n* Run Metador like this: `metador-cli run`\n\n* Navigate to `http://localhost:8000/` in your browser.\n\n### I want to deploy Metador properly! (for system administrator)\n\nAs Metador tries to be a general building block and not impose too many assumptions on\nyour setup, here only a general overview of the required steps is provided.\n\nFor serious deployment into an existing infrastructure, the following steps are required:\n\n* prepare JSON Schemas for the metadata you want to collect for the files.\n\n* write dataset profiles linking the JSON Schemas to file name patterns (explained below).\n\n* think about the way how both `tusd` and Metador will be visible to the outside world.\n  This probably involves a reverse proxy, e.g. Apache or nginx to serve both applications\n  and take care of HTTPS. Make sure that the public hostname is an alias for localhost in\n  `/etc/hosts`, if running both services on the same machine.\n\n  However your setup might be, you need to make sure that:\n\n  1. both `tusd` and Metador are accessible from the client side (notice that by default\n    they run on two different ports, unless you mask that with route rewriting).\n\n  2. The passed hook endpoint URL of Metador is accessible by `tusd`.\n\n  3. The file upload directory of `tusd` is accessible (read + write) by Metador.\n\n* Use `metador-cli default-conf \u003e metador.toml` to get a copy of the default config file,\n  add your JSON schemas and\n  at least change the `metadir.site` and `tusd.endpoint` entries according to your\n  planned setup (you will probably at least change the domain, and maybe the ports).\n  You can delete everything in your config that you do not want to override.\n\n* *Optional:* For ORCID integration, you need access to the ORCID public API.\n  If you don't use ORCID, you have to take care of authorization yourself!\n\n* Run `tusd` as required with your setup, passing\n  `-hooks-http \"$(metador-cli tusd-hook-url)\"` as argument.\n\n* Run Metador with your configuration: `metador-cli run --conf YOUR_CONFIG_FILE`\n\n  Metador will use the current directory as the working directory and also look for\n  profiles and the configuration file there, unless told otherwise via CLI interface or\n  configuration settings.\n\n* To start metador and tusd automatically on start of the system or virtual machine,\n  look at the example `.service` files. These must be adapted according to your setup\n  and placed in `/etc/systemd/system`.\n  After doing this, they can be enabled with\n  `systemctl enable metador` and `systemctl enable metador-tusd`,\n  and managed just like any other background service, i.e. the logs that are printed to\n  the standard output then can be inspected with `journalctl`.\n\n### Using HTTPS\n\n**You definitely should set up some kind of encryption! Especially if you work with\nsensitive data, classified data, data under an embargo or even just unpublished data!\nIf you do not encrypt your traffic, someone could read it in transit like a postcard!**\n\nTo enable https, use a reverse proxy that handles traffic encryption for you.\nYou can use the provided example nginx configuration as a starting point.\nRemember to set the `-behind-proxy` flag when starting `tusd` in order to\nhandle proxy headers correctly.\n\nFor testing purposes, you can easily generate a self-signed certificate:\n```\nopenssl req -nodes -x509 -newkey rsa:4096 -keyout cert.key -out cert.pem -days 365\n```\n\nFor production use, get a certificate that is signed by your institution, or get\none from [Let's Encrypt](https://letsencrypt.org/), e.g. by following\n[these](https://www.itzgeek.com/how-tos/linux/debian/how-to-install-lets-encrypt-ssl-certificate-for-nginx-on-debian-11.html)\n[guides](https://stevenwestmoreland.com/2017/11/renewing-certbot-certificates-using-a-systemd-timer.html).\n\n## Setting up ORCID Authentication\n\nFollow instructions given e.g.\n[here](https://info.orcid.org/documentation/integration-guide/registering-a-public-api-client/)\nAs redirect URL you should register the value you get from `metador-cli orcid-redir-url`\n(the output is based on your configuration).\n\nAfterwards, fill out the `orcid` section of your Metador configuration accordingly,\nadding your client ID and secret token.\n\nIf you register on the ORCID sandbox server, do not forget to set `sandbox=true`!\n\n## Deployment using Docker\n\nTo build a Docker image with a pre-configured setup of metador, tusd and nginx, run:\n\n```\ndocker build -t metador:latest .\n```\n\nPrepare a directory (e.g. called `metador_rundir`) that contains the following items:\n\n* your Metador configuration `metador.toml`\n* a `profiles` directory containing your dataset profiles (explained below) and JSON schemas\n* SSL certificate and key named `cert.pem` and `cert.key` valid for the domain you will use for accessing Metador\n\nThis directory is used both for configuration and retrieval of the data.\n\nTo run Metador with your prepared directory `./metador_rundir`, run:\n\n```\ndocker run -it --mount type=bind,source=\"$(pwd)\"/metador_rundir,target=/mnt -p 80:80 -p 443:443 metador:latest\n```\n\nNow you should be able to access Metador on your computer\nby visiting `https://localhost` in your browser.\n\n## Dataset profiles and JSON Schemas\n\nIn your configuration you must provide an existing directory that contains your\ndataset profiles and (local) JSON Schemas that are referenced in the profiles.\n\nA dataset profile must have the following shape:\n```\n{\n  \"title\": \"Dataset Profile Title\",\n  \"description\": \"Short summary of what this dataset profile is intended for\",\n  \"schemas\": {\n    \"SCHEMA_NAME_1\": \u003cJSONSCHEMA\u003e,\n    ...,\n    \"SCHEMA_NAME_N\": \u003cJSONSCHEMA\u003e,\n  },\n  \"rootSchema\": \"SCHEMA_NAME\" | bool,\n  \"patterns\": [\n    {\"pattern\": \".*\\\\.txt\", \"useSchema\": \"SCHEMA_NAME\" | bool},\n    {\"pattern\": \".*\\\\.jpg\", \"useSchema\": \"SCHEMA_NAME\" | bool},\n    {\"pattern\": \".*\\\\.mp4\", \"useSchema\": \"SCHEMA_NAME\" | bool}\n  ],\n  \"fallbackSchema\": \"SCHEMA_NAME\" | bool\n}\n```\n\nThe `title` and `description` keys are self-explanatory.\nThe `schemas` section can be used to embed arbitrary JSON Schemas that are e.g. not\nrelevant for other profiles or for some other reason are not stored in a separate file.\n\nThe keys `rootSchema`, `fallbackSchema` and `patterns` are defining the behavior of the\ndataset profile.\n\nThe key `rootSchema` defines the JSON Schema that is defining the metadata required for\nthe dataset itself, i.e. file-independent, general metadata or metadata that applies to\ne.g. all the files (e.g. for reducing the effort for the user).\n\nFor each uploaded file, the filename is matched against the listed `patterns` in the\nprovided order. In case of a full match (i.e. the pattern must match the complete\nfilename) the corresponding schema is used. If no pattern matches, then the\n`fallbackSchema` is applied.\n\nRemember that in the pattern a regex is expected, so characters like `.`, `*` etc. are\ninterpreted as special symbols unless escaped by a backslash. But because backslashes also\nmust be escaped in a string, e.g. in order to match an actual `*` symbol, the pattern must\nbe `\"\\\\*\"`.\n\nAs a schema to be applied, you can use\n\n* a boolean\n* the name of an embedded schema in the `schemas` section\n* a filename of a JSON schema (relative to the profile directory)\n\nSetting the schema to `true` means that arbitrary metadata can be provided. In the UI this\nspecial case is treated by providing the possibility to add arbitrary key-value pairs as\nmetadata.\n\nSetting the schema to `false` would literally mean that no metadata could be valid. This\nis not useful, so instead a `false` schema is interpreted as forbidding to use a file with\na name that matched the pattern (in case of `useSchema`) or to upload files that do not\nmatch any pattern (in case of `fallbackSchema`).\n\n## Cleaning up abandoned uploads and datasets\n\nThe upload server tusd will create intermediate files in its own data directory, in\nnormal operation they will be removed/relocated when a file upload is completed.\nIn the case that an upload is abandoned, these intermediate files will stay there forever,\nunless cleaned up.\n\nTo clean up, you can run `metador-cli tusd-cleanup YOUR_TUSD_DATA_DIR`. The command should\nbe launched in the same directory where the server is run and with the same configuration,\nso the tusd directory shall be either relative to that location, or an absolute path.\n\nTo automate this, you can e.g. set up a cronjob to run this script regularily.\n\n## Technical FAQ\n\nThe following never actually asked questions might be of interest to you.\n\n### Feature: Can I upload existing JSON metadata for the files?\n\nThe only purpose of Metador is to enable a human-friendly input form to simplify the\nannotation of the data. If the users happen to have JSON files that are valid according to\nthe required schema, of course you can just switch to the raw JSON Editor view and paste\nthe content there. But if you do already have structured metadata, you most likely do not\nneed Metador.\n\n### Feature: Will there be an API for external tools to automate uploads?\n\nThis service is intended for use by human beings, to send you data that originally has\nad-hoc or lacking metadata. If the dataset is already fully annotated, it probably should\nbe transferred in a different and simpler way offered by your database or repository\nsolution.\n\nIf you insist on using this service mechanically, in fact it is designed to be as RESTful\nas possible so you might try to script an auto-uploader. The API is accessible under the\n`/docs` route. For uploads you would also need a\n[tus client](https://tus.io/implementations.html).\nThe only difficult point would be the automated ORCID authentication that you must handle.\nThere is no and probably will be no \"API token\" support.\n\n### Feature: Will there be support for e.g. cloud-based storage?\n\nNo, this service is meant to bring annotated data to **your** hard drives, that must be\nlarge enough to store the files at least temporarily.\nYour post-processing script can do with completed datasets whatever it wants, including\nmoving it to arbitrary different locations.\n\n### Feature: Will there be support for authentication mechanisms besides ORCID?\n\nORCID is highly adopted in research and allows to sign in using other mechanisms,\nto that in the research domain it should be sufficient. If you or your partners do not\nhave an ORCID yet, maybe now it is the time!\n\nIf you want to restrict access to your instance to a narrow circle of persons, instead of\nallowing anyone with an ORCID to use your service, just use the provided **allowlist**\nfunctionality.\n\nIt is **not** planned to add authentication that requires the user to register\nspecifically to use this service. No one likes to create new accounts and invent new\npasswords, and you probably have more important things to do and do not need the\nadditional responsibility of keeping these credentials secure.\n\nIf you, against all advice, want to have a custom authentication mechanism, use\nthis with ORCID disabled, i.e. in \"open mode\", and then restrict access to the service in\na different way.\n\n## Development\n\n### Initial Preparations\n\nTo setup Metador for development, perform the following steps:\n\n1. Go to your Metador project directory and run `poetry install`.\n\n2. Next, `pre-commit install` to enable the required pre-commit hooks.\n\n2. Create a separate directory (ideally outside the `metador` project directory),\n   e.g. call it `metador_rundir`.\n\n3. Symlink the `profiles` from the project directory to your runtime directory or create a\n   new profiles directory with your profiles for testing and development.\n\n4. Create a `metador.toml` config file in the runtime directory and override some default\n   settings. You might want to use a configuration like this:\n```\n[metador.log]\nlevel = 'DEBUG'\nfile = 'metador.log'\n\n[orcid]\nenabled = true\nuse_fake = true\nsandbox = true\n\n[uvicorn]\nreload = true\n```\n\nThe option `use_fake` will enable a dummy sign-on that does not use real ORCID servers and\ndoes not need having an ORCID. When `use_fake` is disabled, but `sandbox` is enabled,\nMetador instead will use the [ORCID test server](https://sandbox.orcid.org/) instead of\nthe real one. Notice that both the test and production servers of ORCID respectively need\nthe setting up described above and a registered user account.\n\n### Running Metador for development\n\nTo start, you should first go to the Metador project directory run `poetry shell`.\nThen switch to the runtime directory (the one with your `metador.toml`)\nand run `metador-cli run`.\n\nFor frontend development, you can run `npm run dev` in the frontend directory in addition\nto the Metador server to get auto-reload for the frontend. Be sure to use the frontend\nserved by the backend server (by default running on port 8000), though!\n\nBefore commiting, run `pytest` and make sure you did not break anything.\n\nTo generate documentation locally, run `pdoc -o docs metador`.\n\nTo check coverage, use `pytest --cov`\n\nAlso verify that the pre-commit hooks all run and complete successfully.\n\n## Copyright and Licence\n\nSee [LICENSE](./LICENSE).\n\n### Main used libraries and dependencies\n\nThe following libraries are used directly (i.e. not only transitively) in this project:\n\n**CLI:** typer (MIT), toml (MIT), colorlog (MIT)\n\n**Backend:** FastAPI (MIT), pydantic (MIT), httpx (BSD-3), tusd (MIT), jsonschema (MIT)\n\n**Backend testing:** pytest (MIT), pytest-cov (MIT), pytest-asyncio (Apache 2), aiotus (Apache 2)\n\n**Frontend:** Svelte (MIT), svelte-fa (MIT), svelte-navigator (MIT), svelte-notifications (MIT),\nsvelte-jsoneditor (ISC), uppy (MIT), Picnic CSS (MIT), Font-Awesome (MIT/CC-BY-4.0), FileSaver.js (MIT), react-jsonschema-form (Apache 2)\n\nMore information is in the documentation of the corresponding packages.\n\n## Acknowledgements\n\n\u003cdiv\u003e\n\u003cimg style=\"vertical-align: middle;\" alt=\"HMC Logo\" src=\"https://github.com/Materials-Data-Science-and-Informatics/Logos/raw/main/HMC/HMC_Logo_M.png\" width=50% height=50% /\u003e\n\u0026nbsp;\u0026nbsp;\n\u003cimg style=\"vertical-align: middle;\" alt=\"FZJ Logo\" src=\"https://github.com/Materials-Data-Science-and-Informatics/Logos/raw/main/FZJ/FZJ.png\" width=30% height=30% /\u003e\n\u003c/div\u003e\n\u003cbr /\u003e\n\nThis project was developed at the Institute for Materials Data Science and Informatics\n(IAS-9) of the Jülich Research Center and funded by the Helmholtz Metadata Collaboration\n(HMC), an incubator-platform of the Helmholtz Association within the framework of the\nInformation and Data Science strategic initiative.\n\n\u003cimg src=\"https://user-images.githubusercontent.com/371708/156158094-9b5fff08-ec62-4dc3-9715-dfe5be432fa2.png\" width=50% height=50% /\u003e\n\nThis project has received financial support from the European Research Council through the\nERC Grant Agreement No. 759419 MuDiLingo (‘A Multiscale Dislocation Language for\nData-Driven Materials Science’).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaterials-data-science-and-informatics%2Fmetador-push","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmaterials-data-science-and-informatics%2Fmetador-push","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaterials-data-science-and-informatics%2Fmetador-push/lists"}