{"id":48054997,"url":"https://github.com/poseidon-framework/poseidon-schema","last_synced_at":"2026-04-04T14:25:43.764Z","repository":{"id":43375217,"uuid":"278626197","full_name":"poseidon-framework/poseidon-schema","owner":"poseidon-framework","description":"An archaeogenetic genotype data organisation file format","archived":false,"fork":false,"pushed_at":"2026-03-23T05:06:40.000Z","size":1208,"stargazers_count":3,"open_issues_count":8,"forks_count":2,"subscribers_count":3,"default_branch":"master","last_synced_at":"2026-03-23T23:59:43.308Z","etag":null,"topics":["adna","ancient-dna","data-format","standardization"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/poseidon-framework.png","metadata":{"files":{"readme":"README.md","changelog":"changelog.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2020-07-10T12:19:47.000Z","updated_at":"2026-03-23T05:06:43.000Z","dependencies_parsed_at":"2023-12-19T17:49:31.595Z","dependency_job_id":"cdfbc287-89e4-4bad-9368-a4d285446c10","html_url":"https://github.com/poseidon-framework/poseidon-schema","commit_stats":null,"previous_names":["poseidon-framework/poseidon-schema"],"tags_count":12,"template":false,"template_full_name":null,"purl":"pkg:github/poseidon-framework/poseidon-schema","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/poseidon-framework%2Fposeidon-schema","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/poseidon-framework%2Fposeidon-schema/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/poseidon-framework%2Fposeidon-schema/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/poseidon-framework%2Fposeidon-schema/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/poseidon-framework","download_url":"https://codeload.github.com/poseidon-framework/poseidon-schema/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/poseidon-framework%2Fposeidon-schema/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31402415,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-04T10:20:44.708Z","status":"ssl_error","status_checked_at":"2026-04-04T10:20:06.846Z","response_time":60,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["adna","ancient-dna","data-format","standardization"],"created_at":"2026-04-04T14:25:38.126Z","updated_at":"2026-04-04T14:25:43.753Z","avatar_url":"https://github.com/poseidon-framework.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"## The Poseidon Standard v3.0.0\n\nPoseidon is a solution for archaeogenetic genotype data organisation. It is geared towards human data, but is to a large extent species-agnostic and can be used to track archaeogenetic data also of non-human species.\n\nThis standard defines a data structure: the **Poseidon package**. A Poseidon package stores genotype data with meta- and context information.\n\nA .pdf version of the latest instance of this document can be downloaded [here](https://github.com/poseidon-framework/poseidon-schema/blob/master/poseidon_package_specification.pdf).\n\nFurther details on [genotype data](https://poseidon-framework.github.io/#/genotype_data), the [.janno file](https://poseidon-framework.github.io/#/janno_details) and the [.ssf file](https://poseidon-framework.github.io/#/ssf_details) are documented on the Poseidon website.\n\nA changelog documents the changes across different schema versions [here](https://github.com/poseidon-framework/poseidon-schema/blob/master/changelog.md).\n\nThe key words *MUST*, *MUST NOT*, *REQUIRED*, *SHALL*, *SHALL NOT*, *SHOULD*, *SHOULD NOT*, *RECOMMENDED*, *MAY*, and *OPTIONAL* in this document are to be interpreted as described in [RFC 2119](https://datatracker.ietf.org/doc/html/rfc2119).\n\n### Primary entities of a Poseidon package\n\nThe main operational entities in a Poseidon package are discrete sets of genotype data attributed to a single human or non-human individual, scientifically generated for archaeogenetic research questions. Within a Poseidon package each of these sets gets attributed a unique identifier: the `Poseidon_ID`.\n\nGenerally, archaeogenetics operates on depositional contexts, e.g. graves, with one or multiple (ancient) human or non-human individuals. Usually, it is possible to attribute the (skeletal) remains within these contexts to individuals based on archaeological evidence and physical-anthropological analysis. Each individual can get sampled one or multiple times, either by directly probing their preserved tissue, or by sampling any reagent that contains their DNA (through whatever pathway or taphonomic process). From one such sample one or multiple extracts can be derived, which can be transformed into one or multiple libraries, which may or may not be subjected to a DNA capture protocol and then sequenced one or multiple times. The raw sequencing data can undergo various different forms of computational processing and eventually genotyping to produce the data relevant for most derived analyses and thus stored in a Poseidon package.\n\nWhile the wetlab-processes yield a relatively predictable tree of separate physical and digital products for any given sample, the computational data-processing breaks the conceptual tree-ness by allowing for arbitrary conflation of sequencing data obtained through potentially separate means: Data from different libraries, for example, may be merged if they are from the same individual, even if they are not from the same sample.\n\n`Poseidon_ID`s therefore represent one consciously selected end-point in the complex data preparation graph laid out above. Typically this end-point corresponds to an optimal result for a given individual, research question and publication.\n\nFor the sake of convenience and despite the lack of conceptual clarity, below we sometimes use the term *sample* to denote `Poseidon_ID` entities. Data aggregation on the level of physical samples is often sensible, and the term is conventionally used for analysis endpoints in the community of practice.\n\n### The Poseidon package structure\n\nA Poseidon package is defined by the POSEIDON.yml file, which holds relative paths to all other files in the package.\n\nA package therefore MUST contain:\n\n- A `POSEIDON.yml` file to formally define the package\n- Genotype data in PLINK, EIGENSTRAT or VCF format\n\nIt SHOULD additionally contain:\n\n- A `.janno` file to store context information on spatiotemporal origin or sample quality\n- A `.bib` file for literature references\n\nIt MAY also contain:\n\n- A `README.md` file for arbitrary, additional context information\n- A `CHANGELOG.md` file to document changes to the package\n- A `.ssf` file with information on the underlying raw sequencing data\n\nHere is an example of a package `Switzerland_LNBA_Roswita` in one directory:\n\n```default\nSwitzerland_LNBA_Roswita/POSEIDON.yml\nSwitzerland_LNBA_Roswita/Switzerland_LNBA.bed\nSwitzerland_LNBA_Roswita/Switzerland_LNBA.bim\nSwitzerland_LNBA_Roswita/Switzerland_LNBA.fam\nSwitzerland_LNBA_Roswita/Switzerland_LNBA.janno\nSwitzerland_LNBA_Roswita/Switzerland_LNBA.ssf\nSwitzerland_LNBA_Roswita/Switzerland_LNBA.bib\nSwitzerland_LNBA_Roswita/README.md\nSwitzerland_LNBA_Roswita/CHANGELOG.md\n```\n\n### Text encoding\n\nAll text files in the package MUST be UTF-8 encoded. They SHOULD use Unix-style line endings, so a single Line Feed (LF, `\\n`) character, NOT a Carriage Return and Line Feed (CRLF) pair (`\\r\\n`) as in MS DOS and Windows.\n\n`Poseidon_ID`s and `Group_Name`s, so the primary sample and group identifiers across `.janno`, `.ssf`, and genotype data files, SHOULD contain only characters of a subset of the 7-bit ASCII code set. Specifically the alphanumeric characters `A-Z`, `a-z`, `0-9`, and the symbols `_` (underscore), `-` (hyphen-minus), and `.` (period, dot or full stop).\n\n### The `POSEIDON.yml` file\n\nThe `POSEIDON.yml` file defines Poseidon packages by listing metainformation and relative paths in a standardised, machine-readable format.\n\n- It MUST be a valid [YAML file](https://yaml.org).\n- Its mandatory and optional fields are documented in the [POSEIDON_yml_fields.tsv file](https://github.com/poseidon-framework/poseidon-schema/blob/master/POSEIDON_yml_fields.tsv).\n\nHere is an example for a `POSEIDON.yml` file:\n\n```yml\nposeidonVersion: 2.7.1\ntitle: Switzerland_LNBA_Roswita\ndescription: LNBA Switzerland genetic data not yet published\ncontributor:\n  - name: Roswita Malone\n    email: roswita.malone@example.org\n    orcid: 1234-1234-1234-1234\n  - name: Paul Panther\n    email: paul.panther@example.edu\npackageVersion: 1.1.2\nlastModified: 2021-01-28\nlicense:\n  name: CC BY 4.0\n  url: https://creativecommons.org/licenses/by/4.0/\n  file: license.md\ngenotypeData:\n  format: PLINK\n  genoFile: Switzerland_LNBA_Roswita.bed\n  genoFileChkSum: 95b093eefacc1d6499afcfe89b15d56c\n  snpFile: Switzerland_LNBA_Roswita.bim\n  snpFileChkSum: 6771d7c873219039ba3d5bdd96031ce3\n  indFile: Switzerland_LNBA_Roswita.fam\n  indFileChkSum: f77dc756666dbfef3bb35191ae15a167\n  snpSet: 1240K\njannoFile : Switzerland_LNBA_Roswita.janno\njannoFileChkSum: 555d7733135ebcabd032d581381c5d6f\nsequencingSourceFile: Switzerland_LNBA_Roswita.ssf\nsequencingSourceFileChkSum: 19db1906240ee2f076e1a9659567dca4\nbibFile: Switzerland_LNBA_Roswita.bib\nbibFileChkSum: 70cd3d5801cee8a93fc2eb40a99c63fa\nreadmeFile: README.md\nchangelogFile: CHANGELOG.md\n```\n\nWhen a package is modified in any way (including updates of the context information in the `.janno` file), then the `packageVersion` field SHOULD be incremented and the `lastModified` field updated to the current date.\n\n#### Package versioning\n\nThe `packageVersion` field is a mandatory entry of the `POSEIDON.yml` file. It denotes the version of the individual package, using a three-component versioning system derived from [semantic versioning](https://semver.org).\n\nEach version number is comprised of three numbers, separated by a `.`. For example: `0.1.0`, `1.0.0` or `2.1.3`. The first number gives the `Major`, the second the `Minor` and the third the `Patch` component of the version number. For a Poseidon package these components SHOULD be incremented when the following changes occur:\n\n- **`Major`** (e.g. `1.4.2` -\u003e `2.0.0`)\n  - When samples are added to a package.\n  - When samples are removed from a package.\n  - When the genotype data (i.e. the contents of the `.bed`/`.bim`/`.fam` or `.geno`/`.snp`/`.ind` files) for any number of samples is changed.\n\n- **`Minor`** (e.g. `1.4.2` -\u003e `1.5.0`)\n  - When larger pieces of meta- or context information are added or modified in any package file, except the genotype data. For example:\n    - An entire `.janno`, `.bib` or `.ssf` file is added or replaced.\n    - Entire columns in the `.janno` or `.ssf` file are added or replaced.\n    - Primary publications for samples in the `.janno` and `.bib` file are added or replaced.\n\n- **`Patch`** (e.g. `1.4.2` -\u003e `1.4.3`)\n  - When smaller pieces of meta- or context information are added or modified in any package file, except the genotype data. For example:\n    - Individual entries in the `.janno` or `.ssf` file are added or replaced.\n    - Secondary publications for samples in the `.janno` and `.bib` file are added or replaced.\n    - BibTeX entries in the `.bib` file are modified.\n    - The package `description` changes in the `POSEIDON.yml` file.\n    - The `CHANGELOG.md` file is modified with additional information on previous entries.\n\nWhen the `packageVersion` is changed, then the `lastModified` date MUST be updated and an entry to the `CHANGELOG.md` file SHOULD be added summarising the changes made.\n\nPackages SHOULD start at `packageVersion` `0.1.0`.\n\n### Data licensing and the `license.md` file\n\nData licences are a common way to grant the public permission to use a dataset under copyright law.\n\nPoseidon packages MAY specify a license, and if so, SHOULD use [Creative Commons licences](https://creativecommons.org/share-your-work/cclicenses).\n\nLicences are documented in the `POSEIDON.yml` file in the `license` section, either with just the `name`, or with a license `file`, or with both the `name` and a `file`. `name` SHOULD include a short string with name and version of the license, e.g. `CC BY 4.0`. The `file`, typically named `license.md`, MAY include the full text of a license, or a short notifier further contextualizing the entry in the `name` field. For example:\n\n```default\nThe Poseidon package Switzerland_LNBA_Roswita © 2021 by Roswita Malone is licensed under Creative Commons Attribution 4.0 International. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/\n```\n\n### Genotype data\n\nGenotype data in Poseidon packages is stored either in (binary) PLINK, EIGENSTRAT or Variant Call Format (VCF).\n\n|   | PLINK (binary) | EIGENSTRAT | VCF |\n|---|---|---|---|\n| genotype file | [`.bed` (binary biallelic genotype table) or `.bed.gz`](https://www.cog-genomics.org/plink/1.9/formats#bed) | [`.geno` (genotype file) or `.geno.gz`](https://github.com/DReichLab/EIG/blob/fb4fb59065055d3622e0f97f0149588eae630a3e/CONVERTF/README#L67) | [`.vcf` or `.vcf.gz`](https://samtools.github.io/hts-specs/VCFv4.2.pdf) |\n| SNP file  | [`.bim` (extended MAP file) or `.bim.gz`](https://www.cog-genomics.org/plink/1.9/formats#bim) | [`.snp` (snp file) or `.snp.gz`](https://github.com/DReichLab/EIG/blob/fb4fb59065055d3622e0f97f0149588eae630a3e/CONVERTF/README#L67) |  |\n| individual file  | [`.fam` (sample information)](https://www.cog-genomics.org/plink/1.9/formats#fam) | [`.ind` (indiv file)](https://github.com/DReichLab/EIG/blob/fb4fb59065055d3622e0f97f0149588eae630a3e/CONVERTF/README#L67) |  |\n\nBoth PLINK and EIGENSTRAT formats require three files to be specified. In PLINK, the genotype file is binary (with 2 bits per genotype), while in Eigenstrat, the genotype file is text-based (with 8 bits per genotype). The SNP and individual files are text-based for both formats (see links behind the file endings in the table above). The EIGENSTRAT format specifically is common within archaeogenetics, compatible with many important tools, e.g. [EIGENSOFT](https://github.com/DReichLab/EIG) and [ADMIXTOOLS](https://github.com/DReichLab/AdmixTools). Finally, the VCF format is the most formally specified format, with properly versioned specifications being released regularly. VCF is well established in the wider genetics community and the de-facto standard to store variants in the field of medical genetics.\n\nVCF files, as well as genotype and SNP files in PLINK and EIGENSTRAT can be stored in gzipped form, signifified by an additional file ending (`*.gz`).\n\nTo make VCF files fully convertible to PLINK and EIGENSTRAT, they MUST be biallelic and contain only genotypes coded as `0/0`, `0/1`, `1/1`, `./.`. Furthermore, they CAN encode group names and genetic sex for all samples through special header fields `##group_names=name1,name2,...` and `##genetic_sex=F,U,M,...`, respectively. If these fields are not present, then group names are assumed to be \"unknown\" and genetic sex \"U\" (unknown) for all samples.\n\n###  The `.janno` file\n\nThe `.janno` file is a tab-separated text file with a header line. It holds context information (variables/columns) for each sample (objects/rows) in a package.\n\n- A set of strictly defined core variables (defined by column name) and their possible content are documented here: [janno_columns.tsv](https://github.com/poseidon-framework/poseidon-schema/blob/master/janno_columns.tsv)\n- A `.janno` file MAY have all of these core variables, or only a subset of them.\n- Only three columns MUST be present to make the file valid: **Poseidon_ID**, **Group_Name** and **Genetic_Sex**.\n- Arbitrary columns not defined here MAY be added as long as their column names do not clash with the defined ones.\n- Arbitrary, additional free-text information directly related to a column **\u003cColumn_Name\u003e** from the set of specified core variables in [janno_columns.tsv](https://github.com/poseidon-framework/poseidon-schema/blob/master/janno_columns.tsv) SHOULD be added in a column whose name has the form **\u003cColumn_Name\u003e_Note**. Example: `Contamination_Note`.\n- The column order is not fixed, but MAY follow the order in [janno_columns.tsv](https://github.com/poseidon-framework/poseidon-schema/blob/master/janno_columns.tsv). **\u003cColumn_Name\u003e_Note** columns SHOULD be placed directly after the respective column they are refering to.\n- If information is unknown or a variable does not apply for a certain sample, then the respective cell(s) MAY be filled with `n/a` or simply an empty string.\n- The order of the samples (rows) in the `.janno` file MUST be equal to the order in the genetic data files (`.ind`, `.fam`) in the package.\n- The values in the columns **Poseidon_ID**, **Group_Name** and **Genetic_Sex** MUST be equal to the terms used in the genetic data files (`.ind`, `.fam`).\n- Multiple predefined columns of the `.janno` file are list columns that can hold multiple values (either strings or numerics) separated by `;`.\n- The decimal separator for all floating point numbers MUST be `.`.\n\n### The `.bib` file\n\nA [BibTeX](http://www.bibtex.org/) file with all references listed in the `.janno` file. The entry keys MUST fit the ones used in the `.janno` file.\n\nExample:\n\n```default\n@article{CassidyPNAS2015,\n    doi = {10.1073/pnas.1518445113},\n    url = {https://doi.org/10.1073%2Fpnas.1518445113},\n    year = 2015,\n    month = {dec},\n    publisher = {Proceedings of the National Academy of Sciences},\n    volume = {113},\n    number = {2},\n    pages = {368--373},\n    author = {Lara M. Cassidy and Rui Martiniano and Eileen M. Murphy and Matthew D. Teasdale and James Mallory and Barrie Hartwell and Daniel G. Bradley},\n    title = {Neolithic and Bronze Age migration to Ireland and establishment of the insular Atlantic genome},\n    journal = {Proceedings of the National Academy of Sciences}\n}\n```\n\nTo connect a sample in the package to this particular literature reference, the .janno file column `Publication` would have to be filled with `CassidyPNAS2015`.\n\n### The `README.md` file\n\nA simple [markdown](https://daringfireball.net/projects/markdown) file with informal, arbitrarily structured information accompanying the package.\n\nExample:\n\n```default\nThis package contains a rather interesting set of samples relevant for the peopling of the Territory of Christmas Island in the Indian Ocean. We consider this especially relevant, because ...\n```\n\n### The `CHANGELOG.md` file\n\nA markdown file to document changes in the history of a package.\n\nExample:\n\n```default\n- V 1.1.1: Fixed a spelling mistake in one site name: \"Hosenacker\" -\u003e \"Rosenacker\"\n- V 1.1.0: Added mtDNA contamination estimation to the .janno file\n- V 1.0.0: Added spatial coordinates and age information to the .janno file and finalized a first stable version of the package\n- V 0.2.0: Added previously restricted sample L1337\n- V 0.1.0: Creation of the package\n```\n\nThe structure with `- V X.X.X:` at the beginning of each line is not mandatory, but SHOULD be followed for reasons of interoperability.\n\n### The `.ssf` file\n\nThe `.ssf` file is another tab-separated text file with a header line. It stores sequencing source data, so metainformation about the raw sequencing data behind the genotypes in a Poseidon package. The primary entities in this table are sequencing entities, typically corresponding to DNA libraries or even multiple runs/lanes of the same library.\n\n- The predefined columns are specified here: [ssf_columns.tsv](https://github.com/poseidon-framework/poseidon-schema/blob/master/ssf_columns.tsv)\n- All columns of this schema are optional, so a `.ssf` MAY have all of these core variables, only a subset of them, or even none. It SHOULD have a `poseidon_IDs` column, though, to link the sequencing entities to the Poseidon package.\n- The link to the individuals listed in the `.janno`-file (and therefore to the entire Poseidon package) is made through a many-to-many foreign-key relationship between the .janno column `Poseidon_ID` and the .ssf column `poseidon_IDs`. That means each entry in the .janno file can be linked to many rows in the .ssf file and vice versa.\n- As in the `.janno` file arbitrary columns not defined here MAY be added to the `.ssf` file as long as their column names do not clash with the defined ones.\n- The order of columns and rows is irrelevant.\n- If information is unknown or a variable does not apply, then the respective cell(s) MAY be filled with `n/a` or simply an empty string.\n- Multiple predefined columns of the `.ssf` file are list columns that can hold multiple values (either strings or numerics) separated by `;`.\n- The decimal separator for all floating point numbers MUST be `.`.\n\n### Details\n\n#### The `Capture_Type` .janno column\n\nThe following protocols are specified:\n\n- `Shotgun`: Sequencing without any enrichment (whole genome sequencing, screening etc.).\n- `1240K`: Target enrichment with hybridization capture optimised for sequences covering the 1240k SNP array, see [@Fu2015](https://doi.org/10.1038/nature14558), [@Haak2015](https://doi.org/10.1038/nature14317), [@Mathieson2015](https://doi.org/10.1038/nature16152).\n- `ArborComplete`, `ArborPrimePlus`, `ArborAncestralPlus`: Target enrichment with hybridization capture as provided by Arbor Biosciences in three different kits branded [myBaits Expert Human Affinities](https://arborbiosci.com/genomics/targeted-sequencing/mybaits/mybaits-expert/mybaits-expert-human-affinities).\n- `TwistAncientDNA`: Target enrichment with hybridization capture as provided by Twist Bioscience [@Rohland2022](https://doi.org/10.1101/gr.276728.122).\n- `WISC2013`: Whole genome capture as described by [@Carpenter2013](10.1016/j.ajhg.2013.10.002).\n- `OtherCapture`: Target enrichment with hybridization capture for any other set of sequences.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fposeidon-framework%2Fposeidon-schema","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fposeidon-framework%2Fposeidon-schema","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fposeidon-framework%2Fposeidon-schema/lists"}