https://github.com/genspectrum/ingest
https://github.com/genspectrum/ingest
Last synced: 5 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/genspectrum/ingest
- Owner: GenSpectrum
- Created: 2023-10-01T18:18:39.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-04-01T12:35:26.000Z (about 2 months ago)
- Last Synced: 2025-04-01T13:34:38.299Z (about 2 months ago)
- Language: Kotlin
- Size: 237 KB
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# GenSpectrum ingest
## Run with Docker
See help page:
```bash
docker run --rm ghcr.io/genspectrum/ingest:main -h
```Preprocess GISAID data:
```bash
docker run --rm \
-v :/data \
ghcr.io/genspectrum/ingest:main \
ingest-sc2-gisaid \
/data \
\
\
\
\
/app/gisaid_geoLocationRules.tsv
```## Internal data format
The program uses `ndjson.zst` as the default format. One JSON entry could look as follows:
```json
{
"id": "MW12345",
"metadata": {
"genbankAccession": "MW12345",
"sequencingDate": "2022-08-15",
"country": "Schweiz",
"pangoLineage": "XBB.1.5"
},
"unalignedNucleotideSequences": {
"main": "AATTCC..."
},
"alignedNucleotideSequences": {
"main": "NNNNNAATTCC..."
},
"nucleotideInsertions": {
// ???
},
"alignedAminoAcidSequences": {
"S": "XXMSR...",
"ORF1a": "...",
// ...
},
"aminoAcidInsertions": {
// ???
}
}
```