https://github.com/luizirber/2024-07-26-sra-metadata
https://github.com/luizirber/2024-07-26-sra-metadata
Last synced: 7 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/luizirber/2024-07-26-sra-metadata
- Owner: luizirber
- Created: 2024-07-27T05:46:08.000Z (over 1 year ago)
- Default Branch: latest
- Last Pushed: 2024-08-02T04:16:09.000Z (over 1 year ago)
- Last Synced: 2025-04-02T12:43:05.701Z (8 months ago)
- Language: Python
- Size: 33.2 KB
- Stars: 3
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Prepare a subset from SRA metadata using duckdb
The instructions for accessing [the SRA metadata in the cloud](https://www.ncbi.nlm.nih.gov/sra/docs/sra-athena/)
show a command to list the contents of a bucket containing the tables:
```
aws s3 ls s3://sra-pub-metadata-us-east-1/sra/metadata/ --no-sign-request
```
These tables are parquet files,
and instead of using Athena to query it this repo has a SQL script
for `duckdb` to do a similar query locally,
and save results to a `subset.parquet` file for later use.
## Steps
### Install pixi
```bash
curl -fsSL https://pixi.sh/install.sh | bash
```
### Prepare duckdb
`duckdb` in conda-forge is missing the [parquet extension](https://github.com/conda-forge/duckdb-split-feedstock/issues/9)
but it can be installed with
```
pixi run install_parquet
```
### Download data
```bash
pixi run download
```
This will download around 9GiB of data (as of 2024-07-26)
### Subset data
```bash
pixi run subset
```
This executes the `query.sql` file in `duckdb`,
in my laptop this took a bit more than 17 minutes on HDD,
and 4m30s on SSD.
Final parquet file is `subset.parquet`
### Load into duckdb database
```bash
pixi run load
```
This took 20 minutes on SSD.
May need to limit memory usage (change the `memory_limit` in `load.sql`),
and uses quite a bit (100GB) of temp disk space.
After data is loaded,
you can explore data by opening it with the `duckdb` CLI:
```
pixi run duckdb metadata.db
```