https://github.com/luizirber/2024-07-26-sra-metadata

Last synced: 7 months ago
JSON representation

Host: GitHub
URL: https://github.com/luizirber/2024-07-26-sra-metadata
Owner: luizirber
Created: 2024-07-27T05:46:08.000Z (over 1 year ago)
Default Branch: latest
Last Pushed: 2024-08-02T04:16:09.000Z (over 1 year ago)
Last Synced: 2025-04-02T12:43:05.701Z (8 months ago)
Language: Python
Size: 33.2 KB
Stars: 3
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Prepare a subset from SRA metadata using duckdb

The instructions for accessing [the SRA metadata in the cloud](https://www.ncbi.nlm.nih.gov/sra/docs/sra-athena/)
show a command to list the contents of a bucket containing the tables:

```
aws s3 ls s3://sra-pub-metadata-us-east-1/sra/metadata/ --no-sign-request
```

These tables are parquet files,
and instead of using Athena to query it this repo has a SQL script
for `duckdb` to do a similar query locally,
and save results to a `subset.parquet` file for later use.

## Steps

### Install pixi

```bash
curl -fsSL https://pixi.sh/install.sh | bash
```

### Prepare duckdb

`duckdb` in conda-forge is missing the [parquet extension](https://github.com/conda-forge/duckdb-split-feedstock/issues/9)
but it can be installed with
```
pixi run install_parquet
```

### Download data

```bash
pixi run download
```
This will download around 9GiB of data (as of 2024-07-26)

### Subset data

```bash
pixi run subset
```

This executes the `query.sql` file in `duckdb`,
in my laptop this took a bit more than 17 minutes on HDD,
and 4m30s on SSD.
Final parquet file is `subset.parquet`

### Load into duckdb database

```bash
pixi run load
```

This took 20 minutes on SSD.
May need to limit memory usage (change the `memory_limit` in `load.sql`),
and uses quite a bit (100GB) of temp disk space.

After data is loaded,
you can explore data by opening it with the `duckdb` CLI:
```
pixi run duckdb metadata.db
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/luizirber/2024-07-26-sra-metadata

Awesome Lists containing this project

README