Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/astrabert/supaseqs
Basically BLAST, but written in PostgreSQL😉
https://github.com/astrabert/supaseqs
bioinformatics blast data-management database dna-sequences fastapi nucleotide-sequence postgresql sqalchemy supabase vector-database vector-search
Last synced: 9 days ago
JSON representation
Basically BLAST, but written in PostgreSQL😉
- Host: GitHub
- URL: https://github.com/astrabert/supaseqs
- Owner: AstraBert
- License: mit
- Created: 2024-09-01T22:37:20.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2024-09-01T22:45:19.000Z (4 months ago)
- Last Synced: 2024-12-25T20:12:36.597Z (16 days ago)
- Topics: bioinformatics, blast, data-management, database, dna-sequences, fastapi, nucleotide-sequence, postgresql, sqalchemy, supabase, vector-database, vector-search
- Language: Python
- Homepage: https://astrabert.github.io/SupaSeqs/
- Size: 34.2 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
SupaSeqs
Basically BLAST written in PostgreSQL😉
**SupaSeqs** is a tool that can be used to manage DNA sequences databases locally, thanks to the PostgreSQL implementation offered by [*Supabase*](https://supabase.com/).
It leverages PostgreSQL as backend database manager, kmer-based vectorization and vector search to mimic the functionalities of BLAST.
### 1. Installation
If you are working in a Linux environment, you may want to just download/copy [setup.sh](./setup.sh) and launch it:
```bash
# Linux
wget https://raw.githubusercontent.com/AstraBert/SupaSeqs/main/scripts/setup.sh
bash setup.sh
```### 1a. Pre-requirements
Make sure that your environment has:
- `git`
- `Node v18` or following
- `npm` and `npx`
- `python 3.10` or following
The installation process should work both on Windows and on Linux.#### 1b. Environment setup
First of all, clone this repository:
```bash
# BOTH Windows and Linux
git clone https://github.com/AstraBert/SupaSeqs
cd SupaSeqs
```Get the `supabase` command line executables:
```bash
# BOTH Windows and Linux
npm install supabase
```Create and start a Supabase instance:
```bash
# BOTH Windows and Linux
npx supabase init
npx supabase start
```Retrieve the connection string from the `DB URL` that will be printed after this command:
```bash
# BOTH Windows and Linux
npx supabase status
```Create a virtual environment, activate it and install the necessary dependencies:
```bash
# Linux
python3 -m venv apienv
source apienv/bin/activate
python3 -m pip install -r requirements.txt
```Or
```powershell
# Windows
python3 -m venv .\apienv
.\apienv\Scripts\activate # For Command Prompt
# or
.\apienv\Activate.ps1 # For PowerShell
python3 -m pip install -r .\requirements.txt
```#### 1c. Application start
Within the virtual environment, run:
```bash
# BOTH Windows and Linux
cd scripts
python3 -m fastapi dev
```If there are problems with the connection to the Supabase client, make sure to replace the connection string in [line of 16 `main.py`](./scripts/main.py#L16) with the one you found running `supabase status`.
### 2. How does it work
The application works as an API service, leveraging [FastAPI](https://fastapi.tiangolo.com/). The connection to Supabase is handled via a [`sqlalchemy`](https://docs.sqlalchemy.org/en/20/) implementation of a client which is similar to the one built in the [`vecs`](https://github.com/supabase/vecs) library.
The application accepts two request types:
1- **POST** - *Upload a sequence or a FASTA file*:
```bash
# Single sequence
curl -X POST "http://127.0.0.1:8000/seqs/" -H "accept: application/json" -H "Content-Type: application/json" -d "{\"sequence\":\"GGCAGAACCCAGGGCACCAGCACGCCGAAGGACCACCGCAGGCTGGCCAGCGCTCCACCCTCCCTGCACCACACCCTGCGAGCAAAAGGCAGCAGAAATGAAGAGCATTTACTTTGTGGCTGGATTGTTTGTAATGCTGGTACAAGGCAGCTGGCAACACCCACTTCAAGACACAGAGGAAAAACCCAGGTCTTTCTCAACTTCTCAAACAGACTTGCTTGATGATCCGGATCAGATGAATGAAGACAAGCGTCATTCACAGGGTACATTCACCAGTGACTACAGCAAGTTCCTCGACACCAGGCGTGCTCAAGACTTCTTGGATTGGCTGAAGAACACCAAGAGGAACAGGAATGAAAT\", \"description\": \"M57688.1 Octodon degus glucagon mRNA, complete cds\"}"
# FASTA file
curl -X POST "http://127.0.0.1:8000/seqs/" -H "accept: application/json" -H "Content-Type: application/json" -d "{\"sequence\": \"sequence.fasta\"}"
```
Each sequence gets vectorized with a 5-mer-based representation (a 1024-dim array), which is then uploaded to the `sequences` table on Supabase along with a description (if provided in the case of the single sequence, the headers of the sequences for those in a FASTA file) and the original sequence.2- **GET** - *Search through the sequence database*
```bash
curl -X 'GET' 'http://localhost:8000/seqs/AACTTCTCAAACAGACTTGCTTGATGATCCGGATCAGATGAATGAAGACAAGCGTCATTCACAGGGTACATTCACCAGTGACTACAGCAAGTTCCTCGACACCAGGCGTGCTCAAGACTTCTTGGATTGGCTGAAGAACACCAAGAGGAACAGGAATGAAAT?limit=100&threshold=75' -H 'accept: application/json'
```The query sequence gets vectorized and the database is searched: a number of sequences (specified with the _limit_ key, maximum is 1000) is returned if they are compliant with a similarity threshold (specified as a percentage value with the _threshold_ key); the typical response looks like this:
```json
{"1":{"sequence":"GGCAGAACCCAGGGCACCAGCACGCCGAAGGACCACCGCAGGCTGGCCAGCGCTCCACCCTCCCTGCACCACACCCTGCGAGCAAAAGGCAGCAGAAATGAAGAGCATTTACTTTGTGGCTGGATTGTTTGTAATGCTGGTACAAGGCAGCTGGCAACACCCACTTCAAGACACAGAGGAAAAACCCAGGTCTTTCTCAACTTCTCAAACAGACTTGCTTGATGATCCGGATCAGATGAATGAAGACAAGCGTCATTCACAGGGTACATTCACCAGTGACTACAGCAAGTTCCTCGACACCAGGCGTGCTCAAGACTTCTTGGATTGGCTGAAGAACACCAAGAGGAACAGGAATGAAAT","description":"M57688.1 Octodon degus glucagon mRNA, complete cds","cos_dist":0.23987939711631145}}
```This is accomplished thanks to a function called `match_page_sections` and defined as follows:
```sql
create or replace function public.match_page_sections (
embedding vector(1024),
match_threshold float,
match_count int
)
returns setof public.sequences
language sql
as $$
select *
from public.sequences
where public.sequences.embedding <=> embedding < 1 - match_threshold
order by public.sequences.embedding <=> embedding asc
limit least(match_count, 1000);
$$;
```### 3. Contributions
Contributions are more than welcome! See [contribution guidelines](./CONTRIBUTING.md) for more information :)
### 4. Funding
If you found this project useful, please consider to [fund it](https://github.com/sponsors/AstraBert) and make it grow: let's support open-source together!😊
### 5. License and rights of usage
This project is provided under [MIT license](./LICENSE): it will always be open-source and free to use.
If you use this project, please cite the author: [Astra Clelia Bertelli](https://astrabert.vercel.app)