https://github.com/zenodeapp/protein-crud

A basic CRUD for Proteins with string query functionality.
https://github.com/zenodeapp/protein-crud

awk contracts crud proteins query-parser solidity

Last synced: 4 months ago
JSON representation

A basic CRUD for Proteins with string query functionality.

Host: GitHub
URL: https://github.com/zenodeapp/protein-crud
Owner: zenodeapp
Created: 2022-07-31T23:05:39.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2023-12-07T20:04:18.000Z (almost 2 years ago)
Last Synced: 2025-02-06T23:19:14.384Z (9 months ago)
Topics: awk, contracts, crud, proteins, query-parser, solidity
Language: Solidity
Homepage:
Size: 72 MB
Stars: 0
Watchers: 1
Forks: 1
Open Issues: 4
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md

Awesome Lists containing this project

README

          # Protein CRUD (This readme is outdated!)

A basic CRUD for Proteins with string query functionality.

This has been built within the Hardhat environment and is merely a basis for querying protein sequences and PDBID/ACCESSIONs.

## Getting Started

### 1. Installation

To get started, install all dependencies using a package manager of your choosing. For instance: yarn install or npm install.

### 2. Run Test Node

After having installed all dependencies use npx hardhat node to locally run a test environment where we could deploy our ProteinQuery contract to.

You could also use a different network of your choosing, which you could configure in the hardhat.config.js-file (for more info on this: https://hardhat.org/hardhat-runner/docs/config).

### 3. Deployment

Now after having a node up and running we'll have to deploy our contract using npx hardhat run scripts/deploy.js. Which will return the contract address, found in your terminal.

#### 3.1 Contract Address

After deploying, add the contract address to the proteins.config.js in the root folder, assign the address to the key named contractAddress.

### 4. Adding the Proteins

So our contract's deployed, we now just have to add our proteins to the contract's storage. The script addProteins.js will help with this. It's a simple Javascript that reads data from a .txt file located in the datasets-folder. These files contain the first n proteins from the Genesis L1 dataset (https://datasetnft.org/), where 104059 is the file containing all proteins. Change which dataset size you wish to use by editing this in the proteins.config.js file.

Use: npx hardhat run scripts/insertProteins.js to run the script.

#### 4.1 Adding the Seeds (Optional, but recommended)

Since version 1.1.0 a new way to query was added (semi-blast), which depends on the insertion of short seeds. Basically, these seeds are created by cutting all sequences into tiny n-sized words, where n could be adjusted by interacting with the contract. Each short sequence holds information on where this segment could be found.

Use: npx hardhat run scripts/insertSeeds.js to run the script. Do know that this is costly and takes longer than the proteins script (optimization is needed).

### 5. Querying the Proteins

We can finally query our dataset of proteins! I've written a task in the hardhat.config.js-file which calls the function naiveQuery and one calling semiBlastQuery (only works with seeds added) present in contracts/ProteinQuery.sol. It enables us to query by sequence OR ID OR both (an exclusive query). 

To run this task, use:

npx hardhat naiveQuery --id "your_id_query" --sequence "your_sequence_query" --exclusive "true/false"

OR

npx hardhat semiBlastQuery --sequence "your_sequence_query" --casesensitive "true/false" alot faster!

#### 5.1 Flags

All flags are optional. So if you want to, let's say, only search for id's containing "1A", you'd only set the flag --id to "1A". If you wanted to search for sequences containing "AAA" but also contain "1A" in its id, you'd have to set both flags to the corresponding values AND set --exclusive to "true". This, because a value of "false" would return all sequences that match "AAA" AND all sequences that have a id containing "1A", while in this particular case we'd only want the values where both queries are true.

The default values for each flag, if omitted, are:

`--id`: ""

`--sequence`: ""

`--exclusive`: "false"

#### 5.2 Returned Value

The query returns an object containing an array with all found proteins and an integer stating the amount of results found: {proteins: Array of ProteinStruct, proteinsFound: uint} where ProteinStruct is an object of the format {nftId: uint, id: string, sequence: string}.

So, for instance, getting the sequence of the third protein in the returned value, in Javascript, would look like this: result.proteins[2].sequence. See the naiveQuery-task in hardhat.config.js for a working example on how to loop through all the query results.

## Remarks

- ~~Solidity is not the most optimal when it comes to handling strings. Especially when it comes to larger strings. Therefore ideas like pre-processing the database and storing smaller segments are possible routes to explore to get this working faster (which I'm currently working on).~~ - included since version 1.1.0.

- ~~The searches are case-sensitive at the moment. This could be solved upon insertion of the proteins (but would discard whether the letters were lower or uppercase) or solved by adding an extra toLowerCase function in Solidity. But then again, Solidity is not optimal for string manipulation and this would degrade performance.~~ - included since version 1.2.2.

- There's a limitation in Solidity where `memory arrays` can't be dynamic in size. And since we cannot know beforehand how many results a query will have, we temporarily store the results in an array of size n, with n = the total amount of proteins. To prevent returning an array with a bunch of empty values, we copy the query results, in the temporary array, to a smaller sized array in the last line of the naiveQuery function. But, ofcourse this is an extra step, degrading the speed of our queries. More info about this issue can be found in the contract.

## Credits and sources of inspiration

I've tried to credit everyone else's code by commenting in code whenever this was the case!

- Hardhat's infrastructure! (https://hardhat.org/)

- Rob Hitchen's User CRUD (https://bitbucket.org/rhitchens2/soliditycrud/src/master/)

- Hermes Ateneo's "contains" function (https://github.com/HermesAteneo/solidity-repeated-word-in-string/blob/main/RepeatedWords.sol)

- Ottodevs' toLowerCase function (https://gist.github.com/ottodevs/c43d0a8b4b891ac2da675f825b1d1dbf) 

- Comparing strings (https://ethereum.stackexchange.com/questions/30912/how-to-compare-strings-in-solidity)

- Quick-sort from Subhodi, but in code altered for our needs (https://gist.github.com/subhodi/b3b86cc13ad2636420963e692a4d896f)

- Semi-blast is inspired by the first steps of the Blast algorithm (by reading research papers, lectures, pseudo-code and implementations in other languages by others)

Tousuke (ZEN - https://twitter.com/KeymasterZen)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zenodeapp/protein-crud

Awesome Lists containing this project

README