Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dnbaker/bioseq
Tokenizers and Machine Learning Models for biological sequence data
https://github.com/dnbaker/bioseq
biological-sequences machine-learning tokenization transformers
Last synced: 20 days ago
JSON representation
Tokenizers and Machine Learning Models for biological sequence data
- Host: GitHub
- URL: https://github.com/dnbaker/bioseq
- Owner: dnbaker
- Created: 2021-07-10T16:55:36.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2024-09-09T22:15:44.000Z (5 months ago)
- Last Synced: 2024-09-10T02:38:52.788Z (5 months ago)
- Topics: biological-sequences, machine-learning, tokenization, transformers
- Language: C++
- Homepage:
- Size: 190 KB
- Stars: 22
- Watchers: 4
- Forks: 3
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# bioseq
A C++/Python package performing fast one-hot encoding for DNA or Protein sequences with C++ code, optionally converting to pytorch and moving to device.
Offers 4-letter DNA, 20-letter amino acid, and a variety of other compressed protein and DNA alphabets, and optionally is parallelized.
## Tokenizing
bioseq.Tokenizer does the tokenizing, and there are pre-made tokenizers for all alphabets, as well as combinations of EOS, BOS, and whether padding gets a unique character, or is simply masked.
`bos_tokenizers` is a dictionary from alphabets to Tokenizers with a BOS tag prepended.
`eos_tokenizers` is a dictionary from alphabets to Tokenizers with an EOS tag appended.
`pos_tokenizers` is adictionary from alphabets to Tokenizers with a padding character used.
`beos_tokenizers` adds both BOS and EOS
`pbeos_tokenizers` adds BOS, EOS, and padding characters.Tokenizer can tokenize (`batch_tokenize`), which creates array of tokens, (uint8 by default),
or it can one-hot encode (`batch_onehot_encode`), which takes the tokens one-step further into one-hot encoding.
Both of these `Tokenizer::batch_*` functions can be parallelized by providing `nthreads={int}`.tokenizing uses seq-first ordering by default as well, but this can be changed with `batch_first=True`.
one-hot encoding uses seq-first ordering (not batch-first). It does not support `batch_first`.Both of these are ~30x as fast as using bytes.translate + np.frombuffer + np.vstack + `torch.from_numpy`,
and ~500x as fast as transformers.tokenizer.batch\_encode\_plus.1. To train Transformers, you need to use `batch_first=True`, followed by torch.nn.Embedding.
2. To train CNNs, tokenize with `batch_first=True`, embed with torch.nn.Embedding, and then apply `lambda x: einops.rearrange(x, 'batch seq emb -> batch emb seq')`.
This is because CNNs expect (Batch, C, L)
3. To train LSTMs, use `batch_first=False` to tokenize, and embed with torch.nn.EmbeddingBasically, you only want `batch_first=False` for LSTM training, and using CNNs will require a rearrange call due to the varying expectation of dimension ordering.
## Decoding
You can decode a sequence with a tokenizer.
```python
import bioseq
tok = bioseq.pbeos_tokenizers['DNA'] # To add BOS, EOS, and PAD characters separately.
tokens = tok.batch_tokenize(["ACGT", "GGGG"], padlen=7, batch_first=True)
decoded = tok.decode_tokens(tokens)
# decoded == ['ACGT', 'GGGG']
```It accepts 1D and 2D arrays. Be careful - if you don't have `batch_first` set, you may get the wrong outputs. You can fix this by swapping dimensions.
And if you have a one-hot encoded array (or have logits), just use an argmax by dimension to convert batch to tokens for decoding.
*Warning* (sharp edges):
1. if you're using a reduced amino acid alphabet, each token represents several amino acids. We simply pick the lexicographically smallest as a representative.
To the the full set of tokens for ambiguous tokens, use the `tokenizer.token_decoder()`. `token_decoder()` returns a dictionary mapping integers to all possible characters.
2. Consider ensuring padding gets its own character. `pbeos_tokenizers`, for instance, adds padding tokens as well as beginning/end of sequence tokens.
Since sequences have different lengths, we have to pad to equal length for a batch. If `padding=True` on the `Tokenizer`, then we add padding tokens at the ends.
One-hot encoding simply leaves them as 0s by default, but for tokens it's particularly important. For instance, in DNA, an empty padding is marked as a 0 and would then be marked as A. You pay slightly more (and use more tokens), but models learn the patterns of padding tokens at the end rather quickly, and you can avoid making mistakes.## DataLoading
We use a bioseq.FlatFile method, which provides random access to the sequences in a FAST{Q,A} file.
This is then used by bioseq.FlatFileDataset for use with torch.utils.data.DataLoader.For an example, see training/trainh.py and training/compute.py.
## Sequence augmentation
We also support augmentation by random mutations sampled according to BLOSUM62 transition probabilities.
This is only valid for tokenizers using the full 20-character amino acid alphabet ("PROTEIN" or "AMINO20"). We may modify this in the future to support other alphabets.bioseq.AmineTokenizer is a pre-build tokenizer without BOS, EOS, or padding which is valid for this.
## Dependencies
pybind11 v2.7 is required, in order to support bytearray
numpy is required
pytorch (as torch) is also requiredBesides these, there are some python-only dependencies which setup.py should download for you.
All of these can be manually installed via `python3 -m pip install -r requirements.txt`.
## Version history
v0.1.3: Bug fix - previous versions mapped Proline ("P") to Lysine ("K"), instead of mapping Pyrrolysine ("O") to "K".
v0.1.2: Dependencies made optional, token decoding added
v0.1.1: Initial version