Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/openvoiceos/quebra_frases
chunks strings into byte sized pieces
https://github.com/openvoiceos/quebra_frases
chunk chunking sentence-chunking tokenization tokenize tokenized tokenizer word-tokenizing
Last synced: about 1 month ago
JSON representation
chunks strings into byte sized pieces
- Host: GitHub
- URL: https://github.com/openvoiceos/quebra_frases
- Owner: OpenVoiceOS
- License: apache-2.0
- Created: 2021-04-15T13:06:37.000Z (over 3 years ago)
- Default Branch: master
- Last Pushed: 2024-01-28T13:54:26.000Z (11 months ago)
- Last Synced: 2024-10-28T14:42:44.700Z (2 months ago)
- Topics: chunk, chunking, sentence-chunking, tokenization, tokenize, tokenized, tokenizer, word-tokenizing
- Language: Python
- Homepage:
- Size: 35.2 KB
- Stars: 1
- Watchers: 4
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# Quebra Frases
quebra_frases chunks strings into byte sized pieces
## Usage
Tokenization
```python
import quebra_frasessentence = "sometimes i develop stuff for mycroft, mycroft is FOSS!"
print(quebra_frases.word_tokenize(sentence))
# ['sometimes', 'i', 'develop', 'stuff', 'for', 'mycroft', ',',
# 'mycroft', 'is', 'FOSS', '!']print(quebra_frases.span_indexed_word_tokenize(sentence))
# [(0, 9, 'sometimes'), (10, 11, 'i'), (12, 19, 'develop'),
# (20, 25, 'stuff'), (26, 29, 'for'), (30, 37, 'mycroft'),
# (37, 38, ','), (39, 46, 'mycroft'), (47, 49, 'is'),
# (50, 54, 'FOSS'), (54, 55, '!')]print(quebra_frases.sentence_tokenize(
"Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."))
#['Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.',
#'Did he mind?',
#"Adam Jones Jr. thinks he didn't.",
#"In any case, this isn't true...",
#"Well, with a probability of .9 it isn't."]print(quebra_frases.span_indexed_sentence_tokenize(
"Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."))
#[(0, 82, 'Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.'),
#(83, 95, 'Did he mind?'),
#(96, 128, "Adam Jones Jr. thinks he didn't."),
#(129, 160, "In any case, this isn't true..."),
#(161, 201, "Well, with a probability of .9 it isn't.")]print(quebra_frases.paragraph_tokenize('This is a paragraph!\n\t\nThis is another '
'one.\t\n\tUsing multiple lines\t \n '
'\n\tparagraph 3 says goodbye'))
#['This is a paragraph!\n\t\n',
#'This is another one.\t\n\tUsing multiple lines\t \n \n',
#'\tparagraph 3 says goodbye']print(quebra_frases.span_indexed_paragraph_tokenize('This is a paragraph!\n\t\nThis is another '
'one.\t\n\tUsing multiple lines\t \n '
'\n\tparagraph 3 says goodbye'))
#[(0, 23, 'This is a paragraph!\n\t\n'),
#(23, 77, 'This is another one.\t\n\tUsing multiple lines\t \n \n'),
#(77, 102, '\tparagraph 3 says goodbye')]
```chunking
```python
import quebra_frasesdelimiters = ["mycroft"]
sentence = "sometimes i develop stuff for mycroft, mycroft is FOSS!"
print(quebra_frases.chunk(sentence, delimiters))
# ['sometimes i develop stuff for', 'mycroft', ',', 'mycroft', 'is FOSS!']samples = ["tell me what do you dream about",
"tell me what did you dream about",
"tell me what are your dreams about",
"tell me what were your dreams about"]
print(quebra_frases.get_common_chunks(samples))
# {'tell me what', 'about'}
print(quebra_frases.get_uncommon_chunks(samples))
# {'do you dream', 'did you dream', 'are your dreams', 'were your dreams'}
print(quebra_frases.get_exclusive_chunks(samples))
# {'do', 'did', 'are', 'were'}samples = ["what is the speed of light",
"what is the maximum speed of a firetruck",
"why are fire trucks red"]
print(quebra_frases.get_exclusive_chunks(samples))
# {'light', 'maximum', 'a firetruck', 'why are fire trucks red'})
print(quebra_frases.get_exclusive_chunks(samples, squash=False))
#[['light'],
#['maximum', 'a firetruck'],
#['why are fire trucks red']])
```## Install
```bash
pip install quebra_frases
```