https://github.com/openvoiceos/quebra_frases

chunks strings into byte sized pieces
https://github.com/openvoiceos/quebra_frases

chunk chunking sentence-chunking tokenization tokenize tokenized tokenizer word-tokenizing

Last synced: 7 months ago
JSON representation

chunks strings into byte sized pieces

Host: GitHub
URL: https://github.com/openvoiceos/quebra_frases
Owner: OpenVoiceOS
License: apache-2.0
Created: 2021-04-15T13:06:37.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2024-01-28T13:54:26.000Z (almost 2 years ago)
Last Synced: 2025-05-03T13:44:03.798Z (7 months ago)
Topics: chunk, chunking, sentence-chunking, tokenization, tokenize, tokenized, tokenizer, word-tokenizing
Language: Python
Homepage:
Size: 35.2 KB
Stars: 1
Watchers: 3
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: readme.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

          # Quebra Frases

quebra_frases chunks strings into byte sized pieces

## Usage

Tokenization

```python

import quebra_frases

sentence = "sometimes i develop stuff for mycroft, mycroft is FOSS!"

print(quebra_frases.word_tokenize(sentence))

# ['sometimes', 'i', 'develop', 'stuff', 'for', 'mycroft', ',', 

# 'mycroft', 'is', 'FOSS', '!']

print(quebra_frases.span_indexed_word_tokenize(sentence))

# [(0, 9, 'sometimes'), (10, 11, 'i'), (12, 19, 'develop'), 

# (20, 25, 'stuff'), (26, 29, 'for'), (30, 37, 'mycroft'), 

# (37, 38, ','), (39, 46, 'mycroft'), (47, 49, 'is'), 

# (50, 54, 'FOSS'), (54, 55, '!')]

print(quebra_frases.sentence_tokenize(

    "Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."))

#['Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.',

#'Did he mind?',

#"Adam Jones Jr. thinks he didn't.",

#"In any case, this isn't true...",

#"Well, with a probability of .9 it isn't."]

print(quebra_frases.span_indexed_sentence_tokenize(

    "Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."))

#[(0, 82, 'Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.'),

#(83, 95, 'Did he mind?'),

#(96, 128, "Adam Jones Jr. thinks he didn't."),

#(129, 160, "In any case, this isn't true..."),

#(161, 201, "Well, with a probability of .9 it isn't.")]

print(quebra_frases.paragraph_tokenize('This is a paragraph!\n\t\nThis is another '

                                       'one.\t\n\tUsing multiple lines\t   \n     '

                                       '\n\tparagraph 3 says goodbye'))

#['This is a paragraph!\n\t\n',

#'This is another one.\t\n\tUsing multiple lines\t   \n     \n',

#'\tparagraph 3 says goodbye']

print(quebra_frases.span_indexed_paragraph_tokenize('This is a paragraph!\n\t\nThis is another '

                                                    'one.\t\n\tUsing multiple lines\t   \n     '

                                                    '\n\tparagraph 3 says goodbye'))

#[(0, 23, 'This is a paragraph!\n\t\n'),

#(23, 77, 'This is another one.\t\n\tUsing multiple lines\t   \n     \n'),

#(77, 102, '\tparagraph 3 says goodbye')]

```

chunking

```python

import quebra_frases

delimiters = ["mycroft"]

sentence = "sometimes i develop stuff for mycroft, mycroft is FOSS!"

print(quebra_frases.chunk(sentence, delimiters))

# ['sometimes i develop stuff for', 'mycroft', ',', 'mycroft', 'is FOSS!']

samples = ["tell me what do you dream about",

           "tell me what did you dream about",

           "tell me what are your dreams about",

           "tell me what were your dreams about"]

print(quebra_frases.get_common_chunks(samples))

# {'tell me what', 'about'}

print(quebra_frases.get_uncommon_chunks(samples))

# {'do you dream', 'did you dream', 'are your dreams', 'were your dreams'}

print(quebra_frases.get_exclusive_chunks(samples))

# {'do', 'did', 'are', 'were'}

samples = ["what is the speed of light",

           "what is the maximum speed of a firetruck",

           "why are fire trucks red"]

print(quebra_frases.get_exclusive_chunks(samples))

# {'light', 'maximum', 'a firetruck', 'why are fire trucks red'})

print(quebra_frases.get_exclusive_chunks(samples, squash=False))

#[['light'],

#['maximum', 'a firetruck'],

#['why are fire trucks red']])

```

## Install

```bash

pip install quebra_frases

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/openvoiceos/quebra_frases

Awesome Lists containing this project

README