https://github.com/hayatiyrtgl/julia_text_preprocessing_for_nlp
Tokenizer for nlp training with julia language
https://github.com/hayatiyrtgl/julia_text_preprocessing_for_nlp
julia julia-language julia-package julia-text
Last synced: 6 months ago
JSON representation
Tokenizer for nlp training with julia language
- Host: GitHub
- URL: https://github.com/hayatiyrtgl/julia_text_preprocessing_for_nlp
- Owner: HayatiYrtgl
- License: mit
- Created: 2024-05-17T11:50:49.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-05-17T11:51:20.000Z (about 2 years ago)
- Last Synced: 2025-03-14T10:14:26.621Z (over 1 year ago)
- Topics: julia, julia-language, julia-package, julia-text
- Language: Julia
- Homepage:
- Size: 3.91 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Sure, here is a detailed README.md file for your GitHub repository based on the provided code:
```markdown
# Tokenizer and Sequence Padding in Julia
This repository provides simple functions for tokenizing text, converting text to sequences of tokens, and padding these sequences. The implementation is done in the Julia programming language.
## Functions
### `Tokenizer`
The `Tokenizer` function takes raw text as input, tokenizes it by splitting the text into words, and creates a dictionary of tokens. It returns a dictionary of tokens and the tokenized text.
#### Parameters
- `raw_text::String`: The input text to be tokenized.
#### Returns
- `token_dictionary::Dict`: A dictionary where keys are unique tokens and values are their corresponding indices.
- `text::Array{String,1}`: An array of the tokenized words.
#### Example
```julia
words = "Abandon Benevolent Catastrophe Diligent Eccentric Fascinate Generous Hilarious Innovative Juxtapose Kaleidoscope Luminous Meticulous Notorious Obsolete Phenomenon"
tok, text = Tokenizer(words)
println(tok)
println(text)
```
### `texts_to_sequence`
The `texts_to_sequence` function converts a given text to a sequence of token indices based on the provided tokenizer dictionary. It also handles unknown tokens by assigning them a default value.
#### Parameters
- `tokenizer_dictionary::Dict`: The dictionary of tokens generated by the `Tokenizer` function.
- `parsed_text::String`: The text to be converted into a sequence.
- `unknown_token::Int`: The value to be assigned to unknown tokens (default is 0).
#### Returns
- `Array{Int64,1}`: An array of token indices representing the input text.
#### Example
```julia
sequence = texts_to_sequence(tok, "abandon Catastrophe Fascinate Hilarious")
println(sequence)
```
### `pad_sequence`
The `pad_sequence` function pads a sequence of token indices to a specified length with zeros. It supports pre-padding.
#### Parameters
- `maxlen::Int`: The maximum length of the padded sequence.
- `array::Array`: The array of token indices to be padded.
- `padding::String`: The padding type, currently supports only "pre" (default is "pre").
#### Returns
- `Array{Float64,2}`: A padded array of the specified length.
#### Example
```julia
padded_sequence = pad_sequence(50, sequence)
println(padded_sequence)
```
## Example Usage
Here is an example of how to use these functions together:
```julia
words = "Abandon Benevolent Catastrophe Diligent Eccentric Fascinate Generous Hilarious Innovative Juxtapose Kaleidoscope Luminous Meticulous Notorious Obsolete Phenomenon"
tok, text = Tokenizer(words)
sequence = texts_to_sequence(tok, "abandon Catastrophe Fascinate Hilarious")
padded_sequence = pad_sequence(50, sequence)
println(padded_sequence)
```
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## License
This project is licensed under the MIT License.
```
This README file includes descriptions of each function, their parameters, return values, and examples of how to use them. It should help users understand and utilize your code effectively.