{"id":15057193,"url":"https://github.com/hayatiyrtgl/julia_text_preprocessing_for_nlp","last_synced_at":"2026-01-01T22:47:30.018Z","repository":{"id":240226612,"uuid":"802035534","full_name":"HayatiYrtgl/Julia_Text_Preprocessing_For_NLP","owner":"HayatiYrtgl","description":"Tokenizer for nlp training with julia language","archived":false,"fork":false,"pushed_at":"2024-05-17T11:51:20.000Z","size":4,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-14T10:14:26.621Z","etag":null,"topics":["julia","julia-language","julia-package","julia-text"],"latest_commit_sha":null,"homepage":"","language":"Julia","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/HayatiYrtgl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-17T11:50:49.000Z","updated_at":"2024-05-17T11:52:18.000Z","dependencies_parsed_at":null,"dependency_job_id":"fc3c772f-710a-4807-9155-1722979061ac","html_url":"https://github.com/HayatiYrtgl/Julia_Text_Preprocessing_For_NLP","commit_stats":null,"previous_names":["hayatiyrtgl/julia_text_preprocessing_for_nlp"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HayatiYrtgl%2FJulia_Text_Preprocessing_For_NLP","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HayatiYrtgl%2FJulia_Text_Preprocessing_For_NLP/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HayatiYrtgl%2FJulia_Text_Preprocessing_For_NLP/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HayatiYrtgl%2FJulia_Text_Preprocessing_For_NLP/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/HayatiYrtgl","download_url":"https://codeload.github.com/HayatiYrtgl/Julia_Text_Preprocessing_For_NLP/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243558478,"owners_count":20310574,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["julia","julia-language","julia-package","julia-text"],"created_at":"2024-09-24T22:03:36.138Z","updated_at":"2026-01-01T22:47:29.948Z","avatar_url":"https://github.com/HayatiYrtgl.png","language":"Julia","funding_links":[],"categories":[],"sub_categories":[],"readme":"Sure, here is a detailed README.md file for your GitHub repository based on the provided code:\n\n```markdown\n# Tokenizer and Sequence Padding in Julia\n\nThis repository provides simple functions for tokenizing text, converting text to sequences of tokens, and padding these sequences. The implementation is done in the Julia programming language.\n\n## Functions\n\n### `Tokenizer`\n\nThe `Tokenizer` function takes raw text as input, tokenizes it by splitting the text into words, and creates a dictionary of tokens. It returns a dictionary of tokens and the tokenized text.\n\n#### Parameters\n\n- `raw_text::String`: The input text to be tokenized.\n\n#### Returns\n\n- `token_dictionary::Dict`: A dictionary where keys are unique tokens and values are their corresponding indices.\n- `text::Array{String,1}`: An array of the tokenized words.\n\n#### Example\n\n```julia\nwords = \"Abandon Benevolent Catastrophe Diligent Eccentric Fascinate Generous Hilarious Innovative Juxtapose Kaleidoscope Luminous Meticulous Notorious Obsolete Phenomenon\"\ntok, text = Tokenizer(words)\nprintln(tok)\nprintln(text)\n```\n\n### `texts_to_sequence`\n\nThe `texts_to_sequence` function converts a given text to a sequence of token indices based on the provided tokenizer dictionary. It also handles unknown tokens by assigning them a default value.\n\n#### Parameters\n\n- `tokenizer_dictionary::Dict`: The dictionary of tokens generated by the `Tokenizer` function.\n- `parsed_text::String`: The text to be converted into a sequence.\n- `unknown_token::Int`: The value to be assigned to unknown tokens (default is 0).\n\n#### Returns\n\n- `Array{Int64,1}`: An array of token indices representing the input text.\n\n#### Example\n\n```julia\nsequence = texts_to_sequence(tok, \"abandon Catastrophe Fascinate Hilarious\")\nprintln(sequence)\n```\n\n### `pad_sequence`\n\nThe `pad_sequence` function pads a sequence of token indices to a specified length with zeros. It supports pre-padding.\n\n#### Parameters\n\n- `maxlen::Int`: The maximum length of the padded sequence.\n- `array::Array`: The array of token indices to be padded.\n- `padding::String`: The padding type, currently supports only \"pre\" (default is \"pre\").\n\n#### Returns\n\n- `Array{Float64,2}`: A padded array of the specified length.\n\n#### Example\n\n```julia\npadded_sequence = pad_sequence(50, sequence)\nprintln(padded_sequence)\n```\n\n## Example Usage\n\nHere is an example of how to use these functions together:\n\n```julia\nwords = \"Abandon Benevolent Catastrophe Diligent Eccentric Fascinate Generous Hilarious Innovative Juxtapose Kaleidoscope Luminous Meticulous Notorious Obsolete Phenomenon\"\ntok, text = Tokenizer(words)\nsequence = texts_to_sequence(tok, \"abandon Catastrophe Fascinate Hilarious\")\npadded_sequence = pad_sequence(50, sequence)\nprintln(padded_sequence)\n```\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nThis project is licensed under the MIT License.\n```\n\nThis README file includes descriptions of each function, their parameters, return values, and examples of how to use them. It should help users understand and utilize your code effectively.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhayatiyrtgl%2Fjulia_text_preprocessing_for_nlp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhayatiyrtgl%2Fjulia_text_preprocessing_for_nlp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhayatiyrtgl%2Fjulia_text_preprocessing_for_nlp/lists"}