{"id":20373409,"url":"https://github.com/arxiver/onepiecelang","last_synced_at":"2025-08-03T18:35:21.318Z","repository":{"id":128770820,"uuid":"314640131","full_name":"arxiver/Onepiecelang","owner":"arxiver","description":"Text segmentation solution using natural language processing. ","archived":false,"fork":false,"pushed_at":"2022-03-29T11:36:55.000Z","size":1034,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-03-01T04:15:15.915Z","etag":null,"topics":["bigram","bigram-model","dp","dynamic-programming","machine-intelligence","machine-learning","natural-language-processing","nlp","nlp-machine-learning","text-segmentation","unigram","unigram-model","viterbi","viterbi-algorithm","word","word-segmentation"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/arxiver.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2020-11-20T18:40:19.000Z","updated_at":"2023-12-31T14:57:28.000Z","dependencies_parsed_at":"2023-06-19T00:58:48.284Z","dependency_job_id":null,"html_url":"https://github.com/arxiver/Onepiecelang","commit_stats":null,"previous_names":["arxiver/onepiecelang"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arxiver%2FOnepiecelang","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arxiver%2FOnepiecelang/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arxiver%2FOnepiecelang/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arxiver%2FOnepiecelang/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/arxiver","download_url":"https://codeload.github.com/arxiver/Onepiecelang/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241921824,"owners_count":20042763,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigram","bigram-model","dp","dynamic-programming","machine-intelligence","machine-learning","natural-language-processing","nlp","nlp-machine-learning","text-segmentation","unigram","unigram-model","viterbi","viterbi-algorithm","word","word-segmentation"],"created_at":"2024-11-15T01:18:12.734Z","updated_at":"2025-03-04T20:44:32.272Z","avatar_url":"https://github.com/arxiver.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cbr /\u003e\n\u003cp align=\"center\"\u003e\n  \u003ch2 align=\"center\"\u003eOnepiecelang\u003c/h2\u003e\n  \u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://user-images.githubusercontent.com/39674365/122943172-13b30380-d377-11eb-93b0-80ad47788e7b.png\" height=200 width=200\u003e \u003cbr\u003e\n    ·\n    \u003ca href=\"./Report.pdf\"\u003eReport\u003c/a\u003e\n    ·\n  \u003c/p\u003e\n\u003c/p\u003e\n\nText segmentation solution using natural language processing.  \nIt is concerned with splitting text into tokens. For example, “This is a whole sentence.” can be segmented into  \n[“This”, “is”, “a”, “whole”,“sentence”, “.”].  \n## Dataset\n\nIt​ ​is included in the folder “./data” which represents the ​ **unigram** ​of most\nfrequent words ordered ascendingly\n\n## Description\n\nThe core is divided into 5 functions\n\n- read_unigrams\n- find_spaces\n- minimize_sentence_cost\n- clean_sentence\n- score\n\n**read_unigrams:** ​ This function is used to read the data-set which contains\nmost frequent words in english words and\n\n**Inputs:**\n\n- Path of the data-set file.\n\n**Returns:**\n\n- Dictionary for each word in the data-set as key and it’s value is the\n    cost of it with respect to it’s order in the file, cost function is\n    implemented as the following:\n       _Cost_ ( _wi_ ) = _Log_ 10 ( _i_ + 1 ) ​(1)\n- Maximum word length\n\n**Description:**\n\n- Read the file\n- Counting the words\n- For each word in the file set it’s key in the dictionary of words and set\n    its weight as mentioned in equation (1)\n- Check length of current iteration word if it’s length is greater than\n    maximum length word set it to current word length.\n\n\n**find_spaces:** ​This the main core, used to take user’s input (concatenated\nsentence)\n\n**Inputs** ​:\n\n- sentence: string the sentence/url before processing (concatenated\n    sentence)\n- unigrams: dict the words in the data-set with corresponding weight\n- window_size: maximum word length in the dict\n\n**Returns** ​:\n\n- separated: list of strings words after separation\n\n**Description** ​:\nPreprocessing\n\n1. remove all spaces in the sentence e.g. sentence = “welcometomy\n    world” , became “welcometomyworld”\n\nInitializations\n\n2. set s_lower copy of the sentence as lowercase to match the\n    dictionary entries\n3. Set weights array for cost of the sentence at each letter index\n4. Set index array that matches each cost what letters included for that\n    word\n5. Set empty array for the words to store the sentence there after\n    splitting\n\nAlgorithm\n\n6. For every letter in the sentence go and find minimum cost of the\n    sentence with respect to it find the count of letters precedes it and\n    makes the whole sentence is minimum cost, cost function is as\n    mentioned before in read function e.g sentence = “heyworld”, first\n    time iterator standing at ‘h’ and the cost function returns cost(‘h’) is\n    minimum, then at next iteration iterator goes to ‘e’ and checks\n    sentence within the window that matches (max_word_length)\n    minimum cost as the following\n\n\n```\nmin ( cost (′ h ′+ cost (′ e ′), cost (′ he ′))\nThat returns the minimum cost of that iteration is cost(“he”) and count\nof prefix letters included is 2 letters and so on, it will be explained\nmore in the next function description ​ minimize_cost_function\n```\n7. After iteration over all letters of the sentence we go over back from\n    last to minimize cost of the last letter which minimizes the whole\n    sentence and takes over the letters which correspond to its weight.\n    E.g. sentence = “heyworld” after finished step 6 we will have at splits\n    array (which corresponding to each letter’s minimum cost howmany\n    letters taken) We go over from last index and backtrack till reach first\n    of sentence, it returns the separated words in descending order\n8. Reverse order of the separated list and return it\n\n**minimize_sentence_cost:** ​Utility function for ​ **find_spaces** ​ used to\nminimize the sentence cost at a certain letter index in the sentence with\nrespect to the prefix window size letters\n\n**Inputs:**\n\n- sentence: string the sentence/url before processing (concatenated\n    sentence)\n- unigrams: dictionary of the words in the data-set with corresponding\n    weight\n- I: is the end of the ptr of current processing word\n- weights: list the cost of the sentence at each index with respect to\n    each split of each letter.\n- window_size: maximum word length in the dictionary or the data-set\n\n**Returns:**\n\n- cost_min: cost computed due to split happened at this letter with its\n    prefixes letters\n- letters_count: the count of letters precedes the current letter index\n    those minimize the cost of the sentence\n\n**Description:**\n\n\nMinimize the sentence cost at the current index by looking backwards and\ntrying to find the best word that can minimize the cost of the sentence e.g.\nsentence: He is good, and my ptr is standing at 's'.\nIt will try to split all previous letters in window of size\ni is the end of the ptr of current processing word\nWe look before it letter by letter and split the subword incrementally, and try\nto find a subword which minimizes cost of sentence, and cost of splitting\nthis word in addition to the cost of sentence.\nInitialization\n\n1. start of the window set it to max(0, i - window_size)\n2. Costs/Weights array from ‘start’ to i, Look at all previous letters cost.\n3. Set cost_min to inf (some cost that is greater than any possible value)\n    and set letters_count = -1 ; not set yet.\nAlgorithm\n4. From the letter at current index (i-1) look at all letters preceding it and\nwithin window size, each of them has computed the cost before.\nRecalling: Foreach letter precedes I\nA.Wj = costs[-j-1] Cost of the sentence at this letter\nB.Wji = cost of the word from the unigram that starting at j and\nends at i\nC.If the cost of this split (Wj + Wji) \u003c min_cost, then assign min\ncost to it and assign letters count to the j iterator.\n5. Return min_cost and letters_count\n\n**clean_sentence:** ​This function is used removing any special character or\nnumber in the sentence, as assumed it is ignored but for assurance of not\nbad testing\nInputs: - sentence\nReturns: - sentence after cleaning\n\n**score:** ​This function is used for testing and returns a score of the test\ndataset\n\n**Inputs** ​:\n\n- X: list of sentences\n\n\n- Y: list of lists, for each sentence its corresponding actual output\n    separated list.\n- unigrams: dictionary the words in the data-set with corresponding\n    weight\n- window_size: maximum word length in the dictionary\n\n**Returns** ​:\n\n- score: number of correctly separated words in exact place\n    corresponding to same place and value in the actual output list\n    divided by the total number of tested words, and multiplied by 100\n\n**Description** ​:\nInitialization\n\n1. Set total_words = 0 and correct = 0\n2. Loop over each example and call find_space on the input sentence\n    and compare each word of the expected output list of this sentence\n    and the actual output of that.\n    And increment if expected word == actual word\n\n## The relationship between files.\n\n“/src/Notebook.ipynb” and “/src/script.py” “Both are equivalent but notebook\ncontains cells or testing and script is for end to end usage”\n\nBoth of those files are using “/data/unigrams.txt”\nAnd Notebook.ipynb uses “/data/test_sentences,txt”\n“/scrapper” is only used for scrapping sentences for testing. I just included it\nto ensure my work.\n\n\n## How to run the code and required dependencies\n\nPython version used is 3\nDependencies:\n\n- **numpy** ​for mathematical operations (needs installation)\n- **time** ​built in module for used for performance measurements\n**- re** ​built in module (regular expression library)\n\nThere are two ways to run the code at “./src” directory:\n\n1. Using jupyter notebook and you will find instructions inside\n    **(Notebook.ipynb)**\n2. Using python script as the following command ​ **“python script.py”**\n\n## How I wrote the code and how it helps you solve the problem\n\nI've watched some of natural language processing specialization courses\nAnd reading beautiful data book chapter 14 has good intuition about the\nproblem.\nWords ninja library of python\nSome blogs about natural language problem and similar problems\n\n\n## Results and conclusions\n\nI have scraped test-set from the internet from this sentences generator,\nIf you want to run the scrapper you need have ​ **node** ​installed in your\nenvironment (javascript code)\nAnd go to scrapper folder and run the following command\n“node index.js \u003e\u003e sentences.txt”\nEach run appends 560 new sentence to file sentences.txt\nhttps://randomwordgenerator.com/json/sentences.json\n\nNo. of sentences: 7840\nNo. of total words: 95648\nNo. of completely correct words: 91098\nScore : 95.24 %\nTime cost is 9.6 seconds for testing those whole test-set and comparing the\nmodel output with the actual output\nTotal time cost = (Cost of compare outputs + Cost of find spaces)\n\n**Note**\nThe test set is not 100% correct, as it scraped as mentioned before so we\nmay have some outliers, which refers to some error in the test score\n\n### Random picked results (correct)\n\n**Testcase (1)\nSentence** ​: Thequickbrownfoxjumpsoverthelazydog\n**Actual output as string** ​: The quick brown fox jumps over the lazy dog\n**Actual output** ​: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']\n**Model output** ​: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']\n\n**Testcase (2)\nSentence** ​:\nHesaidhewasnotthereyesterdayhowevermanypeoplesawhimthere\n**Actual output as string** ​: He said he was not there yesterday however\nmany people saw him there\n\n\n**Actual output** ​: ['He', 'said', 'he', 'was', 'not', 'there', 'yesterday', 'however',\n'many', 'people', 'saw', 'him', 'there']\n**Model output** ​: ['He', 'said', 'he', 'was', 'not', 'there', 'yesterday', 'however',\n'many', 'people', 'saw', 'him', 'there']\n\n**Testcase (3)\nSentence** ​: Ashelookedoutthewindowhesawaclownwalkby\n**Actual output as string** ​: As he looked out the window he saw a clown\nwalk by\n**Actual output** ​: ['As', 'he', 'looked', 'out', 'the', 'window', 'he', 'saw', 'a',\n'clown', 'walk', 'by']\n**Model output** ​: ['As', 'he', 'looked', 'out', 'the', 'window', 'he', 'saw', 'a',\n'clown', 'walk', 'by']\n\n**Testcase (4)\nSentence:** ​ Thisisawholesentence.\n**Actual output as string** ​: This is a whole sentence.\n**Actual output** ​: ['This', 'is', 'a', 'whole', 'sentence', '.']\n**Model output** ​: ['This', 'is', 'a', 'whole', 'sentence', '.']\n\n**Testcase (5)\nSentence:** ​ Hedranklifebeforespittingitout\n**Actual output as string:** ​ He drank life before spitting it out\n**Actual output:** ​ ['He', 'drank', 'life', 'before', 'spitting', 'it', 'out']\n**Model output:** ​ ['He', 'drank', 'life', 'before', 'spitting', 'it', 'out']\n\n### Random picked results (incorrect)\n\n**Testcase (1)\nSentence** ​:\nIwasveryproudofmynicknamethroughouthighschoolbuttodayIcouldntbeanydi\nfferenttowhatmynicknamewas\n\n\n**Actual output** ​: ['I', 'was', 'very', 'proud', 'of', 'my', 'nickname', 'throughout',\n'high', 'school', 'but', 'today', 'I', 'couldnt', 'be', 'any', 'different', 'to', 'what',\n'my', 'nickname', 'was']\n\n**Model output** ​: ['I', 'was', 'very', 'proud', 'of', 'my', 'nickname', 'throughout',\n'highschool', 'but', 'today', 'I', 'couldnt', 'be', 'any', 'different', 'to', 'what', 'my',\n'nickname', 'was']\n\n**Notice** ​: High School is printed as Highschool, because it is in the unigram\nmodel as an entry\n\n**Testcase (2)\nSentence** ​: Shealwaysspeakstohiminaloudvoice\n**Actual output** ​: ['She', 'always', 'speaks', 'to', 'him', 'in', 'a', 'loud', 'voice']\n**Model output** ​: ['She', 'always', 'speaks', 'to', 'him', 'in', 'aloud', 'voice']\n**Notice** ​the output is expected aloud instead of a loud due to the unigram\nmodel also, and It may be there some incorrect test examples may expect\nsomething and there is more than one way to express that.\n\n**Testcase (3)\nSentence** ​: Allyouneedtodoispickupthepenandbegin\n**Actual output** ​: ['All', 'you', 'need', 'to', 'do', 'is', 'pick', 'up', 'the', 'pen', 'and',\n'begin']\n**Model output** ​: ['All', 'you', 'need', 'to', 'do', 'is', 'pickup', 'the', 'pen', 'and',\n'begin']\n\n**Notice** ​: It is similar to previous example the model used ‘pickup’ without\nspace together instead of ‘pick up’. That demonstrates the most effect\ncomes out from the unigram/data-set used for building the model.\n\n```\nMore test cases can be tested on the notebook.\n```\n\n## Assumptions\n\nThe following list represents my assumptions.\n\n- There is no ​ **numbers or any special characters** ​in the url will be\n    provided or sentence\n- The domain of the sentences/urls will be used matches my used\n    dictionary and the most frequent words in English\n- Words spellings are correct\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farxiver%2Fonepiecelang","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Farxiver%2Fonepiecelang","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farxiver%2Fonepiecelang/lists"}