{"id":14958836,"url":"https://github.com/zayedrais/documentsearchengine","last_synced_at":"2025-05-02T12:32:04.357Z","repository":{"id":43890793,"uuid":"236357970","full_name":"zayedrais/DocumentSearchEngine","owner":"zayedrais","description":"Document Search Engine project with TF-IDF abd Google universal sentence encoder model","archived":false,"fork":false,"pushed_at":"2023-05-01T21:22:46.000Z","size":30013,"stargazers_count":53,"open_issues_count":3,"forks_count":24,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-07T02:06:58.802Z","etag":null,"topics":["data-science","deep-learning","document-search","document-similarity","juypter","machine-learning","python","python-text-analysis","semantic-search","semantic-search-engine","tensorflow","tensorflow-models","tensorflow-tutorials","text-analysis","text-search","text-semantic-similarity","tfidf","tfidf-text-analysis","tfidf-vectorizer","universal-sentence-encoder"],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zayedrais.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-01-26T18:39:41.000Z","updated_at":"2025-01-12T04:36:16.000Z","dependencies_parsed_at":"2024-09-22T08:11:47.083Z","dependency_job_id":null,"html_url":"https://github.com/zayedrais/DocumentSearchEngine","commit_stats":{"total_commits":12,"total_committers":2,"mean_commits":6.0,"dds":0.08333333333333337,"last_synced_commit":"d5951f0e8bd17e9f974f7b5b5fd700d505123228"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zayedrais%2FDocumentSearchEngine","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zayedrais%2FDocumentSearchEngine/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zayedrais%2FDocumentSearchEngine/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zayedrais%2FDocumentSearchEngine/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zayedrais","download_url":"https://codeload.github.com/zayedrais/DocumentSearchEngine/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252038231,"owners_count":21684657,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","deep-learning","document-search","document-similarity","juypter","machine-learning","python","python-text-analysis","semantic-search","semantic-search-engine","tensorflow","tensorflow-models","tensorflow-tutorials","text-analysis","text-search","text-semantic-similarity","tfidf","tfidf-text-analysis","tfidf-vectorizer","universal-sentence-encoder"],"created_at":"2024-09-24T13:18:22.772Z","updated_at":"2025-05-02T12:31:59.347Z","avatar_url":"https://github.com/zayedrais.png","language":"Jupyter Notebook","readme":"\u003cp class=\"ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj\" data-selectable-paragraph=\"\"\u003eIn this post, we will be building a \u003cstrong class=\"id iq\"\u003esemantic documents search engine\u003c/strong\u003e by using \u003ca class=\"bu dh iw ix iy iz\" href=\"http://qwone.com/~jason/20Newsgroups/\" target=\"_blank\" rel=\"noopener nofollow\"\u003e20newsgroup open-source dataset\u003c/a\u003e.\u003c/p\u003e\n\u003ch1 class=\"hb hc cs ax aw ee ka mm kc mn mo mp mq mr ms mt hf\" data-selectable-paragraph=\"\"\u003ePrerequisites\u003c/h1\u003e\n\u003cul class=\"\"\u003e\n\u003cli class=\"ia ib cs ax id b ie mu ig mv mg mw mi mx mk my io mz na nb\" data-selectable-paragraph=\"\"\u003e\u003ca class=\"bu dh iw ix iy iz\" href=\"https://www.python.org/\" target=\"_blank\" rel=\"noopener nofollow\"\u003ePython 3.5\u003c/a\u003e+\u003c/li\u003e\n\u003cli class=\"ia ib cs ax id b ie nc ig nd mg ne mi nf mk ng io mz na nb\" data-selectable-paragraph=\"\"\u003e\u003ca class=\"bu dh iw ix iy iz\" href=\"https://pypi.org/project/pip/\" target=\"_blank\" rel=\"noopener nofollow\"\u003epip 19\u003c/a\u003e+ or pip3\u003c/li\u003e\n\u003cli class=\"ia ib cs ax id b ie nc ig nd mg ne mi nf mk ng io mz na nb\" data-selectable-paragraph=\"\"\u003e\u003ca class=\"bu dh iw ix iy iz\" href=\"https://www.nltk.org/\" target=\"_blank\" rel=\"noopener nofollow\"\u003eNLTK\u003c/a\u003e\u003c/li\u003e\n\u003cli class=\"ia ib cs ax id b ie nc ig nd mg ne mi nf mk ng io mz na nb\" data-selectable-paragraph=\"\"\u003e\u003ca class=\"bu dh iw ix iy iz\" href=\"https://scikit-learn.org/stable/\" target=\"_blank\" rel=\"noopener nofollow\"\u003eScikit-learn\u003c/a\u003e\u003c/li\u003e\n\u003cli class=\"ia ib cs ax id b ie nc ig nd mg ne mi nf mk ng io mz na nb\" data-selectable-paragraph=\"\"\u003e\u003ca class=\"bu dh iw ix iy iz\" href=\"https://www.tensorflow.org\" target=\"_blank\" rel=\"noopener nofollow\"\u003eTensorFlow-GPU\u003c/a\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch1 class=\"hb hc cs ax aw ee ka mm kc mn mo mp mq mr ms mt hf\" data-selectable-paragraph=\"\"\u003e1. Getting Ready\u003c/h1\u003e\n\u003cp class=\"ia ib cs ax id b ie mu ig mv mg mw mi mx mk my io gj\" data-selectable-paragraph=\"\"\u003eFor this post we will need the above prerequisites\u003cstrong class=\"id iq\"\u003e,\u0026nbsp;\u003c/strong\u003eIf you do not have it yet, please make ready for it.\u003c/p\u003e\n\u003ch1 class=\"hb hc cs ax aw ee ka mm kc mn mo mp mq mr ms mt hf\" data-selectable-paragraph=\"\"\u003e2. Data collection\u003c/h1\u003e\n\u003cp class=\"ia ib cs ax id b ie mu ig mv mg mw mi mx mk my io gj\" data-selectable-paragraph=\"\"\u003eHere, we are using 20newsgroup dataset to the analysis of a text search engine giving input keywords/sentences input.\u003c/p\u003e\n\u003cp class=\"ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj\" data-selectable-paragraph=\"\"\u003eThe 20 Newsgroups data set is a collection of approximately 11K newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.\u003c/p\u003e\n\u003cpre class=\"lp lq lr ls lt nh ni eu\"\u003e\u003cspan class=\"nj hc cs ax nk b bp nl nm y nn\" data-selectable-paragraph=\"\"\u003enews = pd.read_json('\u003ca class=\"bu dh iw ix iy iz\" href=\"https://raw.githubusercontent.com/zayedrais/DocumentSearchEngine/master/data/newsgroups.json\" target=\"_blank\" rel=\"noopener nofollow\"\u003ehttps://raw.githubusercontent.com/zayedrais/DocumentSearchEngine/master/data/newsgroups.json\u003c/a\u003e')\u003c/span\u003e\u003c/pre\u003e\n\u003ch2 class=\"nj hc cs ax aw ee no np nq nr ns nt nu nv nw nx ny\" data-selectable-paragraph=\"\"\u003e2.1 data cleaning:\u003c/h2\u003e\n\u003cp class=\"ia ib cs ax id b ie mu ig mv mg mw mi mx mk my io gj\" data-selectable-paragraph=\"\"\u003eBefore going into a clean phase, we are retrieving the subject of the document from the text.\u003c/p\u003e\n\u003cpre class=\"lp lq lr ls lt nh ni eu\"\u003e\u003cspan class=\"nj hc cs ax nk b bp nl nm y nn\" data-selectable-paragraph=\"\"\u003efor i,txt in enumerate(news['content']):\u003cbr /\u003e    subject = re.findall('Subject:(.*\\n)',txt)\u003cbr /\u003e    if (len(subject) !=0):\u003cbr /\u003e        news.loc[i,'Subject'] =str(i)+' '+subject[0]\u003cbr /\u003e    else:\u003cbr /\u003e        news.loc[i,'Subject'] ='NA'\u003cbr /\u003edf_news =news[['Subject','content']]\u003c/span\u003e\u003c/pre\u003e\n\u003cp class=\"ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj\" data-selectable-paragraph=\"\"\u003eNow, we are removing the unwanted data from text content and the subject of a dataset.\u003c/p\u003e\n\u003cpre class=\"lp lq lr ls lt nh ni eu\"\u003e\u003cspan class=\"nj hc cs ax nk b bp nl nm y nn\" data-selectable-paragraph=\"\"\u003edf_news.content =df_news.content.replace(to_replace='from:(.*\\n)',value='',regex=True) ##remove from to email \u003cbr /\u003edf_news.content =df_news.content.replace(to_replace='lines:(.*\\n)',value='',regex=True)\u003cbr /\u003edf_news.content =df_news.content.replace(to_replace='[!\"#$%\u0026amp;\\'()*+,/:;\u0026lt;=\u0026gt;?@[\\\\]^_`{|}~]',value=' ',regex=True) #remove punctuation except\u003cbr /\u003edf_news.content =df_news.content.replace(to_replace='-',value=' ',regex=True)\u003cbr /\u003edf_news.content =df_news.content.replace(to_replace='\\s+',value=' ',regex=True)    #remove new line\u003cbr /\u003edf_news.content =df_news.content.replace(to_replace='  ',value='',regex=True)                #remove double white space\u003cbr /\u003edf_news.content =df_news.content.apply(lambda x:x.strip())  # Ltrim and Rtrim of whitespace\u003c/span\u003e\u003c/pre\u003e\n\u003ch2 class=\"nj hc cs ax aw ee no np nq nr ns nt nu nv nw nx ny\" data-selectable-paragraph=\"\"\u003e2.2 data preprocessing\u003c/h2\u003e\n\u003cp class=\"ia ib cs ax id b ie mu ig mv mg mw mi mx mk my io gj\" data-selectable-paragraph=\"\"\u003ePreprocessing is one of the major steps when we are dealing with any kind of text models. During this stage, we have to look at the distribution of our data, what techniques are needed and how deep we should clean.\u003c/p\u003e\n\u003ch2 class=\"nj hc cs ax aw ee no np nq nr ns nt nu nv nw nx ny\" data-selectable-paragraph=\"\"\u003eLowercase\u003c/h2\u003e\n\u003cp class=\"ia ib cs ax id b ie mu ig mv mg mw mi mx mk my io gj\" data-selectable-paragraph=\"\"\u003eConversion the text into a lower form. i.e. \u0026lsquo;\u003cstrong class=\"id iq\"\u003eDogs\u0026rsquo;\u003c/strong\u003e into \u0026lsquo;\u003cstrong class=\"id iq\"\u003edogs\u003c/strong\u003e\u0026rsquo;\u003c/p\u003e\n\u003cpre class=\"lp lq lr ls lt nh ni eu\"\u003e\u003cspan class=\"nj hc cs ax nk b bp nl nm y nn\" data-selectable-paragraph=\"\"\u003edf_news['content']=[entry.lower() for entry in df_news['content']]\u003c/span\u003e\u003c/pre\u003e\n\u003ch2 class=\"nj hc cs ax aw ee no np nq nr ns nt nu nv nw nx ny\" data-selectable-paragraph=\"\"\u003eWord Tokenization\u003c/h2\u003e\n\u003cp class=\"ia ib cs ax id b ie mu ig mv mg mw mi mx mk my io gj\" data-selectable-paragraph=\"\"\u003eWord tokenization is the process to divide the sentence into the form of a word.\u003c/p\u003e\n\u003cp class=\"ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj\" data-selectable-paragraph=\"\"\u003e\u0026ldquo;\u003cstrong class=\"id iq\"\u003eJhon is running in the track\u003c/strong\u003e\u0026rdquo; \u0026rarr; \u0026lsquo;\u003cstrong class=\"id iq\"\u003ejohn\u003c/strong\u003e\u0026rsquo;, \u0026lsquo;\u003cstrong class=\"id iq\"\u003eis\u003c/strong\u003e\u0026rsquo;, \u0026lsquo;\u003cstrong class=\"id iq\"\u003erunning\u003c/strong\u003e\u0026rsquo;, \u0026lsquo;\u003cstrong class=\"id iq\"\u003ein\u003c/strong\u003e\u0026rsquo;, \u0026lsquo;\u003cstrong class=\"id iq\"\u003ethe\u003c/strong\u003e\u0026rsquo;, \u0026lsquo;\u003cstrong class=\"id iq\"\u003etrack\u003c/strong\u003e\u0026rsquo;\u003c/p\u003e\n\u003cpre class=\"lp lq lr ls lt nh ni eu\"\u003e\u003cspan class=\"nj hc cs ax nk b bp nl nm y nn\" data-selectable-paragraph=\"\"\u003edf_news['Word tokenize']= [word_tokenize(entry) for entry in df_news.content]\u003c/span\u003e\u003c/pre\u003e\n\u003ch2 class=\"nj hc cs ax aw ee no np nq nr ns nt nu nv nw nx ny\" data-selectable-paragraph=\"\"\u003eStop words\u003c/h2\u003e\n\u003cp class=\"ia ib cs ax id b ie mu ig mv mg mw mi mx mk my io gj\" data-selectable-paragraph=\"\"\u003eStop words are the most commonly occurring words which don\u0026rsquo;t give any additional value to the document vector. in-fact removing these will increase computation and space efficiency. \u003ca class=\"bu dh iw ix iy iz\" href=\"https://www.nltk.org/\" target=\"_blank\" rel=\"noopener nofollow\"\u003eNLTK\u003c/a\u003e library has a method to download the stopwords.\u003c/p\u003e\n\u003cfigure class=\"lp lq lr ls lt gp t u paragraph-image\"\u003e\n\u003cdiv class=\"t u nz\"\u003e\n\u003cdiv class=\"gu y br gv\"\u003e\n\u003cdiv class=\"oa y\"\u003e\u003cimg class=\"is it z ab ac fu v ha fr-fic fr-dii fr-draggable\" src=\"https://miro.medium.com/max/601/1*PdgWsOM1ep9Z2rfkQ6UJZA.png\" width=\"601\" height=\"275\" data-fr-image-pasted=\"true\" /\u003e\u003c/div\u003e\n\u003c/div\u003e\n\u003c/div\u003e\n\u003c/figure\u003e\n\u003ch2 class=\"nj hc cs ax aw ee no np nq nr ns nt nu nv nw nx ny\" data-selectable-paragraph=\"\"\u003eWord Lemmatization\u003c/h2\u003e\n\u003cp class=\"ia ib cs ax id b ie mu ig mv mg mw mi mx mk my io gj\" data-selectable-paragraph=\"\"\u003eLemmatisation is a way to reduce the word to root synonym of a word. Unlike Stemming, Lemmatisation makes sure that the reduced word is again a dictionary word (word present in the same language). WordNetLemmatizer can be used to lemmatize any word.\u003c/p\u003e\n\u003cp class=\"ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj\" data-selectable-paragraph=\"\"\u003ei.e. \u003cstrong class=\"id iq\"\u003erocks \u0026rarr;rock, better \u0026rarr;good, corpora \u0026rarr;corpus\u003c/strong\u003e\u003c/p\u003e\n\u003cp class=\"ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj\" data-selectable-paragraph=\"\"\u003eHere created wordLemmatizer function to remove a \u003cstrong class=\"id iq\"\u003esingle character\u003c/strong\u003e, \u003cstrong class=\"id iq\"\u003estopwords\u003c/strong\u003e and \u003cstrong class=\"id iq\"\u003elemmatize\u003c/strong\u003e the words.\u003c/p\u003e\n\u003cpre class=\"lp lq lr ls lt nh ni eu\"\u003e\u003cspan class=\"nj hc cs ax nk b bp nl nm y nn\" data-selectable-paragraph=\"\"\u003e# WordNetLemmatizer requires Pos tags to understand if the word is noun or verb or adjective etc. By default it is set to Noun\u003cbr /\u003edef wordLemmatizer(data):\u003cbr /\u003e    tag_map = defaultdict(lambda : wn.NOUN)\u003cbr /\u003e    tag_map['J'] = wn.ADJ\u003cbr /\u003e    tag_map['V'] = wn.VERB\u003cbr /\u003e    tag_map['R'] = wn.ADV\u003cbr /\u003e    file_clean_k =pd.DataFrame()\u003cbr /\u003e    for index,entry in enumerate(data):\u003cbr /\u003e        \u003cbr /\u003e        # Declaring Empty List to store the words that follow the rules for this step\u003cbr /\u003e        Final_words = []\u003cbr /\u003e        # Initializing WordNetLemmatizer()\u003cbr /\u003e        word_Lemmatized = WordNetLemmatizer()\u003cbr /\u003e        # pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.\u003cbr /\u003e        for word, tag in pos_tag(entry):\u003cbr /\u003e            # Below condition is to check for Stop words and consider only alphabets\u003cbr /\u003e            if len(word)\u0026gt;1 and word not in stopwords.words('english') and word.isalpha():\u003cbr /\u003e                word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])\u003cbr /\u003e                Final_words.append(word_Final)\u003cbr /\u003e            # The final processed set of words for each iteration will be stored in 'text_final'\u003cbr /\u003e                file_clean_k.loc[index,'Keyword_final'] = str(Final_words)\u003cbr /\u003e                file_clean_k.loc[index,'Keyword_final'] = str(Final_words)\u003cbr /\u003e                file_clean_k=file_clean_k.replace(to_replace =\"\\[.\", value = '', regex = True)\u003cbr /\u003e                file_clean_k=file_clean_k.replace(to_replace =\"'\", value = '', regex = True)\u003cbr /\u003e                file_clean_k=file_clean_k.replace(to_replace =\" \", value = '', regex = True)\u003cbr /\u003e                file_clean_k=file_clean_k.replace(to_replace ='\\]', value = '', regex = True)\u003cbr /\u003e    return file_clean_k\u003c/span\u003e\u003c/pre\u003e\n\u003cp class=\"ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj\" data-selectable-paragraph=\"\"\u003eBy using this function took around \u003cstrong class=\"id iq\"\u003e13 hrs\u003c/strong\u003e time to check and lemmatize the words of 11K documents of the 20newsgroup dataset. Find below the JSON file of the lemmatized word.\u003c/p\u003e\n\u003cpre class=\"lp lq lr ls lt nh ni eu\"\u003e\u003cspan class=\"nj hc cs ax nk b bp nl nm y nn\" data-selectable-paragraph=\"\"\u003e\u003ca class=\"bu dh iw ix iy iz\" href=\"https://raw.githubusercontent.com/zayedrais/DocumentSearchEngine/master/data/WordLemmatize20NewsGroup.json\" target=\"_blank\" rel=\"noopener nofollow\"\u003ehttps://raw.githubusercontent.com/zayedrais/DocumentSearchEngine/master/data/WordLemmatize20NewsGroup.json\u003c/a\u003e\u003c/span\u003e\u003c/pre\u003e\n\u003ch2 class=\"nj hc cs ax aw ee no np nq nr ns nt nu nv nw nx ny\" data-selectable-paragraph=\"\"\u003e2.3 data is ready for use\u003c/h2\u003e\n\u003cp class=\"ia ib cs ax id b ie mu ig mv mg mw mi mx mk my io gj\" data-selectable-paragraph=\"\"\u003eSee a sample of clean data-\u003c/p\u003e\n\u003cpre class=\"lp lq lr ls lt nh ni eu\"\u003e\u003cspan class=\"nj hc cs ax nk b bp nl nm y nn\" data-selectable-paragraph=\"\"\u003edf_news.Clean_Keyword[0]\u003c/span\u003e\u003c/pre\u003e\n\u003cfigure class=\"lp lq lr ls lt gp t u paragraph-image\"\u003e\n\u003cdiv class=\"lw lx br ly v\"\u003e\n\u003cdiv class=\"t u ob\"\u003e\n\u003cdiv class=\"gu y br gv\"\u003e\n\u003cdiv class=\"oc y\"\u003e\u003cimg class=\"is it z ab ac fu v ha fr-fic fr-dii fr-draggable\" src=\"https://miro.medium.com/max/718/1*Br5cASjTPcoN0J1QhXIYuA.png\" width=\"718\" height=\"105\" data-fr-image-pasted=\"true\" /\u003e\u003c/div\u003e\n\u003c/div\u003e\n\u003c/div\u003e\n\u003c/div\u003e\n\u003c/figure\u003e\n\u003ch1 class=\"hb hc cs ax aw ee ka mm kc mn mo mp mq mr ms mt hf\" data-selectable-paragraph=\"\"\u003e3. Document Search engine\u003c/h1\u003e\n\u003cp class=\"ia ib cs ax id b ie mu ig mv mg mw mi mx mk my io gj\" data-selectable-paragraph=\"\"\u003eIn this post, we are using three approaches to understand text analysis.\u003c/p\u003e\n\u003cp class=\"ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj\" data-selectable-paragraph=\"\"\u003e1.Document search engine with \u003cstrong class=\"id iq\"\u003eTF-IDF\u003c/strong\u003e\u003c/p\u003e\n\u003cp class=\"ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj\" data-selectable-paragraph=\"\"\u003e2.Document search engine with \u003cstrong class=\"id iq\"\u003eGoogle Universal sentence encoder\u003c/strong\u003e\u003c/p\u003e\n\u003ch2 class=\"nj hc cs ax aw ee no np nq nr ns nt nu nv nw nx ny\" data-selectable-paragraph=\"\"\u003e3.1 Calculating the ranking by using \u003ca class=\"bu dh iw ix iy iz\" href=\"https://en.wikipedia.org/wiki/Cosine_similarity\" target=\"_blank\" rel=\"noopener nofollow\"\u003ecosine similarity\u003c/a\u003e\u003c/h2\u003e\n\u003cp class=\"ia ib cs ax id b ie mu ig mv mg mw mi mx mk my io gj\" data-selectable-paragraph=\"\"\u003eIt is the most common metric used to calculate the similarity between document text from input keywords/sentences. Mathematically, it measures the cosine of the angle b/w two vectors projected in a multi-dimensional space.\u003c/p\u003e\n\u003cfigure class=\"lp lq lr ls lt gp t u paragraph-image\"\u003e\n\u003cdiv class=\"t u od\"\u003e\n\u003cdiv class=\"gu y br gv\"\u003e\n\u003cdiv class=\"oe y\"\u003e\u003cimg class=\"is it z ab ac fu v ha fr-fic fr-dii fr-draggable\" src=\"https://miro.medium.com/max/372/1*nJT7q9nlDWgXllSHcI4ZJA.jpeg\" width=\"372\" height=\"263\" data-fr-image-pasted=\"true\" /\u003e\u003c/div\u003e\n\u003c/div\u003e\n\u003c/div\u003e\n\u003cfigcaption class=\"bb bp ma dx mb w t u mc md aw fa\" data-selectable-paragraph=\"\"\u003eCosine Similarity b/w document to query\u003c/figcaption\u003e\n\u003c/figure\u003e\n\u003cp class=\"ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj\" data-selectable-paragraph=\"\"\u003eIn the above diagram, have 3 document vector value and one query vector in space. when we are calculating the cosine similarity b/w above 3 documents. The most similarity value will be D3 document from three documents.\u003c/p\u003e\n\u003ch1 class=\"hb hc cs ax aw ee ka mm kc mn mo mp mq mr ms mt hf\" data-selectable-paragraph=\"\"\u003e1. Document search engine with TF-IDF:\u003c/h1\u003e\n\u003cp class=\"ia ib cs ax id b ie mu ig mv mg mw mi mx mk my io gj\" data-selectable-paragraph=\"\"\u003e\u003ca class=\"bu dh iw ix iy iz\" href=\"https://en.wikipedia.org/wiki/Tf%E2%80%93idf\" target=\"_blank\" rel=\"noopener nofollow\"\u003e\u003cstrong class=\"id iq\"\u003eTF-IDF\u003c/strong\u003e\u003c/a\u003e stands for \u003cstrong class=\"id iq\"\u003e\u0026ldquo;Term Frequency \u0026mdash; Inverse Document Frequency\u0026rdquo;\u003c/strong\u003e. This is a technique to calculate the weight of each word signifies the importance of the word in the document and corpus. This algorithm is mostly using for the retrieval of information and text mining field.\u003c/p\u003e\n\u003ch2 class=\"nj hc cs ax aw ee no np nq nr ns nt nu nv nw nx ny\" data-selectable-paragraph=\"\"\u003eTerm Frequency (TF)\u003c/h2\u003e\n\u003cp class=\"ia ib cs ax id b ie mu ig mv mg mw mi mx mk my io gj\" data-selectable-paragraph=\"\"\u003eThe number of times a word appears in a document divided by the total number of words in the document. Every document has its term frequency.\u003c/p\u003e\n\u003cfigure class=\"lp lq lr ls lt gp t u paragraph-image\"\u003e\n\u003cdiv class=\"lw lx br ly v\"\u003e\n\u003cdiv class=\"t u of\"\u003e\n\u003cdiv class=\"gu y br gv\"\u003e\n\u003cdiv class=\"og y\"\u003e\u003cimg class=\"is it z ab ac fu v ha fr-fic fr-dii fr-draggable\" src=\"https://miro.medium.com/max/343/0*0Uzik-cTMA-i6BUt.png\" width=\"343\" height=\"121\" data-fr-image-pasted=\"true\" /\u003e\u003c/div\u003e\n\u003c/div\u003e\n\u003c/div\u003e\n\u003c/div\u003e\n\u003c/figure\u003e\n\u003ch2 class=\"nj hc cs ax aw ee no np nq nr ns nt nu nv nw nx ny\" data-selectable-paragraph=\"\"\u003eInverse Data Frequency (IDF)\u003c/h2\u003e\n\u003cp class=\"ia ib cs ax id b ie mu ig mv mg mw mi mx mk my io gj\" data-selectable-paragraph=\"\"\u003eThe log of the number of documents divided by the number of documents that contain the word \u003cstrong class=\"id iq\"\u003e\u003cem class=\"ic\"\u003ew\u003c/em\u003e\u003c/strong\u003e. Inverse data frequency determines the weight of rare words across all documents in the corpus.\u003c/p\u003e\n\u003cfigure class=\"lp lq lr ls lt gp t u paragraph-image\"\u003e\n\u003cdiv class=\"t u oh\"\u003e\n\u003cdiv class=\"gu y br gv\"\u003e\n\u003cdiv class=\"oi y\"\u003e\u003cimg class=\"is it z ab ac fu v ha fr-fic fr-dii fr-draggable\" src=\"https://miro.medium.com/max/390/0*t2Uxb_43L3vjwDPm.png\" width=\"390\" height=\"123\" data-fr-image-pasted=\"true\" /\u003e\u003c/div\u003e\n\u003c/div\u003e\n\u003c/div\u003e\n\u003c/figure\u003e\n\u003cp class=\"ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj\" data-selectable-paragraph=\"\"\u003eLastly, the \u003cstrong class=\"id iq\"\u003eTF-IDF\u003c/strong\u003e is simply the TF multiplied by IDF.\u003c/p\u003e\n\u003cpre class=\"lp lq lr ls lt nh ni eu\"\u003e\u003cspan class=\"nj hc cs ax nk b bp nl nm y nn\" data-selectable-paragraph=\"\"\u003e\u003cstrong class=\"nk iq\"\u003eTF-IDF = Term Frequency (TF) * Inverse Document Frequency (IDF)\u003c/strong\u003e\u003c/span\u003e\u003c/pre\u003e\n\u003cfigure class=\"lp lq lr ls lt gp t u paragraph-image\"\u003e\n\u003cdiv class=\"t u oj\"\u003e\n\u003cdiv class=\"gu y br gv\"\u003e\n\u003cdiv class=\"ok y\"\u003e\u003cimg class=\"is it z ab ac fu v ha fr-fic fr-dii fr-draggable\" src=\"https://miro.medium.com/max/505/0*yJm1bH6Ds0vFFyhP.png\" width=\"505\" height=\"128\" data-fr-image-pasted=\"true\" /\u003e\u003c/div\u003e\n\u003c/div\u003e\n\u003c/div\u003e\n\u003c/figure\u003e\n\u003cp class=\"ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj\" data-selectable-paragraph=\"\"\u003eRather than manually implementing \u003ca class=\"bu dh iw ix iy iz\" href=\"http://www.tfidf.com/\" target=\"_blank\" rel=\"noopener nofollow\"\u003eTF-IDF\u003c/a\u003e ourselves, we could use the class provided by \u003ca class=\"bu dh iw ix iy iz\" href=\"https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html\" target=\"_blank\" rel=\"noopener nofollow\"\u003eSklearn\u003c/a\u003e.\u003c/p\u003e\n\u003ch2 class=\"nj hc cs ax aw ee no np nq nr ns nt nu nv nw nx ny\" data-selectable-paragraph=\"\"\u003eGenerated TF-IDF by using TfidfVectorizer from Sklearn\u003c/h2\u003e\n\u003cp class=\"ia ib cs ax id b ie mu ig mv mg mw mi mx mk my io gj\" data-selectable-paragraph=\"\"\u003eImport the packages:\u003c/p\u003e\n\u003cpre class=\"lp lq lr ls lt nh ni eu\"\u003e\u003cspan class=\"nj hc cs ax nk b bp nl nm y nn\" data-selectable-paragraph=\"\"\u003eimport pandas as pd\u003cbr /\u003eimport numpy as np\u003cbr /\u003eimport os \u003cbr /\u003eimport re\u003cbr /\u003eimport operator\u003cbr /\u003eimport nltk \u003cbr /\u003efrom nltk.tokenize import word_tokenize\u003cbr /\u003efrom nltk import pos_tag\u003cbr /\u003efrom nltk.corpus import stopwords\u003cbr /\u003efrom nltk.stem import WordNetLemmatizer\u003cbr /\u003efrom collections import defaultdict\u003cbr /\u003efrom nltk.corpus import wordnet as wn\u003cbr /\u003efrom sklearn.feature_extraction.text import TfidfVectorizer\u003c/span\u003e\u003c/pre\u003e\n\u003cp class=\"ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj\" data-selectable-paragraph=\"\"\u003eTF-IDF\u003c/p\u003e\n\u003cpre class=\"lp lq lr ls lt nh ni eu\"\u003e\u003cspan class=\"nj hc cs ax nk b bp nl nm y nn\" data-selectable-paragraph=\"\"\u003efrom sklearn.feature_extraction.text import TfidfVectorizer\u003cbr /\u003eimport operator\u003c/span\u003e\u003cspan class=\"nj hc cs ax nk b bp ol om on oo op nm y nn\" data-selectable-paragraph=\"\"\u003e## Create Vocabulary\u003cbr /\u003evocabulary = set()\u003c/span\u003e\u003cspan class=\"nj hc cs ax nk b bp ol om on oo op nm y nn\" data-selectable-paragraph=\"\"\u003efor doc in df_news.Clean_Keyword:\u003cbr /\u003e    vocabulary.update(doc.split(','))\u003c/span\u003e\u003cspan class=\"nj hc cs ax nk b bp ol om on oo op nm y nn\" data-selectable-paragraph=\"\"\u003evocabulary = list(vocabulary)\u003c/span\u003e\u003cspan class=\"nj hc cs ax nk b bp ol om on oo op nm y nn\" data-selectable-paragraph=\"\"\u003e# Intializating the tfIdf model\u003cbr /\u003etfidf = TfidfVectorizer(vocabulary=vocabulary)\u003c/span\u003e\u003cspan class=\"nj hc cs ax nk b bp ol om on oo op nm y nn\" data-selectable-paragraph=\"\"\u003e# Fit the TfIdf model\u003cbr /\u003etfidf.fit(df_news.Clean_Keyword)\u003c/span\u003e\u003cspan class=\"nj hc cs ax nk b bp ol om on oo op nm y nn\" data-selectable-paragraph=\"\"\u003e# Transform the TfIdf model\u003cbr /\u003etfidf_tran=tfidf.transform(df_news.Clean_Keyword)\u003c/span\u003e\u003c/pre\u003e\n\u003cp class=\"ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj\" data-selectable-paragraph=\"\"\u003eThe above code has created TF-IDF weight of the whole dataset, Now have to create a function to generate a vector for the input query.\u003c/p\u003e\n\u003ch2 class=\"nj hc cs ax aw ee no np nq nr ns nt nu nv nw nx ny\" data-selectable-paragraph=\"\"\u003eCreate a vector for Query/search keywords\u003c/h2\u003e\n\u003cpre class=\"lp lq lr ls lt nh ni eu\"\u003e\u003cspan class=\"nj hc cs ax nk b bp nl nm y nn\" data-selectable-paragraph=\"\"\u003edef gen_vector_T(tokens):\u003c/span\u003e\u003cspan class=\"nj hc cs ax nk b bp ol om on oo op nm y nn\" data-selectable-paragraph=\"\"\u003eQ = np.zeros((len(vocabulary)))    \u003cbr /\u003e    x= tfidf.transform(tokens)\u003cbr /\u003e    #print(tokens[0].split(','))\u003cbr /\u003e    for token in tokens[0].split(','):\u003cbr /\u003e        #print(token)\u003cbr /\u003e        try:\u003cbr /\u003e            ind = vocabulary.index(token)\u003cbr /\u003e            Q[ind]  = x[0, tfidf.vocabulary_[token]]\u003cbr /\u003e        except:\u003cbr /\u003e            pass\u003cbr /\u003e    return Q\u003c/span\u003e\u003c/pre\u003e\n\u003ch2 class=\"nj hc cs ax aw ee no np nq nr ns nt nu nv nw nx ny\" data-selectable-paragraph=\"\"\u003eCosine Similarity function for the calculation\u003c/h2\u003e\n\u003cpre class=\"lp lq lr ls lt nh ni eu\"\u003e\u003cspan class=\"nj hc cs ax nk b bp nl nm y nn\" data-selectable-paragraph=\"\"\u003edef cosine_sim(a, b):\u003cbr /\u003e    cos_sim = np.dot(a, b)/(np.linalg.norm(a)*np.linalg.norm(b))\u003cbr /\u003e    return cos_sim\u003c/span\u003e\u003c/pre\u003e\n\u003ch2 class=\"nj hc cs ax aw ee no np nq nr ns nt nu nv nw nx ny\" data-selectable-paragraph=\"\"\u003eCosine Similarity b/w document to query function\u003c/h2\u003e\n\u003cpre class=\"lp lq lr ls lt nh ni eu\"\u003e\u003cspan class=\"nj hc cs ax nk b bp nl nm y nn\" data-selectable-paragraph=\"\"\u003edef cosine_similarity_T(k, query):\u003cbr /\u003e    preprocessed_query = preprocessed_query = re.sub(\"\\W+\", \" \", query).strip()\u003cbr /\u003e    tokens = word_tokenize(str(preprocessed_query))\u003cbr /\u003e    q_df = pd.DataFrame(columns=['q_clean'])\u003cbr /\u003e    q_df.loc[0,'q_clean'] =tokens\u003cbr /\u003e    q_df['q_clean'] =wordLemmatizer(q_df.q_clean)\u003cbr /\u003e    d_cosines = []\u003cbr /\u003e    \u003cbr /\u003e    query_vector = gen_vector_T(q_df['q_clean'])\u003cbr /\u003e    for d in tfidf_tran.A:\u003cbr /\u003e        d_cosines.append(cosine_sim(query_vector, d))\u003cbr /\u003e                    \u003cbr /\u003e    out = np.array(d_cosines).argsort()[-k:][::-1]\u003cbr /\u003e    #print(\"\")\u003cbr /\u003e    d_cosines.sort()\u003cbr /\u003e    a = pd.DataFrame()\u003cbr /\u003e    for i,index in enumerate(out):\u003cbr /\u003e        a.loc[i,'index'] = str(index)\u003cbr /\u003e        a.loc[i,'Subject'] = df_news['Subject'][index]\u003cbr /\u003e    for j,simScore in enumerate(d_cosines[-k:][::-1]):\u003cbr /\u003e        a.loc[j,'Score'] = simScore\u003cbr /\u003e    return a\u003c/span\u003e\u003c/pre\u003e\n\u003ch2 class=\"nj hc cs ax aw ee no np nq nr ns nt nu nv nw nx ny\" data-selectable-paragraph=\"\"\u003eTesting the function\u003c/h2\u003e\n\u003cpre class=\"lp lq lr ls lt nh ni eu\"\u003e\u003cspan class=\"nj hc cs ax nk b bp nl nm y nn\" data-selectable-paragraph=\"\"\u003ecosine_similarity_T(10,\u0026rsquo;computer science\u0026rsquo;)\u003c/span\u003e\u003c/pre\u003e\n\u003cfigure class=\"lp lq lr ls lt gp t u paragraph-image\"\u003e\n\u003cdiv class=\"t u oq\"\u003e\n\u003cdiv class=\"gu y br gv\"\u003e\n\u003cdiv class=\"or y\"\u003e\u003cimg class=\"is it z ab ac fu v ha fr-fic fr-dii fr-draggable\" src=\"https://miro.medium.com/max/429/1*iSxlgzrxMz9Epnp4WkhxBQ.png\" width=\"429\" height=\"210\" data-fr-image-pasted=\"true\" /\u003e\u003c/div\u003e\n\u003c/div\u003e\n\u003c/div\u003e\n\u003cfigcaption class=\"bb bp ma dx mb w t u mc md aw fa\" data-selectable-paragraph=\"\"\u003e\u003cstrong class=\"aw iq\"\u003eResult of top 5 similarity documents for \u0026ldquo;computer science\u0026rdquo; word\u003c/strong\u003e\u003c/figcaption\u003e\n\u003c/figure\u003e\n\u003ch1 class=\"hb hc cs ax aw ee ka mm kc mn mo mp mq mr ms mt hf\" data-selectable-paragraph=\"\"\u003e2. Document search engine with Google Universal sentence encoder\u003c/h1\u003e\n\u003ch2 class=\"nj hc cs ax aw ee no np nq nr ns nt nu nv nw nx ny\" data-selectable-paragraph=\"\"\u003eIntroduction Google USE\u003c/h2\u003e\n\u003cp class=\"ia ib cs ax id b ie mu ig mv mg mw mi mx mk my io gj\" data-selectable-paragraph=\"\"\u003eThe pre-trained \u003ca class=\"bu dh iw ix iy iz\" href=\"https://ai.googleblog.com/2018/05/advances-in-semantic-textual-similarity.html\" target=\"_blank\" rel=\"noopener nofollow\"\u003eUniversal Sentence Encoder\u003c/a\u003e is publicly available in \u003ca class=\"bu dh iw ix iy iz\" href=\"https://www.tensorflow.org/hub/\" target=\"_blank\" rel=\"noopener nofollow\"\u003eTensorflow-hub\u003c/a\u003e. It comes with two variations i.e. one trained with \u003ca class=\"bu dh iw ix iy iz\" href=\"https://tfhub.dev/google/universal-sentence-encoder-large/5\" target=\"_blank\" rel=\"noopener nofollow\"\u003e\u003cstrong class=\"id iq\"\u003eTransformer encoder\u003c/strong\u003e\u003c/a\u003e and others trained with \u003ca class=\"bu dh iw ix iy iz\" href=\"https://tfhub.dev/google/universal-sentence-encoder/4\" target=\"_blank\" rel=\"noopener nofollow\"\u003e\u003cstrong class=\"id iq\"\u003eDeep Averaging Network (DAN)\u003c/strong\u003e\u003c/a\u003e. They are pre-trained on a large corpus and can be used in a variety of tasks (sentimental analysis, classification and so on). The two have a trade-off of accuracy and computational resource requirement. While the one with Transformer encoder has higher accuracy, it is computationally more expensive. The one with DNA encoding is computationally less expensive and with little lower accuracy.\u003c/p\u003e\n\u003cp class=\"ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj\" data-selectable-paragraph=\"\"\u003eHere we are using Second one DAN Universal sentence encoder as available in this URL:- \u003ca class=\"bu dh iw ix iy iz\" href=\"https://tfhub.dev/google/universal-sentence-encoder/4\" target=\"_blank\" rel=\"noopener nofollow\"\u003eGoogle USE DAN Model\u003c/a\u003e\u003c/p\u003e\n\u003cp class=\"ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj\" data-selectable-paragraph=\"\"\u003eBoth models take a word, sentence or a paragraph as input and output a \u003cstrong class=\"id iq\"\u003e512\u003c/strong\u003e-dimensional vector.\u003c/p\u003e\n\u003cfigure class=\"lp lq lr ls lt gp t u paragraph-image\"\u003e\n\u003cdiv class=\"t u hw\"\u003e\n\u003cdiv class=\"gu y br gv\"\u003e\n\u003cdiv class=\"os y\"\u003e\u003cimg class=\"is it z ab ac fu v ha fr-fic fr-dii fr-draggable\" src=\"https://miro.medium.com/max/640/0*pGl0O-Z_Way5sT_U\" width=\"640\" height=\"189\" data-fr-image-pasted=\"true\" /\u003e\u003c/div\u003e\n\u003c/div\u003e\n\u003c/div\u003e\n\u003cfigcaption class=\"bb bp ma dx mb w t u mc md aw fa\" data-selectable-paragraph=\"\"\u003eA prototypical semantic retrieval pipeline, used for textual similarity.\u003c/figcaption\u003e\n\u003c/figure\u003e\n\u003cp class=\"ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj\" data-selectable-paragraph=\"\"\u003eBefore using the TensorFlow-hub model.\u003c/p\u003e\n\u003cp class=\"ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj\" data-selectable-paragraph=\"\"\u003e\u003cstrong class=\"id iq\"\u003ePrerequisite :\u003c/strong\u003e\u003c/p\u003e\n\u003cpre class=\"lp lq lr ls lt nh ni eu\"\u003e\u003cspan class=\"nj hc cs ax nk b bp nl nm y nn\" data-selectable-paragraph=\"\"\u003e!pip install --upgrade tensorflow-gpu\u003cbr /\u003e #Install TF-Hub.\u003cbr /\u003e!pip install tensorflow-hub\u003cbr /\u003e!pip install seaborn\u003c/span\u003e\u003c/pre\u003e\n\u003cp class=\"ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj\" data-selectable-paragraph=\"\"\u003eNow import the packages:\u003c/p\u003e\n\u003cpre class=\"lp lq lr ls lt nh ni eu\"\u003e\u003cspan class=\"nj hc cs ax nk b bp nl nm y nn\" data-selectable-paragraph=\"\"\u003eimport pandas as pd\u003cbr /\u003eimport numpy as np\u003cbr /\u003eimport re, string\u003cbr /\u003eimport os \u003cbr /\u003eimport tensorflow as tf\u003cbr /\u003eimport tensorflow_hub as hub\u003cbr /\u003eimport matplotlib.pyplot as plt\u003cbr /\u003eimport seaborn as sns\u003cbr /\u003efrom sklearn.metrics.pairwise import linear_kernel\u003c/span\u003e\u003c/pre\u003e\n\u003cp class=\"ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj\" data-selectable-paragraph=\"\"\u003eDownload the model from \u003ca class=\"bu dh iw ix iy iz\" href=\"https://tfhub.dev/google/universal-sentence-encoder/4\" target=\"_blank\" rel=\"noopener nofollow\"\u003eTensorFlow-hub\u003c/a\u003e of calling direct URL:\u003c/p\u003e\n\u003cpre class=\"lp lq lr ls lt nh ni eu\"\u003e\u003cspan class=\"nj hc cs ax nk b bp nl nm y nn\" data-selectable-paragraph=\"\"\u003e! curl -L -o 4.tar.gz \"\u003ca class=\"bu dh iw ix iy iz\" href=\"https://tfhub.dev/google/universal-sentence-encoder/4?tf-hub-format=compressed\" target=\"_blank\" rel=\"noopener nofollow\"\u003ehttps://tfhub.dev/google/universal-sentence-encoder/4?tf-hub-format=compressed\u003c/a\u003e\"\u003c/span\u003e\u003cspan class=\"nj hc cs ax nk b bp ol om on oo op nm y nn\" data-selectable-paragraph=\"\"\u003eor\u003cbr /\u003emodule_url = \"\u003ca class=\"bu dh iw ix iy iz\" href=\"https://tfhub.dev/google/universal-sentence-encoder/4\" target=\"_blank\" rel=\"noopener nofollow\"\u003ehttps://tfhub.dev/google/universal-sentence-encoder/4\u003c/a\u003e\"\u003c/span\u003e\u003c/pre\u003e\n\u003cp class=\"ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj\" data-selectable-paragraph=\"\"\u003eLoad the Google DAN Universal sentence encoder\u003c/p\u003e\n\u003cpre class=\"lp lq lr ls lt nh ni eu\"\u003e\u003cspan class=\"nj hc cs ax nk b bp nl nm y nn\" data-selectable-paragraph=\"\"\u003e#Model load through local path:\u003c/span\u003e\u003cspan class=\"nj hc cs ax nk b bp ol om on oo op nm y nn\" data-selectable-paragraph=\"\"\u003emodule_path =\"/home/zettadevs/GoogleUSEModel/USE_4\"\u003cbr /\u003e%time model = hub.load(module_path)\u003c/span\u003e\u003cspan class=\"nj hc cs ax nk b bp ol om on oo op nm y nn\" data-selectable-paragraph=\"\"\u003e#Create function for using model training\u003cbr /\u003edef embed(input):\u003cbr /\u003e    return model(input)\u003c/span\u003e\u003c/pre\u003e\n\u003ch2 class=\"nj hc cs ax aw ee no np nq nr ns nt nu nv nw nx ny\" data-selectable-paragraph=\"\"\u003eUse Case 1:- Word semantic\u003c/h2\u003e\n\u003cpre class=\"lp lq lr ls lt nh ni eu\"\u003e\u003cspan class=\"nj hc cs ax nk b bp nl nm y nn\" data-selectable-paragraph=\"\"\u003eWordMessage =[\u0026lsquo;big data\u0026rsquo;, \u0026lsquo;millions of data\u0026rsquo;, \u0026lsquo;millions of records\u0026rsquo;,\u0026rsquo;cloud computing\u0026rsquo;,\u0026rsquo;aws\u0026rsquo;,\u0026rsquo;azure\u0026rsquo;,\u0026rsquo;saas\u0026rsquo;,\u0026rsquo;bank\u0026rsquo;,\u0026rsquo;account\u0026rsquo;]\u003c/span\u003e\u003c/pre\u003e\n\u003cfigure class=\"lp lq lr ls lt gp t u paragraph-image\"\u003e\n\u003cdiv class=\"t u ot\"\u003e\n\u003cdiv class=\"gu y br gv\"\u003e\n\u003cdiv class=\"ou y\"\u003e\u003cimg class=\"is it z ab ac fu v ha fr-fic fr-dii fr-draggable\" src=\"https://miro.medium.com/max/472/1*DfKpd8bPS1PA-UugtEDCFA.png\" width=\"472\" height=\"380\" data-fr-image-pasted=\"true\" /\u003e\u003c/div\u003e\n\u003c/div\u003e\n\u003c/div\u003e\n\u003c/figure\u003e\n\u003ch2 class=\"nj hc cs ax aw ee no np nq nr ns nt nu nv nw nx ny\" data-selectable-paragraph=\"\"\u003eUse Case 2: Sentence Semantic\u003c/h2\u003e\n\u003cpre class=\"lp lq lr ls lt nh ni eu\"\u003e\u003cspan class=\"nj hc cs ax nk b bp nl nm y nn\" data-selectable-paragraph=\"\"\u003eSentMessage =['How old are you?','what is your age?','how are you?','how you doing?']\u003c/span\u003e\u003c/pre\u003e\n\u003cfigure class=\"lp lq lr ls lt gp t u paragraph-image\"\u003e\n\u003cdiv class=\"t u ov\"\u003e\n\u003cdiv class=\"gu y br gv\"\u003e\n\u003cdiv class=\"ow y\"\u003e\u003cimg class=\"is it z ab ac fu v ha fr-fic fr-dii fr-draggable\" src=\"https://miro.medium.com/max/467/1*-eUVfSMSsv8rmNoqPTmQ2w.png\" width=\"467\" height=\"375\" data-fr-image-pasted=\"true\" /\u003e\u003c/div\u003e\n\u003c/div\u003e\n\u003c/div\u003e\n\u003c/figure\u003e\n\u003ch2 class=\"nj hc cs ax aw ee no np nq nr ns nt nu nv nw nx ny\" data-selectable-paragraph=\"\"\u003eUse Case 3: Word, Sentence and paragram Semantic\u003c/h2\u003e\n\u003cpre class=\"lp lq lr ls lt nh ni eu\"\u003e\u003cspan class=\"nj hc cs ax nk b bp nl nm y nn\" data-selectable-paragraph=\"\"\u003eword ='Cloud computing'\u003c/span\u003e\u003cspan class=\"nj hc cs ax nk b bp ol om on oo op nm y nn\" data-selectable-paragraph=\"\"\u003eSentence = 'what is cloud computing'\u003c/span\u003e\u003cspan class=\"nj hc cs ax nk b bp ol om on oo op nm y nn\" data-selectable-paragraph=\"\"\u003ePara =(\"Cloud computing is the latest generation technology with a high IT infrastructure that provides us a means by which we can use and utilize the applications as utilities via the internet.\"\u003cbr /\u003e        \"Cloud computing makes IT infrastructure along with their services available 'on-need' basis.\" \u003cbr /\u003e        \"The cloud technology includes - a development platform, hard disk, computing power, software application, and database.\")\u003c/span\u003e\u003cspan class=\"nj hc cs ax nk b bp ol om on oo op nm y nn\" data-selectable-paragraph=\"\"\u003ePara5 =(\u003cbr /\u003e    \"Universal Sentence Encoder embeddings also support short paragraphs. \"\u003cbr /\u003e    \"There is no hard limit on how long the paragraph is. Roughly, the longer \"\u003cbr /\u003e    \"the more 'diluted' the embedding will be.\")\u003c/span\u003e\u003cspan class=\"nj hc cs ax nk b bp ol om on oo op nm y nn\" data-selectable-paragraph=\"\"\u003ePara6 =(\"Azure is a cloud computing platform which was launched by Microsoft in February 2010.\"\u003cbr /\u003e       \"It is an open and flexible cloud platform which helps in development, data storage, service hosting, and service management.\"\u003cbr /\u003e       \"The Azure tool hosts web applications over the internet with the help of Microsoft data centers.\")\u003cbr /\u003ecase4Message=[word,Sentence,Para,Para5,Para6]\u003c/span\u003e\u003c/pre\u003e\n\u003cfigure class=\"lp lq lr ls lt gp t u paragraph-image\"\u003e\n\u003cdiv class=\"t u ox\"\u003e\n\u003cdiv class=\"gu y br gv\"\u003e\n\u003cdiv class=\"oy y\"\u003e\u003cimg class=\"is it z ab ac fu v ha fr-fic fr-dii fr-draggable\" src=\"https://miro.medium.com/max/556/1*ejcbdMkwG1nUHBMtYUyjTg.png\" width=\"556\" height=\"156\" data-fr-image-pasted=\"true\" /\u003e\u003c/div\u003e\n\u003c/div\u003e\n\u003c/div\u003e\n\u003c/figure\u003e\n\u003ch1 class=\"hb hc cs ax aw ee ka mm kc mn mo mp mq mr ms mt hf\" data-selectable-paragraph=\"\"\u003eTraining the model\u003c/h1\u003e\n\u003cp class=\"ia ib cs ax id b ie mu ig mv mg mw mi mx mk my io gj\" data-selectable-paragraph=\"\"\u003eHere we have trained the dataset at batch-wise because it takes a long time to execution to generate the graph of the dataset. so better to train batch-wise data.\u003c/p\u003e\n\u003cpre class=\"lp lq lr ls lt nh ni eu\"\u003e\u003cspan class=\"nj hc cs ax nk b bp nl nm y nn\" data-selectable-paragraph=\"\"\u003eModel_USE= embed(df_news.content[0:2500])\u003c/span\u003e\u003c/pre\u003e\n\u003cp class=\"ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj\" data-selectable-paragraph=\"\"\u003e\u003cstrong class=\"id iq\"\u003eSave the model\u003c/strong\u003e, for reusing the model.\u003c/p\u003e\n\u003cpre class=\"lp lq lr ls lt nh ni eu\"\u003e\u003cspan class=\"nj hc cs ax nk b bp nl nm y nn\" data-selectable-paragraph=\"\"\u003eexported = tf.train.Checkpoint(v=tf.Variable(Model_USE))\u003cbr /\u003eexported.f = tf.function(\u003cbr /\u003e    lambda  x: exported.v * x,\u003cbr /\u003e    input_signature=[tf.TensorSpec(shape=None, dtype=tf.float32)])\u003c/span\u003e\u003cspan class=\"nj hc cs ax nk b bp ol om on oo op nm y nn\" data-selectable-paragraph=\"\"\u003etf.saved_model.save(exported,'/home/zettadevs/GoogleUSEModel/TrainModel')\u003c/span\u003e\u003c/pre\u003e\n\u003cp class=\"ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj\" data-selectable-paragraph=\"\"\u003e\u003cstrong class=\"id iq\"\u003eLoad the model from path:\u003c/strong\u003e\u003c/p\u003e\n\u003cpre class=\"lp lq lr ls lt nh ni eu\"\u003e\u003cspan class=\"nj hc cs ax nk b bp nl nm y nn\" data-selectable-paragraph=\"\"\u003eimported = tf.saved_model.load(\u0026lsquo;/home/zettadevs/GoogleUSEModel/TrainModel/\u0026rsquo;)\u003cbr /\u003eloadedmodel =imported.v.numpy()\u003c/span\u003e\u003c/pre\u003e\n\u003cp class=\"ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj\" data-selectable-paragraph=\"\"\u003e\u003cstrong class=\"id iq\"\u003eFunction for document search:\u003c/strong\u003e\u003c/p\u003e\n\u003cpre class=\"lp lq lr ls lt nh ni eu\"\u003e\u003cspan class=\"nj hc cs ax nk b bp nl nm y nn\" data-selectable-paragraph=\"\"\u003edef SearchDocument(query):\u003cbr /\u003e    q =[query]\u003cbr /\u003e    # embed the query for calcluating the similarity\u003cbr /\u003e    Q_Train =embed(q)\u003cbr /\u003e    \u003cbr /\u003e    #imported_m = tf.saved_model.load('/home/zettadevs/GoogleUSEModel/TrainModel')\u003cbr /\u003e    #loadedmodel =imported_m.v.numpy()\u003cbr /\u003e    # Calculate the Similarity\u003cbr /\u003e    linear_similarities = linear_kernel(Q_Train, con_a).flatten() \u003cbr /\u003e    #Sort top 10 index with similarity score\u003cbr /\u003e    Top_index_doc = linear_similarities.argsort()[:-11:-1]\u003cbr /\u003e    # sort by similarity score\u003cbr /\u003e    linear_similarities.sort()\u003cbr /\u003e    a = pd.DataFrame()\u003cbr /\u003e    for i,index in enumerate(Top_index_doc):\u003cbr /\u003e        a.loc[i,'index'] = str(index)\u003cbr /\u003e        a.loc[i,'File_Name'] = df_news['Subject'][index] ## Read File name with index from File_data DF\u003cbr /\u003e    for j,simScore in enumerate(linear_similarities[:-11:-1]):\u003cbr /\u003e        a.loc[j,'Score'] = simScore\u003cbr /\u003e    return a\u003c/span\u003e\u003c/pre\u003e\n\u003carticle\u003e\n\u003csection class=\"gj gk gl gm gn\"\u003e\n\u003cdiv class=\"n p\"\u003e\n\u003cdiv class=\"ai aj ak al am jw ao v\"\u003e\n\u003cp class=\"ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj\" data-selectable-paragraph=\"\"\u003e\u003cstrong class=\"id iq\"\u003eTest the search:\u003c/strong\u003e\u003c/p\u003e\n\u003cpre class=\"lp lq lr ls lt nh ni eu\"\u003e\u003cspan class=\"nj hc cs ax nk b bp nl nm y nn\" data-selectable-paragraph=\"\"\u003eSearchDocument('computer science')\u003c/span\u003e\u003c/pre\u003e\n\u003cfigure class=\"lp lq lr ls lt gp t u paragraph-image\"\u003e\n\u003cdiv class=\"t u oz\"\u003e\n\u003cdiv class=\"gu y br gv\"\u003e\n\u003cdiv class=\"pa y\"\u003e\u003cimg class=\"is it z ab ac fu v ha fr-fic fr-dii fr-draggable\" src=\"https://miro.medium.com/max/430/1*8Btzc5dq-HlTzCaFLw_jdQ.png\" width=\"430\" height=\"211\" data-fr-image-pasted=\"true\" /\u003e\u003c/div\u003e\n\u003c/div\u003e\n\u003c/div\u003e\n\u003c/figure\u003e\n\u003ch1 class=\"hb hc cs ax aw ee ka mm kc mn mo mp mq mr ms mt hf\" data-selectable-paragraph=\"\"\u003eConclusion:\u003c/h1\u003e\n\u003cp class=\"ia ib cs ax id b ie mu ig mv mg mw mi mx mk my io gj\" data-selectable-paragraph=\"\"\u003eAt the end of this tutorial, we are concluding that \u0026ldquo;google universal sentence encoder\u0026rdquo; model is providing the semantic search result while TF-IDF model doesn\u0026rsquo;t know the meaning of the word. just giving the result based on words available on the documents.\u003c/p\u003e\n\u003cp class=\"ia ib cs ax id b ie mu ig mv mg mw mi mx mk my io gj\" data-selectable-paragraph=\"\"\u003e\u003ca href=\"https://medium.com/@zayedrais/build-your-semantic-document-search-engine-with-tf-idf-and-google-use-c836bf5f27fb\" target=\"_blank\" rel=\"noopener\"\u003eOriginal Post on medium\u003c/a\u003e\u003c/p\u003e\n\u003cp class=\"ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj\" data-selectable-paragraph=\"\"\u003e\u003cstrong class=\"id iq\"\u003eSome references:\u003c/strong\u003e\u003c/p\u003e\n\u003cul class=\"\"\u003e\n\u003cli class=\"ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io mz na nb\" data-selectable-paragraph=\"\"\u003e\u003ca class=\"bu dh iw ix iy iz\" href=\"https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089\" target=\"_blank\" rel=\"noopener\"\u003eTF-IDF\u003c/a\u003e\u003c/li\u003e\n\u003cli class=\"ia ib cs ax id b ie nc ig nd mg ne mi nf mk ng io mz na nb\" data-selectable-paragraph=\"\"\u003e\u003ca class=\"bu dh iw ix iy iz\" href=\"https://ai.googleblog.com/2018/05/advances-in-semantic-textual-similarity.html\" target=\"_blank\" rel=\"noopener nofollow\"\u003eGoogle USE\u003c/a\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/div\u003e\n\u003c/div\u003e\n\u003c/section\u003e\n\u003c/article\u003e\n\u003cdiv class=\"is gi pb ji v pi pg sf\" data-test-id=\"post-sidebar\"\u003e\u0026nbsp;\u003c/div\u003e\n\u003cdiv class=\"ej\"\u003e\u0026nbsp;\u003c/div\u003e\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzayedrais%2Fdocumentsearchengine","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzayedrais%2Fdocumentsearchengine","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzayedrais%2Fdocumentsearchengine/lists"}