{"id":19224445,"url":"https://github.com/hermann-web/search-engine-with-python-nlp","last_synced_at":"2026-05-02T18:33:14.539Z","repository":{"id":113472103,"uuid":"391651575","full_name":"Hermann-web/Search-engine-with-python-nlp","owner":"Hermann-web","description":"A python search engine build with NLP methods for a django project","archived":false,"fork":false,"pushed_at":"2021-09-14T18:49:53.000Z","size":1971,"stargazers_count":1,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-04T21:19:57.287Z","etag":null,"topics":["cosine-similarity","document-searching","natural-language-processing","nlp","nltk","pandas","python","scikit-learn","search-engine","semantic-similarity","similarity-score","similarity-search"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Hermann-web.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-08-01T14:35:45.000Z","updated_at":"2022-09-23T13:06:39.000Z","dependencies_parsed_at":"2023-06-28T23:01:02.614Z","dependency_job_id":null,"html_url":"https://github.com/Hermann-web/Search-engine-with-python-nlp","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Hermann-web%2FSearch-engine-with-python-nlp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Hermann-web%2FSearch-engine-with-python-nlp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Hermann-web%2FSearch-engine-with-python-nlp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Hermann-web%2FSearch-engine-with-python-nlp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Hermann-web","download_url":"https://codeload.github.com/Hermann-web/Search-engine-with-python-nlp/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240298487,"owners_count":19779283,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cosine-similarity","document-searching","natural-language-processing","nlp","nltk","pandas","python","scikit-learn","search-engine","semantic-similarity","similarity-score","similarity-search"],"created_at":"2024-11-09T15:11:42.842Z","updated_at":"2025-10-12T14:02:08.468Z","avatar_url":"https://github.com/Hermann-web.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SearchEngine\n\n\u003cp class=\"ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj\" data-selectable-paragraph=\"\"\u003eIn this post, we will be building a \u003cstrong class=\"id iq\"\u003esemantic documents search engine\u003c/strong\u003e\u003c/p\u003e\n\n##Prerequistes\n*   Python \u003e=3.7\n*   NLTK\n*   Pandas\n*   Scikit-learn\n\n##Prerequistes\n```\nimport re, json\nimport unicodedata, string\nimport time\nimport operator\nimport numpy as np \nimport pandas as pd\nfrom collections import Counter\n```\n```\nfrom collections import defaultdict\nimport nltk \nfrom nltk.tokenize import word_tokenize\nfrom nltk import pos_tag\nfrom nltk.corpus import stopwords\nfrom nltk.stem import WordNetLemmatizer\nfrom nltk.tokenize import word_tokenize\nfrom nltk.corpus import wordnet as wn\nfrom sklearn.feature_extraction.text import TfidfVectorizer\n```\n```\nnltk.download('punkt')\nnltk.download('wordnet')\nnltk.download('averaged_perceptron_tagger')\nnltk.download('stopwords')\n```\n\n##data\nFiles used in the notebook are stored in the folder data\n\n\n# **1: Créer les keywords à partir d'une phrase en se basant sur les mots d'un dictionnaire et un corpus de texte en passant par la tokenization, la correction, la lemmatization et le removeStopWords**\n\n\n\n\n--- \n##preprocessing\n--- \n```\ndef get_dico():\n    textdir = \"liste.de.mots.francais.frgut_.txt\"\n    try:DICO = open(textdir,'r',encoding=\"utf-8\").read()\n    except: DICO = open(textdir,'r').read()\n    \n    return DICO\n\n\ndef remove_accents(input_str):\n    \"\"\"This method removes all diacritic marks from the given string\"\"\"\n    norm_txt = unicodedata.normalize('NFD', input_str)\n    shaved = ''.join(c for c in norm_txt if not unicodedata.combining(c))\n    return unicodedata.normalize('NFC', shaved)\n\ndef clean_sentence(texte):\n    # Replace diacritics\n    texte = remove_accents(texte)\n    # Lowercase the document\n    texte = texte.lower()\n    # Remove Mentions\n    texte = re.sub(r'@\\w+', '', texte)\n    # Remove punctuations\n    texte = re.sub(r'[%s]' % re.escape(string.punctuation), ' ', texte)\n    # Remove the doubled space\n    texte = re.sub(r'\\s{2,}', ' ', texte)\n    #remove whitespaces at the beginning and the end\n    texte = texte.strip()\n    \n    return texte\n\n\ndef tokenize_sentence(texte):\n        #clean the sentence \n    texte = clean_sentence(texte)\n        #tokenize \n    liste_words = texte.split()\n        #return \n    return liste_words\n\ndef strip_apostrophe(liste_words):\n    get_radical = lambda word: word.split('\\'')[-1]\n    return list(map(get_radical,liste_words))\n\ndef pre_process(sentence):\n    #remove '_' from the sentence \n    sentence = sentence.replace('_','')\n    \n    #get words fro the sentence \n    liste_words = tokenize_sentence(sentence)\n    #cut out 1 or 2 letters ones \n    liste_words = [elt for elt in liste_words if len(elt)\u003e2]\n    #prendre le radical après l'apostrophe\n    liste_words = strip_apostrophe(liste_words)\n    print('\\nsentence to words : ',liste_words)\n    return liste_words\n```\n--- \n##correction des mots\n--- \n\n```\ndef edits1(word):\n    \"All edits that are one edit away from `word`.\"\n    letters    = 'abcdefghijklmnopqrstuvwxyz'\n    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]\n    deletes    = [L + R[1:]               for L, R in splits if R]\n    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)\u003e1]\n    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]\n    inserts    = [L + c + R               for L, R in splits for c in letters]\n    return set(deletes + transposes + replaces + inserts)\n\ndef edits2(word): \n    \"All edits that are two edits away from `word`.\"\n    return (e2 for e1 in edits1(word) for e2 in edits1(e1))\n\ndef known(words): \n    \"The subset of `words` that appear in the dictionary of WORDS.\"\n    return set(w for w in words if w in WORDS)\n    \ndef candidates(word): \n    \"Generate possible spelling corrections for word.\"\n    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])\n\n\n\n\ndef DICO_ET_CORRECTEUR():\n    \"cette fonction retourne la liste des mots de dictionnaire\"\n    DICO = get_dico()\n    WORDS = Counter(pre_process(DICO)) #Counter prends un str et retourne une sorte de liste enrichie\n    \"correction des mots \"\n    N = sum(WORDS.values())\n    P = lambda word: WORDS[word] / N #\"Probability of `word`.\"\n    \n    correction = lambda word: max(candidates(word), key=P) #\"Most probable\n    return WORDS,correction\n\nWORDS,CORRECTION = DICO_ET_CORRECTEUR()\n```\n--- \n##stopwords et stemming(premier exemple)\n--- \n\n```\n\n##stopwords #//https://www.ranks.nl/stopwords/french\nwith open('stp_words_.txt','r') as f:\n    STOPWORDS = f.read()\n\n##bdd de stemmer\nwith open(\"sample_.json\",'r',encoding='cp1252') as json_file:\n    #json_file.seek(0)\n    LISTE = json.load(json_file)\nmy_stemmer = lambda word: LISTE[word] if word in LISTE else word\n```\n\n---\n##fonction: SENTENCE_TO_CORRECT_WORDS\n--- \n```\ndef SENTENCE_TO_CORRECT_WORDS(sentence):\n    \"cette fonction retourne la liste des mots du user\"\n    print('\\n------------pre_process--------\\n')\n    liste_words = pre_process(sentence)\n    print(liste_words)\n    print('\\n------------correction--------\\n')\n    liste_words = list(map(CORRECTION,liste_words))\n    print(liste_words)\n    print('\\n------------stemming--------\\n')\n    liste_words = list(map(my_stemmer,liste_words))\n    print(liste_words)\n    print('\\n------------remove stop-words--------\\n')\n    liste_words = [elt for elt in liste_words if elt not in STOPWORDS]\n    print(liste_words)\n    print('\\n-------------------------------------\\n')\n    return liste_words\n```\n\n\n\n\n---\n##Test: SENTENCE_TO_CORRECT_WORDS\n---\n\n```\nSENTENCE_TO_CORRECT_WORDS('La PR reste au statut «\\xa0Approuve(e)\\xa0» et il n’y a pas de commande\\\"\\'')\n```\n\n\n\n---\n##Output\n---\n\n```\n------------pre_process--------\n['reste', 'statut', 'approuve', 'n’y', 'pas', 'commande']\n\n------------correction--------\n['reste', 'statut', 'approuve', 'non', 'pas', 'commande']\n\n------------stemming--------\n['rester', 'statut', 'approuver', 'non', 'pas', 'commander']\n\n------------remove stop-words--------\n['rester', 'statut', 'approuver', 'commander']\n\n-------------------------------------\n['rester', 'statut', 'approuver', 'commander']\n\n```\n\n\n\n\n\n\n\n\n---\n##**Create dataset**\n---\n\n```\ndef open_file(textdir):\n  found = False\n  try:texte = open(textdir,'r',encoding=\"utf-8\").read();found=True\n  except:pass\n  try: texte = open(textdir,'r').read();found=True \n  except: pass\n  if not found:\n    texte = open(textdir,'r',encoding='cp1252').read();found=True\n  return  texte\ndef add_col(df_news,titre,keywords):\n  return df_news.append(dict(zip(df_news.columns,[titre, keywords])), ignore_index=True)\n\nliste_pb = [elt for elt in open_file('liste_pb_.txt').split('\\n') if elt]\ndf_new = df_news.drop(df_news.index)\nfor i,titre in enumerate(liste_pb):\n  keywords = ','.join(SENTENCE_TO_CORRECT_WORDS(titre))\n  df_new = add_col(df_new,titre,keywords)\ndf_new.head()\n\n```\n---\n##Output\n---\n```\n \t                   Subject \t                                  Clean_Keyword\n0 \tMessage d'erreur : \"Le fournisseur ARIBA n'exi... \tmessage,erreur,fournisseur,aria,exister\n1 \tMessage d'erreur : \"Commande d’article non aut... \tmessage,erreur,commander,article,autoriser,oto\n2 \tMessage d'erreur : \"Statut utilisateur FERM ac... \tmessage,erreur,statut,utilisateur,actif,oto\n3 \tMessage d'erreur : \"Statut systeme TCLO actif ... \tmessage,erreur,statut,systeme,col,actif,nord\n4 \tMessage d'erreur \"___ Cost center change could... \tmessage,erreur,coat,centrer,changer,cold,affecter\n5 \tMessaeg d'erreur \"___ OTP change could not be ... \tmessage,erreur,otp,changer,cold,affecter\n6 \tMessaeg d'erreur \"Entrez Centre de couts\" \t        message,erreur,entrer,centrer,cout\n7 \tMessage d'erreur \"Indiquez une seule imputatio... \tmessage,erreur,indiquer,imputation,statistique\n8 \tMessage d'erreur \"Imputations CO ont des centr... \tmessage,erreur,imputation,centrer,profit\n9 \tMessage d'erreur \"Poste ___ Ordre ___ depassem... \tmessage,erreur,poster,ordre,depassement,budget\n10 \tMessage d'erreur \"Entrez une quantite de comma... \tmessage,erreur,entrer,quantite,commander\n11 \tMessage d'erreur \"Indiquez la quantite\" \t        message,erreur,indiquer,quantite\n12 \tMessage d'erreur \"Le prix net doit etre superi... \tmessage,erreur,prix,net,superieur\n... \t... \t...\n... \t... \t...\n... \t... \t...\n57 \tUO4-5 Commande | Envoi d'une commande manuelle \tuo4,commander,envoi,commander,manuel\n58 \tUO5-4 Reception | Anomalie workflow \tuo5,reception,anomalie,workflow\n59 \tUO5-1 Reception | Modification(s) de reception(s) \tuo5,reception,modification,reception\n60 \tUO5-2 Reception | Annulation(s) de reception(s) \tuo5,reception,annulation,reception\n61 \tUO5-3 Reception | Forcer la reception \tuo5,reception,forcer,reception\n62 \tUO3-5 Demande d'achat | Demande de support cre... \tuo3,demander,achat,demander,support,creation\n63 \tUO3-6 Demande d'achat | Demande de support mod... \tuo3,demander,achat,demander,support,modification\n64 \tUO3-7 Demande d'achat | Demande de support ann... \tuo3,demander,achat,demander,support,annulation\n65 \tUO4-2 Commande | Demande de support modificati... \tuo4,commander,demander,support,modification,co...\n```\n\n\n\n---\n##tokenize and stemming(second exemple)\n---\n\n```\n# WordNetLemmatizer requires Pos tags to understand if the word is noun or verb or adjective etc. By default it is set to Noun\ndef wordLemmatizer(data,colname):\n    tag_map = defaultdict(lambda : wn.NOUN)\n    tag_map['J'] = wn.ADJ\n    tag_map['V'] = wn.VERB\n    tag_map['R'] = wn.ADV\n    file_clean_k =pd.DataFrame()\n    for index,entry in enumerate(data):\n        \n        # Declaring Empty List to store the words that follow the rules for this step\n        Final_words = []\n        # Initializing WordNetLemmatizer()\n        word_Lemmatized = WordNetLemmatizer()\n        # pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.\n        for word, tag in pos_tag(entry):\n            # Below condition is to check for Stop words and consider only alphabets\n            if len(word)\u003e1 and word not in stopwords.words('french') and word.isalpha():\n                word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])\n                Final_words.append(word_Final)\n            # The final processed set of words for each iteration will be stored in 'text_final'\n                file_clean_k.loc[index,colname] = str(Final_words)\n                file_clean_k.loc[index,colname] = str(Final_words)\n                file_clean_k=file_clean_k.replace(to_replace =\"\\[.\", value = '', regex = True)\n                file_clean_k=file_clean_k.replace(to_replace =\"'\", value = '', regex = True)\n                file_clean_k=file_clean_k.replace(to_replace =\" \", value = '', regex = True)\n                file_clean_k=file_clean_k.replace(to_replace ='\\]', value = '', regex = True)\n\n    return file_clean_k\n\n\ndef wordLemmatizer_(sentence):\n    #prendre une phrase que retourner un str (les mots sont separes par des ,)\n    preprocessed_query = preprocessed_query = re.sub(\"\\W+\", \" \", sentence).strip()\n    tokens = word_tokenize(str(preprocessed_query))\n    q_df = pd.DataFrame(columns=['q_clean'])\n    idx = 0\n    colname = 'keyword_final'\n    q_df.loc[idx,'q_clean'] =tokens\n    print('\\n\\n---inputtoken');print(q_df.q_clean)\n    print('\\n\\n---outputlemma');print(wordLemmatizer(q_df.q_clean,colname).loc[idx,colname])\n    return wordLemmatizer(q_df.q_clean,colname).loc[idx,colname]\n\n```\n\n\n\n\n# **2: trouver la meilleure phrase dans une liste de phrase**\n---\n## method: TF-Idf\n---\nTfIdf stands for: Term Frequency Inverse Document Frequency\nIn order to compare the user input to existing sentence in database, we will go throught two process\n- Normalize database: Apply the pre-processing method to all sentences in the database. We then have, for each sentence, a list of keywords\n- For each keyword kw for each sentence st, we compute, \n   - $frqc(word,sentence)$ : occurrence of the keyword word in the sentence\n    - $doc\\_frqc(word)$: number of sentences where the word appears\n    - $N$ = Number of sentences\n\n```\n{r, message=FALSE}\n$$\ntf(wd,stc) =  \\frac {frqc(wd,stc)}{ \\sum_{stc} frqc(wd,stc) }\\\\\nidf(wd) =  \\log(\\frac{N}{doc\\_frqc(wd)})\\\\\ntfidf(wd,stc) = tf(wd,stc) *idf(wd)\n$$\n```\n```\n                     frqc(wd,stc)      \ntf(wd,stc)  =   ---------------------- \n                  __                   \n                 \\       frqc(wd,stc)  \n                 /__ stc               \n                                 \n                           N           \n    idf(wd)  =   log(------------)     \n                     doc_frqc(wd)      \n                                       \ntfidf(wd,stc)  =  tf(wd,stc)  * idf(wd)\n```\n\n\n## Example\nst1: The computer is down\nst2: We need to change the computers\nst3: Changements have to be handle by the IT\n\n#### keywords per sentence\n```\nst1: [computer , down]\nst2: [need, change, computer]\nst3: [change, handle, IT]\n```\n#### vocabulary: [computer , down, need, change, handle, IT]\n\n| tf | sentence1| sentence2| sentence3| \n| --- | --- | --- | --- | \n| computer| 1/2 | 1/2 | 0 |\n| down| 1 | 0 | 0 |\n| need| 0 | 1 | 0 |\n| change| 0 | 1/2 | 1/2 |\n| handle| 0 | 0 | 1 |\n| IT| 0 | 0 | 1 |\n\n#### idf values for the keywords\nN =number_of_sentences =  3\n| idf | | \n| --- | --- | \n| computer| log(3/2) \n| down| log(3/1) \n| need| log(3/1) \n| change| log(3/2)\n| handle| log(3/1) \n| IT| log(3/1) \n\n#### example for sentence 2: computing of the keywords tfidf values\n```\n{r, echo=FALSE}\n$$\\\\\ntfidf('computer') = tf('computer', sentence2)*idf('computer') = 1/2 * log(3/2)\\\\\ntfidf('down') = 0 * log(3/1)\\\\\ntfidf('need') = 1 * log(3/1)\\\\\ntfidf('change') = 1/2 * log(3/2)\\\\\ntfidf('handle') = 0 * log(3/1)\\\\\ntfidf('IT') = 0 * log(3/1)\\\\\n$$\n```                                                                              \n```                                                                             \ntfidf('computer')  =  tf('computer', sentence2) * idf('computer')  =  1 / 2  * log(3 / 2)                                  \n                                                                              \n                      tfidf('down')  =  0  *  log(3 / 1)                      \n                                                                              \n                      tfidf('need')  =  1  *  log(3 / 1)                      \n                                                                              \n                   tfidf('change')  =  1 / 2  *  log(3 / 2)                   \n                                                                              \n                     tfidf('handle')  =  0  *  log(3 / 1)                     \n                                                                              \n                       tfidf('IT')  =  0  *  log(3 / 1)                       \n                                                                              \n ```                                                                            \n\n#### vectorisation of the sentence 2\n```\nsentence2 \u003c==\u003e [ 0.5 * log(3 / 2), 0, 1 * log(3 / 2), 0.5 *  log(3 /2) , 0, 0]                                      \n```\n\n#### vectorisation of the sentences\n```\n{r, echo=FALSE}\n$$\nsentence1 \u003c==\u003e [\\ 0.5*log(3/2),\\ log(3/1),\\ 0 ,\\ 0,\\ 0]\\\\\nsentence2 \u003c==\u003e [\\ 0.5*log(3/2),\\ 0, 1*log(3/2),\\ 0.5* log(3/2) ,\\ 0,\\ 0]\\\\\nsentence3 \u003c==\u003e [\\ 0, \\ 0, \\ 0,\\ 0.5 * 1*log(3/2),\\  log(3/1),\\ 1*log(3/1)]\n$$\n```\n```\n          sentence1 \u003c==\u003e [ 0.5 * log(3 / 2), log(3 / 1), 0 , 0, 0]          \n                                                                                \n    sentence2 \u003c==\u003e [ 0.5 * log(3 / 2), 0, 1 * log(3 / 2), 0.5 *  log(3 /2) , 0, 0]                                   \n                                                                                \nsentence3 \u003c==\u003e [ 0,  0,  0, 0.5  *  1 * log(3 / 2),  log(3 / 1), 1 * log(3 /1)]                                       \n```\n\n\n#### similarities between the user input and the sentences\nuser input: The IT have replaced all of the computers\nkeywords: [ 'IT', 'all',  'computer']\nkeywords found in dictionnary: [ 'IT','computer']\nvectorization: [1,0,0,0,1]\n\n#### scores\n```\n{r, echo=FALSE}\n$$\nsentence1: tfidf(sentence1)*vector \n=  [\\ 0.5*log(3/2),\\ log(3/1),\\ 0 ,\\ 0,\\ 0] *[1,0,0,0,1]\n=  0.5*log(3/2) \\\\\nsentence1:0.5*log(3/2)\\\\\nsentence2:  0.5*log(3/2)\\\\\nsentence3: log(3/1)\n$$\n```\n```\nsentence1: score = tfidf(sentence1) * vector  \n                 = [ 0.5 * log(3 / 2), log(3 /1), 0 , 0, 0]  * [1,0,0,0,1] \n                 =   0.5 * log(3 / 2)           \n                                                                     \n                     sentence1:0.5 * log(3 / 2)                      \n                                                                     \n                    sentence2:  0.5 * log(3 / 2)                     \n                                                                     \n                        sentence3: log(3 / 1)                        \n```\n\n\n---\n## fonction: cosine_similarity_T\n---\n\n\n```\ndef init(df_news):\n  ## Create Vocabulary\n  vocabulary = set()\n  for doc in df_news.Clean_Keyword:\n      vocabulary.update(doc.split(','))\n  vocabulary = list(vocabulary)# Intializating the tfIdf model\n  tfidf = TfidfVectorizer(vocabulary=vocabulary)# Fit the TfIdf model\n  tfidf.fit(df_news.Clean_Keyword)# Transform the TfIdf model\n  tfidf_tran=tfidf.transform(df_news.Clean_Keyword)\n  globals()['vocabulary'],globals()['tfidf'],globals()['tfidf_tran'] = vocabulary,tfidf,tfidf_tran\n\n\n```\n---\n## Create a vector for Query/search keywords\n---\n\n```\ndef gen_vector_T(tokens,df_news,vocabulary,tfidf,tfidf_tran):\n  Q = np.zeros((len(vocabulary)))    \n  x= tfidf.transform(tokens)\n  #print(tokens[0].split(','))\n  #print(keywords)\n  for token in tokens[0].split(','):\n      \n      try:\n          ind = vocabulary.index(token)\n          Q[ind]  = x[0, tfidf.vocabulary_[token]]\n          print(token,':',ind)\n      except:\n          print(token,':','not found')\n          pass\n  return Q\n```\n---\n## Cosine Similarity function\n---\n\n```\ndef cosine_sim(a, b):\n    if not np.linalg.norm(a) and not np.linalg.norm(b): return -3\n    if not np.linalg.norm(a):return -1\n    if not np.linalg.norm(b):return -2\n    cos_sim = np.dot(a, b)/(np.linalg.norm(a)*np.linalg.norm(b))\n    return cos_sim   \n\ndef cosine_similarity_T(k, query,df_news,vocabulary=None,tfidf=None,tfidf_tran=None,mine=True):\n    try:\n      vocabulary = globals()['vocabulary']\n      tfidf = globals()['tfidf']\n      tfidf_tran = globals()['tfidf_tran']\n    except:\n      print('up exception')\n      init(df_news)\n    q_df = pd.DataFrame(columns=['q_clean'])\n    if mine:q_df.loc[0,'q_clean'] =','.join(SENTENCE_TO_CORRECT_WORDS(query))\n    else:q_df.loc[0,'q_clean'] = wordLemmatizer_(query)\n    \n    \n    print('\\n\\n---q_df');print(q_df)\n    \n    print('\\n\\n')\n    d_cosines = []\n    query_vector = gen_vector_T(q_df['q_clean'],df_news,vocabulary,tfidf,tfidf_tran )\n    for d in tfidf_tran.A:\n        d_cosines.append(cosine_sim(query_vector, d ))\n                    \n    out = np.array(d_cosines).argsort()[-k:][::-1]\n    #print(\"\")\n    d_cosines.sort()\n    a = pd.DataFrame()\n    for i,index in enumerate(out):\n        a.loc[i,'index'] = str(index)\n        a.loc[i,'Subject'] = df_news['Subject'][index]\n    for j,simScore in enumerate(d_cosines[-k:][::-1]):\n        a.loc[j,'Score'] = simScore\n    return a\n```\n\n\n\n---\n## Test: cosine_similarity_T\n---\n\n```\ndef test(data,sentence,init_=False,mine=True):\n  if not init_:\n    deb = time.time();print('\\n\\n###########')\n    init(df_news)\n    print('\\n###########temps init: ', time.time()-deb)\n  deb = time.time();print('\\n\\n###########')\n  print(cosine_similarity_T(10, sentence,df_news))\n  print('\\n###########temps methode 1: ', time.time()-deb)\nsentence = 'Message d\\'erreur \\\"La qte livree est differente de la qte facturee ; fonction impossible\"'\nsentence = 'erreur de conversion'\nsentence = 'message d\\'erreur'\nsentence = \"groupe d'acheteurs non défini\"\nsentence = \"UO4\"\nsentence = \"le fournisseur MDM n'existe pas\"\ninit(df_new) \n\ncosine_similarity_T(10,sentence,df_new )\n```\n\n\n\n---\n## Output\n---\n\n\n```\n------------pre_process--------\n['fournisseur', 'mdm', 'existe', 'pas']\n\n------------correction--------\n['fournisseur', 'mdm', 'existe', 'pas']\n\n------------stemming--------\n['fournisseur', 'mdm', 'exister', 'pas']\n\n------------remove stop-words--------\n['fournisseur', 'mdm', 'exister']\n\n-------------------------------------\n\n       index \t                 Subject \t                         Score\n0 \t19 \tMessage d'erreur \"Le fournisseur MDM___ n’exis... \t0.781490\n1 \t0 \tMessage d'erreur : \"Le fournisseur ARIBA n'exi... \t0.600296\n2 \t20 \tMessage d'erreur \"Le fournisseur MDM___ est bl... \t0.587467\n3 \t14 \tMessage d'erreur \"Le centre de profit __ n'exi... \t0.236420\n4 \t33 \tMessage d'erreur \"Il existe des factures pour ... \t0.214371\n5 \t53 \tMessage d'erreur \"Fournisseur non present dans... \t0.142208\n6 \t18 \tMessage d'erreur \"Validation ___ : le compte _... \t0.000000\n7 \t30 \tMessage d'erreur \"Renseigner correctement le d... \t0.000000\n8 \t29 \tMessage d'erreur \"Article ___ non gere dans la... \t0.000000\n9 \t28 \tMessage d'erreur \"Fonctions oblig. Suivantes n... \t0.000000\n... \t... \t...\n```\n\n\n\n\n```\n\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhermann-web%2Fsearch-engine-with-python-nlp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhermann-web%2Fsearch-engine-with-python-nlp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhermann-web%2Fsearch-engine-with-python-nlp/lists"}