{"id":18729286,"url":"https://github.com/faezeh-gholamrezaie/vectorization-techniques-tutorial","last_synced_at":"2025-10-16T22:05:16.723Z","repository":{"id":226470118,"uuid":"768770132","full_name":"faezeh-gholamrezaie/Vectorization-Techniques-tutorial","owner":"faezeh-gholamrezaie","description":"Vectorization Techniques in Natural Language Processing Tutorial for Deep Learning Researchers","archived":false,"fork":false,"pushed_at":"2024-03-16T09:30:07.000Z","size":1921,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-19T21:07:56.085Z","etag":null,"topics":["bert-embeddings","bow","cbow","doc2vec","doc2vec-word2vec","fasttext-embeddings","glove-embeddings","infersent","nlp","sentence-bert","sentence-embeddings","skipgram","text-analysis","text-classification","tf-idf","universal-sentence-encoder","vectorization","word2vec"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/faezeh-gholamrezaie.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-03-07T17:44:10.000Z","updated_at":"2024-05-10T16:43:07.000Z","dependencies_parsed_at":"2024-03-07T19:26:17.695Z","dependency_job_id":"446ebadf-829e-47d8-ba79-1d2451a8469b","html_url":"https://github.com/faezeh-gholamrezaie/Vectorization-Techniques-tutorial","commit_stats":null,"previous_names":["faezeh-gholamrezaie/different-techniques-for-sentence-semantic-similarity-in-nlp","faezeh-gholamrezaie/vectorization-techniques-in-nlp"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/faezeh-gholamrezaie/Vectorization-Techniques-tutorial","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/faezeh-gholamrezaie%2FVectorization-Techniques-tutorial","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/faezeh-gholamrezaie%2FVectorization-Techniques-tutorial/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/faezeh-gholamrezaie%2FVectorization-Techniques-tutorial/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/faezeh-gholamrezaie%2FVectorization-Techniques-tutorial/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/faezeh-gholamrezaie","download_url":"https://codeload.github.com/faezeh-gholamrezaie/Vectorization-Techniques-tutorial/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/faezeh-gholamrezaie%2FVectorization-Techniques-tutorial/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279246646,"owners_count":26133500,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-16T02:00:06.019Z","response_time":53,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert-embeddings","bow","cbow","doc2vec","doc2vec-word2vec","fasttext-embeddings","glove-embeddings","infersent","nlp","sentence-bert","sentence-embeddings","skipgram","text-analysis","text-classification","tf-idf","universal-sentence-encoder","vectorization","word2vec"],"created_at":"2024-11-07T14:26:35.215Z","updated_at":"2025-10-16T22:05:16.686Z","avatar_url":"https://github.com/faezeh-gholamrezaie.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Vectorization-Techniques-tutorial\nHere, various techniques for text embedding are provided to address the problem of predicting the topic of a set of abstracts from two different branches, \"Machine Learning in Statistics\" and \"Applied Statistics,\" along with their corresponding category labels. Your task is to train a simple classifier that can predict the topic of an article based on its abstract input.\n\nThis repository provides a tutorial for those interested in learning about vectorization methods in Natural Language Processing (NLP) using PyTorch and TensorFlow. The majority of the models in this tutorial are implemented with less than 100 lines of code (excluding comments or blank lines).\n\nThis code provides a solution to the \"Which Statistic?\" question in the LLM Hackathon Qualifier. \n: [quera Pages](https://quera.org/problemset/220643)\n\n# NLP students and researchers\nSoftware engineers interested in learning about vectorization methods in NLP\nAnyone interested in deep learning and natural language processing\nGetting Started:\n\n# To get started, follow these steps:\n\nClone the repository with the command git clone https://github.com/faezeh-gholamrezaie/Vectorization-Techniques-in-NLP.git\n\nActivate your Python virtual environment.\n\nInstall the requirements with the command pip install -r requirements.txt.\n\nOpen a Jupyter Notebook in the root directory of the repository.\n\nRun the tutorials and examples in sequence.\n\n# Content:\n\nTutorials on various common vectorization methods in NLP\nImplementations of different NLP models using PyTorch and TensorFlow\nExercises and examples for better understanding of the concepts\nAudience.\n\n- [Load Data](#load-data)\n- [Preprocess Data](#preprocess-data)\n- [Text to Features](#Text-to-Features)\n- [Text vectorization](#text-vectorization)\n- [Ensemble Models](#Ensemble-Models)\n\n## Load Data\n\nThe dataset used in this project is available at the link https://quera.org/contest/assignments/4367/download_problem_initial_project/220643/.\nThe dataset consists of a number of paper abstracts from two different branches of statistics: \"Statistics in Machine Learning\" and \"Applied Statistics\". Each abstract is labeled with its branch.\nA simple classifier has been trained to predict the topic of a paper given its abstract.\n\n## Preprocess Data\n\n- Removing extra white space from text:\n\nAll extra white space (more than one space) was removed from the text.\n-  Removing all special characters from the text:\n\nAll special characters such as periods, commas, question marks, exclamation points, parentheses, etc. were removed from the text.\n- Removing all single characters from the text:\n\nAll single characters such as \"a\", \"b\", \"c\", etc. were removed from the text.\n- Converting text to lower case:\n\nAll letters in the text were converted to lower case.\n- Word tokenization:\n\nThe text was split into separate word tokens.\n- Lemmatization:\n\nAll words were converted to their root form.\n- Removing stop words from the text:\n\nStop words such as \"the\", \"is\", \"of\", etc. were removed from the text.\n- Removing words with a length less than 3 from the text:\n\nAll words with a length less than 3 characters were removed from the text.\nThe purpose of performing these preprocessing steps is to improve the performance of the classification model by removing noise and unnecessary information from the text.\n\nNotes on preprocessing:\n\nThe preprocessing steps should be selected based on the type of data and the model being used.\n\nThere is no \"one size fits all\" approach to preprocessing.\n\nIt is important to carefully examine the data before selecting the preprocessing steps\n\n\n## Text to Features\n\nFeatures are often referred to as independent variables or predictors in machine learning.\n\nFeature Engineering:\n\nIn machine learning, features are carefully selected or created from raw data to be most informative for the task at hand. This process is called feature engineering. In text classification, relevant features might include word frequency, document length, or the presence of specific keywords.\n\n## Text vectorization\n\nText data is inherently different from numerical data that machine learning algorithms typically operate on. Text vectorization techniques aim to bridge this gap by transforming textual data into numerical vectors. These vectors represent the characteristics of the text in a way that can be understood and processed by machine learning models.\n\nThere are various techniques for text vectorization, each with its own advantages and limitations. The table below compares the methods used in this project:\n![various techniques for text vectorization](https://github.com/faezeh-gholamrezaie/Vectorization-Techniques-in-NLP/blob/main/various%20techniques%20for%20text%20vectorization.png)\n\n## Ensemble Models\n\nWe will use three different regression models:\n\nLinear Regression: This is a simple linear model that fits a straight line to the data.\n\nXGBoost: This is a more complex model that uses decision trees to make predictions.\n\nRandom Forest: This is a model that uses a large number of decision trees to make predictions.\n\nWe will train each of these models on the training data and then average their predictions to make a final prediction. This is known as ensemble averaging and can help to improve the accuracy of our predictions.\n\n# Resources:\n\nStanford AI Lab - CS224n: Natural Language Processing with Deep Neural Networks: https://nlp.stanford.edu/\n\nDeep Learning for NLP book: https://www.manning.com/books/deep-learning-for-natural-language-processing\n\nPyTorch documentation: https://pytorch.org/docs/\n\nTensorFlow documentation: https://www.tensorflow.org/api_docs\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffaezeh-gholamrezaie%2Fvectorization-techniques-tutorial","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffaezeh-gholamrezaie%2Fvectorization-techniques-tutorial","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffaezeh-gholamrezaie%2Fvectorization-techniques-tutorial/lists"}