https://github.com/vishal815/language_predictor_ml_nlp
Click below to checkout the website of this ML-NLP Project
https://github.com/vishal815/language_predictor_ml_nlp
coderun data-science deep-learning githubproject huggingface language language-detection language-prediction machine-learning ml nlp nlp-machine-learning open-source streamlit textanalysis tfidf-vectorizer vishal vishal-lazrus vishallazrus
Last synced: 2 months ago
JSON representation
Click below to checkout the website of this ML-NLP Project
- Host: GitHub
- URL: https://github.com/vishal815/language_predictor_ml_nlp
- Owner: vishal815
- Created: 2023-04-20T11:13:02.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-03-08T05:15:35.000Z (about 1 year ago)
- Last Synced: 2025-01-30T20:57:05.058Z (4 months ago)
- Topics: coderun, data-science, deep-learning, githubproject, huggingface, language, language-detection, language-prediction, machine-learning, ml, nlp, nlp-machine-learning, open-source, streamlit, textanalysis, tfidf-vectorizer, vishal, vishal-lazrus, vishallazrus
- Language: Jupyter Notebook
- Homepage: https://vishal815-language-predictor-ml-nlp-app-dqjsvm.streamlit.app/
- Size: 1.98 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Language_Predictor_ML_NLP
## huggingface
Check out the hosted website [here👉](https://vishal815-language-predictor-ml-nlp-app-dqjsvm.streamlit.app/).## rander server
Check out the hosted website [here👉](https://huggingface.co/spaces/Visal9252/Languagepredictormlnlp)
#To run code: streamlit run app.py
## TfidfVectorizer
The `TfidfVectorizer` method helps us to achieve this by generating a numerical representation of each text document based on the frequency of each term and how often it appears in each document compared to its frequency in the entire corpus.
The `ngram_range` parameter in `TfidfVectorizer` specifies the range of n-grams to be considered. An n-gram is a contiguous sequence of n items from a given sample of text or speech. By default, `TfidfVectorizer` uses a unigram approach, but specifying `ngram_range=(1,2)` means that both unigrams and bigrams will be considered.
The `analyzer` parameter in `TfidfVectorizer` specifies the type of analysis to be performed. By setting `analyzer='char'`, the vectorizer will generate character-level n-grams instead of word-level n-grams.
Using `TfidfVectorizer` from the `feature_extraction.text` module in the `scikit-learn` library, we can generate numerical representations of text data based on term frequency and inverse document frequency. By specifying `ngram_range=(1,2)` and `analyzer='char'`, we can consider both unigrams and bigrams at the character level.