Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/madhups1992/ceegle-search
Finding the google search results which shows conspiracy content - NLP, Webscrapper, Flask app, KNN(From Scratch)
https://github.com/madhups1992/ceegle-search
end-to-end-machine-learning flask flask-application from-scratch google-search-using-python html-css-javascript k-nearest-neighbours knn-classifier named-entity-recognition nlp python selenium-python tweeter-api ui webscraping youtube-search
Last synced: 28 days ago
JSON representation
Finding the google search results which shows conspiracy content - NLP, Webscrapper, Flask app, KNN(From Scratch)
- Host: GitHub
- URL: https://github.com/madhups1992/ceegle-search
- Owner: madhups1992
- License: lgpl-3.0
- Created: 2019-11-03T21:33:26.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2019-11-20T14:17:37.000Z (almost 5 years ago)
- Last Synced: 2024-10-11T20:02:43.753Z (28 days ago)
- Topics: end-to-end-machine-learning, flask, flask-application, from-scratch, google-search-using-python, html-css-javascript, k-nearest-neighbours, knn-classifier, named-entity-recognition, nlp, python, selenium-python, tweeter-api, ui, webscraping, youtube-search
- Language: JavaScript
- Homepage:
- Size: 3.92 MB
- Stars: 2
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Ceegle-search
Finding the google search results which shows conspiracy content - NLP, Webscrapper, Flask app### Objective: Highlight the conspiracy content on the google search.
### Steps:
1) Using webscrapper extracting title and url from 4 pages of google search.
2) Extracting information from twitter for the search key with #conspiracy.[Positive labels]
3) Extracting information from twitter for the search key.[Negative labels]
4) The above will be used as the training set.
5) The extracted text then has to be cleaned. It is a part of NLP
- Removing symbols, single characters, numbers, etc.,
- Removing stop words
- POS(Part-of-speech) tagger is used and extracted "noun, verb, adjective" that are relative.
- Lemmatizing based on POS(P)(eg: going, go => go)
- There will always be more unwanted words that appear rarely. So based on frequency top 30 words from both positive and negative words are choosen.
- The others words were removed.
- This is converted to TFIDF(Term frequency Inverse Document Frequency) vectorizor.
6) This was converted into a dataframe. To ascess the frequency of each words.
7) Using KNN(k-nearest neighbor) Algorithm the classification algorithm was implemented from scratch. The distance metric used is euclidean distance with k=7.Compared to Random forest algorithm this seemed to work well for the prediction.
8) From the google search results, all the cleaning process done for the trainig set has been repeated.
9) Then the KNN algorithm was used to test the classification results. The application was converted into the gif file for easy understanding.#### The below is the preview of the application with search key = "pope". This will identify conspiracy content on the google search.
![](ceegleSearch.gif)