https://github.com/madhups1992/ceegle-search

Finding the google search results which shows conspiracy content - NLP, Webscrapper, Flask app, KNN(From Scratch)
https://github.com/madhups1992/ceegle-search

end-to-end-machine-learning flask flask-application from-scratch google-search-using-python html-css-javascript k-nearest-neighbours knn-classifier named-entity-recognition nlp python selenium-python tweeter-api ui webscraping youtube-search

Last synced: 3 months ago
JSON representation

Finding the google search results which shows conspiracy content - NLP, Webscrapper, Flask app, KNN(From Scratch)

Host: GitHub
URL: https://github.com/madhups1992/ceegle-search
Owner: madhups1992
License: lgpl-3.0
Created: 2019-11-03T21:33:26.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2019-11-20T14:17:37.000Z (over 5 years ago)
Last Synced: 2025-01-15T07:19:08.696Z (5 months ago)
Topics: end-to-end-machine-learning, flask, flask-application, from-scratch, google-search-using-python, html-css-javascript, k-nearest-neighbours, knn-classifier, named-entity-recognition, nlp, python, selenium-python, tweeter-api, ui, webscraping, youtube-search
Language: JavaScript
Homepage:
Size: 3.92 MB
Stars: 2
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Ceegle-search
Finding the google search results which shows conspiracy content - NLP, Webscrapper, Flask app

### Objective: Highlight the conspiracy content on the google search.

### Steps:
1) Using webscrapper extracting title and url from 4 pages of google search.
2) Extracting information from twitter for the search key with #conspiracy.[Positive labels]
3) Extracting information from twitter for the search key.[Negative labels]
4) The above will be used as the training set.
5) The extracted text then has to be cleaned. It is a part of NLP
- Removing symbols, single characters, numbers, etc.,
- Removing stop words
- POS(Part-of-speech) tagger is used and extracted "noun, verb, adjective" that are relative.
- Lemmatizing based on POS(P)(eg: going, go => go)
- There will always be more unwanted words that appear rarely. So based on frequency top 30 words from both positive and negative words are choosen.
- The others words were removed.
- This is converted to TFIDF(Term frequency Inverse Document Frequency) vectorizor.
6) This was converted into a dataframe. To ascess the frequency of each words.
7) Using KNN(k-nearest neighbor) Algorithm the classification algorithm was implemented from scratch. The distance metric used is euclidean distance with k=7.Compared to Random forest algorithm this seemed to work well for the prediction.
8) From the google search results, all the cleaning process done for the trainig set has been repeated.
9) Then the KNN algorithm was used to test the classification results. The application was converted into the gif file for easy understanding.

#### The below is the preview of the application with search key = "pope". This will identify conspiracy content on the google search.

![](ceegleSearch.gif)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/madhups1992/ceegle-search

Awesome Lists containing this project

README