Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/devidw/google-untitled-spam-spider

A spam spider which is targeting 'Untitled' spam pages from the Google search results.
https://github.com/devidw/google-untitled-spam-spider

crawler crawling crawling-algorithm crawling-python crawling-sites crawling-tool google-untitled python python3 spam spam-detection spammer untitled untitled-spam

Last synced: about 2 months ago
JSON representation

A spam spider which is targeting 'Untitled' spam pages from the Google search results.

Awesome Lists containing this project

README

        

= Google _'Untitled'_ Spam Spider

A tiny web spider that starts crawling a website and crawls as long as it can find links on those pages, which links to similar spam pages.

This spider is targeting the 'Untitled' spam pages from the Google search results.

I wrote https://david.wolf.gdn/posts/spam/google-untitled/[several articles] about those spam pages. In which I discuss the underlying backgrounds of this spam network.

[quote, David Wolf, 'https://david.wolf.gdn/i-crawled-105009-google-untitled-spam-pages-in-7-days-and-700504-other-linked-spam-pages/[david.wolf.gdn]']
I crawled 105,009 Google 'Untitled' Spam Pages in 7 days and 700,504 other linked Spam Pages

== Usage

[source,python]
----
from google_spam_spider import GoogleSpamSpider

spider = GoogleSpamSpider(
url='http://zone-casino.fr/2hephe/torch-functional-unfold.html', # The url to start crawling
direct_spam_logs='direct_spam.log', # The file to log direct spam
external_spam_logs='external_spam.log' # The file to log external spam
)
----