https://github.com/rrmerugu/trawler

A data gathering/trawling framework to search and get information from web sources like bing
https://github.com/rrmerugu/trawler

crawler-engine python search webcrawler

Last synced: 6 months ago
JSON representation

A data gathering/trawling framework to search and get information from web sources like bing

Host: GitHub
URL: https://github.com/rrmerugu/trawler
Owner: rrmerugu
License: mit
Created: 2017-06-04T10:59:41.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2022-07-15T18:40:55.000Z (almost 4 years ago)
Last Synced: 2025-11-27T13:22:18.879Z (7 months ago)
Topics: crawler-engine, python, search, webcrawler
Language: Python
Homepage:
Size: 76.2 KB
Stars: 2
Watchers: 3
Forks: 2
Open Issues: 5
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Invaana Trawler

[![Build Status](https://travis-ci.org/rrmerugu/trawler.svg?branch=master)](https://travis-ci.org/rrmerugu/trawler)

[![codecov](https://codecov.io/gh/rrmerugu/trawler/branch/master/graph/badge.svg)](https://codecov.io/gh/rrmerugu/trawler)

This is very light weight data gathering framework to search and gather information from web sources like Bing, 

Stackoverflow and etc. 

## Installation and Configuration

```bash

# install this package from PyPi

pip install trawler

# or for latest code

pip install git+https://github.com/rrmerugu/trawler.git#egg=trawler

# install selenium components including drivers (you need chrome installed in your machine)

npm install selenium-standalone@latest -g

selenium-standalone install # installs the drivers 

selenium-standalone start # starts the selenium server

pip install -r requirements/requirements.txt

```

## Usage

```python

from trawler import TrawlIt

trawl = TrawlIt(kw="MongoDB", generate_kws=True, browser="bing", method="requests")

#trawl = TrawlIt(kw="MongoDB", generate_kws=True, browser="bing")

trawl.run() # this will gather data from all generated keywords and saves it to MongoDB

trawl.generated_keywords # access the generated keywords ['learning MongoDB', 'Programming with MongoDB', 'MongoDB tutorials' ] 

trawl.data # access the data after the run

trawl.stop() # do this or there will be an idle browser instance left on your machine

# or 

trawl = TrawlIt(kw="Python Exception Error",  browser="stackoverflow")

trawl.run() # this will gather data and saves it to MongoDB

trawl.data # access the data after the run

trawl.stop() # do this or there will be an idle browser instance left on your machine

trawl = TrawlIt(kw="django",  browser="stackoverflow-doc")

trawl.run() # this will gather the topics from the stackoverflow documentation

trawl.data # access the data after the run

trawl.stop() # do this or there will be an idle browser instance left on your machine

trawl = TrawlIt(kw="django",  browser="wordpress")

trawl.run() # this will gather the topics from the stackoverflow documentation

trawl.data # access the data after the run

trawl.stop() # do this or there will be an idle browser instance left on your machine

from trawler.browsers.wordpress import BrowseWordPress

stack = BrowseWordPress( max_page=1, base_url="http://econbrowser.com/")

# stack = BrowseWordPress(kw="invaana", max_page=1, base_url="http://econbrowser.com")

stack.search()

stack.data # returns the data

```

## Supported Web sources

Current this framework supports, automating searches with 

- Bing

- Bing Images

- Bing Keywords

- StackOverFlow

- StackOverFlow Documentation

- Wordpress

## Important Note

Please understand https://advertise.bingads.microsoft.com/en-in/resources/policies/web-crawling before using this

framework. Make sure you comply with the respective website privacy policies before you crawl them.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rrmerugu/trawler

Awesome Lists containing this project

README