https://github.com/rrmerugu/trawler
A data gathering/trawling framework to search and get information from web sources like bing
https://github.com/rrmerugu/trawler
crawler-engine python search webcrawler
Last synced: 6 months ago
JSON representation
A data gathering/trawling framework to search and get information from web sources like bing
- Host: GitHub
- URL: https://github.com/rrmerugu/trawler
- Owner: rrmerugu
- License: mit
- Created: 2017-06-04T10:59:41.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2022-07-15T18:40:55.000Z (almost 4 years ago)
- Last Synced: 2025-11-27T13:22:18.879Z (7 months ago)
- Topics: crawler-engine, python, search, webcrawler
- Language: Python
- Homepage:
- Size: 76.2 KB
- Stars: 2
- Watchers: 3
- Forks: 2
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Invaana Trawler
[](https://travis-ci.org/rrmerugu/trawler)
[](https://codecov.io/gh/rrmerugu/trawler)
This is very light weight data gathering framework to search and gather information from web sources like Bing,
Stackoverflow and etc.
## Installation and Configuration
```bash
# install this package from PyPi
pip install trawler
# or for latest code
pip install git+https://github.com/rrmerugu/trawler.git#egg=trawler
# install selenium components including drivers (you need chrome installed in your machine)
npm install selenium-standalone@latest -g
selenium-standalone install # installs the drivers
selenium-standalone start # starts the selenium server
pip install -r requirements/requirements.txt
```
## Usage
```python
from trawler import TrawlIt
trawl = TrawlIt(kw="MongoDB", generate_kws=True, browser="bing", method="requests")
#trawl = TrawlIt(kw="MongoDB", generate_kws=True, browser="bing")
trawl.run() # this will gather data from all generated keywords and saves it to MongoDB
trawl.generated_keywords # access the generated keywords ['learning MongoDB', 'Programming with MongoDB', 'MongoDB tutorials' ]
trawl.data # access the data after the run
trawl.stop() # do this or there will be an idle browser instance left on your machine
# or
trawl = TrawlIt(kw="Python Exception Error", browser="stackoverflow")
trawl.run() # this will gather data and saves it to MongoDB
trawl.data # access the data after the run
trawl.stop() # do this or there will be an idle browser instance left on your machine
trawl = TrawlIt(kw="django", browser="stackoverflow-doc")
trawl.run() # this will gather the topics from the stackoverflow documentation
trawl.data # access the data after the run
trawl.stop() # do this or there will be an idle browser instance left on your machine
trawl = TrawlIt(kw="django", browser="wordpress")
trawl.run() # this will gather the topics from the stackoverflow documentation
trawl.data # access the data after the run
trawl.stop() # do this or there will be an idle browser instance left on your machine
from trawler.browsers.wordpress import BrowseWordPress
stack = BrowseWordPress( max_page=1, base_url="http://econbrowser.com/")
# stack = BrowseWordPress(kw="invaana", max_page=1, base_url="http://econbrowser.com")
stack.search()
stack.data # returns the data
```
## Supported Web sources
Current this framework supports, automating searches with
- Bing
- Bing Images
- Bing Keywords
- StackOverFlow
- StackOverFlow Documentation
- Wordpress
## Important Note
Please understand https://advertise.bingads.microsoft.com/en-in/resources/policies/web-crawling before using this
framework. Make sure you comply with the respective website privacy policies before you crawl them.