Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/howie6879/magic_google
Google search results crawler, get google search results that you need
https://github.com/howie6879/magic_google
crawler google google-search spider
Last synced: about 4 hours ago
JSON representation
Google search results crawler, get google search results that you need
- Host: GitHub
- URL: https://github.com/howie6879/magic_google
- Owner: howie6879
- Created: 2017-01-12T06:55:21.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2023-11-14T10:11:26.000Z (about 1 year ago)
- Last Synced: 2024-11-16T05:49:01.312Z (4 days ago)
- Topics: crawler, google, google-search, spider
- Language: Python
- Size: 39.1 KB
- Stars: 393
- Watchers: 23
- Forks: 109
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## magic_google
[![](https://img.shields.io/pypi/v/magic_google.svg)](https://pypi.org/project/magic-google/)
### 1.What's magic_google
This is an easy Google Searching crawler that you can get anything you want in the page by using it.During the process of crawling,you need to pay attention to the limitation from google towards ip address and the warning of exception , so I suggest that you should pause running the program and own the Proxy ip
php - [MagicGoogle](https://github.com/howie6879/php-google)
### 2.How to Use?
Run
``` shell
pip install magic_google
# Or
pip install git+https://github.com/howie6879/magic_google.git
# Or
git clone https://github.com/howie6879/magic_google.git
cd magic_google
vim google_search.py
# Or
python setup.py install
```
Example
``` python
from magic_google import MagicGoogle
import pprint# Or PROXIES = None
PROXIES = [{
'http': 'http://192.168.2.207:1080',
'https': 'http://192.168.2.207:1080'
}]# Or MagicGoogle()
mg = MagicGoogle(PROXIES)# Crawling the whole page
result = mg.search_page(query='python')# Crawling url
for url in mg.search_url(query='python'):
pprint.pprint(url)
# Output
# 'https://www.python.org/'
# 'https://www.python.org/downloads/'
# 'https://www.python.org/about/gettingstarted/'
# 'https://docs.python.org/2/tutorial/'
# 'https://docs.python.org/'
# 'https://en.wikipedia.org/wiki/Python_(programming_language)'
# 'https://www.codecademy.com/courses/introduction-to-python-6WeG3/0?curriculum_id=4f89dab3d788890003000096'
# 'https://www.codecademy.com/learn/python'
# 'https://developers.google.com/edu/python/'
# 'https://learnpythonthehardway.org/book/'
# 'https://www.continuum.io/downloads'# Get {'title','url','text'}
for i in mg.search(query='python', num=1):
pprint.pprint(i)
# Output
# {'text': 'The official home of the Python Programming Language.',
# 'title': 'Welcome to Python .org',
# 'url': 'https://www.python.org/'}
```
You can see [google_search.py](./examples/google_search.py)**If you need a big amount of querie but only having an ip address,I suggest you can have a time lapse between 5s ~ 30s.**
The reason that it always return empty might be as follows:
```html
302 Moved
302 Moved
The document has moved
here.```