Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/oxylabs/how-to-scrape-google-scholar
A guide for extracting titles, authors, and citations from Google Scholar using Python and Oxylabs SERP Scraper API.
https://github.com/oxylabs/how-to-scrape-google-scholar
google-scholar google-scholar-scraper google-scholar-scrapper google-search-scraper python python-scraper scraper-api web-scraper web-scraping
Last synced: 4 days ago
JSON representation
A guide for extracting titles, authors, and citations from Google Scholar using Python and Oxylabs SERP Scraper API.
- Host: GitHub
- URL: https://github.com/oxylabs/how-to-scrape-google-scholar
- Owner: oxylabs
- Created: 2024-03-07T12:23:51.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2024-09-30T12:56:05.000Z (3 months ago)
- Last Synced: 2024-12-23T04:04:02.156Z (11 days ago)
- Topics: google-scholar, google-scholar-scraper, google-scholar-scrapper, google-search-scraper, python, python-scraper, scraper-api, web-scraper, web-scraping
- Language: Python
- Homepage: https://oxylabs.io/products/scraper-api/serp
- Size: 10.7 KB
- Stars: 347
- Watchers: 1
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# How to Scrape Google Scholar
[![Oxylabs promo code](https://user-images.githubusercontent.com/129506779/250792357-8289e25e-9c36-4dc0-a5e2-2706db797bb5.png)](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=877&url_id=112)
Take a look at the process of getting titles, authors, and citations from [Google Scholar](https://scholar.google.com/) using Oxylabs [SERP Scraper API](https://oxylabs.io/products/scraper-api/serp) (a part of Web Scraper API) and Python. You can get a **1-week free trial** by registering on the [dashboard](https://dashboard.oxylabs.io/).
For a detailed walkthrough with explanations and visuals, check our [blog post](https://oxylabs.io/blog/how-to-scrape-google-scholar).
## The complete code
```python
import requests
from bs4 import BeautifulSoupUSERNAME = "USERNAME"
PASSWORD = "PASSWORD"def get_html_for_page(url):
payload = {
"url": url,
"source": "google",
}
response = requests.post(
"https://realtime.oxylabs.io/v1/queries",
auth=(USERNAME, PASSWORD),
json=payload,
)
response.raise_for_status()
return response.json()["results"][0]["content"]def get_citations(article_id):
url = f"https://scholar.google.com/scholar?q=info:{article_id}:scholar.google.com&output=cite"
html = get_html_for_page(url)
soup = BeautifulSoup(html, "html.parser")
data = []
for citation in soup.find_all("tr"):
title = citation.find("th", {"class": "gs_cith"}).get_text(strip=True)
content = citation.find("div", {"class": "gs_citr"}).get_text(strip=True)
entry = {
"title": title,
"content": content,
}
data.append(entry)return data
def parse_data_from_article(article):
title_elem = article.find("h3", {"class": "gs_rt"})
title = title_elem.get_text()
title_anchor_elem = article.select("a")[0]
url = title_anchor_elem["href"]
article_id = title_anchor_elem["id"]
authors = article.find("div", {"class": "gs_a"}).get_text()
return {
"title": title,
"authors": authors,
"url": url,
"citations": get_citations(article_id),
}def get_url_for_page(url, page_index):
return url + f"&start={page_index}"def get_data_from_page(url):
html = get_html_for_page(url)
soup = BeautifulSoup(html, "html.parser")
articles = soup.find_all("div", {"class": "gs_ri"})
return [parse_data_from_article(article) for article in articles]data = []
url = "https://scholar.google.com/scholar?q=global+warming+&hl=en&as_sdt=0,5"NUM_OF_PAGES = 1
page_index = 0
for _ in range(NUM_OF_PAGES):
page_url = get_url_for_page(url, page_index)
entries = get_data_from_page(page_url)
data.extend(entries)
page_index += 10print(data)
```## Final word
Check our [documentation](https://developers.oxylabs.io/scraper-apis/web-scraper-api/google) for more API parameters and variables found in this tutorial.
If you have any questions, feel free to contact us at [email protected].
Read More Google Scraping Related Repositories: [Google Sheets for Basic Web Scraping](https://github.com/oxylabs/web-scraping-google-sheets), [How to Scrape Google Shopping Results](https://github.com/oxylabs/scrape-google-shopping), [Google Play Scraper](https://github.com/oxylabs/google-play-scraper), [How To Scrape Google Jobs](https://github.com/oxylabs/how-to-scrape-google-jobs), [Google News Scrpaer](https://github.com/oxylabs/google-news-scraper), [How to Scrape Google Flights with Python](https://github.com/oxylabs/how-to-scrape-google-flights), [How To Scrape Google Images](https://github.com/oxylabs/how-to-scrape-google-images), [Scrape Google Search Results](https://github.com/oxylabs/scrape-google-python), [Scrape Google Trends](https://github.com/oxylabs/how-to-scrape-google-trends)