https://github.com/jhu-library-applications/is-this-digitized

google-books-api hathitrust internet-archive oclc

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/jhu-library-applications/is-this-digitized
Owner: jhu-library-applications
Created: 2022-09-13T14:17:34.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2023-12-14T16:11:02.000Z (over 1 year ago)
Last Synced: 2025-02-05T17:40:05.196Z (5 months ago)
Topics: google-books-api, hathitrust, internet-archive, oclc
Language: Python
Homepage:
Size: 13.7 KB
Stars: 2
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Searching for digitized books by OCLC identifier

This repository has scripts to search the following websites for digitized books by their OCLC numbers.

- [Google Books](https://books.google.com/)
- [HathiTrust](https://www.hathitrust.org/)
- [Internet Archive](https://archive.org/)

***

## Data

### Formatting your OCLC numbers for searching.

OCLC identifiers should be entered into a spreadsheet in a column called 'oclc_id'. The OCLC identifiers should not have any prefixes like "ocm", "on", or "(OCoLC)". Save your spreadsheet as a UTF-8 encoded CSV. It does not matter if the identifiers are saved as integers or strings, as the scripts automatically converts identifiers into strings.

When your CSV is ready, put it in the same folder location as the scripts below on your local system.

### Test data

There is a folder called "test-data" in the repository with test data and results. This can help with formatting and troubleshooting the scripts on your local system.
- test.csv: A CSV with 9 items (3 items findable by OCLC number for each website). These items were selected at random.
- hathiTrustResults_test.csv: The results from running test.csv against searchHathiTrustByOCLC.py.
- googleBooksResults_test.csv: The results from running test.csv against searchGoogleBooksByOCLC.py.
- internetArchiveResults_test.csv: The results from running test.csv against searchInternetArchivesByOCLC.py.

***

## Scripts

## Requirements
- Python 3+
- [pandas](https://pandas.pydata.org/) library
- [requests](https://requests.readthedocs.io/en/latest/) library
- [internetarchive](https://archive.org/services/docs/api/internetarchive/) library

## searchGoogleBooksByOCLC.py

Setup: Register for a Google API key to search. Go to [Google's APIs & Services
Credentials page](https://console.cloud.google.com/apis/credentials) and register for an API key using a Google account. Then create a Python file in the same folder as this script called `googleKey.py` with the following code:

```python
key='##########'
```
Be sure to add `googleKey.py` to your gitignore.

Search limits: There is a 60-second pause after searching a set of 100 OCLC numbers as Google Books limits the number of books searched per minute via API. So, if you have 1000 OCLC identifiers to search, this script will take at least 10 minutes. I'm sure there is a better solution, I just don't know what it is. There is also a daily limit of books you can search via API. Avoid searching more than 1000 identifiers in a 24-hour period. You will get an error if this occurs, just try rerunning your script the next day.

## searchHathiTrustByOCLC.py

This searches the `oclc` field in HathiTrust.

## searchInternetArchiveByOCLC.py

This searches two metadata fields in the Internet Archive for an OCLC number: `external-identifier` and `oclc_id`.

## combineMyResults.py

This script combines CSV results generated by running the above three scripts.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jhu-library-applications/is-this-digitized

Awesome Lists containing this project

README