https://github.com/davidshq/oreillyas
Little baby version of Python script that grabs a list of books available from O'Reilly Learning.
https://github.com/davidshq/oreillyas
api oreilly oreilly-books
Last synced: about 1 year ago
JSON representation
Little baby version of Python script that grabs a list of books available from O'Reilly Learning.
- Host: GitHub
- URL: https://github.com/davidshq/oreillyas
- Owner: davidshq
- Created: 2023-01-05T20:33:28.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2024-08-06T13:19:08.000Z (almost 2 years ago)
- Last Synced: 2025-04-10T16:10:47.166Z (about 1 year ago)
- Topics: api, oreilly, oreilly-books
- Language: Python
- Homepage:
- Size: 76.2 KB
- Stars: 4
- Watchers: 2
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# O'Reilly Learning API Scraper
Version: 0.0.3 3/8/2024
## Table of Contents
1. [Description](#description)
2. [Usage](#usage)
3. [How It Works](#how-it-works)
4. [Why It Works This Way](#why-it-works-this-way)
5. [Loading Data Into SQLite](#loading-data-into-sqlite)
6. [Loading Data Into Neo4j](#loading-data-into-neo4j)
7. [Quirks](#quirks)
8. [Secondary Documentation / Scripts](#secondary-documentation--scripts)
9. [Credits](#credits)
## Description
A primitive Python script that pulls down all the available books from the
O'Reilly Learning API and saves them to a local directory as JSON.
Provides a utility to transform the JSON into a SQLite DB including preserving
many-to-many relationships.
Has several other useful scripts for transforming the database.
> NOTE: You have to have an authentication token from O'Reilly in order to pull down more
> than the first five pages of results.
## Usage
1. Clone the repository
2. Install pipenv: `pip install pipenv`
3. Install dependencies and create virtual environment: `pipenv install`
4. Activate the virtual environment: `pipenv shell`
5. Tweak any settings you want in `main.py`
6. Run the script: `python main.py`
## How It Works
It adds each page of results from the O'Reilly API to a Python dictionary
then writes that dictionary out to a JSON file.
## Why It Works This Way
Each page of results is its own contained JSON, we could concatenate
the JSON manually, but adding it to the dictionary is easier.
## Loading Data Into SQLite
In the `json-to-sqlite` subfolder you'll find three scripts which can be used to:
1. Add a unique integer (pid) to each book record in `oreilly.json`: `add_pid_to_json.py`
2. Create a SQLite DB and appropriate tables to contain the data from `oreilly.json`: `create_db.py`
3. Transform the JSON data from `oreilly.json` into rows of data in the new SQLite DB: `convert_json_to_tables.py`
### Some Useful Views
You can optionally create a set of views that may be easier to use than the raw tables. You can add these views by running `/create_views.py`
Currently this generates a view for each publisher as well as a view for publishers with various imprints.
It also generates a view of each book that includes the publisher's name.
### Getting counts of books by publisher
You can populate the `book_counts` column on the `publishers` table with the number of books each publisher has by running `/add_count_to_publishers.py`
## Loading Data Into Neo4j
In the `json-to-neo4j` subfolder you'll find a script that can be used to load the data from `oreilly.json` into a Neo4j database.
You should have an existing Neo4j database running and have set the host and auth environment variables in the `.env` file.
## How To: Generate a Sample from JSON results
The O'Reilly API results can get quite large (well over 100 MB) and can be a bit hard to manipulate in a GUI editor. You may want to run `generate_sample_from_json.py` after running `main.py`. This will take the first 400 records (you can customize the number) and place them in a separate json file (`oreilly_sample.json`) that still gives a good idea of what the results are but in a more manageable size.
## Quirks
### Excluding Fields
You can exclude fields from the results returned by the API but only some fields. For example, `archive_id` can be excluded but `num_of_followers` cannot.
You can find a complete list of the excludable fields here: https://www.oreilly.com/online-learning/integration-docs/search.html#/get~api~v2~search~5
## Secondary Documentation / Scripts
The folder `for-learning` contains some additional scripts that show me exploring the O'Reilly API. This includes `get_entire_api_response.py` which can be used to see the entire JSON response returned by the API in contrast to the `main.py` script which utilizes only the results portion of the response
The folder `generic-json-mapping` is essentially nothing yet. I was surprised by the lack of a generic, essentially code free tool to convert JSON to a relational SQL DB. This is where I may eventually build something to handle that generic scenario (if it really starts to happen it'll probably be broken out into it's own repo).
There is also a `pure_sql_queries` folder which contains some SQL queries I've used to explore the data.
## Credits
In some files I have explicitly noted this and while not required I'll do so anyways. I've used GitHub Copilot quite a bit in creating this project. I haven't messed around with it much before and this seemed like a good opportunity to see what I could get it to do. It can be quite frustrating at times, but I see potential.