{"id":24469204,"url":"https://github.com/davidshq/oreillyas","last_synced_at":"2025-04-13T10:24:32.389Z","repository":{"id":137395683,"uuid":"585692546","full_name":"davidshq/oreillyas","owner":"davidshq","description":"Little baby version of Python script that grabs a list of books available from O'Reilly Learning.","archived":false,"fork":false,"pushed_at":"2024-08-06T13:19:08.000Z","size":78,"stargazers_count":4,"open_issues_count":1,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-10T16:10:47.166Z","etag":null,"topics":["api","oreilly","oreilly-books"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/davidshq.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-01-05T20:33:28.000Z","updated_at":"2024-12-02T03:00:07.000Z","dependencies_parsed_at":"2024-05-01T18:20:07.773Z","dependency_job_id":"ceff4012-5969-4fcd-afba-d2550c409ff2","html_url":"https://github.com/davidshq/oreillyas","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidshq%2Foreillyas","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidshq%2Foreillyas/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidshq%2Foreillyas/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidshq%2Foreillyas/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/davidshq","download_url":"https://codeload.github.com/davidshq/oreillyas/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248696724,"owners_count":21147169,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["api","oreilly","oreilly-books"],"created_at":"2025-01-21T07:14:44.906Z","updated_at":"2025-04-13T10:24:32.357Z","avatar_url":"https://github.com/davidshq.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# O'Reilly Learning API Scraper\n\nVersion: 0.0.3 3/8/2024\n\n## Table of Contents\n1. [Description](#description)\n2. [Usage](#usage)\n3. [How It Works](#how-it-works)\n4. [Why It Works This Way](#why-it-works-this-way)\n5. [Loading Data Into SQLite](#loading-data-into-sqlite)\n6. [Loading Data Into Neo4j](#loading-data-into-neo4j)\n7. [Quirks](#quirks)\n8. [Secondary Documentation / Scripts](#secondary-documentation--scripts)\n9. [Credits](#credits)\n\n## Description\nA primitive Python script that pulls down all the available books from the\nO'Reilly Learning API and saves them to a local directory as JSON.\n\nProvides a utility to transform the JSON into a SQLite DB including preserving\nmany-to-many relationships.\n\nHas several other useful scripts for transforming the database.\n\n\u003e NOTE: You have to have an authentication token from O'Reilly in order to pull down more\n\u003e than the first five pages of results.\n\n## Usage\n1. Clone the repository\n2. Install pipenv: `pip install pipenv`\n3. Install dependencies and create virtual environment: `pipenv install`\n4. Activate the virtual environment: `pipenv shell`\n5. Tweak any settings you want in `main.py`\n6. Run the script: `python main.py`\n\n## How It Works\nIt adds each page of results from the O'Reilly API to a Python dictionary\nthen writes that dictionary out to a JSON file.\n\n## Why It Works This Way\nEach page of results is its own contained JSON, we could concatenate\nthe JSON manually, but adding it to the dictionary is easier.\n\n## Loading Data Into SQLite\nIn the `json-to-sqlite` subfolder you'll find three scripts which can be used to:\n1. Add a unique integer (pid) to each book record in `oreilly.json`: `add_pid_to_json.py`\n2. Create a SQLite DB and appropriate tables to contain the data from `oreilly.json`: `create_db.py`\n3. Transform the JSON data from `oreilly.json` into rows of data in the new SQLite DB: `convert_json_to_tables.py`\n\n### Some Useful Views\nYou can optionally create a set of views that may be easier to use than the raw tables. You can add these views by running `/create_views.py`\n\nCurrently this generates a view for each publisher as well as a view for publishers with various imprints.\n\nIt also generates a view of each book that includes the publisher's name.\n\n### Getting counts of books by publisher\nYou can populate the `book_counts` column on the `publishers` table with the number of books each publisher has by running `/add_count_to_publishers.py`\n\n## Loading Data Into Neo4j\nIn the `json-to-neo4j` subfolder you'll find a script that can be used to load the data from `oreilly.json` into a Neo4j database.\n\nYou should have an existing Neo4j database running and have set the host and auth environment variables in the `.env` file.\n\n## How To: Generate a Sample from JSON results\nThe O'Reilly API results can get quite large (well over 100 MB) and can be a bit hard to manipulate in a GUI editor. You may want to run `generate_sample_from_json.py` after running `main.py`. This will take the first 400 records (you can customize the number) and place them in a separate json file (`oreilly_sample.json`) that still gives a good idea of what the results are but in a more manageable size.\n\n## Quirks\n\n### Excluding Fields\nYou can exclude fields from the results returned by the API but only some fields. For example, `archive_id` can be excluded but `num_of_followers` cannot.\n\nYou can find a complete list of the excludable fields here: https://www.oreilly.com/online-learning/integration-docs/search.html#/get~api~v2~search~5\n\n## Secondary Documentation / Scripts\n\nThe folder `for-learning` contains some additional scripts that show me exploring the O'Reilly API. This includes `get_entire_api_response.py` which can be used to see the entire JSON response returned by the API in contrast to the `main.py` script which utilizes only the results portion of the response\n\nThe folder `generic-json-mapping` is essentially nothing yet. I was surprised by the lack of a generic, essentially code free tool to convert JSON to a relational SQL DB. This is where I may eventually build something to handle that generic scenario (if it really starts to happen it'll probably be broken out into it's own repo).\n\nThere is also a `pure_sql_queries` folder which contains some SQL queries I've used to explore the data.\n\n## Credits\nIn some files I have explicitly noted this and while not required I'll do so anyways. I've used GitHub Copilot quite a bit in creating this project. I haven't messed around with it much before and this seemed like a good opportunity to see what I could get it to do. It can be quite frustrating at times, but I see potential.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavidshq%2Foreillyas","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdavidshq%2Foreillyas","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavidshq%2Foreillyas/lists"}