https://github.com/hughp/glottolog4.7-isbn-extraction
ISBN extraction from Glottolog
https://github.com/hughp/glottolog4.7-isbn-extraction
Last synced: about 1 month ago
JSON representation
ISBN extraction from Glottolog
- Host: GitHub
- URL: https://github.com/hughp/glottolog4.7-isbn-extraction
- Owner: HughP
- Created: 2023-04-17T20:00:05.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2023-04-17T20:08:07.000Z (about 2 years ago)
- Last Synced: 2025-02-15T06:27:23.432Z (3 months ago)
- Size: 60.4 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Glottolog ISBN Extraction
The goal with this information would be to compare access points from the perspective of scholars and librarians. This would also give an interesting insight into how the LCSH terms are used as well as LCC terms/IDs on Linguistic and language based resources. By comparing these data sources we learn or gain insight into what scholars consider important as opposed to what librarians consider important and we can further consider how information services meet consumer goals. This repo contains the 20,880 ISBNs that I extracted from the bibliography of the Glottolog version 4.7.
* Activities performed by: Hugh Paterson III
* Glottolog version used: 4.7
* Date of Extraction: 17 April 2023
* Source file: glottolog_source.bib.zip [61.4MB]
* Source file format: bibTeX
* Source location: https://glottolog.org/meta/downloads
* Method of Extraction: text mining via Linux Command line tools
* Result Files:
* ISBNs: Glottolog-ISBN-Final.txt
* OCLC IDs: glottolog-OCLCnumbers.txt## Methods
The following methods were loosely followed... There was a lot of one-off search and replace within a text-editor. Ultimatly I added a prefix to many ISBNs so that I could get the last 1000 or so. `ISBN10 0` --> `ISBN-0`. ISBNs occured in a variety of fields, including ISSN, Title, notes (various), ISBN, citation, abstract, etc. A global search for `ISBN` was refined to filter and clean out the noise. Spaces and minus signs were removed.
Some commands used were:
$ unzip glottolog_source.bib.zip
$ grep -i "ISBN" glottolog.bib > ISBN.txt
$ cat ISBN.txt | sort > ISBN-sorted.txt
$ cat ISBN-sorted.txt | sort -u > ISBN-sorted-u.txt
$ tr ' ' '\n' < title-test-1-minus-minuses-expeirment.txt | sort | uniq
$ tr ' ' '\n' < title-test-1-minus-minuses-expeirment.txt | sort | uniq | grep ISBN
$ tr ' ' '\n' < title-test-1-minus-minuses-expeirment.txt | sort | uniq | sed -nr '/^.{6,}$/p'
$ tr ' ' '\n' < Glottolog-ISBNs-Extracted.txt | sort | uniq > Glottolog-ISBN-Final.txt
## Follow-up actions
The ISBNs are sent to OCLC requesting MARC records for the matching ISBNs and then those records will be overlayed with the bibtex records. Analysis will ensue.