https://github.com/rootsdev/genscrape
JavaScript library that aids in scraping person data off of genealogy websites
https://github.com/rootsdev/genscrape
Last synced: 4 months ago
JSON representation
JavaScript library that aids in scraping person data off of genealogy websites
- Host: GitHub
- URL: https://github.com/rootsdev/genscrape
- Owner: rootsdev
- License: mit
- Created: 2014-11-29T21:59:46.000Z (over 11 years ago)
- Default Branch: master
- Last Pushed: 2023-01-12T02:31:57.000Z (over 3 years ago)
- Last Synced: 2025-08-09T06:54:48.401Z (10 months ago)
- Language: HTML
- Size: 1.62 MB
- Stars: 46
- Watchers: 32
- Forks: 7
- Open Issues: 52
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[](https://www.npmjs.com/package/genscrape)
[](https://travis-ci.org/rootsdev/genscrape)
[](https://coveralls.io/r/rootsdev/genscrape)
genscrape
==========
A JavaScript library that aids in scraping person data off of genealogy websites.
It is designed to be used in a browser extension. Data is output in
[GEDCOM X](http://www.gedcomx.org/) format.
## Install
```
npm install --save genscrape
```
Or use the [CDN](https://unpkg.com/#/)
```html
```
We recommend pinning CDN links to a specific version.
## Usage
Scraping is sometimes asynchronous so we chose to implement the [EventEmitter](https://nodejs.org/api/events.html)
interface using [EventEmmitter2](https://github.com/asyncly/EventEmitter2).
```js
genscrape().on('data', function(data){
// Do something with the data
});
```
That's it. Genscrape automatically detects what page you're on, looks up the
correct scraper, then does its magic.
### Events
#### data
The `data` event is fired when genscrape successfully scrapes data from the page.
Read more about the data model below.
#### noMatch
The `noMatch` event is fired when genscrape is called on a page with a URL that
it does not recognize. In other words, it does not have a parser registered for
the URL.
#### noData
The `noData` event is fired when genscrape is run on a page that a parser is
registered for but the parser was unable to find any data.
In cases of one-page apps such as the FamilySearch Family Tree and the findmypast
trees, you may see multiple `noData` and `data` events as the user browses through
the tree.
#### error
The `error` event is fired when something unexpected occurs while a scraper is
processing, such as a failed AJAX call. There is no standardized format for the
errors yet.
## Data Model
Data is returned in the [GEDCOM X JSON](https://github.com/FamilySearch/gedcomx/blob/master/specifications/json-format-specification.md)
format.
When possible, we mark one person as `principal`. This isn't possible when
viewing marriage records on sites that don't make a distinction between persons
in a record. The only currently supported site where this occurs is Open Archives.
Every GEDCOM X document has a [SourceDescription](https://github.com/FamilySearch/gedcomx/blob/master/specifications/conceptual-model-specification.md#source-description),
pointed to by the `about` property of the document, that provides a citation for
all data in the document. That SourceDescription points to an [Agent](https://github.com/FamilySearch/gedcomx/blob/master/specifications/conceptual-model-specification.md#agent),
via the `repository` property, which describes the website the data came from.
```json
{
"id": "agent",
"names": [
{
"lang": "en",
"value": "Find A Grave"
}
],
"homepage": {
"resource": "https://www.findagrave.com"
}
}
```
We populate person's IDs with IDs from the source repository.
Person IDs aren't unique between websites. Thus we add a `genscrape`
[Identifier](https://github.com/FamilySearch/gedcomx/blob/master/specifications/conceptual-model-specification.md#identifier-type)
which allow us to better compare two arbitrary records to determine if they come
from the same record on the same website.
```json
{
"id": "65630115",
"identifiers": {
"genscrape": [
"genscrape://findagrave/65630115"
]
}
}
```
## Supported Sites
* [Ancestry.com](http://ancestry.com), [Ancestry.ca](http://ancestry.ca), [Ancestry.co.uk](http://ancestry.co.uk), [Ancestry.com.au](http://ancestry.com.au), both historical records and tree profiles.
* [BillionGraves](http://billiongraves.com/)
* [Find A Grave](http://www.findagrave.com/)
* [findmypast.com](http://www.findmypast.com), [findmypast.co.uk](http://www.findmypast.co.uk), [findmypast.com.au](http://www.findmypast.com.au), [findmypast.ie](http://www.findmypast.ie), both historical records and tree profiles.
* [FamilySearch.org](https://familysearch.org), both historical records and tree profiles.
* [Genealogie Online](https://www.genealogieonline.nl)
* [MyHeritage](https://www.myheritage.com) historical records and tree profiles
* [Open Archives](https://www.openarch.nl)
* [WeRelate](http://www.werelate.org/)
* [WikiTree](http://www.wikitree.com/) person profiles