https://github.com/wintermi/get-linked-data
A command line application designed to crawl a given set of URLs and scrape the JSON Linked Data (JSON-LD) contained within the webpage before writing the data entries out to a CSV file.
https://github.com/wintermi/get-linked-data
colly json-ld scraper
Last synced: over 1 year ago
JSON representation
A command line application designed to crawl a given set of URLs and scrape the JSON Linked Data (JSON-LD) contained within the webpage before writing the data entries out to a CSV file.
- Host: GitHub
- URL: https://github.com/wintermi/get-linked-data
- Owner: wintermi
- License: apache-2.0
- Created: 2023-12-04T13:52:49.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-12-13T12:41:24.000Z (over 1 year ago)
- Last Synced: 2025-02-16T04:36:00.549Z (over 1 year ago)
- Topics: colly, json-ld, scraper
- Language: Go
- Homepage:
- Size: 125 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Get Linked Data
[](https://github.com/wintermi/get-linked-data/actions)
[](https://goreportcard.com/report/github.com/wintermi/get-linked-data)
[](https://github.com/wintermi/get-linked-data/blob/main/LICENSE)
[](https://github.com/wintermi/get-linked-data/releases)
## Description
A command line application designed to crawl a given set of URLs and scrape the JSON Linked Data (JSON-LD) contained within the webpage before writing the data entries out to a CSV file.
```
USAGE:
get-linked-data -i URL_CSV -s ELEMENT_SELECTOR -o OUTPUT_CSV -e FAILED_URL_CSV
ARGS:
-d string
Field Delimiter (Required) (default ",")
-e string
Failed Request URLs Output CSV File (Required)
-g Scrape Google's Cached Version Instead
-i string
CSV File containing URLs to Scrape (Required)
-j string
jq Selector
-o string
Output Scraped Data CSV File (Required)
-p int
Parallelism or Maximum allowed Concurrent Requests (default 100)
-s string
Element Selector (Required)
-v Output Verbose Detail
-w int
Random Wait Time in Milliseconds between Requests (default 2000)
-x Scrape XML not HTML
```
## Example
```
get-linked-data -i "urls.csv" -e "script#product-schema" -o "results.csv"
```
## License
**get-linked-data** is released under the [Apache License 2.0](https://github.com/wintermi/get-linked-data/blob/main/LICENSE) unless explicitly mentioned in the file header.