Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/philippta/flyscrape

Flyscrape is a command-line web scraping tool designed for those without advanced programming skills.
https://github.com/philippta/flyscrape

Last synced: 1 day ago
JSON representation

Flyscrape is a command-line web scraping tool designed for those without advanced programming skills.

Awesome Lists containing this project

README

        






Flyscrape is a command-line web scraping tool designed for those without
advanced programming skills, enabling precise extraction of website data.



Installation · Documentation · Releases

## Demo



## Features

- **Standalone:** Flyscrape comes as a single binary executable.
- **jQuery-like:** Extract data from HTML pages with a familiar API.
- **Scriptable:** Use JavaScript to write your data extraction logic.
- **System Cookies:** Give Flyscrape access to your browsers cookie store.
- **Browser Mode:** Render JavaScript heavy pages using a headless Browser.
- **Nested Scraping:** Extract data from linked pages within a single scrape.

## Overview

- [Example](#example)
- [Installation](#installation)
- [Recommended](#recommended)
- [Homebrew](#homebrew)
- [Pre-compiled binary](#pre-compiled-binary)
- [Compile from source](#compile-from-source)
- [Usage](#usage)
- [Configuration](#configuration)
- [Query API](#query-api)
- [Flyscrape API](#flyscrape-api)
- [Document Parsing](#document-parsing)
- [File Downloads](#file-downloads)
- [Issues and suggestions](#issues-and-suggestions)

## Example

This example scrapes the first few pages form Hacker News, specifically the New, Show and Ask sections.

```javascript
export const config = {
urls: [
"https://news.ycombinator.com/new",
"https://news.ycombinator.com/show",
"https://news.ycombinator.com/ask",
],

// Cache request for later.
cache: "file",

// Enable JavaScript rendering.
browser: true,
headless: false,

// Follow pagination 5 times.
depth: 5,
follow: ["a.morelink[href]"],
}

export default function ({ doc, absoluteURL }) {
const title = doc.find("title");
const posts = doc.find(".athing");

return {
title: title.text(),
posts: posts.map((post) => {
const link = post.find(".titleline > a");

return {
title: link.text(),
url: link.attr("href"),
};
}),
}
}
```

```bash
$ flyscrape run hackernews.js
[
{
"url": "https://news.ycombinator.com/new",
"data": {
"title": "New Links | Hacker News",
"posts": [
{
"title": "Show HN: flyscrape - An standalone and scriptable web scraper",
"url": "https://flyscrape.com/"
},
...
]
}
}
]
```

Check out the [examples folder](examples) for more detailed examples.

## Installation

### Recommended

The easiest way to install `flyscrape` is via its install script.

```bash
curl -fsSL https://flyscrape.com/install | bash
```

### Homebrew

For macOS users `flyscrape` is also available via homebrew:

```bash
brew install flyscrape
```

### Pre-compiled binary

`flyscrape` is available for MacOS, Linux and Windows as a downloadable binary from the [releases page](https://github.com/philippta/flyscrape/releases).

### Compile from source

To compile flyscrape from source, follow these steps:

1. Install Go: Make sure you have Go installed on your system. If not, you can download it from [https://go.dev/](https://go.dev/).

2. Install flyscrape: Open a terminal and run the following command:

```bash
go install github.com/philippta/flyscrape/cmd/flyscrape@latest
```

## Usage

```
Usage:

flyscrape run SCRIPT [config flags]

Examples:

# Run the script.
$ flyscrape run example.js

# Set the URL as argument.
$ flyscrape run example.js --url "http://other.com"

# Enable proxy support.
$ flyscrape run example.js --proxies "http://someproxy:8043"

# Follow paginated links.
$ flyscrape run example.js --depth 5 --follow ".next-button > a"

# Set the output format to ndjson.
$ flyscrape run example.js --output.format ndjson

# Write the output to a file.
$ flyscrape run example.js --output.file results.json
```

## Configuration

Below is an example scraping script that showcases the capabilities of flyscrape. For a full documentation of all configuration options, visit the [documentation page](https://flyscrape.com/docs/getting-started/).

```javascript
export const config = {
// Specify the URL to start scraping from.
url: "https://example.com/",

// Specify the multiple URLs to start scraping from. (default = [])
urls: [
"https://anothersite.com/",
"https://yetanother.com/",
],

// Enable rendering with headless browser. (default = false)
browser: true,

// Specify if browser should be headless or not. (default = true)
headless: false,

// Specify how deep links should be followed. (default = 0, no follow)
depth: 5,

// Specify the css selectors to follow. (default = ["a[href]"])
// Setting follow to [] disables automatic following.
// Can later be used with manual following.
follow: [".next > a", ".related a"],

// Specify the allowed domains. ['*'] for all. (default = domain from url)
allowedDomains: ["example.com", "anothersite.com"],

// Specify the blocked domains. (default = none)
blockedDomains: ["somesite.com"],

// Specify the allowed URLs as regex. (default = all allowed)
allowedURLs: ["/posts", "/articles/\d+"],

// Specify the blocked URLs as regex. (default = none)
blockedURLs: ["/admin"],

// Specify the rate in requests per minute. (default = no rate limit)
rate: 60,

// Specify the number of concurrent requests. (default = no limit)
concurrency: 1,

// Specify a single HTTP(S) proxy URL. (default = no proxy)
// Note: Not compatible with browser mode.
proxy: "http://someproxy.com:8043",

// Specify multiple HTTP(S) proxy URLs. (default = no proxy)
// Note: Not compatible with browser mode.
proxies: [
"http://someproxy.com:8043",
"http://someotherproxy.com:8043",
],

// Enable file-based request caching. (default = no cache)
cache: "file",

// Specify the HTTP request header. (default = none)
headers: {
"Authorization": "Bearer ...",
"User-Agent": "Mozilla ...",
},

// Use the cookie store of your local browser. (default = off)
// Options: "chrome" | "edge" | "firefox"
cookies: "chrome",

// Specify the output options.
output: {
// Specify the output file. (default = stdout)
file: "results.json",

// Specify the output format. (default = json)
// Options: "json" | "ndjson"
format: "json",
},
};

export default function ({ doc, url, absoluteURL, scrape, follow }) {
// doc
// Contains the parsed HTML document.

// url
// Contains the scraped URL.

// absoluteURL("/foo")
// Transforms a relative URL into absolute URL.

// scrape(url, function({ doc, url, absoluteURL, scrape }) {
// return { ... };
// })
// Scrapes a linked page and returns the scrape result.

// follow("/foo")
// Follows a link manually.
// Disable automatic following with `follow: []` for best results.
}
```

## Query API

```javascript
//

Hey

const el = doc.find(".element")
el.text() // "Hey"
el.html() // `
Hey
`
el.name() // div
el.attr("foo") // "bar"
el.hasAttr("foo") // true
el.hasClass("element") // true

//


    //
  • Item 1

  • //
  • Item 2

  • //
  • Item 3

  • //

const list = doc.find("ul")
list.children() // [
  • Item 1
  • ,
  • Item 2
  • ,
  • Item 3
  • ]

    const items = list.find("li")
    items.length() // 3
    items.first() //

  • Item 1

  • items.last() //
  • Item 3

  • items.get(1) //
  • Item 2

  • items.get(1).prev() //
  • Item 1

  • items.get(1).next() //
  • Item 3

  • items.get(1).parent() //
      ...

    items.get(1).siblings() // [
  • Item 1
  • ,
  • Item 2
  • ,
  • Item 3
  • ]
    items.map(item => item.text()) // ["Item 1", "Item 2", "Item 3"]
    items.filter(item => item.hasClass("a")) // [
  • Item 1
  • ]

    //


    //

    Aleph


    //

    Aleph


    //

    Beta


    //

    Beta


    //

    Gamma


    //

    Gamma


    //

    const header = doc.find("div h2")

    header.get(1).prev() //

    Aleph


    header.get(1).prevAll() // [

    Aleph

    ,

    Aleph

    ]
    header.get(1).prevUntil('div,h1,h2,h3') //

    Aleph


    header.get(1).next() //

    Beta


    header.get(1).nextAll() // [

    Beta

    ,

    Gamma

    ,

    Gamma

    ]
    header.get(1).nextUntil('div,h1,h2,h3') //

    Beta


    ```

    ## Flyscrape API

    ### Document Parsing

    ```javascript
    import { parse } from "flyscrape";

    const doc = parse(`

    bar
    `);
    const text = doc.find(".foo").text();
    ```

    ### File Downloads

    ```javascript
    import { download } from "flyscrape/http";

    download("http://example.com/image.jpg") // downloads as "image.jpg"
    download("http://example.com/image.jpg", "other.jpg") // downloads as "other.jpg"
    download("http://example.com/image.jpg", "dir/") // downloads as "dir/image.jpg"

    // If the server offers a filename via the Content-Disposition header and no
    // destination filename is provided, Flyscrape will honor the suggested filename.
    // E.g. `Content-Disposition: attachment; filename="archive.zip"`
    download("http://example.com/generate_archive.php", "dir/") // downloads as "dir/archive.zip"
    ```

    ## Issues and Suggestions

    If you encounter any issues or have suggestions for improvement, please [submit an issue](https://github.com/philippta/flyscrape/issues).