https://github.com/philippta/flyscrape

Flyscrape is a command-line web scraping tool designed for those without advanced programming skills.
https://github.com/philippta/flyscrape
Last synced: 2 months ago
JSON representation
Flyscrape is a command-line web scraping tool designed for those without advanced programming skills.
Host: GitHub
URL: https://github.com/philippta/flyscrape
Owner: philippta
License: mpl-2.0
Created: 2023-08-28T10:39:28.000Z (almost 2 years ago)
Default Branch: master
Last Pushed: 2025-04-10T10:53:44.000Z (3 months ago)
Last Synced: 2025-04-13T15:08:03.887Z (3 months ago)
Language: Go
Homepage: https://flyscrape.com
Size: 14.7 MB
Stars: 1,296
Watchers: 10
Forks: 38
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

my-awesome - philippta/flyscrape - 04 star:1.3k fork:0.0k Flyscrape is a command-line web scraping tool designed for those without advanced programming skills. (Go)
README

        




  

  

  








Flyscrape is a command-line web scraping tool designed for those without 
advanced programming skills, enabling precise extraction of website data. 








Installation · Documentation · Releases



## Demo



  



## Features

- **Standalone:** Flyscrape comes as a single binary executable.

- **jQuery-like:** Extract data from HTML pages with a familiar API.

- **Scriptable:** Use JavaScript to write your data extraction logic.

- **System Cookies:** Give Flyscrape access to your browsers cookie store.

- **Browser Mode:** Render JavaScript heavy pages using a headless Browser.

- **Nested Scraping:** Extract data from linked pages within a single scrape.

## Overview

- [Example](#example)

- [Installation](#installation)

    - [Recommended](#recommended)

    - [Homebrew](#homebrew)

    - [Pre-compiled binary](#pre-compiled-binary)

    - [Compile from source](#compile-from-source)

- [Usage](#usage)

- [Configuration](#configuration)

- [Query API](#query-api)

- [Flyscrape API](#flyscrape-api)

    - [Document Parsing](#document-parsing)

    - [File Downloads](#file-downloads)

- [Issues and suggestions](#issues-and-suggestions)

## Example

This example scrapes the first few pages form Hacker News, specifically the New, Show and Ask sections.

```javascript

export const config = {

    urls: [

        "https://news.ycombinator.com/new",

        "https://news.ycombinator.com/show",

        "https://news.ycombinator.com/ask",

    ],

    // Cache request for later.

    cache: "file",

    // Enable JavaScript rendering.

    browser: true,

    headless: false,

    // Follow pagination 5 times.

    depth: 5,

    follow: ["a.morelink[href]"],

}

export default function ({ doc, absoluteURL }) {

    const title = doc.find("title");

    const posts = doc.find(".athing");

    return {

        title: title.text(),

        posts: posts.map((post) => {

            const link = post.find(".titleline > a");

            return {

                title: link.text(),

                url: link.attr("href"),

            };

        }),

    }

}

```

```bash

$ flyscrape run hackernews.js

[

  {

    "url": "https://news.ycombinator.com/new",

    "data": {

      "title": "New Links | Hacker News",

      "posts": [

        {

          "title": "Show HN: flyscrape - An standalone and scriptable web scraper",

          "url": "https://flyscrape.com/"

        },

        ...

      ]

    }

  }

]

```

Check out the [examples folder](examples) for more detailed examples.

## Installation

### Recommended

The easiest way to install `flyscrape` is via its install script.

```bash

curl -fsSL https://flyscrape.com/install | bash

```

### Homebrew

For macOS users `flyscrape` is also available via homebrew:

```bash

brew install flyscrape

```

### Pre-compiled binary

`flyscrape` is available for MacOS, Linux and Windows as a downloadable binary from the [releases page](https://github.com/philippta/flyscrape/releases).

### Compile from source

To compile flyscrape from source, follow these steps:

1. Install Go: Make sure you have Go installed on your system. If not, you can download it from [https://go.dev/](https://go.dev/).

2. Install flyscrape: Open a terminal and run the following command:

   ```bash

   go install github.com/philippta/flyscrape/cmd/flyscrape@latest

   ```

## Usage

```

Usage:

    flyscrape run SCRIPT [config flags]

Examples:

    # Run the script.

    $ flyscrape run example.js

    # Set the URL as argument.

    $ flyscrape run example.js --url "http://other.com"

    # Enable proxy support.

    $ flyscrape run example.js --proxies "http://someproxy:8043"

    # Follow paginated links.

    $ flyscrape run example.js --depth 5 --follow ".next-button > a"

    # Set the output format to ndjson.

    $ flyscrape run example.js --output.format ndjson

    # Write the output to a file.

    $ flyscrape run example.js --output.file results.json

```

## Configuration

Below is an example scraping script that showcases the capabilities of flyscrape. For a full documentation of all configuration options, visit the [documentation page](https://flyscrape.com/docs/getting-started/).

```javascript

export const config = {

    // Specify the URL to start scraping from.

    url: "https://example.com/",

    // Specify the multiple URLs to start scraping from.   (default = [])

    urls: [                          

        "https://anothersite.com/",

        "https://yetanother.com/",

    ],

    // Enable rendering with headless browser.             (default = false)

    browser: true,

    // Specify if browser should be headless or not.       (default = true)

    headless: false,

    // Specify how deep links should be followed.          (default = 0, no follow)

    depth: 5,                        

    // Specify the css selectors to follow.                (default = ["a[href]"])

    // Setting follow to [] disables automatic following.

    // Can later be used with manual following.

    follow: [".next > a", ".related a"],                      

 

    // Specify the allowed domains. ['*'] for all.         (default = domain from url)

    allowedDomains: ["example.com", "anothersite.com"],              

 

    // Specify the blocked domains.                        (default = none)

    blockedDomains: ["somesite.com"],              

    // Specify the allowed URLs as regex.                  (default = all allowed)

    allowedURLs: ["/posts", "/articles/\d+"],                 

 

    // Specify the blocked URLs as regex.                  (default = none)

    blockedURLs: ["/admin"],                 

   

    // Specify the rate in requests per minute.            (default = no rate limit)

    rate: 60,                       

    // Specify the number of concurrent requests.          (default = no limit)

    concurrency: 1,                       

    // Specify a single HTTP(S) proxy URL.                 (default = no proxy)

    // Note: Not compatible with browser mode.

    proxy: "http://someproxy.com:8043",

    // Specify multiple HTTP(S) proxy URLs.                (default = no proxy)

    // Note: Not compatible with browser mode.

    proxies: [

      "http://someproxy.com:8043",

      "http://someotherproxy.com:8043",

    ],                     

    // Enable file-based request caching.                  (default = no cache)

    cache: "file",                   

    // Specify the HTTP request header.                    (default = none)

    headers: {                       

        "Authorization": "Bearer ...",

        "User-Agent": "Mozilla ...",

    },

    // Use the cookie store of your local browser.         (default = off)

    // Options: "chrome" | "edge" | "firefox"

    cookies: "chrome",

    // Specify the output options.

    output: {

        // Specify the output file.                        (default = stdout)

        file: "results.json",

        

        // Specify the output format.                      (default = json)

        // Options: "json" | "ndjson"

        format: "json",

    },

};

export default function ({ doc, url, absoluteURL, scrape, follow }) {

    // doc

    // Contains the parsed HTML document.

    // url

    // Contains the scraped URL.

    // absoluteURL("/foo")

    // Transforms a relative URL into absolute URL.

    // scrape(url, function({ doc, url, absoluteURL, scrape }) {

    //     return { ... };

    // })

    // Scrapes a linked page and returns the scrape result.

    // follow("/foo")

    // Follows a link manually.

    // Disable automatic following with `follow: []` for best results.

}

```

## Query API

```javascript

// 
Hey

const el = doc.find(".element")

el.text()                                 // "Hey"

el.html()                                 // `Hey`

el.name()                                 // div

el.attr("foo")                            // "bar"

el.hasAttr("foo")                         // true

el.hasClass("element")                    // true

// 


//   Item 1

//   Item 2

//   Item 3

// 

const list = doc.find("ul")

list.children()                           // [Item 1
, Item 2
, Item 3
]

const items = list.find("li")

items.length()                            // 3

items.first()                             // 
Item 1

items.last()                              // Item 3

items.get(1)                              // Item 2

items.get(1).prev()                       // Item 1

items.get(1).next()                       // Item 3

items.get(1).parent()                     // ...

items.get(1).siblings()                   // [Item 1
, Item 2
, Item 3]

items.map(item => item.text())            // ["Item 1", "Item 2", "Item 3"]

items.filter(item => item.hasClass("a"))  // [Item 1
]

// 


//   Aleph

//   Aleph

//   Beta

//   Beta

//   Gamma

//   Gamma

// 

const header = doc.find("div h2")

header.get(1).prev()                     // 
Aleph

header.get(1).prevAll()                  // [Aleph
, Aleph]

header.get(1).prevUntil('div,h1,h2,h3')  // Aleph

header.get(1).next()                     // Beta

header.get(1).nextAll()                  // [Beta
, Gamma
, Gamma]

header.get(1).nextUntil('div,h1,h2,h3')  // Beta

```

## Flyscrape API

### Document Parsing

```javascript

import { parse } from "flyscrape";

const doc = parse(`
bar`);

const text = doc.find(".foo").text();

```

### File Downloads

```javascript

import { download } from "flyscrape/http";

download("http://example.com/image.jpg")              // downloads as "image.jpg"

download("http://example.com/image.jpg", "other.jpg") // downloads as "other.jpg"

download("http://example.com/image.jpg", "dir/")      // downloads as "dir/image.jpg"

// If the server offers a filename via the Content-Disposition header and no

// destination filename is provided, Flyscrape will honor the suggested filename.

// E.g. `Content-Disposition: attachment; filename="archive.zip"`

download("http://example.com/generate_archive.php", "dir/") // downloads as "dir/archive.zip"

```

## Issues and Suggestions

If you encounter any issues or have suggestions for improvement, please [submit an issue](https://github.com/philippta/flyscrape/issues).
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/philippta/flyscrape

Awesome Lists containing this project

README

Aleph

Beta

Gamma

Aleph

Aleph

Gamma