Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/cattoface/recursivefilescrape

Python script to download all files in a webpage in a recursive way.
https://github.com/cattoface/recursivefilescrape

Last synced: about 1 month ago
JSON representation

Python script to download all files in a webpage in a recursive way.

Host: GitHub
URL: https://github.com/cattoface/recursivefilescrape
Owner: CattoFace
License: mit
Created: 2022-07-27T21:29:13.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2022-10-27T19:30:52.000Z (about 2 years ago)
Last Synced: 2023-06-01T09:35:12.494Z (over 1 year ago)
Language: Python
Size: 38.7 MB
Stars: 1
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Recursive File Scraper

A Python script that recursively downloads files from a webpage and links within that page using a console or by importing the script.

Single page downloading and page component filter and other configurations are available.

## Setup

**Source:**

Python 3 is required to run the script.

Clone the repository, enter the directory and run the following line to install the script's dependencies:

```bash

pip install -r requirements.txt

```

**Binary:**

If a binary has been precompiled for your platform, it will be available in the releases section and no further steps are required(Most recent binaries are also available inside the bin folder).

Binaries are generated using Nuitka.

## Usage

**Command:**

Run the relevant file with any additional flags:

```bash

./recursivescrape[.py/.exe/Linux64] [flags]

```

```bash

python ./recursivescrape.py [flags]

```

The available flags are:

|Flag|Description|Default|

|---|---|---|

|-h, --help|Show the help page of the program and all available flags||

|-u, --url|URL to start from. **REQUIRED**||

|-p, --download-path|Directory to download files to. Will use the current directory by default.||

|-c, --cookies| Cookie values as needed in the json format. Example: {\"session\":\"12kmjyu72yberuykd57\"}|{}|

|--id|Component id that contains the files and following paths. by default will check the whole page.||

|-o, --overwrite|Download and overwrite existing files. If not added, files that already exist will be skipped.|False|

|-r, --resume|Resume previous progress from file PROGRESS_FILE, will ignore url and no-recursion arguments if file is found.|False|

|-bi, --backup-interval|Saves the current progress every BACKUP_INTERVAL pages, 0 will disable automatic backup.|0|

|-f, --progress-file|The file to save and load progress with, relative to the download path.|progress.dat|

|-l, --dont-prevent-loops|Save memory by not remembering past pages but increase the chance of checking pages multiple times, do not add if there are any loops in the directory. Changing this flag between resumed runs results in undefined behaviour.|False|

|-nr, --no-recursion|Only download files from the given url and do not follow links recursively|False|

|--concurrent|Amount of pages and files to download concurrently at most|10|

|-v, --verbose|Increase output detail. use -vv for even more detail.|

**Code:**

Place the script in the same folder as your file(or your python import path) and import it:

```python

import recursivescrape

```

Call the scrape function with the same flags that are available using the script, only root_url is strictly required:

```python

recursivescrape.scrape(

                root_url: str,

                download_path: str = None,

                cookies: dict = {},

                id: str = "",

                overwrite: bool = False,

                resume: bool = False,

                progress_file: str = "progress.dat",

                dont_prevent_loops: bool = True,

                no_recursion: bool = False,

                backup_interval: int = 0,

                verbosity: int = 0,

                concurrent: int = 10)

```

## Build Binary From Source

Run the relevant script from the bin folder:

```bash

./generateLinuxBin.sh

.\generateWindowsBin.bat

```

The script will create a venv, install all the needed packages into it, run the compile command and save the binary in the current folder.

The compilation will include a few small downloads depending on the platform.

After compilation run the relevant clean script to remove the unneeded files:

```bash

./cleanLinuxBuildFiles.sh

.\cleanWindowsBuildFiles.bat

```

## Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to test changes before sending a request.

## License

[MIT](https://choosealicense.com/licenses/mit/)