https://github.com/0xibra/blackfeed

Python package that allows you easily and elastically download thousands of files concurrently.
https://github.com/0xibra/blackfeed

batch-processing concurrency downloader http python

Last synced: over 1 year ago
JSON representation

Python package that allows you easily and elastically download thousands of files concurrently.

Host: GitHub
URL: https://github.com/0xibra/blackfeed
Owner: 0xIbra
Created: 2019-12-13T06:29:54.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2022-04-30T16:03:57.000Z (about 4 years ago)
Last Synced: 2025-01-30T23:47:39.221Z (over 1 year ago)
Topics: batch-processing, concurrency, downloader, http, python
Language: Python
Homepage: https://pypi.org/project/blackfeed/
Size: 63.5 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # BlackFeed

> BlackFeed is a micro python library that allows you download and upload files concurrently.

> You can download your files locally but you can also upload them to your cloud without writing them to disk.

### Packages required

> Installed automatically with **pip**

- requests

- boto3

## Install

```bash

pip install blackfeed

```

## Usage

Download and upload files to AWS S3

**For this to work, AWS CLI must be configured**

```python

from blackfeed import Downloader

from blackfeed.adapter import S3Adapter

queue = [

    {

        'url': 'https://www.example.com/path/to/image.jpg', # Required

        'destination': 'some/key/image.jpg' # S3 key - Required 

    },{

        'url': 'https://www.example.com/path/to/image2.jpg',

        'destination': 'some/key/image2.jpg' 

    }

]

downloader = Downloader(

    S3Adapter(bucket='bucketname'),

    multi=True, # If true, uploads files to images to S3 with multithreading

    stateless=False # If set to False, it generates and stores md5 hashes of files in a file

    state_id='flux_states' # name of the file where hashes will be stored (states.txt) not required

    bulksize=200 # Number of concurrent downloads

)

downloader.process(queue)

stats = downloader.get_stats() # Returns a dict with information about the process

```

### Download files with states

Loading states can be useful if you don't want to re-download the same file twice.

```python

from blackfeed import Downloader

from blackfeed.adapter import S3Adapter

queue = [

...

]

downloader = Downloader(

    S3Adapter(bucket='bucketname'),

    multi=True,

    stateless=False,

    state_id='filename'

)

# You can add a callback function if needed

# This function will be called after each bulk is processed

def callback(responses):

    # response: {

    #    'destination': destination of the file can be local or can be S3 key,

    #    'url': URL from where the file was downloaded,

    #    'httpcode': HTTP code returned by the server,

    #    'status': True|False,

    #    'content-type': Mime type of the downloaded resource Example: image/jpeg

    # }

    # responses: response[]

    pass # Your logic

downloader.set_callback(callback)

downloader.load_states('filename') # This will load states from "filename.txt"

downloader.process(queue)

stats = downloader.get_stats() # Statistics 

```

## ElasticDownloader

> Let's you to download/retrieve files from FTP, SFTP and HTTP/S servers easily.

### Examples

#### Downloading file from FTP 

```python

from blackfeed import ElasticDownloader

uri = 'ftp://user:password@ftp.server.com/path/to/file.csv'

retriever = ElasticDownloader()

res = retriever.download(uri, localpath='/tmp/myfile.csv') # localfile is optional

# .download() function returns False if there was an error or return the local path of the downloaded file if it was a success.

print(res)

```

```bash

/tmp/myfile.csv

```

### Retrieving binary content of file from FTP

```python

from blackfeed import ElasticDownloader

uri = 'ftp://user:password@ftp.server.com/path/to/file.csv'

retriever = ElasticDownloader()

res = retriever.retrieve(uri) # Return type: io.BytesIO | False

with open('/tmp/myfile.csv', 'wb') as f:

    f.write(res.getvalue())

```

**ElasticDownloader** can handle FTP, SFTP and HTTP URIs automatically.

Use the method **download** to download file locally and use the **retrieve** method to get the binary content of a file.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/0xibra/blackfeed

Awesome Lists containing this project

README