Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/waikato-datamining/simple-file-poller

Simple Python library for performing file polling.
https://github.com/waikato-datamining/simple-file-poller

Last synced: about 2 months ago
JSON representation

Simple Python library for performing file polling.

Host: GitHub
URL: https://github.com/waikato-datamining/simple-file-poller
Owner: waikato-datamining
License: mit
Created: 2020-12-01T21:58:14.000Z (about 4 years ago)
Default Branch: main
Last Pushed: 2024-09-13T04:26:44.000Z (3 months ago)
Last Synced: 2024-10-13T08:40:57.289Z (2 months ago)
Language: Python
Homepage:
Size: 49.8 KB
Stars: 0
Watchers: 4
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGES.rst
- License: LICENSE

Awesome Lists containing this project

README

        # simple-file-poller

Simple Python 3 library for file polling.

## Installation

Install via pip:

```

pip install simple-file-poller

```

## Usage

The `sfp.Poller` class is used to poll for files to process.

As a minimum, the `input_dir` and `output_dir` directories need to 

get supplied.

You can choose between two types of polling: *simple* (default) or 

*[watchdog](https://github.com/gorakhargosh/watchdog)-based* one 

(`use_watchdog = True`). The simple approach merely checks the input 

directory every `poll_wait` seconds for new files. The watchdog 

approach reacts to *FILE_CREATED* events in the input directory 

to trigger the listing of files. The watchdog approach should be 

used in order to reduce latency within a pipeline of file-processing 

applications. Due to potential race conditions (e.g., when pairs of 

files need to be processed but the second appears slightly after the 

first one triggered the event already), some files may not get 

listed/processed. Hence, the poller will check in watchdog mode every 

`watchdog_check_interval` seconds whether there are not any files 

present in the input directory after all. Of course, watchdog mode 

is only available in conjunction with `continuous` mode.

By default, input files get moved to the output directory once process. 

With the `delete_input` option, you can remove them instead (e.g., if 

it is not necessary to keep them).

What kind of files are being included in the poll depends on their extension 

(when using `None` for `extensions` then all files get included).

By supplying a `check_file` method (signature: `fname:str, poller:Poller`) you 

can ensure that you only process valid files. E.g., with the 

[python-image-complete](https://pypi.org/project/python-image-complete/)

library you can determine whether an image is valid, i.e., fully written. 

See below for an example. 

If a file fails the check, it gets put on an internal blacklist. If it fails 

more than `blacklist_tries` times, it will get permanently excluded from 

processing (either moved to the output directory or deleted).

Since checks can take quite a long time, you may want to limit the batch size

of files queued for processing by setting a value greater than 0 for `max_files`.

Otherwise the process may look like it has stopped working when a large number

of files are present in the input directory and no output is coming out. 

The `process_file` method (signature: `fname:str, output_dir:str, poller:Poller`) 

is used for performing the actual processing of a file, e.g., applying a deep

learning classification model to an image to obtain a label.  

By specifying `tmp_dir`, all output files get generated in that directory before being

automatically moved into the actual `output_dir`. That avoids other processes that

are monitoring or polling for files in the output directory to spring into action

before the files have been fully written. 

The input directory may contain more than one file per ID (but with differing file

extensions) and if these should get moved to the output directory, then this can

be achieved with the `other_input_files` [glob](https://docs.python.org/3/library/glob.html) 

definition. The `{NAME}` placeholder, representing the current file being processed

(without its extension), can be used in that expression. For example, when processing

all `.jpg` files with the `process_file` method and all `.txt` and `.xml` should get

moved as well then use `["{NAME}.txt", "{NAME}.xml]` for `other_input_files`. If you

want to delete these files instead of moving them, then set `delete_other_input_files` 

to `True`. 

The `params` object can be used to attach parameters that are used by the 

`check_file` and `process_file` methods. By setting attributes in this

object you avoid the danger of accidentally replacing attributes that

were introduced in the Poller class at a later stage.

## Custom file check

The following example looks for JPG and PNG files in `/home/fracpete/poll/in/` and will

write dummy output files to the temp directory `/home/fracpete/poll/tmp/` before

moving them to `/home/fracpete/poll/tmp/`. A maximum of 3 files is processed at 

a time. It uses a custom check method to ensure that the images have been completely

written to disk before attempting to process them. Though PNG and JPG images

get processed, only JPG images get checked (`poller.params.dont_check_ext`).

```python

import os

from sfp import Poller, Parameters, dummy_file_processing

from image_complete.auto import is_image_complete

def image_complete(fname, poller):

    ext = os.path.splitext(fname)[1]

    if ext in poller.params.dont_check_ext:

        poller.debug("Not checking extension: %s" % ext)

        result = True

    else:

        result = is_image_complete(fname)

    poller.debug("Image complete:", fname, "->", result)

    return result

params = Parameters()

params.dont_check_ext = [".png"]

p = Poller(

    input_dir="/home/fracpete/poll/in/",

    output_dir="/home/fracpete/poll/out/",

    tmp_dir="/home/fracpete/poll/tmp/",

    continuous=True,

    max_files=3,

    check_file=image_complete,

    process_file=dummy_file_processing,

    extensions=[".jpg", ".png"],

    params=params)

p.poll()

print("Stopped?", p.is_stopped())

```

**NB:** Install the [python-image-complete](https://github.com/waikato-datamining/python-image-complete) 

library with:

```bash

pip install python-image-complete

```

## Custom logging

By supplying a method to the `logging` option, you can customize the logging

that occurs via the `info`, `debug` and `error` method calls of the Poller. 

The example below uses the Python logging framework.  

```python

from sfp import Poller, LOGGING_TYPE_INFO, LOGGING_TYPE_DEBUG, LOGGING_TYPE_ERROR

import logging

_logger = None

def custom_logging(*args):

    global _logger

    if _logger is None:

        logging.basicConfig()

        _logger = logging.getLogger("sfp")

        _logger.setLevel(logging.DEBUG)

    str_args = [str(x) for x in args]

    if type == LOGGING_TYPE_ERROR:

        _logger.error(" ".join(str_args))

    elif type == LOGGING_TYPE_DEBUG:

        _logger.debug(" ".join(str_args))

    else:

        _logger.info(" ".join(str_args))

p = Poller()

# ... setting more options

p.logging = custom_logging

p.output_timestamp = False # the Python logging framework should handle that instead

p.poll()

```