Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ruarxive/filegetter
A command-line tool to collect files from public data sources using URL patterns and config files
https://github.com/ruarxive/filegetter
archival digitalpreservation
Last synced: 2 days ago
JSON representation
A command-line tool to collect files from public data sources using URL patterns and config files
- Host: GitHub
- URL: https://github.com/ruarxive/filegetter
- Owner: ruarxive
- License: mit
- Created: 2022-10-05T13:48:58.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2022-10-05T14:16:46.000Z (about 2 years ago)
- Last Synced: 2024-08-02T16:32:12.869Z (3 months ago)
- Topics: archival, digitalpreservation
- Language: Python
- Homepage:
- Size: 143 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.rst
- Changelog: HISTORY.rst
- License: LICENSE
Awesome Lists containing this project
- awesome-digital-preservation - filegetter - A command-line tool to collect files from public data sources using URL patterns and config files (Other digital objects / Online storage)
README
==============================================================
filegetter -- a command-line tool to collect files from public data sources
==============================================================filegetter is a file collection command-line tool that help to download a lot of files with URLS in YAML configuration files
.. contents::
.. section-numbering::
History
=======
This tool was developed to automate files collection from datasets created by other tools.
Several examples in `examples` directory shows it's usage in practice.Main features
=============* Any list of URLs supported: CSV, JSON lines or plain text
* URL prefixes supported
* Saves result to filesystem or ZIP container
* Stores report as CSV fileInstallation
============.. code-block:: bash
# Make sure we have an up-to-date version of pip and setuptools:
$ pip install --upgrade pip setuptools$ pip install --upgrade filegetter
(If ``pip`` installation fails for some reason, you can try
``easy_install filegetter`` as a fallback.)Python version
--------------Python version 3.6 or greater is required.
Quickstart
==========This example is about archival of files of Russian federal draft budget law 2023-2025.
.. code-block:: bash
$ mkdir budget2023
$ cd budget2023Create file filegetter.cfg as:
.. code-block:: bash
[project]
name = budget2023
description = Budget of RF 2023 documents
source = dataset.csv
source_type = csv
delimiter = ,[data]
data_key = href[files]
fetch_mode = prefix
root_url = https://sozd.duma.gov.ru
keys = href
storage_mode = filepath
transfer_ext = True[storage]
storage_type = zip
compression = TrueExecute command "run" to collect the data. Result stored in "storage.zip"
.. code-block:: bash
$ filegetter run
Usage
=====Synopsis:
.. code-block:: bash
$ filegetter [flags] [command] inputfile
See also ``filegetter --help``.
Config options
==============project
-------
* name - short name of the project
* description - text that explains what for is this project
* source - source data file, full or relational path
* source_type - type of the data source, csv, jsonl or list
* delimiter - splitter character, by default comma ','data
----
* data_key - key with URLs or URL partfiles
-----
* fetch_mode - file fetch mode. Could be 'prefix' or 'id'. Prefix
* root_url - root url / prefix for files
* keys - list of keys with urls/file id's to search for files to save
* storage_mode - a way how files stored in storage/files.zip. By default 'filepath' and files storaged same way as they presented in url
* default_ext - set default extension, for example jpg or csv
* transfer_ext - adds extension to files if file have no extensionstorage
-------
* storage_type - type of local storage. 'zip' is local zip file is default one
* compression - if True than compressed ZIP file used, less space used, more CPU time processing data