Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/eepp/findtb

Tarball finder
https://github.com/eepp/findtb

crawling tarball

Last synced: 8 days ago
JSON representation

Tarball finder

Awesome Lists containing this project

README

        

// Render with Asciidoctor

= findtb
Philippe Proulx
:toc:

image:https://img.shields.io/pypi/v/findtb.svg?label=Latest%20version[link="https://pypi.python.org/pypi/findtb"]

_**findtb**_ is a Python{nbsp}3 package which finds
https://en.wikipedia.org/wiki/Tar_(computing)[tarball] URLs within an
HTML (HTTP) page or FTP directory.

findtb finds tarball URLs directly into the URL's resource as well as in
minor/path version directories, if any.

== Installation

Install findtb from https://pypi.org/project/findtb/[PyPI]:

----
$ sudo pip3 install findtb
----

== Usage

Use `findtb.find_tarball_urls()` to get a set of all the tarball URLs
found from a given URL:

[source,python]
----
import findtb
import pprint

# find tarball URLs
urls = findtb.find_tarball_urls('ansible',
'https://releases.ansible.com/ansible/')

# print URLs
pprint.pprint(urls)
----

----
{'https://releases.ansible.com/ansible/ansible-1.1.tar.gz',
'https://releases.ansible.com/ansible/ansible-1.2.1.tar.gz',
'https://releases.ansible.com/ansible/ansible-1.2.2.tar.gz',
'https://releases.ansible.com/ansible/ansible-1.2.3.tar.gz',
'https://releases.ansible.com/ansible/ansible-1.2.tar.gz',
'https://releases.ansible.com/ansible/ansible-1.3.0.tar.gz',
...
'https://releases.ansible.com/ansible/ansible-2.9.0rc5.tar.gz',
'https://releases.ansible.com/ansible/ansible-2.9.1.tar.gz',
'https://releases.ansible.com/ansible/ansible-2.9.2.tar.gz',
'https://releases.ansible.com/ansible/ansible-2.9.3.tar.gz',
'https://releases.ansible.com/ansible/ansible-2.9.4.tar.gz',
'https://releases.ansible.com/ansible/ansible-2.9.5.tar.gz',
'https://releases.ansible.com/ansible/ansible-2.9.6.tar.gz',
'https://releases.ansible.com/ansible/ansible-2.9.7.tar.gz'}
----

The first parameter is the project's name as expected in the tarball
names. The second parameter is the URL from which to start the search.

You can pass a `logging.Logger` object to `findtb.find_tarball_urls()`.
This can be useful as the search process can be long:

[source,python]
----
import findtb
import pprint
import logging

# configure logging
logging.basicConfig(style='{',
format='{asctime} [{name}] {{{levelname}}}: {message}')

# create logger
logger = logging.getLogger('findtb')
logger.setLevel(logging.INFO)

# find tarball URLs
urls = findtb.find_tarball_urls('glib',
'ftp://ftp.gnome.org/pub/gnome/sources/glib/',
logger)

# print URLs
pprint.pprint(urls)
----

----
2020-04-20 12:08:22,200 [findtb] {INFO}: FTP: connecting to `ftp.gnome.org`.
2020-04-20 12:08:27,908 [findtb] {INFO}: FTP: changing working directory to `/pub/gnome/sources/glib/`.
2020-04-20 12:08:28,061 [findtb] {INFO}: FTP: listing files in `/pub/gnome/sources/glib/`.
2020-04-20 12:08:28,824 [findtb] {INFO}: FTP: changing working directory to `/pub/gnome/sources/glib/2.39`.
2020-04-20 12:08:28,983 [findtb] {INFO}: FTP: listing files in `/pub/gnome/sources/glib/2.39`.
2020-04-20 12:08:29,770 [findtb] {INFO}: FTP: changing working directory to `..`.
2020-04-20 12:08:29,922 [findtb] {INFO}: FTP: changing working directory to `/pub/gnome/sources/glib/2.19`.
2020-04-20 12:08:30,079 [findtb] {INFO}: FTP: listing files in `/pub/gnome/sources/glib/2.19`.
2020-04-20 12:08:30,871 [findtb] {INFO}: FTP: changing working directory to `..`.
2020-04-20 12:08:31,027 [findtb] {INFO}: FTP: changing working directory to `/pub/gnome/sources/glib/2.60`.
2020-04-20 12:08:31,177 [findtb] {INFO}: FTP: listing files in `/pub/gnome/sources/glib/2.60`.
...
{'ftp://ftp.gnome.org/pub/gnome/sources/glib/1.1/glib-1.1.12.tar.gz',
'ftp://ftp.gnome.org/pub/gnome/sources/glib/1.1/glib-1.1.15.tar.gz',
'ftp://ftp.gnome.org/pub/gnome/sources/glib/1.2/glib-1.2.0.tar.gz',
'ftp://ftp.gnome.org/pub/gnome/sources/glib/1.2/glib-1.2.10.tar.gz',
'ftp://ftp.gnome.org/pub/gnome/sources/glib/1.2/glib-1.2.2.tar.gz',
'ftp://ftp.gnome.org/pub/gnome/sources/glib/1.2/glib-1.2.5.tar.gz',
'ftp://ftp.gnome.org/pub/gnome/sources/glib/1.2/glib-1.2.6.tar.gz',
...
'ftp://ftp.gnome.org/pub/gnome/sources/glib/2.9/glib-2.9.4.tar.bz2',
'ftp://ftp.gnome.org/pub/gnome/sources/glib/2.9/glib-2.9.4.tar.gz',
'ftp://ftp.gnome.org/pub/gnome/sources/glib/2.9/glib-2.9.5.tar.bz2',
'ftp://ftp.gnome.org/pub/gnome/sources/glib/2.9/glib-2.9.5.tar.gz',
'ftp://ftp.gnome.org/pub/gnome/sources/glib/2.9/glib-2.9.6.tar.bz2',
'ftp://ftp.gnome.org/pub/gnome/sources/glib/2.9/glib-2.9.6.tar.gz'}
----

`findtb.find_tarball_urls()` is quite low-level: it returns a set of
strings. Use `findtb.find_tarballs()` to get a set of `findtb.Tarball`
objects instead.

A `findtb.Tarball` object contains useful properties such as `name`
(tarball file name) and `version` (a `packaging.version.Version`
object):

[source,python]
----
import findtb

# find tarballs
tarballs = findtb.find_tarballs('python',
'https://www.python.org/ftp/python/')

# print names and versions
for tarball in tarballs:
print(tarball.name, tarball.version.release)
----

----
Python-3.7.0b2.tgz (3, 7, 0)
Python-3.4.0a1.tgz (3, 4, 0)
Python-2.4.6.tgz (2, 4, 6)
Python-3.3.4rc1.tar.xz (3, 3, 4)
Python-3.1.3rc1.tar.bz2 (3, 1, 3)
python-3.6.0b4-embed-win32.zip None
Python-3.7.0a1.tgz (3, 7, 0)
python-3.5.0rc3-embed-amd64.zip None
...
Python-2.2.1.tgz (2, 2, 1)
python-3.9.0a4-embed-win32.zip None
Python-2.7.1rc1.tgz (2, 7, 1)
Python-3.4.0a3.tgz (3, 4, 0)
Python-3.5.2.tar.xz (3, 5, 2)
Python-3.0a2.tgz (3, 0)
Python-3.6.6rc1.tgz (3, 6, 6)
----

Any networking/parsing error raises `findtb.Error`.

== Limitations

findtb is not guaranteed to work for all projects. Its very
sophisticated algorithms rely on nasty regular expressions and there are
dozens of software versioning schemes in the wild.

Feel free to contribute if findtb does not work for you.