Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/warrenweckesser/scarff

An ARFF file writer that handles NumPy arrays and SciPy sparse matrices.
https://github.com/warrenweckesser/scarff
arff python
Last synced: 16 days ago
JSON representation
An ARFF file writer that handles NumPy arrays and SciPy sparse matrices.
Host: GitHub
URL: https://github.com/warrenweckesser/scarff
Owner: WarrenWeckesser
License: mit
Created: 2021-10-01T03:19:57.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2024-10-12T22:41:21.000Z (2 months ago)
Last Synced: 2024-10-17T08:50:28.933Z (2 months ago)
Topics: arff, python
Language: Python
Homepage:
Size: 89.8 KB
Stars: 0
Watchers: 3
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.rst
- License: LICENSE.txt
Awesome Lists containing this project

README

        scarff

======

An ARFF file writer that handles NumPy arrays and SciPy sparse matrices.

Limitations:

* ``relational`` attributes are not supported.

* The ``dateformat`` parameter accepts a format string that defines

  the output format for ``date`` attributes.  ARFF uses the Java

  SimpleDateFormat specification for the format string.  Only a subset

  of the SimpleDateFormat patterns are supported by ``savearff``.

* The big limitation, of course, is that this package includes only a

  writer.  It does not provide a function to read ARFF files.

Examples

--------

Initial imports::

    >>> import sys

    >>> import numpy as np

    >>> from scarff import savearff

**NumPy array of integers**

``a`` is a 2-d array of integers.  The default attribute names generated

by ``savearff`` for each column are ``f0``, ``f1``, etc.  Here we

override that default and assign each column an attribute name with the

``attributes`` parameter::

    >>> a = np.array([[1, 2, 3], [9, 7, 6], [2, 2, 8], [4, 2, 3]])

    >>> savearff(sys.stdout, a, attributes=['x0', 'y0', 'z0'],

    ...          relation='points')

    @relation points

    @attribute x0 integer

    @attribute y0 integer

    @attribute z0 integer

    @data

    1,2,3

    9,7,6

    2,2,8

    4,2,3

**NumPy array with a structured dtype**

In this example, we have a structured array with a data type

that has four fields.  ``savearff`` takes the attribute names

from the names of the fields in the data type.  This example

also shows the use of a ``date`` attribute::

    >>> dt = np.dtype([('id', int), ('strength', float), ('key', 'U8'),

    ...                ('timestamp', 'datetime64[s]')])

    >>> m = np.array([(233, 1.75, 'QXX34', '2011-05-04T13:12:04'),

    ...               (154, 3.25, 'QXX99', '2011-05-04T13:47:43'),

    ...               (199, 2.16, 'QXZ55', '2011-05-04T14:41:02'),

    ...               (198, 2.32, 'QXZ59', '2011-05-04T15:28:19')], dtype=dt)

    >>> savearff(sys.stdout, m, relation='measurements',

    ...          dateformat='yyyy-MM-dd HH:mm:ss')

    @relation measurements

    @attribute id integer

    @attribute strength real

    @attribute key string

    @attribute timestamp date "yyyy-MM-dd HH:mm:ss"

    @data

    233,1.75,"QXX34","2011-05-04 13:12:04"

    154,3.25,"QXX99","2011-05-04 13:47:43"

    199,2.16,"QXZ55","2011-05-04 14:41:02"

    198,2.32,"QXZ59","2011-05-04 15:28:19"

**Nominal attributes**

ARFF files can have "nominal" attributes, in which the possible

values are restricted to a given set.  The ``nominal`` parameter

of ``savearff`` allows a column to be designated as a nominal

attribute.  The set of possible values can be derived from the

set of unique values found in the column, or can be given explicitly.

For example, here we use ``nominal={'color': True}`` to indicate that

the ``color`` attribute is nominal; the set of possible values will

be the set of unique values found in the data (in this case, ``black``,

``green`` and ``red``)::

    >>> things = [[10, 20, 'a', 'green'],

    ...           [30, 40, 'b', 'red'],

    ...           [50, 60, 'b', 'red'],

    ...           [70, 80, 'c', 'black'],

    ...           [19, 29, 'c', 'red']]

    >>> savearff(sys.stdout, things, relation='THINGS',

    ...          attributes=['x', 'y', 'code', 'color'],

    ...          nominal={'color': True})

    @relation THINGS

    @attribute x integer

    @attribute y integer

    @attribute code string

    @attribute color {black,green,red}

    @data

    10,20,"a","green"

    30,40,"b","red"

    50,60,"b","red"

    70,80,"c","black"

    19,29,"c","red"

The set of possible values can be given explicitly::

    >>> savearff(sys.stdout, things, relation='THINGS',

    ...          attributes=['x', 'y', 'code', 'color'],

    ...          nominal={'color': ['red', 'green', 'blue', 'black', 'white']})

    @relation THINGS

    @attribute x integer

    @attribute y integer

    @attribute code string

    @attribute color {red,green,blue,black,white}

    @data

    10,20,"a","green"

    30,40,"b","red"

    50,60,"b","red"

    70,80,"c","black"

    19,29,"c","red"

**SciPy sparse matrix**

SciPy is not a required dependency of ``scarff``, but ``savearff``

will recognize SciPy sparse matrices and write them to the ARFF file

using the sparse format by default::

    >>> from scipy.sparse import csc_matrix

    >>> data = [10, 20, 30, 40, 50, 60]

    >>> rows = [0, 2, 2, 3, 5, 5]

    >>> cols = [3, 1, 2, 2, 3, 4]

    >>> s = csc_matrix((data, (rows, cols)), shape=(7, 5))

    >>> s.toarray()

    array([[ 0,  0,  0, 10,  0],

           [ 0,  0,  0,  0,  0],

           [ 0, 20, 30,  0,  0],

           [ 0,  0, 40,  0,  0],

           [ 0,  0,  0,  0,  0],

           [ 0,  0,  0, 50, 60],

           [ 0,  0,  0,  0,  0]])

    >>> savearff(sys.stdout, s, relation='links',

    ...          attributes=['a', 'b', 'c', 'd', 'e'])

    @relation links

    @attribute a integer

    @attribute b integer

    @attribute c integer

    @attribute d integer

    @attribute e integer

    @data

    {3 10}

    {}

    {1 20, 2 30}

    {2 40}

    {}

    {3 50, 4 60}

    {}

**Sparse format with a NumPy array**

A regular NumPy array can be written in the sparse format by giving

the argument ``fileformat='sparse'``::

    >>> sp = np.array([[0, 0, 99, 0, 0],

    ...                [29, 0, 0, 0, 19],

    ...                [0, 0, 0, 0, 0],

    ...                [0, 89, 0, 0, 0]])

    >>> savearff(sys.stdout, sp, fileformat='sparse',

    ...          relation='sparse example')

    @relation "sparse example"

    @attribute f0 integer

    @attribute f1 integer

    @attribute f2 integer

    @attribute f3 integer

    @attribute f4 integer

    @data

    {2 99}

    {0 29, 4 19}

    {}

    {1 89}

**Missing data**

The ``missing`` parameter allows values to be specified that

correspond to missing values.  These will appear as ``?`` in the

``@data`` section of the ARFF file.

In this example, the value 999.25 indicates a missing value::

    >>> x = np.array([[1.75, 7.93, 18.31],

    ...               [2.44, 6.62, 32.11],

    ...               [2.51, 2.25, 999.25],

    ...               [2.64, 2.33, 999.25],

    ...               [2.75, 2.83, 999.25]])

    >>> savearff(sys.stdout, x, missing=[999.25], relation='readings')

    @relation readings

    @attribute f0 real

    @attribute f1 real

    @attribute f2 real

    @data

    1.75,7.93,18.31

    2.44,6.62,32.11

    2.51,2.25,?

    2.64,2.33,?

    2.75,2.83,?

**NumPy masked array**

``savearff`` recognizes NumPy masked arrays.  Masked values in

the input array will be written as ``?`` in the ``@data`` section::

    >>> flux = np.ma.masked_array([[3.4, 2.1, 0.0, 3.4],

    ...                            [3.2, 4.8, 0.5, 3.7],

    ...                            [3.3, 2.8, 0.0, 4.1]],

    ...                           mask=[[0, 0, 1, 0],

    ...                                 [0, 0, 0, 0],

    ...                                 [0, 0, 1, 0]])

    >>> flux

    masked_array(

      data=[[3.4, 2.1, --, 3.4],

            [3.2, 4.8, 0.5, 3.7],

            [3.3, 2.8, --, 4.1]],

      mask=[[False, False,  True, False],

            [False, False, False, False],

            [False, False,  True, False]],

      fill_value=1e+20)

    >>> savearff(sys.stdout, flux, relation='flux capacitance')

    @relation "flux capacitance"

    @attribute f0 real

    @attribute f1 real

    @attribute f2 real

    @attribute f3 real

    @data

    3.4,2.1,?,3.4

    3.2,4.8,0.5,3.7

    3.3,2.8,?,4.1

**NumPy array with nested data type**

This example uses a NumPy array with a structured data type with nested

and array elements in the structure.  ``savearff`` flattens the data type

and derives attribute names from the structured data type; note how the

field names in the structured data type are used to produce the attribute

names in the output::

    >>> dt = np.dtype([('key', 'U4'),

    ...                ('position', [('x', np.float32), ('y', np.float32)]),

    ...                ('values', np.float32, 3)])

    >>> records = np.array([('A234', (1.9, -3.0), (6, 7, 2)),

    ...                     ('A555', (2.8, 0.6), (4, 2.5, 3)),

    ...                     ('B431', (2.7, 8.6), (4, 2.8, 0.2))], dtype=dt)

    >>> savearff(sys.stdout, records, relation='records')

    @relation records

    @attribute key string

    @attribute position.x real

    @attribute position.y real

    @attribute values_0 real

    @attribute values_1 real

    @attribute values_2 real

    @data

    "A234",1.9,-3,6,7,2

    "A555",2.8,0.6,4,2.5,3

    "B431",2.7,8.6,4,2.8,0.2

The above example demonstrates the default method for converting

structured data type field names to attribute names. ``savearff``

has several options to change how the names are generated.

For example::

    >>> savearff(sys.stdout, records, relation='records',

    ...          join='$', index_base=1, index_open='(', index_close=')')

    @relation records

    @attribute key string

    @attribute position$x real

    @attribute position$y real

    @attribute values(1) real

    @attribute values(2) real

    @attribute values(3) real

    @data

    "A234",1.9,-3,6,7,2

    "A555",2.8,0.6,4,2.5,3

    "B431",2.7,8.6,4,2.8,0.2

**Instance weights**

The ARFF format provides the option of saving an "instance weight" with

each instance (i.e. each row) of the data.  ``savearff`` accepts a

``weights`` argument containing a sequence of numbers.  The length of

``weights`` must equal the number of rows to be written in the ``@DATA``

section.  The weights are written to the file as an additional column in

the ``@DATA`` section, with the values enclosed in curly brackets.

For example::

    >>> dt = np.dtype([('id', int), ('x', float), ('y', float)])

    >>> samples = np.array([(300, 1.5, 1.8),

    ...                     (300, 0.8, 2.4),

    ...                     (304, 2.4, 0.5),

    ...                     (304, 3.2, 0.2)], dtype=dt)

    >>> weights = np.array([2, 2, 1, 1])

    >>> savearff(sys.stdout, samples, relation='samples', weights=weights)

    @relation samples

    @attribute id integer

    @attribute x real

    @attribute y real

    @data

    300,1.5,1.8, {2}

    300,0.8,2.4, {2}

    304,2.4,0.5, {1}

    304,3.2,0.2, {1}