Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/warrenweckesser/scarff
An ARFF file writer that handles NumPy arrays and SciPy sparse matrices.
https://github.com/warrenweckesser/scarff
arff python
Last synced: 16 days ago
JSON representation
An ARFF file writer that handles NumPy arrays and SciPy sparse matrices.
- Host: GitHub
- URL: https://github.com/warrenweckesser/scarff
- Owner: WarrenWeckesser
- License: mit
- Created: 2021-10-01T03:19:57.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2024-10-12T22:41:21.000Z (2 months ago)
- Last Synced: 2024-10-17T08:50:28.933Z (2 months ago)
- Topics: arff, python
- Language: Python
- Homepage:
- Size: 89.8 KB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.rst
- License: LICENSE.txt
Awesome Lists containing this project
README
scarff
======An ARFF file writer that handles NumPy arrays and SciPy sparse matrices.
Limitations:
* ``relational`` attributes are not supported.
* The ``dateformat`` parameter accepts a format string that defines
the output format for ``date`` attributes. ARFF uses the Java
SimpleDateFormat specification for the format string. Only a subset
of the SimpleDateFormat patterns are supported by ``savearff``.
* The big limitation, of course, is that this package includes only a
writer. It does not provide a function to read ARFF files.Examples
--------Initial imports::
>>> import sys
>>> import numpy as np
>>> from scarff import savearff**NumPy array of integers**
``a`` is a 2-d array of integers. The default attribute names generated
by ``savearff`` for each column are ``f0``, ``f1``, etc. Here we
override that default and assign each column an attribute name with the
``attributes`` parameter::>>> a = np.array([[1, 2, 3], [9, 7, 6], [2, 2, 8], [4, 2, 3]])
>>> savearff(sys.stdout, a, attributes=['x0', 'y0', 'z0'],
... relation='points')
@relation points@attribute x0 integer
@attribute y0 integer
@attribute z0 integer@data
1,2,3
9,7,6
2,2,8
4,2,3**NumPy array with a structured dtype**
In this example, we have a structured array with a data type
that has four fields. ``savearff`` takes the attribute names
from the names of the fields in the data type. This example
also shows the use of a ``date`` attribute::>>> dt = np.dtype([('id', int), ('strength', float), ('key', 'U8'),
... ('timestamp', 'datetime64[s]')])
>>> m = np.array([(233, 1.75, 'QXX34', '2011-05-04T13:12:04'),
... (154, 3.25, 'QXX99', '2011-05-04T13:47:43'),
... (199, 2.16, 'QXZ55', '2011-05-04T14:41:02'),
... (198, 2.32, 'QXZ59', '2011-05-04T15:28:19')], dtype=dt)
>>> savearff(sys.stdout, m, relation='measurements',
... dateformat='yyyy-MM-dd HH:mm:ss')
@relation measurements@attribute id integer
@attribute strength real
@attribute key string
@attribute timestamp date "yyyy-MM-dd HH:mm:ss"@data
233,1.75,"QXX34","2011-05-04 13:12:04"
154,3.25,"QXX99","2011-05-04 13:47:43"
199,2.16,"QXZ55","2011-05-04 14:41:02"
198,2.32,"QXZ59","2011-05-04 15:28:19"**Nominal attributes**
ARFF files can have "nominal" attributes, in which the possible
values are restricted to a given set. The ``nominal`` parameter
of ``savearff`` allows a column to be designated as a nominal
attribute. The set of possible values can be derived from the
set of unique values found in the column, or can be given explicitly.
For example, here we use ``nominal={'color': True}`` to indicate that
the ``color`` attribute is nominal; the set of possible values will
be the set of unique values found in the data (in this case, ``black``,
``green`` and ``red``)::>>> things = [[10, 20, 'a', 'green'],
... [30, 40, 'b', 'red'],
... [50, 60, 'b', 'red'],
... [70, 80, 'c', 'black'],
... [19, 29, 'c', 'red']]
>>> savearff(sys.stdout, things, relation='THINGS',
... attributes=['x', 'y', 'code', 'color'],
... nominal={'color': True})
@relation THINGS@attribute x integer
@attribute y integer
@attribute code string
@attribute color {black,green,red}@data
10,20,"a","green"
30,40,"b","red"
50,60,"b","red"
70,80,"c","black"
19,29,"c","red"The set of possible values can be given explicitly::
>>> savearff(sys.stdout, things, relation='THINGS',
... attributes=['x', 'y', 'code', 'color'],
... nominal={'color': ['red', 'green', 'blue', 'black', 'white']})
@relation THINGS@attribute x integer
@attribute y integer
@attribute code string
@attribute color {red,green,blue,black,white}@data
10,20,"a","green"
30,40,"b","red"
50,60,"b","red"
70,80,"c","black"
19,29,"c","red"**SciPy sparse matrix**
SciPy is not a required dependency of ``scarff``, but ``savearff``
will recognize SciPy sparse matrices and write them to the ARFF file
using the sparse format by default::>>> from scipy.sparse import csc_matrix
>>> data = [10, 20, 30, 40, 50, 60]
>>> rows = [0, 2, 2, 3, 5, 5]
>>> cols = [3, 1, 2, 2, 3, 4]
>>> s = csc_matrix((data, (rows, cols)), shape=(7, 5))
>>> s.toarray()
array([[ 0, 0, 0, 10, 0],
[ 0, 0, 0, 0, 0],
[ 0, 20, 30, 0, 0],
[ 0, 0, 40, 0, 0],
[ 0, 0, 0, 0, 0],
[ 0, 0, 0, 50, 60],
[ 0, 0, 0, 0, 0]])
>>> savearff(sys.stdout, s, relation='links',
... attributes=['a', 'b', 'c', 'd', 'e'])
@relation links@attribute a integer
@attribute b integer
@attribute c integer
@attribute d integer
@attribute e integer@data
{3 10}
{}
{1 20, 2 30}
{2 40}
{}
{3 50, 4 60}
{}**Sparse format with a NumPy array**
A regular NumPy array can be written in the sparse format by giving
the argument ``fileformat='sparse'``::>>> sp = np.array([[0, 0, 99, 0, 0],
... [29, 0, 0, 0, 19],
... [0, 0, 0, 0, 0],
... [0, 89, 0, 0, 0]])
>>> savearff(sys.stdout, sp, fileformat='sparse',
... relation='sparse example')
@relation "sparse example"@attribute f0 integer
@attribute f1 integer
@attribute f2 integer
@attribute f3 integer
@attribute f4 integer@data
{2 99}
{0 29, 4 19}
{}
{1 89}**Missing data**
The ``missing`` parameter allows values to be specified that
correspond to missing values. These will appear as ``?`` in the
``@data`` section of the ARFF file.In this example, the value 999.25 indicates a missing value::
>>> x = np.array([[1.75, 7.93, 18.31],
... [2.44, 6.62, 32.11],
... [2.51, 2.25, 999.25],
... [2.64, 2.33, 999.25],
... [2.75, 2.83, 999.25]])
>>> savearff(sys.stdout, x, missing=[999.25], relation='readings')
@relation readings@attribute f0 real
@attribute f1 real
@attribute f2 real@data
1.75,7.93,18.31
2.44,6.62,32.11
2.51,2.25,?
2.64,2.33,?
2.75,2.83,?**NumPy masked array**
``savearff`` recognizes NumPy masked arrays. Masked values in
the input array will be written as ``?`` in the ``@data`` section::>>> flux = np.ma.masked_array([[3.4, 2.1, 0.0, 3.4],
... [3.2, 4.8, 0.5, 3.7],
... [3.3, 2.8, 0.0, 4.1]],
... mask=[[0, 0, 1, 0],
... [0, 0, 0, 0],
... [0, 0, 1, 0]])
>>> flux
masked_array(
data=[[3.4, 2.1, --, 3.4],
[3.2, 4.8, 0.5, 3.7],
[3.3, 2.8, --, 4.1]],
mask=[[False, False, True, False],
[False, False, False, False],
[False, False, True, False]],
fill_value=1e+20)
>>> savearff(sys.stdout, flux, relation='flux capacitance')
@relation "flux capacitance"@attribute f0 real
@attribute f1 real
@attribute f2 real
@attribute f3 real@data
3.4,2.1,?,3.4
3.2,4.8,0.5,3.7
3.3,2.8,?,4.1**NumPy array with nested data type**
This example uses a NumPy array with a structured data type with nested
and array elements in the structure. ``savearff`` flattens the data type
and derives attribute names from the structured data type; note how the
field names in the structured data type are used to produce the attribute
names in the output::>>> dt = np.dtype([('key', 'U4'),
... ('position', [('x', np.float32), ('y', np.float32)]),
... ('values', np.float32, 3)])
>>> records = np.array([('A234', (1.9, -3.0), (6, 7, 2)),
... ('A555', (2.8, 0.6), (4, 2.5, 3)),
... ('B431', (2.7, 8.6), (4, 2.8, 0.2))], dtype=dt)
>>> savearff(sys.stdout, records, relation='records')
@relation records@attribute key string
@attribute position.x real
@attribute position.y real
@attribute values_0 real
@attribute values_1 real
@attribute values_2 real@data
"A234",1.9,-3,6,7,2
"A555",2.8,0.6,4,2.5,3
"B431",2.7,8.6,4,2.8,0.2The above example demonstrates the default method for converting
structured data type field names to attribute names. ``savearff``
has several options to change how the names are generated.
For example::>>> savearff(sys.stdout, records, relation='records',
... join='$', index_base=1, index_open='(', index_close=')')
@relation records@attribute key string
@attribute position$x real
@attribute position$y real
@attribute values(1) real
@attribute values(2) real
@attribute values(3) real@data
"A234",1.9,-3,6,7,2
"A555",2.8,0.6,4,2.5,3
"B431",2.7,8.6,4,2.8,0.2**Instance weights**
The ARFF format provides the option of saving an "instance weight" with
each instance (i.e. each row) of the data. ``savearff`` accepts a
``weights`` argument containing a sequence of numbers. The length of
``weights`` must equal the number of rows to be written in the ``@DATA``
section. The weights are written to the file as an additional column in
the ``@DATA`` section, with the values enclosed in curly brackets.For example::
>>> dt = np.dtype([('id', int), ('x', float), ('y', float)])
>>> samples = np.array([(300, 1.5, 1.8),
... (300, 0.8, 2.4),
... (304, 2.4, 0.5),
... (304, 3.2, 0.2)], dtype=dt)
>>> weights = np.array([2, 2, 1, 1])
>>> savearff(sys.stdout, samples, relation='samples', weights=weights)
@relation samples@attribute id integer
@attribute x real
@attribute y real@data
300,1.5,1.8, {2}
300,0.8,2.4, {2}
304,2.4,0.5, {1}
304,3.2,0.2, {1}