Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/scikit-hep/awkward-0.x
Manipulate arrays of complex data structures as easily as Numpy.
https://github.com/scikit-hep/awkward-0.x
analysis apache-arrow arrow big-data columnar columnar-storage hdf5 numpy parquet python python3 root root-cern scikit-hep
Last synced: about 1 month ago
JSON representation
Manipulate arrays of complex data structures as easily as Numpy.
- Host: GitHub
- URL: https://github.com/scikit-hep/awkward-0.x
- Owner: scikit-hep
- License: bsd-3-clause
- Archived: true
- Created: 2018-06-12T14:00:35.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2021-02-08T18:33:36.000Z (over 3 years ago)
- Last Synced: 2024-09-24T23:04:16.001Z (about 2 months ago)
- Topics: analysis, apache-arrow, arrow, big-data, columnar, columnar-storage, hdf5, numpy, parquet, python, python3, root, root-cern, scikit-hep
- Language: Python
- Homepage:
- Size: 6.42 MB
- Stars: 215
- Watchers: 15
- Forks: 39
- Open Issues: 19
-
Metadata Files:
- Readme: README.rst
- License: LICENSE
Awesome Lists containing this project
README
.. image:: docs/source/logo-300px.png
This is a deprecated version of Awkward Array
=============================================See `scikit-hep/awkward-1.0 `__ for the latest version of Awkward Array. Old and new versions are available as separate packages,
.. code-block:: bash
pip install awkward # new
pip install awkward0 # oldYou can adopt the new library gradually. If you want to use some of its features without completely switching over, you can use `ak.from_awkward0 `__ and `ak.to_awkward0 `__ with the new library loaded as
.. code-block:: python
import awkward as ak
Awkward Array
=============.. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.3275017.svg
:target: https://doi.org/10.5281/zenodo.3275017.. inclusion-marker-1-do-not-remove
Manipulate arrays of complex data structures as easily as Numpy.
.. inclusion-marker-1-5-do-not-remove
Calculations with rectangular, numerical data are simpler and faster in Numpy than traditional for loops. Consider, for instance,
.. code-block:: python
all_r = []
for x, y in zip(all_x, all_y):
all_r.append(sqrt(x**2 + y**2))versus
.. code-block:: python
all_r = sqrt(all_x**2 + all_y**2)
Not only is the latter easier to read, it's hundreds of times faster than the for loop (and provides opportunities for hidden vectorization and parallelization). However, the Numpy abstraction stops at rectangular arrays of numbers or character strings. While it's possible to put arbitrary Python data in a Numpy array, Numpy's ``dtype=object`` is essentially a fixed-length list: data are not contiguous in memory and operations are not vectorized.
Awkward Array is a pure Python+Numpy library for manipulating complex data structures as you would Numpy arrays. Even if your data structures
* contain variable-length lists (jagged/ragged),
* are deeply nested (record structure),
* have different data types in the same list (heterogeneous),
* are masked, bit-masked, or index-mapped (nullable),
* contain cross-references or even cyclic references,
* need to be Python class instances on demand,
* are not defined at every point (sparse),
* are not contiguous in memory,
* should not be loaded into memory all at once (lazy),this library can access them as `columnar data structures `__, with the efficiency of Numpy arrays. They may be converted from JSON or Python data, loaded from "awkd" files, `HDF5 `__, `Parquet `__, or `ROOT `__ files, or they may be views into memory buffers like `Arrow `__.
.. inclusion-marker-2-do-not-remove
Installation
============Install Awkward Array like any other Python package:
.. code-block:: bash
pip install awkward0 # maybe with sudo or --user, or in virtualenv
The base ``awkward0`` package requires only `Numpy `__ (1.13.1+).
Recommended packages:
---------------------- `pyarrow `__ to view Arrow and Parquet data as Awkward Arrays
- `h5py `__ to read and write Awkward Arrays in HDF5 files
- `Pandas `__ as an alternative view.. inclusion-marker-3-do-not-remove
Questions
=========If you have a question about how to use Awkward Array that is not answered in the document below, I recommend asking your question on `StackOverflow `__ with the ``[awkward-array]`` tag. (I get notified of questions with this tag.) Note that this tag is primarily intended for the new version of Awkward Array, so if you're using this version (Awkward 0.x), be sure to mention that.
.. raw:: html
If you believe you have found a bug in Awkward Array, post it on the `GitHub issues tab `__.
Tutorial
========**Table of contents:**
* `Introduction <#introduction>`__
* `Overview with sample datasets <#overview-with-sample-datasets>`__
* `NASA exoplanets from a Parquet file <#nasa-exoplanets-from-a-parquet-file>`__
* `NASA exoplanets from an Arrow buffer <#nasa-exoplanets-from-an-arrow-buffer>`__
* `Relationship to Pandas <#relationship-to-pandas>`__
* `LHC data from a ROOT file <#lhc-data-from-a-root-file>`__
* `Awkward Array data model <#awkward-array-data-model>`__
* `Mutability <#mutability>`__
* `Relationship to Arrow <#relationship-to-arrow>`__
* `High-level operations common to all classes <#high-level-operations-common-to-all-classes>`__
* `Slicing with square brackets <#slicing-with-square-brackets>`__
* `Assigning with square brackets <#assigning-with-square-brackets>`__
* `Numpy-like broadcasting <#numpy-like-broadcasting>`__
* `Support for Numpy universal functions (ufuncs) <#support-for-numpy-universal-functions-ufuncs>`__
* `Global switches <#global-switches>`__
* `Generic properties and methods <#generic-properties-and-methods>`__
* `Reducers <#reducers>`__
* `Properties and methods for jaggedness <#properties-and-methods-for-jaggedness>`__
* `Properties and methods for tabular columns <#properties-and-methods-for-tabular-columns>`__
* `Properties and methods for missing values <#properties-and-methods-for-missing-values>`__
* `Functions for structure manipulation <#functions-for-structure-manipulation>`__
* `Functions for input/output and conversion <#functions-for-inputoutput-and-conversion>`__
* `High-level types <#high-level-types>`__
* `Low-level layouts <#low-level-layouts>`__
Introduction
------------Numpy is great for exploratory data analysis because it encourages the analyst to calculate one operation at a time, rather than one datum at a time. To compute an expression like
.. raw:: html
you might first compute ``sqrt((px1 + px2)**2 + (py1 + py2)**2)`` for all data (which is a meaningful quantity: ``pt``), then compute ``sqrt(pt**2 + (pz1 + pz2)**2)`` (another meaningful quantity: ``p``), then compute the whole expression as ``sqrt((E1 + E2)**2 - p**2)``. Performing each step separately on all data lets you plot and cross-check distributions of partial computations, to discover surprises as early as possible.
This order of data processing is called "columnar" in the sense that a dataset may be visualized as a table in which rows are repeated measurements and columns are the different measurable quantities (same layout as `Pandas DataFrames `__). It is also called "vectorized" in that a Single (virtual) Instruction is applied to Multiple Data (virtual SIMD). Numpy can be hundreds to thousands of times faster than pure Python because it avoids the overhead of handling Python instructions in the loop over numbers. Most data processing languages (R, MATLAB, IDL, all the way back to APL) work this way: an interactive interpreter controlling fast, array-at-a-time math.
However, it's difficult to apply this methodology to non-rectangular data. If your dataset has nested structure, a different number of values per row, different data types in the same column, or cross-references or even circular references, Numpy can't help you.
If you try to make an array with non-trivial types:
.. code-block:: python3
import numpy
nested = numpy.array([{"x": 1, "y": 1.1}, {"x": 2, "y": 2.2}, {"x": 3, "y": 3.3}, {"x": 4, "y": 4.4}, {"x": 5, "y": 5.5}])
nested
# array([{'x': 1, 'y': 1.1}, {'x': 2, 'y': 2.2}, {'x': 3, 'y': 3.3},
# {'x': 4, 'y': 4.4}, {'x': 5, 'y': 5.5}], dtype=object)Numpy gives up and returns a ``dtype=object`` array, which means Python objects and pure Python processing. You don't get the columnar operations or the performance boost.
For instance, you might want to say
.. code-block:: python3
try:
nested + 100
except Exception as err:
print(type(err), str(err))
# unsupported operand type(s) for +: 'dict' and 'int'but there is no vectorized addition for an array of dicts because there is no addition for dicts defined in pure Python. Numpy is not using its vectorized routines—it's calling Python code on each element.
The same applies to variable-length data, such as lists of lists, where the inner lists have different lengths. This is a more serious shortcoming than the above because the list of dicts (Python's equivalent of an "`array of structs `__") could be manually reorganized into two numerical arrays, ``"x"`` and ``"y"`` (a "`struct of arrays `__"). Not so with a list of variable-length lists.
.. code-block:: python3
varlen = numpy.array([[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6], [7.7, 8.8, 9.9]])
varlen
# array([list([1.1, 2.2, 3.3]), list([]), list([4.4, 5.5]), list([6.6]),
# list([7.7, 8.8, 9.9])], dtype=object)As before, we get a ``dtype=object`` without vectorized methods.
.. code-block:: python3
try:
varlen + 100
except Exception as err:
print(type(err), str(err))
# can only concatenate list (not "int") to listWhat's worse, this array looks purely numerical and could have been made by a process that was *supposed* to create equal-length inner lists.
Awkward Array provides a way of talking about these data structures as arrays.
.. code-block:: python3
import awkward0
nested = awkward0.fromiter([{"x": 1, "y": 1.1}, {"x": 2, "y": 2.2}, {"x": 3, "y": 3.3}, {"x": 4, "y": 4.4}, {"x": 5, "y": 5.5}])
nested
# ] at 0x7f25e80a01d0>This ``Table`` is a columnar data structure with the same meaning as the Python data we built it with. To undo ``awkward0.fromiter``, call ``.tolist()``.
.. code-block:: python3
nested.tolist()
# [{'x': 1, 'y': 1.1},
# {'x': 2, 'y': 2.2},
# {'x': 3, 'y': 3.3},
# {'x': 4, 'y': 4.4},
# {'x': 5, 'y': 5.5}]Values at the same position of the tree structure are contiguous in memory: this is a struct of arrays.
.. code-block:: python3
nested.contents["x"]
# array([1, 2, 3, 4, 5])nested.contents["y"]
# array([1.1, 2.2, 3.3, 4.4, 5.5])Having a structure like this means that we can perform vectorized operations on the whole structure with relatively few Python instructions (number of Python instructions scales with the complexity of the data type, not with the number of values in the dataset).
.. code-block:: python3
(nested + 100).tolist()
# [{'x': 101, 'y': 101.1},
# {'x': 102, 'y': 102.2},
# {'x': 103, 'y': 103.3},
# {'x': 104, 'y': 104.4},
# {'x': 105, 'y': 105.5}](nested + numpy.arange(100, 600, 100)).tolist()
# [{'x': 101, 'y': 101.1},
# {'x': 202, 'y': 202.2},
# {'x': 303, 'y': 303.3},
# {'x': 404, 'y': 404.4},
# {'x': 505, 'y': 505.5}]It's less obvious that variable-length data can be represented in a columnar format, but it can.
.. code-block:: python3
varlen = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6], [7.7, 8.8, 9.9]])
varlen
#Unlike Numpy's ``dtype=object`` array, the inner lists are *not* Python lists and the numerical values *are* contiguous in memory. This is made possible by representing the structure (where each inner list starts and stops) in one array and the values in another.
.. code-block:: python3
varlen.counts, varlen.content
# (array([3, 0, 2, 1, 3]), array([1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9]))(For fast random access, the more basic representation is ``varlen.offsets``, which is in turn a special case of a ``varlen.starts, varlen.stops`` pair. These details are discussed below.)
A structure like this can be broadcast like Numpy with a small number of Python instructions (scales with the complexity of the data type, not the number of values).
.. code-block:: python3
varlen + 100
#varlen + numpy.arange(100, 600, 100)
#You can even slice this object as though it were multidimensional (each element is a tensor of the same rank, but with different numbers of dimensions).
.. code-block:: python3
# Skip the first two inner lists; skip the last value in each inner list that remains.
varlen[2:, :-1]
#The data are not rectangular, so some inner lists might have as many elements as your selection. Don't worry—you'll get error messages.
.. code-block:: python3
try:
varlen[:, 1]
except Exception as err:
print(type(err), str(err))
# index 1 is out of bounds for jagged min size 0Masking with the ``.counts`` is handy because all the Numpy advanced indexing rules apply (in an extended sense) to jagged arrays.
.. code-block:: python3
varlen[varlen.counts > 1, 1]
# array([2.2, 5.5, 8.8])I've only presented the two most important Awkward Array classes, ``Table`` and ``JaggedArray`` (and not how they combine). Each class is presented in more detail below. For now, I'd just like to point out that you can make crazy complicated data structures
.. code-block:: python3
crazy = awkward0.fromiter([[1.21, 4.84, None, 10.89, None],
[19.36, [30.25]],
[{"x": 36, "y": {"z": 49}}, None, {"x": 64, "y": {"z": 81}}]
])and they vectorize and slice as expected.
.. code-block:: python3
numpy.sqrt(crazy).tolist()
# [[1.1, 2.2, None, 3.3000000000000003, None],
# [4.4, [5.5]],
# [{'x': 6.0, 'y': {'z': 7.0}}, None, {'x': 8.0, 'y': {'z': 9.0}}]]This is because any Awkward Array can be the content of any other Awkward Array. Like Numpy, the features of Awkward Array are simple, yet compose nicely to let you build what you need.
Overview with sample datasets
-----------------------------Many of the examples in this tutorial use ``awkward0.fromiter`` to make Awkward Arrays from lists and ``array.tolist()`` to turn them back into lists (or dicts for ``Table``, tuples for ``Table`` with anonymous fields, Python objects for ``ObjectArrays``, etc.). These should be considered slow methods, since Python instructions are executed in the loop, but that's a necessary part of examining or building Python objects.
Ideally, you'd want to get your data from a binary, columnar source and produce binary, columnar output, or convert only once and reuse the converted data. `Parquet `__ is a popular columnar format for storing data on disk and `Arrow `__ is a popular columnar format for sharing data in memory (between functions or applications). `ROOT `__ is a popular columnar format for particle physicists, and `uproot `__ natively produces Awkward Arrays from ROOT files.
`HDF5 `__ and its Python library `h5py `__ are columnar, but only for rectangular arrays, unlike the others mentioned here. Awkward Array can *wrap* HDF5 with an interpretation layer to store columnar data structures, but then the Awkward Array library wuold be needed to read the data back in a meaningful way. Awkward also has a native file format, ``.awkd`` files, which are simply ZIP archives of columns as binary blobs and metadata (just as Numpy's ``.npz`` is a ZIP of arrays with metadata). The HDF5, awkd, and pickle serialization procedures use the same protocol, which has backward and forward compatibility features. (Note: these storage formats are not compatible with Awkward 1.0 onward.)
NASA exoplanets from a Parquet file
"""""""""""""""""""""""""""""""""""Let's start by opening a Parquet file. Awkward reads Parquet through the `pyarrow `__ module, which is an optional dependency, so be sure you have it installed before trying the next line.
.. code-block:: python3
stars = awkward0.fromparquet("tests/samples/exoplanets.parquet")
stars
# ... ] at 0x7f25b9c67780>(There is also an ``awkward0.toparquet`` that takes the file name and array as arguments.)
Columns are accessible with square brackets and strings
.. code-block:: python3
stars["name"]
#or by dot-attribute (if the name doesn't have weird characters and doesn't conflict with a method or property name).
.. code-block:: python3
stars.ra, stars.dec
# (,
# )This file contains data about extrasolar planets and their host stars. As such, it's a ``Table`` full of Numpy arrays and ``JaggedArrays``. The star attributes (`"name"`, `"ra"` or right ascension in degrees, `"dec"` or declination in degrees, `"dist"` or distance in parsecs, `"mass"` in multiples of the sun's mass, and `"radius"` in multiples of the sun's radius) are plain Numpy arrays and the planet attributes (`"name"`, `"orbit"` or orbital distance in AU, `"eccen"` or eccentricity, `"period"` or periodicity in days, `"mass"` in multiples of Jupyter's mass, and `"radius"` in multiples of Jupiter's radius) are jagged because each star may have a different number of planets.
.. code-block:: python3
stars.planet_name
#stars.planet_period, stars.planet_orbit
# (,
# )For large arrays, only the first and last values are printed: the second-to-last star has three planets; all the other stars shown here have one planet.
These arrays are called ``ChunkedArrays`` because the Parquet file is lazily read in chunks (Parquet's row group structure). The ``ChunkedArray`` (subdivides the file) contains ``VirtualArrays`` (read one chunk on demand), which generate the ``JaggedArrays``. This is an illustration of how each Awkward class provides one feature, and you get desired behavior by combining them.
The ``ChunkedArrays`` and ``VirtualArrays`` support the same Numpy-like access as ``JaggedArray``, so we can compute with them just as we would any other array.
.. code-block:: python3
# distance in parsecs → distance in light years
stars.dist * 3.26156
## for all stars, drop the first planet
stars.planet_mass[:, 1:]
#NASA exoplanets from an Arrow buffer
""""""""""""""""""""""""""""""""""""The pyarrow implementation of Arrow is more complete than its implementation of Parquet, so we can use more features in the Arrow format, such as nested tables.
Unlike Parquet, which is intended as a file format, Arrow is a memory format. You might get an Arrow buffer as the output of another function, through interprocess communication, from a network RPC call, a message bus, etc. Arrow can be saved as files, though this isn't common. In this case, we'll get it from a file.
.. code-block:: python3
import pyarrow
arrow_buffer = pyarrow.ipc.open_file(open("tests/samples/exoplanets.arrow", "rb")).get_batch(0)
stars = awkward0.fromarrow(arrow_buffer)
stars
# ... ] at 0x7f25b94f2518>(There is also an ``awkward0.toarrow`` that takes an Awkward Array as its only argument, returning the relevant Arrow structure.)
This file is structured differently. Instead of jagged arrays of numbers like ``"planet_mass"``, ``"planet_period"``, and ``"planet_orbit"``, this file has a jagged table of ``"planets"``. A jagged table is a ``JaggedArray`` of ``Table``.
.. code-block:: python3
stars["planets"]
# ] [] [] ... [] [ ] []] at 0x7f25b94fb080>Notice that the square brackets are nested, but the contents are ```` objects. The second-to-last star has three planets, as before.
We can find the non-jagged ``Table`` in the ``JaggedArray.content``.
.. code-block:: python3
stars["planets"].content
# ... ] at 0x7f25b94f2d68>When viewed as Python lists and dicts, the ``'planets'`` field is a list of planet dicts, each with its own fields.
.. code-block:: python3
stars[:2].tolist()
# [{'dec': 17.792868,
# 'dist': 93.37,
# 'mass': 2.7,
# 'name': '11 Com',
# 'planets': [{'eccen': 0.231,
# 'mass': 19.4,
# 'name': 'b',
# 'orbit': 1.29,
# 'period': 326.03,
# 'radius': nan}],
# 'ra': 185.179276,
# 'radius': 19.0},
# {'dec': 71.823898,
# 'dist': 125.72,
# 'mass': 2.78,
# 'name': '11 UMi',
# 'planets': [{'eccen': 0.08,
# 'mass': 14.74,
# 'name': 'b',
# 'orbit': 1.53,
# 'period': 516.21997,
# 'radius': nan}],
# 'ra': 229.27453599999998,
# 'radius': 29.79}]Despite being packaged in an arguably more intuitive way, we can still get jagged arrays of numbers by requesting ``"planets"`` and a planet attribute (two column selections) without specifying which star or which parent.
.. code-block:: python3
stars.planets.name
#stars.planets.mass
#Even though the ``Table`` is hidden inside the ``JaggedArray``, its ``columns`` pass through to the top.
.. code-block:: python3
stars.columns
# ['dec', 'dist', 'mass', 'name', 'planets', 'ra', 'radius']stars.planets.columns
# ['eccen', 'mass', 'name', 'orbit', 'period', 'radius']For a more global view of the structures contained within one of these arrays, print out its high-level type. ("High-level" because it presents logical distinctions, like jaggedness and tables, but not physical distinctions, like chunking and virtualness.)
.. code-block:: python3
print(stars.type)
# [0, 2935) -> 'dec' -> float64
# 'dist' -> float64
# 'mass' -> float64
# 'name' ->
# 'planets' -> [0, inf) -> 'eccen' -> float64
# 'mass' -> float64
# 'name' ->
# 'orbit' -> float64
# 'period' -> float64
# 'radius' -> float64
# 'ra' -> float64
# 'radius' -> float64The above should be read like a function's data type: ``argument type -> return type`` for the function that takes an index in square brackets and returns something else. For example, the first ``[0, 2935)`` means that you could put any non-negative integer less than ``2935`` in square brackets after ``stars``, like this:
.. code-block:: python3
stars[1734]
#and get an object that would take ``'dec'``, ``'dist'``, ``'mass'``, ``'name'``, ``'planets'``, ``'ra'``, or ``'radius'`` in its square brackets. The return type depends on which of those strings you provide.
.. code-block:: python3
stars[1734]["mass"] # type is float64
# 0.54stars[1734]["name"] # type is
# 'Kepler-186'stars[1734]["planets"]
# ] at 0x7f25b94dc438>The planets have their own table structure:
.. code-block:: python3
print(stars[1734]["planets"].type)
# [0, 5) -> 'eccen' -> float64
# 'mass' -> float64
# 'name' ->
# 'orbit' -> float64
# 'period' -> float64
# 'radius' -> float64Notice that within the context of ``stars``, the ``planets`` could take any non-negative integer ``[0, inf)``, but for a particular star, the allowed domain is known with more precision: ``[0, 5)``. This is because ``stars["planets"]`` is a jagged array—a different number of planets for each star—but one ``stars[1734]["planets"]`` is a simple array—five planets for *this* star.
Passing a non-negative integer less than 5 to this array, we get an object that takes one of six strings: : ``'eccen'``, ``'mass'``, ``'name'``, ``'orbit'``, ``'period'``, and ``'radius'``.
.. code-block:: python3
stars[1734]["planets"][4]
#and the return type of these depends on which string you provide.
.. code-block:: python3
stars[1734]["planets"][4]["period"] # type is float
# 129.9441stars[1734]["planets"][4]["name"] # type is
# 'f'stars[1734]["planets"][4].tolist()
# {'eccen': 0.04,
# 'mass': nan,
# 'name': 'f',
# 'orbit': 0.432,
# 'period': 129.9441,
# 'radius': 0.10400000000000001}(Incidentally, this is a `potentially habitable exoplanet `__, the first ever discovered.)
.. code-block:: python3
stars[1734]["name"], stars[1734]["planets"][4]["name"]
# ('Kepler-186', 'f')Some of these arguments "commute" and others don't. Dimensional axes have a particular order, so you can't request a planet by its row number before selecting a star, but you can swap a column-selection (string) and a row-selection (integer). For a rectangular table, it's easy to see how you can slice column-first or row-first, but it even works when the table is jagged.
.. code-block:: python3
stars["planets"]["name"][1734][4]
# 'f'stars[1734]["planets"][4]["name"]
# 'f'None of these intermediate slices actually process data, so you can slice in any order that is logically correct without worrying about performance. Projections, even multi-column projections
.. code-block:: python3
orbits = stars["planets"][["name", "eccen", "orbit", "period"]]
orbits[1734].tolist()
In this representation, each star's attributes must be duplicated for all of its planets, and it is not possible to show stars that have no planets (not present in this dataset), but the information is preserved in a way that Pandas can recognize and operate on. (For instance, .unstack() would widen each planet attribute into a separate column per planet and simplify the index to strictly one row per star.)
The limitation is that only a single jagged structure can be represented by a DataFrame. The structure can be arbitrarily deep in Tables (which add depth to the column names),.. code-block:: python3
array = awkward0.fromiter([{"a": {"b": 1, "c": {"d": [2]}}, "e": 3},
stars[1734]["planets"][4]["name"]
# 'f'None of these intermediate slices actually process data, so you can slice in any order that is logically correct without worrying about performance. Projections,
even multi-column projections.. code-block:: python3
orbits = stars["planets"][["name", "eccen", "orbit", "period"]]
orbits[1734].tolist()
# [{'name': 'b', 'eccen': nan, 'orbit': 0.0343, 'period': 3.8867907},
# {'name': 'c', 'eccen': nan, 'orbit': 0.0451, 'period': 7.267302},
# {'name': 'd', 'eccen': nan, 'orbit': 0.0781, 'period': 13.342996},
# {'name': 'e', 'eccen': nan, 'orbit': 0.11, 'period': 22.407704},
# {'name': 'f', 'eccen': 0.04, 'orbit': 0.432, 'period': 129.9441}]are a useful way to restructure data without incurring a runtime cost.
Relationship to Pandas
""""""""""""""""""""""Arguably, this kind of dataset could be manipulated as a `Pandas DataFrame `__ instead of Awkward Arrays. Despite the variable number of planets per star, the exoplanets dataset could be flattened into a rectangular DataFrame, in which the distinction between solar systems is represented by a two-component index (leftmost pair of columns below), a `MultiIndex `__.
.. code-block:: python3
awkward0.topandas(stars, flatten=True)[-9:]
.. raw:: html
dec
dist
mass
name
planets
ra
radius
eccen
mass
name
orbit
period
radius
2931
0
-15.937480
3.60
0.78
49
0.1800
0.01237
101
0.538000
162.870000
NaN
26.017012
NaN
1
-15.937480
3.60
0.78
49
0.1600
0.01237
102
1.334000
636.130000
NaN
26.017012
NaN
2
-15.937480
3.60
0.78
49
0.0600
0.00551
103
0.133000
20.000000
NaN
26.017012
NaN
3
-15.937480
3.60
0.78
49
0.2300
0.00576
104
0.243000
49.410000
NaN
26.017012
NaN
2932
0
30.245163
112.64
2.30
53
0.0310
20.60000
98
1.170000
305.500000
NaN
107.784882
26.80
2933
0
41.405460
13.41
1.30
48
0.0215
0.68760
98
0.059222
4.617033
NaN
24.199345
1.56
1
41.405460
13.41
1.30
48
0.2596
1.98100
99
0.827774
241.258000
NaN
24.199345
1.56
2
41.405460
13.41
1.30
48
0.2987
4.13200
100
2.513290
1276.460000
NaN
24.199345
1.56
2934
0
8.461452
56.27
2.20
55
0.0000
2.80000
98
0.680000
136.750000
NaN
298.562012
12.00
In this representation, each star's attributes must be duplicated for all of its planets, and it is not possible to show stars that have no planets (not present in this dataset), but the information is preserved in a way that Pandas can recognize and operate on. (For instance, ``.unstack()`` would widen each planet attribute into a separate column per planet and simplify the index to strictly one row per star.)
The limitation is that only a single jagged structure can be represented by a DataFrame. The structure can be arbitrarily deep in ``Tables`` (which add depth to the column names),
.. code-block:: python3
array = awkward0.fromiter([{"a": {"b": 1, "c": {"d": [2]}}, "e": 3},
{"a": {"b": 4, "c": {"d": [5, 5.1]}}, "e": 6},
{"a": {"b": 7, "c": {"d": [8, 8.1, 8.2]}}, "e": 9}])
awkward0.topandas(array, flatten=True).. raw:: html
a
e
b
c
d
0
0
1
2.0
3
1
0
4
5.0
6
1
4
5.1
6
2
0
7
8.0
9
1
7
8.1
9
2
7
8.2
9
and arbitrarily deep in ``JaggedArrays`` (which add depth to the row names),
.. code-block:: python3
array = awkward0.fromiter([{"a": 1, "b": [[2.2, 3.3, 4.4], [], [5.5, 6.6]]},
{"a": 10, "b": [[1.1], [2.2, 3.3], [], [4.4]]},
{"a": 100, "b": [[], [9.9]]}])
awkward0.topandas(array, flatten=True).. raw:: html
a
b
0
0
0
1
2.2
1
1
3.3
2
1
4.4
2
0
1
5.5
1
1
6.6
1
0
0
10
1.1
1
0
10
2.2
1
10
3.3
3
0
10
4.4
2
1
0
100
9.9
and they can even have two ``JaggedArrays`` at the same level if their number of elements is the same (at all levels of depth).
.. code-block:: python3
array = awkward0.fromiter([{"a": [[1.1, 2.2, 3.3], [], [4.4, 5.5]], "b": [[1, 2, 3], [], [4, 5]]},
{"a": [[1.1], [2.2, 3.3], [], [4.4]], "b": [[1], [2, 3], [], [4]]},
{"a": [[], [9.9]], "b": [[], [9]]}])
awkward0.topandas(array, flatten=True).. raw:: html
a
b
0
0
0
0
1.1
1
1
1
2.2
2
2
2
3.3
3
2
0
0
4.4
4
1
1
5.5
5
1
0
0
0
1.1
1
1
0
0
2.2
2
1
1
3.3
3
3
0
0
4.4
4
2
1
0
0
9.9
9
But if there are two ``JaggedArrays`` with *different* structure at the same level, a single DataFrame cannot represent them.
.. code-block:: python3
array = awkward0.fromiter([{"a": [1, 2, 3], "b": [1.1, 2.2]},
{"a": [1], "b": [1.1, 2.2, 3.3]},
{"a": [1, 2], "b": []}])
try:
awkward0.topandas(array, flatten=True)
except Exception as err:
print(type(err), str(err))
# this array has more than one jagged array structureTo describe data like these, you'd need two DataFrames, and any calculations involving both ``"a"`` and ``"b"`` would have to include a join on those DataFrames. Awkward Arrays are not limited in this way: the last ``array`` above is a valid Awkward Array and is useful for calculations that mix ``"a"`` and ``"b"``.
LHC data from a ROOT file
"""""""""""""""""""""""""Particle physicsts need structures like these—in fact, they have been a staple of particle physics analyses for decades. The `ROOT `__ file format was developed in the mid-90's to serialize arbitrary C++ data structures in a columnar way (replacing ZEBRA and similar Fortran projects that date back to the 70's). The `PyROOT `__ library dynamically wraps these objects to present them in Python, though with a performance penalty. The `uproot `__ library reads columnar data directly from ROOT files in Python without intermediary C++.
.. code-block:: python3
import uproot3
events = uproot3.open("http://scikit-hep.org/uproot3/examples/HZZ-objects.root")["events"].lazyarrays()
events
# ... ] at 0x781189cd7b70>events.columns
# ['jetp4',
# 'jetbtag',
# 'jetid',
# 'muonp4',
# 'muonq',
# 'muoniso',
# 'electronp4',
# 'electronq',
# 'electroniso',
# 'photonp4',
# 'photoniso',
# 'MET',
# 'MC_bquarkhadronic',
# 'MC_bquarkleptonic',
# 'MC_wdecayb',
# 'MC_wdecaybbar',
# 'MC_lepton',
# 'MC_leptonpdgid',
# 'MC_neutrino',
# 'num_primaryvertex',
# 'trigger_isomu24',
# 'eventweight']This is a typical particle physics dataset (though small!) in that it represents the momentum and energy (``"p4"`` for `Lorentz 4-momentum `__) of several different species of particles: ``"jet"``, ``"muon"``, ``"electron"``, and ``"photon"``. Each collision can produce a different number of particles in each species. Other variables, such as missing transverse energy or ``"MET"``, have one value per collision event. Events with zero particles in a species are valuable for the event-level data.
.. code-block:: python3
# The first event has two muons.
events.muonp4
## The first event has zero jets.
events.jetp4
## Every event has exactly one MET.
events.MET
#Unlike the exoplanet data, these events cannot be represented as a DataFrame because of the different numbers of particles in each species and because zero-particle events have value. Even with just ``"muonp4"``, ``"jetp4"``, and ``"MET"``, there is no translation.
.. code-block:: python3
try:
awkward0.topandas(events[["muonp4", "jetp4", "MET"]], flatten=True)
except Exception as err:
print(type(err), str(err))
# name 'awkward0' is not definedIt could be described as a collection of DataFrames, in which every operation relating particles in the same event would require a join. But that would make analysis harder, not easier. An event has meaning on its own.
.. code-block:: python3
events[0].tolist()
# {'jetp4': [],
# 'jetbtag': [],
# 'jetid': [],
# 'muonp4': [TLorentzVector(-52.899, -11.655, -8.1608, 54.779),
# TLorentzVector(37.738, 0.69347, -11.308, 39.402)],
# 'muonq': [1, -1],
# 'muoniso': [4.200153350830078, 2.1510612964630127],
# 'electronp4': [],
# 'electronq': [],
# 'electroniso': [],
# 'photonp4': [],
# 'photoniso': [],
# 'MET': TVector2(5.9128, 2.5636),
# 'MC_bquarkhadronic': TVector3(0, 0, 0),
# 'MC_bquarkleptonic': TVector3(0, 0, 0),
# 'MC_wdecayb': TVector3(0, 0, 0),
# 'MC_wdecaybbar': TVector3(0, 0, 0),
# 'MC_lepton': TVector3(0, 0, 0),
# 'MC_leptonpdgid': 0,
# 'MC_neutrino': TVector3(0, 0, 0),
# 'num_primaryvertex': 6,
# 'trigger_isomu24': True,
# 'eventweight': 0.009271008893847466}Particle physics isn't alone in this: analyzing JSON-formatted log files in production systems or allele likelihoods in genomics are two other fields where variable-length, nested structures can help. Arbitrary data structures are useful and working with them in columns provides a new way to do exploratory data analysis: one array at a time.
Awkward Array data model
------------------------Awkward Array features are provided by a suite of classes that each extend Numpy arrays in one small way. These classes may then be composed to combine features.
In this sense, Numpy arrays are Awkward Array's most basic array class. A Numpy array is a small Python object that points to a large, contiguous region of memory, and, as much as possible, operations replace or change the small Python object, not the big data buffer. Therefore, many Numpy operations are *views*, rather than *in-place operations* or *copies*, leaving the original value intact but returning a new value that is linked to the original. Assigning to arrays and in-place operations are allowed, but they are more complicated to use because one must be aware of which arrays are views and which are copies.
Awkward Array's model is to treat all arrays as though they were immutable, favoring views over copies, and not providing any high-level in-place operations on low-level memory buffers (i.e. no in-place assignment).
Numpy provides complete control over the interpretation of an ``N`` dimensional array. A Numpy array has a `dtype `__ to interpret bytes as signed and unsigned integers of various bit-widths, floating-point numbers, booleans, little endian and big endian, fixed-width bytestrings (for applications such as 6-byte MAC addresses or human-readable strings with padding), or `record arrays `__ for contiguous structures. A Numpy array has a `pointer `__ to the first element of its data buffer (``array.ctypes.data``) and a `shape `__ to describe its ``N`` dimensions as a rank-``N`` tensor. Only ``shape[0]`` is the length as returned by the Python function ``len``. Furthermore, an `order `__ flag determines if rank > 1 arrays are laid out in "C" order or "Fortran" order. A Numpy array also has a `stride `__ to determine how many bytes separate one element from the next. (Data in a Numpy array need not be strictly contiguous, but they must be regular: the number of bytes seprating them is a constant.) This stride may even be negative to describe a reversed view of an array, which allows any ``slice`` of an array, even those with ``skip != 1`` to be a view, rather than a copy. Numpy arrays also have flags to determine whether they `own `__ their data buffer (and should therefore delete it when the Python object goes out of scope) and whether the data buffer is `writable `__.
The biggest restriction on this data model is that Numpy arrays are strictly rectangular. The ``shape`` and ``stride`` are constants, enforcing a regular layout. Awkward's ``JaggedArray`` is a generalization of Numpy's rank-2 arrays—that is, arrays of arrays—in that the inner arrays of a ``JaggedArray`` may all have different lengths. For higher ranks, such as arrays of arrays of arrays, put a ``JaggedArray`` inside another as its ``content``. An important special case of ``JaggedArray`` is ``StringArray``, whose ``content`` is interpreted as characters (with or without encoding), which represents an array of strings without unnecessary padding, as in Numpy's case.
Although Numpy's `record arrays `__ present a buffer as a table, with differently typed, named columns, that table must be contiguous or interleaved (with non-trivial ``strides``) in memory: an `array of structs `__. Awkward's ``Table`` provides the same interface, except that each column may be anywhere in memory, stored in a ``contents`` dict mapping field names to arrays. This is a true generalization: a ``Table`` may be a wrapped view of a Numpy record array, but not vice-versa. Use a ``Table`` anywhere you'd have a record/class/struct in non-columnar data structures. A ``Table`` with anonymous (integer-valued, rather than string-valued) fields is like an array of strongly typed tuples.
Numpy has a `masked array `__ module for nullable data—values that may be "missing" (like Python's ``None``). Naturally, the only kinds of arrays Numpy can mask are subclasses of its own ``ndarray``, and we need to be able to mask any Awkward Array, so the Awkward library defines its own ``MaskedArray``. Additionally, we sometimes want to mask with bits, rather than bytes (e.g. for Arrow compatibility), so there's a ``BitMaskedArray``, and sometimes we want to mask large structures without using memory for the masked-out values, so there's an ``IndexedMaskedArray`` (fusing the functionality of a ``MaskedArray`` with an ``IndexedArray``).
Numpy has no provision for an array containing different data types ("heterogeneous"), but Awkward Array has a ``UnionArray``. The ``UnionArray`` stores data for each type as separate ``contents`` and identifies the types and positions of each element in the ``contents`` using ``tags`` and ``index`` arrays (equivalent to Arrow's `dense union type `__ with ``types`` and ``offsets`` buffers). As a data type, unions are a counterpart to records or tuples (making ``UnionArray`` a counterpart to ``Table``): each record/tuple contains *all* of its ``contents`` but a union contains *any* of its ``contents``. (Note that a ``UnionArray`` may be the best way to interleave two arrays, even if they have the same type. Heterogeneity is not a necessary feature of a ``UnionArray``.)
Numpy has a ``dtype=object`` for arrays of Python objects, but Awkward's ``ObjectArray`` creates Python objects on demand from array data. A large dataset of some ``Point`` class, containing floating-point members ``x`` and ``y``, can be stored as an ``ObjectArray`` of a ``Table`` of ``x`` and ``y`` with much less memory than a Numpy array of ``Point`` objects. The ``ObjectArray`` has a ``generator`` function that produces Python objects from array elements. ``StringArray`` is also a special case of ``ObjectArray``, which instantiates variable-length character contents as Python strings.
Although an ``ObjectArray`` can save memory, creating Python objects in a loop may still use more computation time than is necessary. Therefore, Awkward Arrays can also have vectorized ``Methods``—bound functions that operate on the array data, rather than instantiating every Python object in an ``ObjectArray``. Although an ``ObjectArray`` is a good use-case for ``Methods``, any Awkward Array can have them. (The second most common case being a ``JaggedArray`` of ``ObjectArrays``.)
The nesting of Awkward arrays within Awkward Arrays need not be tree-like: they can have cross-references and cyclic references (using ordinary Python assignment). ``IndexedArray`` can aid in building complex structures: it is simply an integer ``index`` that would be applied to its ``content`` with `integer array indexing `__ to get any element. ``IndexedArray`` is the equivalent of a pointer in non-columnar data structures.
The counterpart of an ``IndexedArray`` is a ``SparseArray``: whereas an ``IndexedArray`` consists of pointers *to* elements of its ``content``, a ``SparseArray`` consists of pointers *from* elements of its content, representing a very large array in terms of its non-zero (or non-``default``) elements. Awkward's ``SparseArray`` is a `coordinate format (COO) `__, one-dimensional array.
Another limitation of Numpy is that arrays cannot span multiple memory buffers. Awkward's ``ChunkedArray`` represents a single logical array made of physical ``chunks`` that may be anywhere in memory. A ``ChunkedArray``'s ``chunksizes`` may be known or unknown. One application of ``ChunkedArray`` is to append data to an array without allocating on every call: ``AppendableArray`` allocates memory in equal-sized chunks.
Another application of ``ChunkedArray`` is to lazily load data in chunks. Awkward's ``VirtualArray`` calls its ``generator`` function to materialize an array when needed, and a ``ChunkedArray`` of ``VirtualArrays`` is a classic lazy-loading array, used to gradually read Parquet and ROOT files. In most libraries, lazy-loading is not a part of the data but a feature of the reading interface. Nesting virtualness makes it possible to load ``Tables`` within ``Tables``, where even the columns of the inner ``Tables`` are on-demand.
For more details, see `array classes `__.
* `Jaggedness `__
* `JaggedArray `__
* `Helper functions `__
* `Product types `__
* `Table `__
* `Sum types `__
* `UnionArray `__
* `Option types `__
* `MaskedArray `__
* `BitMaskedArray `__
* `IndexedMaskedArray `__
* `Indirection `__
* `IndexedArray `__
* `SparseArray `__
* `Helper functions `__
* `Opaque objects `__
* `Mix-in Methods `__
* `ObjectArray `__
* `StringArray `__
* `Non-contiguousness `__
* `ChunkedArray `__
* `AppendableArray `__
* `Laziness `__
* `VirtualArray `__
Mutability
""""""""""Awkward Arrays are considered immutable in the sense that elements of the data cannot be modified in-place. That is, assignment with square brackets at an integer index raises an error. Awkward does not prevent the underlying Numpy arrays from being modified in-place, though that can lead to confusing results—the behavior is left undefined. The reason for this omission in functionality is that the internal representation of columnar data structures is more constrained than their non-columnar counterparts: some in-place modification can't be defined, and others have surprising side-effects.
However, the Python objects representing Awkward Arrays can be changed in-place. Each class has properties defining its structure, such as ``content``, and these may be replaced at any time. (Replacing properties does not change values in any Numpy arrays.) In fact, this is the only way to build cyclic references: an object in Python must be assigned to a name before that name can be used as a reference.
Awkward Arrays are appendable, but only through ``AppendableArray``, and ``Table`` columns may be added, changed, or removed. The only use of square-bracket assignment (i.e. ``__setitem__``) is to modify ``Table`` columns.
Awkward Arrays produced by an external program may grow continuously, as long as more deeply nested arrays are filled first. That is, the ``content`` of a ``JaggedArray`` must be updated before updating its structure arrays (``starts`` and ``stops``). The definitions of Awkward Array validity allow for nested elements with no references pointing at them ("unreachable" elements), but not for references pointing to a nested element that doesn't exist.
Relationship to Arrow
"""""""""""""""""""""`Apache Arrow `__ is a cross-language, columnar memory format for complex data structures. There is intentionally a high degree of overlap between Awkward Array and Arrow. But whereas Arrow's focus is data portability, Awkward's focus is computation: it would not be unusual to get data from Arrow, compute something with Awkward Array, then return it to another Arrow buffer. For this reason, ``awkward0.fromarrow`` is a zero-copy view. Awkward's data representation is broader than Arrow's, so ``awkward0.toarrow`` does, in general, perform a copy.
The main difference between Awkward Array and Arrow is that Awkward Array does not require all arrays to be included within a contiguous memory buffer, though libraries like `pyarrow `__ relax this criterion while building a compliant Arrow buffer. This restriction does imply that Arrow cannot encode cross-references or cyclic dependencies.
Arrow also doesn't have the luxury of relying on Numpy to define its `primitive arrays `__, so it has a fixed endianness, has no regular tensors without expressing it as a jagged array, and requires 32-bit integers for indexing, instead of taking whatever integer type a user provides.
`Nullability `__ is an optional property of every data type in Arrow, but it's a structure element in Awkward. Similarly, `dictionary encoding `__ is built into Arrow as a fundamental property, but it would be built from an ``IndexedArray`` in Awkward. Chunking and lazy-loading are supported by readers such as `pyarrow `__, but they're not part of the Arrow data model.
The following list translates Awkward Array classes and features to their Arrow counterparts, if possible.
* ``JaggedArray``: Arrow's `list type `__.
* ``Table``: Arrow's `struct type `__, though columns can be added to or removed from Awkward ``Tables`` whereas Arrow is strictly immutable.
* ``BitMaskedArray``: every data type in Arrow potentially has a `null bitmap `__, though it's an explicit array structure in Awkward. (Arrow has no counterpart for Awkward's ``MaskedArray`` or ``IndexedMaskedArray``.)
* ``UnionArray``: directly equivalent to Arrow's `dense union `__. Arrow also has a `sparse union `__, which Awkward Array only has as a ``UnionArray.fromtags`` constructor that builds the dense union on the fly from a sparse union.
* ``ObjectArray`` and ``Methods``: no counterpart because Arrow must be usable in any language.
* ``StringArray``: "string" is a logical type built on top of Arrow's `list type `__.
* ``IndexedArray``: no counterpart (though its role in building `dictionary encoding `__ is built into Arrow as a fundamental property).
* ``SparseArray``: no counterpart.
* ``ChunkedArray``: no counterpart (though a reader may deal with non-contiguous data).
* ``AppendableArray``: no counterpart; Arrow is strictly immutable.
* ``VirtualArray``: no counterpart (though a reader may lazily load data).High-level operations: common to all classes
--------------------------------------------There are three levels of abstraction in Awkward Array: high-level operations for data analysis, low-level operations for engineering the structure of the data, and implementation details. Implementation details are handled in the usual way for Python: if exposed at all, class, method, and function names begin with underscores and are not guaranteed to be stable from one release to the next.
The distinction between high-level operations and low-level operations is more subtle and developed as Awkward Array was put to use. Data analysts care about the logical structure of the data—whether it is jagged, what the column names are, whether certain values could be ``None``, etc. Data engineers (or an analyst in "engineering mode") care about contiguousness, how much data are in memory at a given time, whether strings are dictionary-encoded, whether arrays have unreachable elements, etc. The dividing line is between high-level types and low-level array layout (both of which are defined in their own sections below). The following Awkward classes have the same high-level type as their content:
* ``IndexedArray`` because indirection to type ``T`` has type ``T``,
* ``SparseArray`` because a lookup of elements with type ``T`` has type ``T``,
* ``ChunkedArray`` because the chunks, which must have the same type as each other, collectively have that type when logically concatenated,
* ``AppendableArray`` because it's a special case of ``ChunkedArray``,
* ``VirtualArray`` because it produces an array of a given type on demand,
* ``UnionArray`` has the same type as its ``contents`` *only if* all ``contents`` have the same type as each other.All other classes, such as ``JaggedArray``, have a logically distinct type from their contents.
This section describes a suite of operations that are common to all Awkward classes. For some high-level types, the operation is meaningless or results in an error, such as the jagged ``counts`` of an array that is not jagged at any level, or the ``columns`` of an array that contains no tables, but the operation has a well-defined action on every array class. To use these operations, you do need to understand the high-level type of your data, but not whether it is wrapped in an ``IndexedArray``, a ``SparseArray``, a ``ChunkedArray``, an ``AppendableArray``, or a ``VirtualArray``.
Slicing with square brackets
""""""""""""""""""""""""""""The primary operation for all classes is slicing with square brackets. This is the operation defined by Python's ``__getitem__`` method. It is so basic that high-level types are defined in terms of what they return when a scalar argument is passed in square brakets.
Just as Numpy's slicing reproduces but generalizes Python sequence behavior, Awkward Array reproduces (most of) `Numpy's slicing behavior `__ and generalizes it in certain cases. An integer argument, a single slice argument, a single Numpy array-like of booleans or integers, and a tuple of any of the above is handled just like Numpy. Awkward Array does not handle ellipsis (because the depth of an Awkward Array can be different on different branches of a ``Table`` or ``UnionArray``) or ``None`` (because it's not always possible to insert a ``newaxis``). Numpy `record arrays `__ accept a string or sequence of strings as a column argument if it is the only argument, not in a tuple with other types. Awkward Array accepts a string or sequence of strings if it contains a ``Table`` at some level.
An integer argument selects one element from the top-level array (starting at zero), changing the type by decreasing rank or jaggedness by one level.
.. code-block:: python3
a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6, 7.7, 8.8], [9.9]])
a[0]
# array([1.1, 2.2, 3.3])Negative indexes count backward from the last element,
.. code-block:: python3
a[-1]
# array([9.9])and the index (after translating negative indexes) must be at least zero and less than the length of the top-level array.
.. code-block:: python3
try:
a[-6]
except Exception as err:
print(type(err), str(err))
# index -6 is out of bounds for axis 0 with size 5A slice selects a range of elements from the top-level array, maintaining the array's type. The first index is the inclusive starting point (starting at zero) and the second index is the exclusive endpoint.
.. code-block:: python3
a[2:4]
#Python's slice syntax (above) or literal ``slice`` objects may be used.
.. code-block:: python3
a[slice(2, 4)]
#Negative indexes count backward from the last element and endpoints may be omitted.
.. code-block:: python3
a[-2:]
#Start and endpoints beyond the array are not errors: they are truncated.
.. code-block:: python3
a[2:100]
#A skip value (third index of the slice) sets the stride for indexing, allowing you to skip elements, and this skip can be negative. It cannot, however, be zero.
.. code-block:: python3
a[::-1]
#A Numpy array-like of booleans with the same length as the array may be used to filter elements. Numpy has a specialized `numpy.compress `__ function for this operation, but the only way to get it in Awkward Array is through square brackets.
.. code-block:: python3
a[[True, True, False, True, False]]
#A Numpy array-like of integers with the same length as the array may be used to select a collection of indexes. Numpy has a specialized `numpy.take `__ function for this operation, but the only way to get it in Awkward Array is through square brakets. Negative indexes and repeated elements are handled in the same way as Numpy.
.. code-block:: python3
a[[-1, 0, 1, 2, 2, 2]]
#A tuple of length ``N`` applies selections to the first ``N`` levels of rank or jaggedness. Our example array has only two levels, so we can apply two kinds of indexes.
.. code-block:: python3
a[2:, 0]
# array([4.4, 6.6, 9.9])a[[True, False, True, True, False], ::-1]
#a[[0, 3, 0], 1::]
#As described in Numpy's `advanced indexing `__, advanced indexes (boolean or integer arrays) are broadcast and iterated as one:
.. code-block:: python3
a[[0, 3], [True, False, True]]
# array([1.1, 8.8])Awkward Array has two extensions beyond Numpy, both of which affect only jagged data. If an array is jagged and a jagged array of booleans with the same structure (same length at all levels) is passed in square brackets, only inner arrays would be filtered.
.. code-block:: python3
a = awkward0.fromiter([[ 1.1, 2.2, 3.3], [], [ 4.4, 5.5], [ 6.6, 7.7, 8.8], [ 9.9]])
mask = awkward0.fromiter([[False, False, True], [], [True, True], [True, True, False], [False]])
a[mask]
#Similarly, if an array is jagged and a jagged array of integers with the same structure is passed in square brackets, only inner arrays would be filtered/duplicated/rearranged.
.. code-block:: python3
a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6, 7.7, 8.8], [9.9]])
index = awkward0.fromiter([[2, 2, 2, 2], [], [1, 0], [2, 1, 0], []])
a[index]
#Although all of the above use a ``JaggedArray`` as an example, the principles are general: you should get analogous results with jagged tables, masked jagged arrays, etc. Non-jagged arrays only support Numpy-like slicing.
If an array contains a ``Table``, it can be selected with a string or a sequence of strings, just like Numpy `record arrays `__.
.. code-block:: python3
a = awkward0.fromiter([{"x": 1, "y": 1.1, "z": "one"}, {"x": 2, "y": 2.2, "z": "two"}, {"x": 3, "y": 3.3, "z": "three"}])
a
# ] at 0x7811883930f0>a["x"]
# array([1, 2, 3])a[["z", "y"]].tolist()
# [{'z': 'one', 'y': 1.1}, {'z': 'two', 'y': 2.2}, {'z': 'three', 'y': 3.3}]Like Numpy, integer indexes and string indexes commute if the integer index corresponds to a structure outside the ``Table`` (this condition is always met for Numpy record arrays).
.. code-block:: python3
a["y"][1]
# 2.2a[1]["y"]
# 2.2a = awkward0.fromiter([[{"x": 1, "y": 1.1, "z": "one"}, {"x": 2, "y": 2.2, "z": "two"}], [], [{"x": 3, "y": 3.3, "z": "three"}]])
a
# ] [] []] at 0x781188407358>a["y"][0][1]
# 2.2a[0]["y"][1]
# 2.2a[0][1]["y"]
# 2.2but not
.. code-block:: python3
a = awkward0.fromiter([{"x": 1, "y": [1.1]}, {"x": 2, "y": [2.1, 2.2]}, {"x": 3, "y": [3.1, 3.2, 3.3]}])
a
# ] at 0x7811883934a8>a["y"][2][1]
# 3.2a[2]["y"][1]
# 3.2try:
a[2][1]["y"]
except Exception as err:
print(type(err), str(err))
# no column named '_util_isstringslice'because
.. code-block:: python3
a[2].tolist()
# {'x': 3, 'y': [3.1, 3.2, 3.3]}cannot take a ``1`` argument before ``"y"``.
Just as integer indexes can be alternated with string/sequence of string indexes, so can slices, arrays, and tuples of slices and arrays.
.. code-block:: python3
a["y"][:, 0]
# array([1.1, 2.1, 3.1])Generally speaking, string and sequence of string indexes are *column* indexes, while all other types are *row* indexes.
Assigning with square brackets
""""""""""""""""""""""""""""""As discussed above, Awkward Arrays are generally immutable with few exceptions. Row assignment is only possible via appending to an ``AppendableArray``. Column assignment, reassignment, and deletion are in general allowed. The syntax for assigning and reassigning columns is through assignment to a square bracket expression. This operation is defined by Python's ``__setitem__`` method. The syntax for deleting columns is through the ``del`` operators on a square bracket expression. This operation is defined by Python's ``__delitem__`` method.
Since only columns can be changed, only strings and sequences of strings are allowed as indexes.
.. code-block:: python3
a = awkward0.fromiter([[{"x": 1, "y": 1.1, "z": "one"}, {"x": 2, "y": 2.2, "z": "two"}], [], [{"x": 3, "y": 3.3, "z": "three"}]])
a
# ] [] []] at 0x7811883905c0>a["a"] = awkward0.fromiter([[100, 200], [], [300]])
a.tolist()
# [[{'x': 1, 'y': 1.1, 'z': 'one', 'a': 100},
# {'x': 2, 'y': 2.2, 'z': 'two', 'a': 200}],
# [],
# [{'x': 3, 'y': 3.3, 'z': 'three', 'a': 300}]]del a["a"]
a.tolist()
# [[{'x': 1, 'y': 1.1, 'z': 'one'}, {'x': 2, 'y': 2.2, 'z': 'two'}],
# [],
# [{'x': 3, 'y': 3.3, 'z': 'three'}]]a[["a", "b"]] = awkward0.fromiter([[{"first": 100, "second": 111}, {"first": 200, "second": 222}], [], [{"first": 300, "second": 333}]])
a.tolist()
# [[{'x': 1, 'y': 1.1, 'z': 'one', 'a': 100, 'b': 111},
# {'x': 2, 'y': 2.2, 'z': 'two', 'a': 200, 'b': 222}],
# [],
# [{'x': 3, 'y': 3.3, 'z': 'three', 'a': 300, 'b': 333}]]Note that the names of the columns on the right-hand side of the assignment are irrelevant; we're setting two columns, there needs to be two columns on the right. Columns can be anonymous:
.. code-block:: python3
a[["a", "b"]] = awkward0.Table(awkward0.fromiter([[100, 200], [], [300]]), awkward0.fromiter([[111, 222], [], [333]]))
a.tolist()
# [[{'x': 1, 'y': 1.1, 'z': 'one', 'a': 100, 'b': 111},
# {'x': 2, 'y': 2.2, 'z': 'two', 'a': 200, 'b': 222}],
# [],
# [{'x': 3, 'y': 3.3, 'z': 'three', 'a': 300, 'b': 333}]]Another thing to note is that the structure (lengths at all levels of jaggedness) must match if the depth is the same.
.. code-block:: python3
try:
a["c"] = awkward0.fromiter([[100, 200, 300], [400], [500, 600]])
except Exception as err:
print(type(err), str(err))
# cannot broadcast JaggedArray to match JaggedArray with a different countsBut if the right-hand side is shallower and can be *broadcasted* to the left-hand side, it will be. (See below for broadcasting.)
.. code-block:: python3
a["c"] = awkward0.fromiter([100, 200, 300])
a.tolist()
# [[{'x': 1, 'y': 1.1, 'z': 'one', 'a': 100, 'b': 111, 'c': 100},
# {'x': 2, 'y': 2.2, 'z': 'two', 'a': 200, 'b': 222, 'c': 100}],
# [],
# [{'x': 3, 'y': 3.3, 'z': 'three', 'a': 300, 'b': 333, 'c': 300}]]Numpy-like broadcasting
"""""""""""""""""""""""In assignments and mathematical operations between higher-rank and lower-rank arrays, Numpy repeats values in the lower-rank array to "fit," if possible, before applying the operation. This is called `boradcasting `__. For example,
.. code-block:: python3
numpy.array([[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]]) + 100
# array([[101.1, 102.2, 103.3],
# [104.4, 105.5, 106.6]])Singletons are also expanded to fit.
.. code-block:: python3
numpy.array([[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]]) + numpy.array([[100], [200]])
# array([[101.1, 102.2, 103.3],
# [204.4, 205.5, 206.6]])Awkward Arrays have the same feature, but this has particularly useful effects for jagged arrays. In an operation involving two arrays of different depths of jaggedness, the shallower one expands to fit the deeper one.
.. code-block:: python3
awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]]) + awkward0.fromiter([100, 200, 300])
#Note that the ``100`` was broadcasted to all three of the elements of the first inner array, ``200`` was broadcasted to no elements in the second inner array (because the second inner array is empty), and ``300`` was broadcasted to all two of the elements of the third inner array.
This is the columnar equivalent to accessing a variable defined outside of an inner loop.
.. code-block:: python3
jagged = [[1.1, 2.2, 3.3], [], [4.4, 5.5]]
flat = [100, 200, 300]
for i in range(3):
for j in range(len(jagged[i])):
# j varies in this loop, but i is constant
print(i, j, jagged[i][j] + flat[i])
# 0 0 101.1
# 0 1 102.2
# 0 2 103.3
# 2 0 304.4
# 2 1 305.5Many translations of non-columnar code to columnar code has this form. It's often surprising to users that they don't have to do anything special to get this feature (e.g. ``cross``).
Support for Numpy universal functions (ufuncs)
""""""""""""""""""""""""""""""""""""""""""""""Numpy's key feature of array-at-a-time programming is mainly provided by "universal functions" or "ufuncs." This is a special class of function that applies a scalars → scalar kernel independently to aligned elements of internal arrays to return a same-shape output array. That is, for a scalars → scalar function ``f(x1, ..., xN) → y``, the ufunc takes ``N`` input arrays of the same ``shape`` and returns one output array with that ``shape`` in which ``output[i] = f(input1[i], ..., inputN[i])`` for all ``i``.
.. code-block:: python3
# N = 1
numpy.sqrt(numpy.array([1, 4, 9, 16, 25]))
# array([1., 2., 3., 4., 5.])# N = 2
numpy.add(numpy.array([[1.1, 2.2], [3.3, 4.4]]), numpy.array([[100, 200], [300, 400]]))
# array([[101.1, 202.2],
# [303.3, 404.4]])Keep in mind that a ufunc is not simply a function that has this property, but a specially named class, deriving from a type in the Numpy library.
.. code-block:: python3
numpy.sqrt, numpy.add
# (, )isinstance(numpy.sqrt, numpy.ufunc), isinstance(numpy.add, numpy.ufunc)
# (True, True)This class of functions can be overridden, and Awkward Array overrides them to recognize and properly handle Awkward Arrays.
.. code-block:: python3
numpy.sqrt(awkward0.fromiter([[1, 4, 9], [], [16, 25]]))
#numpy.add(awkward0.fromiter([[[1.1], 2.2], [], [3.3, None]]), awkward0.fromiter([[[100], 200], [], [None, 300]]))
#Only the primary action of the ufunc (``ufunc.__call__``) has been overridden; methods like ``ufunc.at``, ``ufunc.reduce``, and ``ufunc.reduceat`` are not supported. Also, the in-place ``out`` parameter is not supported because Awkward Array data cannot be changed in-place.
For Awkward Arrays, the input arguments to a ufunc must all have the same structure or, if shallower, be broadcastable to the deepest structure. (See above for "broadcasting.") The scalar function is applied to elements at the same positions within this structure from different input arrays. The output array has this structure, populated by return values of the scalar function.
* Rectangular arrays must have the same shape, just as in Numpy. A scalar can be broadcasted (expanded) to have the same shape as the arrays.
* Jagged arrays must have the same number of elements in all inner arrays. A rectangular array with the same outer shape (i.e. containing scalars instead of inner arrays) can be broadcasted to inner arrays with the same lengths.
* Tables must have the same sets of columns (though not necessarily in the same order). There is no broadcasting of missing columns.
* Missing values (``None`` from ``MaskedArrays``) transform to missing values in every ufunc. That is, ``None + 5`` is ``None``, ``None + None`` is ``None``, etc.
* Different data types (through a ``UnionArray``) must be compatible at every site where values are included in the calculation. For instance, input arrays may contain tables with different sets of columns, but all inputs at index ``i`` must have the same sets of columns as each other:.. code-block:: python3
numpy.add(awkward0.fromiter([{"x": 1, "y": 1.1}, {"y": 1.1, "z": 100}]),
awkward0.fromiter([{"x": 3, "y": 3.3}, {"y": 3.3, "z": 300}])).tolist()
# [{'x': 4, 'y': 4.4}, {'y': 4.4, 'z': 400}]Unary and binary operations on Awkward Arrays, such as ``-x``, ``x + y``, and ``x**2``, are actually Numpy ufuncs, so all of the above applies to them as well (such as broadcasting the scalar ``2`` in ``x**2``).
Remember that only ufuncs have been overridden by Awkward Array: other Numpy functions such as ``numpy.concatenate`` are ignorant of Awkward Arrays and will attempt to convert them to Numpy first. In some cases, that may be what you want, but in many, especially any cases involving jagged arrays, it will be a major performance loss and a loss of functionality: jagged arrays turn into Numpy ``dtype=object`` arrays containing Numpy arrays, which can be a very large number of Python objects and doesn't behave as a multidimensional array.
You can check to see if a function from Numpy is a ufunc with ``isinstance``.
.. code-block:: python3
isinstance(numpy.concatenate, numpy.ufunc)
# Falseand you can prevent accidental conversions to Numpy by setting ``allow_tonumpy`` to ``False``, either on one array or globally on a whole class of Awkward Arrays. (See "global switches" below.)
.. code-block:: python3
x = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
y = awkward0.fromiter([[6.6, 7.7, 8.8], [9.9]])
numpy.concatenate([x, y])
# array([array([1.1, 2.2, 3.3]), array([], dtype=float64),
# array([4.4, 5.5]), array([6.6, 7.7, 8.8]), array([9.9])],
# dtype=object)x.allow_tonumpy = False
try:
numpy.concatenate([x, y])
except Exception as err:
print(type(err), str(err))
# awkward0.array.base.AwkwardArray.allow_tonumpy is False; refusing to convert to NumpyGlobal switches
"""""""""""""""The ``AwkwardArray`` abstract base class has the following switches to turn off sometmes-undesirable behavior. These switches could be set on the ``AwkwardArray`` class itself, affecting all Awkward Arrays, or they could be set on a particular class like ``JaggedArray`` to only affect ``JaggedArray`` instances, or they could be set on a particular instance, to affect only that instance.
* ``allow_tonumpy`` (default is ``True``); if ``False``, forbid any action that would convert an Awkward Array into a Numpy array (with a likely loss of performance and functionality).
* ``allow_iter`` (default is ``True``); if ``False``, forbid any action that would iterate over an Awkward Array in Python (except printing a few elements as part of its string representation).
* ``check_prop_valid`` (default is ``True``); if ``False``, skip the single-property validity checks in array constructors and when setting properties.
* ``check_whole_valid`` (default is ``True``); if ``False``, skip the whole-array validity checks that are typically called before methods that need them... code-block:: python3
awkward0.AwkwardArray.check_prop_valid
# Trueawkward0.JaggedArray.check_whole_valid
# Truea = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
numpy.array(a)
# array([array([1.1, 2.2, 3.3]), array([], dtype=float64),
# array([4.4, 5.5])], dtype=object)a.allow_tonumpy = False
try:
numpy.array(a)
except Exception as err:
print(type(err), str(err))
# awkward0.array.base.AwkwardArray.allow_tonumpy is False; refusing to convert to Numpylist(a)
# [array([1.1, 2.2, 3.3]), array([], dtype=float64), array([4.4, 5.5])]a.allow_iter = False
try:
list(a)
except Exception as err:
print(type(err), str(err))
# awkward0.array.base.AwkwardArray.allow_iter is False; refusing to iteratea
#Generic properties and methods
""""""""""""""""""""""""""""""All Awkward Arrays have the following properties and methods.
* ``type``: the high-level type of the array. (See below for a detailed description of high-level types.)
.. code-block:: python3
a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
b = awkward0.fromiter([[1.1, 2.2, None, 3.3, None],
[4.4, [5.5]],
[{"x": 6, "y": {"z": 7}}, None, {"x": 8, "y": {"z": 9}}]
])a.type
# ArrayType(3, inf, dtype('float64'))print(a.type)
# [0, 3) -> [0, inf) -> float64b.type
# ArrayType(3, inf, OptionType(UnionType(dtype('float64'), ArrayType(inf, dtype('float64')), TableType(x=dtype('int64'), y=TableType(z=dtype('int64'))))))print(b.type)
# [0, 3) -> [0, inf) -> ?((float64 |
# [0, inf) -> float64 |
# 'x' -> int64
# 'y' -> 'z' -> int64 ))* ``layout``: the low-level layout of the array. (See below for a detailed description of low-level layouts.)
.. code-block:: python3
a.layout
# layout
# [ ()] JaggedArray(starts=layout[0], stops=layout[1], content=layout[2])
# [ 0] ndarray(shape=3, dtype=dtype('int64'))
# [ 1] ndarray(shape=3, dtype=dtype('int64'))
# [ 2] ndarray(shape=5, dtype=dtype('float64'))b.layout
# layout
# [ ()] JaggedArray(starts=layout[0], stops=layout[1], content=layout[2])
# [ 0] ndarray(shape=3, dtype=dtype('int64'))
# [ 1] ndarray(shape=3, dtype=dtype('int64'))
# [ 2] IndexedMaskedArray(mask=layout[2, 0], content=layout[2, 1], maskedwhen=-1)
# [ 2, 0] ndarray(shape=10, dtype=dtype('int64'))
# [ 2, 1] UnionArray(tags=layout[2, 1, 0], index=layout[2, 1, 1], contents=[layout[2, 1, 2], layout[2, 1, 3], layout[2, 1, 4]])
# [ 2, 1, 0] ndarray(shape=7, dtype=dtype('uint8'))
# [ 2, 1, 1] ndarray(shape=7, dtype=dtype('int64'))
# [ 2, 1, 2] ndarray(shape=4, dtype=dtype('float64'))
# [ 2, 1, 3] JaggedArray(starts=layout[2, 1, 3, 0], stops=layout[2, 1, 3, 1], content=layout[2, 1, 3, 2])
# [ 2, 1, 3, 0] ndarray(shape=1, dtype=dtype('int64'))
# [ 2, 1, 3, 1] ndarray(shape=1, dtype=dtype('int64'))
# [ 2, 1, 3, 2] ndarray(shape=1, dtype=dtype('float64'))
# [ 2, 1, 4] Table(x=layout[2, 1, 4, 0], y=layout[2, 1, 4, 1])
# [ 2, 1, 4, 0] ndarray(shape=2, dtype=dtype('int64'))
# [ 2, 1, 4, 1] Table(z=layout[2, 1, 4, 1, 0])
# [2, 1, 4, 1, 0] ndarray(shape=2, dtype=dtype('int64'))* ``dtype``: the `Numpy dtype `__ that this array would have if cast as a Numpy array. Numpy dtypes cannot fully specify Awkward Arrays: use the ``type`` for an analyst-friendly description of the data type or ``layout`` for details about how the arrays are represented.
.. code-block:: python3
a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
a.dtype # the closest Numpy dtype to a jagged array is dtype=object ('O')
# dtype('O')numpy.array(a)
# array([array([1.1, 2.2, 3.3]), array([], dtype=float64),
# array([4.4, 5.5])], dtype=object)* ``shape``: the `Numpy shape `__ that this array would have if cast as a Numpy array. This only specifies the first regular dimensions, not any jagged dimensions or regular dimensions nested within Awkward structures. The Python length (``__len__``) of the array is the first element of this ``shape``.
.. code-block:: python3
a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
a.shape
# (3,)len(a)
# 3The following ``JaggedArray`` has two fixed-size dimensions at the top, followed by a jagged dimension inside of that. The shape only represents the first few dimensions.
.. code-block:: python3
a = awkward0.JaggedArray.fromcounts([[3, 0], [2, 4]], [1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])
a
#a.shape
# (2, 2)len(a)
# 2print(a.type)
# [0, 2) -> [0, 2) -> [0, inf) -> float64Also, a dimension can effectively be fixed-size, but represented by a ``JaggedArray``. The ``shape`` does not encompass any dimensions represented by a ``JaggedArray``.
.. code-block:: python3
# Same structure, but it's JaggedArrays all the way down.
b = a.structure1d()
b
#b.shape
# (2,)* ``size``: the product of ``shape``, as in Numpy.
.. code-block:: python3
a.shape
# (2, 2)a.size
# 4* ``nbytes``: the total number of bytes in all memory buffers referenced by the array, not including bytes in Python objects (which are Python-implementation dependent, not even available in PyPy). Same as the Numpy property of the same name.
.. code-block:: python3
a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
a.nbytes
# 72a.offsets.nbytes + a.content.nbytes
# 72* ``tolist()``: converts the array into Python objects: ``lists`` for arrays, ``dicts`` for table rows, ``tuples`` for table rows with anonymous fields and a ``rowname`` of ``"tuple"``, ``None`` for missing data, and Python objects from ``ObjectArrays``. This is an approximate inverse of ``awkward0.fromiter``.
.. code-block:: python3
awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]]).tolist()
# [[1.1, 2.2, 3.3], [], [4.4, 5.5]]awkward0.fromiter([{"x": 1, "y": 1.1}, {"x": 2, "y": 2.2}, {"x": 3, "y": 3.3}]).tolist()
# [{'x': 1, 'y': 1.1}, {'x': 2, 'y': 2.2}, {'x': 3, 'y': 3.3}]awkward0.Table.named("tuple", [1, 2, 3], [1.1, 2.2, 3.3]).tolist()
# [(1, 1.1), (2, 2.2), (3, 3.3)]awkward0.fromiter([[1.1, 2.2, None], [], [None, 3.3]]).tolist()
# [[1.1, 2.2, None], [], [None, 3.3]]class Point:
def __init__(self, x, y):
self.x, self.y = x, y
def __repr__(self):
return f"Point({self.x}, {self.y})"a = awkward0.fromiter([[Point(1, 1.1), Point(2, 2.2), Point(3, 3.3)], [], [Point(4, 4.4), Point(5, 5.5)]])
a
#a.tolist()
# [[Point(1, 1.1), Point(2, 2.2), Point(3, 3.3)],
# [],
# [Point(4, 4.4), Point(5, 5.5)]]* ``valid(exception=False, message=False)``: manually invoke the whole-array validity checks on the top-level array (not recursively). With the default options, this function returns ``True`` if valid and ``False`` if not. If ``exception=True``, it returns nothing on success and raises the appropriate exception on failure. If ``message=True``, it returns ``None`` on success and the error string on failure. (TODO: ``recursive=True``?)
.. code-block:: python3
a = awkward0.JaggedArray.fromcounts([3, 0, 2], [1.1, 2.2, 3.3, 4.4]) # content array is too short
a.valid()
# Falsetry:
a.valid(exception=True)
except Exception as err:
print(type(err), str(err))
# maximum offset 5 is beyond the length of the content (4)a.valid(message=True)
# ": maximum offset 5 is beyond the length of the content (4)"* ``astype(dtype)``: convert *nested Numpy arrays* into the given type while maintaining Awkward structure.
.. code-block:: python3
a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
a.astype(numpy.int32)
#* ``regular()``: convert the Awkward Array into a Numpy array and (unlike ``numpy.array(awkward_array)``) raise an error if it cannot be faithfully represented.
.. code-block:: python3
# This JaggedArray happens to have equal-sized inner arrays.
a = awkward0.fromiter([[1.1, 2.2, 3.3], [4.4, 5.5, 6.6], [7.7, 8.8, 9.9]])
a
#a.regular()
# array([[1.1, 2.2, 3.3],
# [4.4, 5.5, 6.6],
# [7.7, 8.8, 9.9]])# This one does not.
a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
a
#try:
a.regular()
except Exception as err:
print(type(err), str(err))
# jagged array is not regular: different elements have different counts* ``copy(optional constructor arguments...)``: copy an Awkward Array object, non-recursively and without copying memory buffers, possibly replacing some of its parameters. If the class is an Awkward subclass or has mix-in methods, they are propagated to the copy.
.. code-block:: python3
class Special:
def get(self, index):
try:
return self[index]
except IndexError:
return NoneJaggedArrayMethods = awkward0.Methods.mixin(Special, awkward0.JaggedArray)
a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
a.__class__ = JaggedArrayMethods
a
#a.get(2)
# array([4.4, 5.5])a.get(3)
b = a.copy(content=[100, 200, 300, 400, 500])
b
#b.get(2)
# array([400, 500])b.get(3)
Internally, all the methods that return views of the array (like slicing) use ``copy`` to retain the special methods.
.. code-block:: python3
c = a[1:]
c
#c.get(1)
# array([4.4, 5.5])c.get(2)
* ``deepcopy(optional constructor arguments...)``: like ``copy``, except that it recursively copies all internal structure, including memory buffers associated with Numpy arrays.
.. code-block:: python3
b = a.deepcopy(content=[100, 200, 300, 400, 500])
b
## Modify the structure of a (not recommended; this is a demo).
a.starts[0] = 1
a
## But b is not modified. (If it were, it would start with 200.)
b
#* ``empty_like(optional constructor arguments...)``
* ``zeros_like(optional constructor arguments...)``
* ``ones_like(optional constructor arguments...)``: recursively copies structure, replacing contents with new uninitialized buffers, new buffers full of zeros, or new buffers full of ones. Not usually used in analysis, but needed for implementation... code-block:: python3
d = a.zeros_like()
d
#e = a.ones_like()
e
#Reducers
""""""""All Awkward Arrays also have a complete set of reducer methods. Reducers can be found in Numpy as well (as array methods and as free-standing functions), but they're not called out as a special class the way that universal functions ("ufuncs") are. Reducers decrease the rank or jaggedness of an array by one dimension, replacing subarrays with scalars. Examples include ``sum``, ``min``, and ``max``, but any monoid (associative operation with an identity) can be a reducer.
In Awkward Array, reducers are only array methods (not free-standing functions) and unlike Numpy, they do not take an ``axis`` parameter. When a reducer is called at any level, it reduces the innermost dimension. (Since outer dimensions can be jagged, this is the only dimension that can be meaningfully reduced.)
.. code-block:: python3
a = awkward0.fromiter([[[[1, 2], [3]], [[4, 5]]], [[[], [6, 7, 8, 9]]]])
a
#a.sum()
#a.sum().sum()
#a.sum().sum().sum()
# array([15, 30])a.sum().sum().sum().sum()
# 45In the following example, "the deepest axis" of different fields in the table are at different depths: singly jagged in ``"x"`` and doubly jagged array in ``"y"``. The ``sum`` reduces each depth by one, producing a flat array ``"x"`` and a singly jagged array in ``"y"``.
.. code-block:: python3
a = awkward0.fromiter([{"x": [], "y": [[0.1, 0.2], [], [0.3]]}, {"x": [1, 2, 3], "y": [[0.4], [], [0.5, 0.6]]}])
a.tolist()
# [{'x': [], 'y': [[0.1, 0.2], [], [0.3]]},
# {'x': [1, 2, 3], 'y': [[0.4], [], [0.5, 0.6]]}]a.sum().tolist()
[{'x': 0, 'y': [0.3, 0.0, 0.3]},
{'x': 6, 'y': [0.4, 0.0, 1.1]}]This sum cannot be reduced again because ``"x"`` is not jagged (would reduce to a scalar) and ``"y"`` is (would reduce to an array). The result cannot be scalar in one field (a single row, not a collection) and an array in another field (a collection).
.. code-block:: python3
try:
a.sum().sum()
except Exception as err:
print(type(err), str(err))
# some Table columns are jagged and others are notA table can be reduced if all of its fields are jagged or if all of its fields are not jagged; here's an example of the latter.
.. code-block:: python3
a = awkward0.fromiter([{"x": 1, "y": 1.1}, {"x": 2, "y": 2.2}, {"x": 3, "y": 3.3}])
a.tolist()
# [{'x': 1, 'y': 1.1}, {'x': 2, 'y': 2.2}, {'x': 3, 'y': 3.3}]a.sum()
#The resulting object is a scalar row—for your convenience, it has been labeled with the reducer that produced it.
.. code-block:: python3
isinstance(a.sum(), awkward0.Table.Row)
# True``UnionArrays`` are even more constrained: they can only be reduced if they have primitive (Numpy) type.
.. code-block:: python3
a = awkward0.fromiter([1, 2, 3, {"x": 1, "y": 1.1}, {"x": 2, "y": 2.2}])
a
# ] at 0x781188355550>try:
a.sum()
except Exception as err:
print(type(err), str(err))
# cannot reduce a UnionArray of non-primitive typea = awkward0.UnionArray.fromtags([0, 0, 0, 1, 1],
[numpy.array([1, 2, 3], dtype=numpy.int32),
numpy.array([4, 5], dtype=numpy.float64)])
a
#a.sum()
# 15.0In all reducers, ``NaN`` in floating-point arrays and ``None`` in ``MaskedArrays`` are skipped, so these reducers are more like ``numpy.nansum``, ``numpy.nanmax``, and ``numpy.nanmin``, but generalized to all nullable types.
.. code-block:: python3
a = awkward0.fromiter([[[[1.1, numpy.nan], [2.2]], [[None, 3.3]]], [[[], [None, numpy.nan, None]]]])
a
#a.sum()
#a = awkward0.fromiter([[{"x": 1, "y": 1.1}, None, {"x": 3, "y": 3.3}], [], [{"x": 4, "y": numpy.nan}]])
a.tolist()
# [[{'x': 1, 'y': 1.1}, None, {'x': 3, 'y': 3.3}], [], [{'x': 4, 'y': nan}]]a.sum().tolist()
# [{'x': 4, 'y': 4.4}, {'x': 0, 'y': 0.0}, {'x': 4, 'y': 0.0}]The following reducers are defined as methods on all Awkward Arrays.
* ``reduce(ufunc, identity)``: generic reducer, calls ``ufunc.reduceat`` and returns ``identity`` for empty arrays.
.. code-block:: python3
# numba.vectorize makes new ufuncs (requires type signatures and a kernel function)
import numba
@numba.vectorize([numba.int64(numba.int64, numba.int64)])
def sum_mod_10(x, y):
return (x + y) % 10a = awkward0.fromiter([[1, 2, 3], [], [4, 5, 6], [7, 8, 9, 10]])
a.sum()
# array([ 6, 0, 15, 34])a.reduce(sum_mod_10, 0)
# array([6, 0, 5, 4])# Missing (None) values are ignored.
a = awkward0.fromiter([[1, 2, None, 3], [], [None, None, None], [7, 8, 9, 10]])
a.reduce(sum_mod_10, 0)
# array([6, 0, 0, 4])* ``any()``: boolean reducer, returns ``True`` if any (logical or) of the elements of an array are ``True``, returns ``False`` for empty arrays.
.. code-block:: python3
a = awkward0.fromiter([[False, False], [True, True], [True, False], []])
a.any()
# array([False, True, True, False])# Missing (None) values are ignored.
a = awkward0.fromiter([[False, None], [True, None], [None]])
a.any()
# array([False, True, False])* ``all()``: boolean reducer, returns ``True`` if all (logical and) of the elements of an array are ``True``, returns ``True`` for empty arrays.
.. code-block:: python3
a = awkward0.fromiter([[False, False], [True, True], [True, False], []])
a.all()
# array([False, True, False, True])# Missing (None) values are ignored.
a = awkward0.fromiter([[False, None], [True, None], [None]])
a.all()
# array([False, True, True])* ``count()``: returns the (integer) number of elements in an array, skipping ``None`` and ``NaN``.
.. code-block:: python3
a = awkward0.fromiter([[1.1, 2.2, None], [], [3.3, numpy.nan]])
a.count()
# array([2, 0, 1])* ``count_nonzero()``: returns the (integer) number of non-zero elements in an array, skipping ``None`` and ``NaN``.
.. code-block:: python3
a = awkward0.fromiter([[1.1, 2.2, None, 0], [], [3.3, numpy.nan, 0]])
a.count_nonzero()
# array([2, 0, 1])* ``sum()``: returns the sum of each array, skipping ``None`` and ``NaN``, returning 0 for empty arrays.
.. code-block:: python3
a = awkward0.fromiter([[1.1, 2.2, None], [], [3.3, numpy.nan]])
a.sum()
# array([3.3, 0. , 3.3])* ``prod()``: returns the product (multiplication) of each array, skipping ``None`` and ``NaN``, returning 1 for empty arrays.
.. code-block:: python3
a = awkward0.fromiter([[1.1, 2.2, None], [], [3.3, numpy.nan]])
a.prod()
# array([2.42, 1. , 3.3 ])* ``min()``: returns the minimum number in each array, skipping ``None`` and ``NaN``, returning infinity or the largest possible integer for empty arrays. (Note that Numpy raises errors for empty arrays.)
.. code-block:: python3
a = awkward0.fromiter([[1.1, 2.2, None], [], [3.3, numpy.nan]])
a.min()
# array([1.1, inf, 3.3])a = awkward0.fromiter([[1, 2, None], [], [3]])
a.min()
# array([ 1, 9223372036854775807, 3])The identity of minimization is ``inf`` for floating-point values and ``9223372036854775807`` for ``int64`` because minimization with any other value would return the other value. This is more convenient for data analysts than raising an error because empty inner arrays are common.
* ``max()``: returns the maximum number in each array, skipping ``None`` and ``NaN``, returning negative infinity or the smallest possible integer for empty arrays. (Note that Numpy raises errors for empty arrays.)
.. code-block:: python3
a = awkward0.fromiter([[1.1, 2.2, None], [], [3.3, numpy.nan]])
a.max()
# array([ 2.2, -inf, 3.3])a = awkward0.fromiter([[1, 2, None], [], [3]])
a.max()
# array([ 2, -9223372036854775808, 3])The identity of maximization is ``-inf`` for floating-point values and ``-9223372036854775808`` for ``int64`` because maximization with any other value would return the other value. This is more convenient for data analysts than raising an error because empty inner arrays are common.
Note that the maximization-identity for unsigned types is ``0``.
.. code-block:: python3
a = awkward0.JaggedArray.fromcounts([3, 0, 2], numpy.array([1.1, 2.2, 3.3, 4.4, 5.5], dtype=numpy.uint16))
a
#a.max()
# array([3, 0, 5], dtype=uint16)Functions like mean and standard deviation aren't true reducers because they're not associative (``mean(mean(x1, x2, x3), mean(x4, x5))`` is not equal to ``mean(mean(x1, x2), mean(x3, x4, x5))``). However, they're useful methods that exist on all Awkward Arrays, defined in terms of reducers.
* ``moment(n, weight=None)``: returns the ``n``th moment of each array (a floating-point value), skipping ``None`` and ``NaN``, returning ``NaN`` for empty arrays. If ``weight`` is given, it is taken as an array of weights, which may have the same structure as the ``array`` or be broadcastable to it, though any broadcasted weights would have no effect on the moment.
.. code-block:: python3
a = awkward0.fromiter([[1, 2, 3], [], [4, 5]])
a.moment(1)
# array([2. , nan, 4.5])a.moment(2)
# array([ 4.66666667, nan, 20.5 ])Here is the first moment (mean) with a weight broadcasted from a scalar and from a non-jagged array, to show how it doesn't affect the result. The moment is calculated over an inner array, so if a constant value is broadcasted to all elements of that inner array, they all get the same weight.
.. code-block:: python3
a.moment(1)
# array([2. , nan, 4.5])a.moment(1, 100)
# array([2. , nan, 4.5])a.moment(1, numpy.array([100, 200, 300]))
# array([2. , nan, 4.5])Only when the weight varies across an inner array does it have an effect.
.. code-block:: python3
a.moment(1, awkward0.fromiter([[1, 10, 100], [], [0, 100]]))
# array([2.89189189, nan, 5. ])* ``mean(weight=None)``: returns the mean of each array (a floating-point value), skipping ``None`` and ``NaN``, returning ``NaN`` for empty arrays, using optional ``weight`` as above.
.. code-block:: python3
a = awkward0.fromiter([[1, 2, 3], [], [4, 5]])
a.mean()
# array([2. , nan, 4.5])* ``var(weight=None, ddof=0)``: returns the variance of each array (a floating-point value), skipping ``None`` and ``NaN``, returning ``NaN`` for empty arrays, using optional ``weight`` as above. The ``ddof`` or "Delta Degrees of Freedom" replaces a divisor of ``N`` (count or sum of weights) with a divisor of ``N - ddof``, following `numpy.var `__.
.. code-block:: python3
a = awkward0.fromiter([[1, 2, 3], [], [4, 5]])
a.var()
# array([0.66666667, nan, 0.25 ])a.var(ddof=1)
# array([1. , nan, 0.5])* ``std(weight=None, ddof=0)``: returns the standard deviation of each array, the square root of the variance described above.
.. code-block:: python3
a.std()
# array([0.81649658, nan, 0.5 ])a.std(ddof=1)
# array([1. , nan, 0.70710678])Properties and methods for jaggedness
"""""""""""""""""""""""""""""""""""""All Awkward Arrays have these methods, but they provide information about the first nested ``JaggedArray`` within a structure. If, for instance, the ``JaggedArray`` is within some structure that doesn't affect high-level type (e.g. ``IndexedArray``, ``ChunkedArray``, ``VirtualArray``), then the methods are passed through to the ``JaggedArray``. If it's nested within something that does change type, but can meaningfully pass on the call, such as ``MaskedArray``, then that's what they do. If, however, it reaches a ``Table``, which may have some jagged columns and some non-jagged columns, the propagation stops.
* ``counts``: Numpy array of the number of elements in each inner array of the shallowest ``JaggedArray``. The ``counts`` may have rank > 1 if there are any fixed-size dimensions before the ``JaggedArray``.
.. code-block:: python3
a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6, 7.7, 8.8, 9.9]])
a.counts
# array([3, 0, 2, 4])# MaskedArrays return -1 for missing values.
a = awkward0.fromiter([[1.1, 2.2, 3.3], [], None, [6.6, 7.7, 8.8, 9.9]])
a.counts
# array([ 3, 0, -1, 4])A missing inner array (counts is ``-1``) is distinct from an empty inner array (counts is ``0``), but if you want to ensure that you're working with data that have at least ``N`` elements, ``counts >= N`` works.
.. code-block:: python3
a.counts >= 1
# array([ True, False, False, True])a[a.counts >= 1]
## UnionArrays return -1 for non-jagged arrays mixed with jagged arrays.
a = awkward0.fromiter([[1.1, 2.2, 3.3], [], 999, [6.6, 7.7, 8.8, 9.9]])
a.counts
# array([ 3, 0, -1, 4])# Same for tabular data, regardless of whether they contain nested jagged arrays.
a = awkward0.fromiter([[1.1, 2.2, 3.3], [], {"x": 1, "y": [1.1, 1.2, 1.3]}, [6.6, 7.7, 8.8, 9.9]])
a.counts
# array([ 3, 0, -1, 4])Note! This means that pure ``Tables`` will always return zeros for counts, regardless of what they contain.
.. code-block:: python3
a = awkward0.fromiter([{"x": [], "y": []}, {"x": [1], "y": [1.1]}, {"x": [1, 2], "y": [1.1, 2.2]}])
a.counts
# array([-1, -1, -1])If all of the columns of a ``Table`` are ``JaggedArrays`` with the same structure, you probably want to zip them into a single ``JaggedArray``.
.. code-block:: python3
b = awkward0.JaggedArray.zip(x=a.x, y=a.y)
b
# ] [ ]] at 0x78112c0dc7f0>b.counts
# array([0, 1, 2])* ``flatten(axis=0)``: removes one level of structure (losing information about boundaries between inner arrays) at a depth of jaggedness given by ``axis``.
.. code-block:: python3
a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6, 7.7, 8.8, 9.9]])
a.flatten()
# array([1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])Unlike a ``JaggedArray``'s ``content``, which is part of its low-level layout, ``flatten()`` performs a high-level logical operation. Here's an example of the distinction.
.. code-block:: python3
# JaggedArray with an unusual but valid structure.
a = awkward0.JaggedArray([3, 100, 0, 6], [6, 100, 2, 10],
[4.4, 5.5, 999, 1.1, 2.2, 3.3, 6.6, 7.7, 8.8, 9.9, 123])
a
#a.flatten() # gives you a logically flattened array
# array([1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])a.content # gives you an internal structure component of the array
# array([ 4.4, 5.5, 999. , 1.1, 2.2, 3.3, 6.6, 7.7, 8.8,
# 9.9, 123. ])In many cases, the output of ``flatten()`` corresponds to the output of ``content``, but be aware of the difference and use the one you want.
With ``flatten(axis=1)``, we can internally flatten nested ``JaggedArrays``.
.. code-block:: python3
a = awkward0.fromiter([[[1.1, 2.2], [3.3]], [], [[4.4, 5.5]], [[6.6, 7.7, 8.8], [], [9.9]]])
a
#a.flatten(axis=0)
#a.flatten(axis=1)
#Even if a ``JaggedArray``'s inner structure is due to a fixed-shape Numpy array, the ``axis`` parameter propagates down and does the right thing.
.. code-block:: python3
a = awkward0.JaggedArray.fromcounts(numpy.array([3, 0, 2]),
numpy.array([[1, 1], [2, 2], [3, 3], [4, 4], [5, 5]]))
a
#type(a.content)
# numpy.ndarraya.flatten(axis=1)
#But, unlike Numpy, we can't ask for an ``axis`` starting from the other end (with a negative index). The "deepest array" is not a well-defined concept for Awkward Arrays.
.. code-block:: python3
try:
a.flatten(axis=-1)
except Exception as err:
print(type(err), str(err))
# axis must be a non-negative integer (can't count from the end)a = awkward0.fromiter([[[1.1, 2.2], [3.3]], [], None, [[6.6, 7.7, 8.8], [], [9.9]]])
a
#a.flatten(axis=1)
#* ``pad(length, maskedwhen=True, clip=False)``: ensures that each inner array has at least ``length`` elements by filling in the empty spaces with ``None`` (i.e. by inserting a ``MaskedArray`` layer). The ``maskedwhen`` parameter determines whether ``mask[i] == True`` means the element is ``None`` (``maskedwhen=True``) or not ``None`` (``maskedwhen=False``). Setting ``maskedwhen`` doesn't change the logical meaning of the array. If ``clip=True``, then the inner arrays will have exactly ``length`` elements (by clipping the ones that are too long). Even though this results in regular sizes, they are still represented by a ``JaggedArray``.
.. code-block:: python3
a = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6, 7.7, 8.8, 9.9]])
a
#a.pad(3)
#a.pad(3, maskedwhen=False)
#a.pad(3, clip=True)
#If you want to get rid of the ``MaskedArray`` layer, replace ``None`` with some value.
.. code-block:: python3
a.pad(3).fillna(-999)
#If you want to make an effectively regular array into a real Numpy array, use ``regular``.
.. code-block:: python3
a.pad(3, clip=True).fillna(0).regular()
# array([[1.1, 2.2, 3.3],
# [0. , 0. , 0. ],
# [4.4, 5.5, 0. ],
# [6.6, 7.7, 8.8]])If a ``JaggedArray`` is nested within some other type, ``pad`` will propagate down to it.
.. code-block:: python3
a = awkward0.fromiter([[1.1, 2.2, 3.3], [], None, [4.4, 5.5], None])
a
#a.pad(3)
#a = awkward0.Table(x=[[1, 1], [2, 2], [3, 3], [4, 4]],
y=awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6, 7.7, 8.8, 9.9]]))
a.tolist()
# [{'x': [1, 1], 'y': [1.1, 2.2, 3.3]},
# {'x': [2, 2], 'y': []},
# {'x': [3, 3], 'y': [4.4, 5.5]},
# {'x': [4, 4], 'y': [6.6, 7.7, 8.8, 9.9]}]a.pad(3).tolist()
# [{'x': [1, 1, None], 'y': [1.1, 2.2, 3.3]},
# {'x': [2, 2, None], 'y': [None, None, None]},
# {'x': [3, 3, None], 'y': [4.4, 5.5, None]},
# {'x': [4, 4, None], 'y': [6.6, 7.7, 8.8, 9.9]}]a.pad(3, clip=True).tolist()
# [{'x': [1, 1, None], 'y': [1.1, 2.2, 3.3]},
# {'x': [2, 2, None], 'y': [None, None, None]},
# {'x': [3, 3, None], 'y': [4.4, 5.5, None]},
# {'x': [4, 4, None], 'y': [6.6, 7.7, 8.8]}]If you pass a ``pad`` through a ``Table``, be sure that every field in each record is a nested array (and therefore can be padded).
.. code-block:: python3
a = awkward0.Table(x=[1, 2, 3, 4],
y=awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6, 7.7, 8.8, 9.9]]))
a.tolist()
# [{'x': 1, 'y': [1.1, 2.2, 3.3]},
# {'x': 2, 'y': []},
# {'x': 3, 'y': [4.4, 5.5]},
# {'x': 4, 'y': [6.6, 7.7, 8.8, 9.9]}]try:
a.pad(3)
except Exception as err:
print(type(err), str(err))
# pad cannot be applied to scalarsThe same goes for ``UnionArrays``.
.. code-block:: python3
a = awkward0.fromiter([[1.1, 2.2, 3.3, [1, 2, 3]], [], [4.4, 5.5, [4, 5]]])
a
#a.pad(5)
#a = awkward0.UnionArray.fromtags([0, 0, 0, 1, 1],
[awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]]),
awkward0.fromiter([[100, 101], [102]])])
a
#a.pad(3)
#a = awkward0.UnionArray.fromtags([0, 0, 0, 1, 1],
[awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]]),
awkward0.fromiter([100, 200])])
a
#try:
a.pad(3)
except Exception as err:
print(type(err), str(err))
# pad cannot be applied to scalarsThe general behavior of ``pad`` is to replace the shallowest ``JaggedArray`` with a ``JaggedArray`` containing a ``MaskedArray``. The one exception to this type signature is that ``StringArrays`` are padded with characters.
.. code-block:: python3
a = awkward0.fromiter(["one", "two", "three"])
a
#