Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/enricoschumann/tsdb

A terribly-simple data base for time series
https://github.com/enricoschumann/tsdb

csv r time-series tsdb

Last synced: about 2 months ago
JSON representation

A terribly-simple data base for time series

Awesome Lists containing this project

README

        

#+TITLE: tsdb: a terribly-simple database for time series
#+AUTHOR: Enrico Schumann
#+OPTIONS: toc:nil
#+BIND: org-latex-default-packages-alist nil
#+BIND: org-use-sub-superscripts {}
#+PROPERTY: tangle yes
#+PROPERTY: header-args :comments link
#+PROPERTY: header-args:R :session *R*
#+PROPERTY: header-args :eval never-export
# ------------------ LATEX ------------------
#+LATEX_CLASS: scrartcl
#+LATEX_CLASS_OPTIONS: [a4paper,fontsize=11pt]
#+LATEX_HEADER: \addtokomafont{disposition}{\rmfamily}
#+LATEX_HEADER: \addtokomafont{descriptionlabel}{\rmfamily}
#+LATEX_HEADER: \setlength{\parindent}{0em}
#+LATEX_HEADER: \setlength{\parskip}{2ex plus0.5ex minus0.5ex}
#+LATEX_HEADER: \newcommand{\pmwr}{\textsc{pm}w\textsc{r}}
#+LATEX_HEADER: \newcommand{\pl}{\textsc{pl}}
#+LATEX_HEADER: \newcommand{\R}{\textsf{R}}
#+LATEX_HEADER: \usepackage[backend=bibtex,citestyle=authoryear]{biblatex}
#+LATEX_HEADER: \addbibresource{Library.bib}
#+LATEX_HEADER: \usepackage[left=3cm,right=5cm,top=2cm,bottom=4cm,twoside]{geometry}
#+LATEX_HEADER: \usepackage[libertine]{newtxmath}
#+LATEX_HEADER: \usepackage{fontspec}
#+LATEX_HEADER: \setmainfont{Linux Libertine O}
#+LATEX_HEADER: \setmonofont[Scale=0.91]{inconsolata}
#+LATEX_HEADER: \usepackage{graphicx}
#+LATEX_HEADER: \usepackage[dvipsnames]{xcolor}
#+LATEX_HEADER: \definecolor{grey20}{gray}{0.20}
#+LATEX_HEADER: \definecolor{grey30}{gray}{0.30}
#+LATEX_HEADER: \definecolor{grey40}{gray}{0.40}
#+LATEX_HEADER: \definecolor{grey90}{gray}{0.90}
#+LATEX_HEADER: \definecolor{grey96}{gray}{0.96}
#+LATEX_HEADER: \usepackage{listings}
#+LATEX_HEADER: \lstset{language=R,basicstyle=\ttfamily,frame=single,commentstyle=\ttfamily\color{OliveGreen},
#+LATEX_HEADER: numberstyle=\ttfamily\footnotesize\color{gray},stringstyle=\ttfamily\color{blue},
#+LATEX_HEADER: backgroundcolor=\color{grey96},rulecolor=\color{grey90},showstringspaces=false,
#+LATEX_HEADER: }
#+LATEX_HEADER: \lstnewenvironment{results}
#+LATEX_HEADER: {\lstset{basicstyle=\ttfamily\color{grey30},backgroundcolor={},frame=single,numbers=none,showstringspaces=false,rulecolor=\color{grey96}}}{}
#+LATEX_HEADER: \usepackage{mdframed}
#+LATEX_HEADER: \newenvironment{FAQ}
#+LATEX_HEADER: {\begin{mdframed}}{\end{mdframed}}
#+LATEX_HEADER: \newenvironment{FAA}
#+LATEX_HEADER: {\begin{mdframed}}{\end{mdframed}}
#+LATEX_HEADER: \usepackage{makeidx}\makeindex
#+LATEX_HEADER: \usepackage[hidelinks]{hyperref}
# ------------------ HTML ------------------
#+HTML_HEAD:
#+HTML_HEAD:
#+HTML_HEAD: html,body {
#+HTML_HEAD: font-family: sans-serif;
#+HTML_HEAD: padding: 0;
#+HTML_HEAD: margin: 0;
#+HTML_HEAD: }
#+HTML_HEAD: body {
#+HTML_HEAD: line-height: 1.45;
#+HTML_HEAD: }
#+HTML_HEAD: #content {
#+HTML_HEAD: font-family: serif;
#+HTML_HEAD: border: 1px solid #eeeeee;
#+HTML_HEAD: border-radius: 3px;
#+HTML_HEAD: color: #222222; width: 100%;
#+HTML_HEAD: width: 700px;
#+HTML_HEAD: padding-top: 2ex;
#+HTML_HEAD: padding: 1em;
#+HTML_HEAD: margin: 0.5em;
#+HTML_HEAD: margin-left: auto;margin-right: auto;
#+HTML_HEAD: }
#+HTML_HEAD: @media (max-width: 700px) {
#+HTML_HEAD: html,body,#content {
#+HTML_HEAD: width: 95%;
#+HTML_HEAD: }
#+HTML_HEAD: }
#+HTML_HEAD: .example {
#+HTML_HEAD: border: 1px solid rgb(240,240,240);
#+HTML_HEAD: padding: 4px;
#+HTML_HEAD: color: rgb(110,110,110);
#+HTML_HEAD: overflow: auto;
#+HTML_HEAD: }
#+HTML_HEAD: .src {
#+HTML_HEAD: border: 1px solid rgb(240,240,240);
#+HTML_HEAD: color: rgb(30,30,30);
#+HTML_HEAD: background-color: rgb(230,230,230);
#+HTML_HEAD: padding: 4px;
#+HTML_HEAD: overflow: auto;
#+HTML_HEAD: }
#+HTML_HEAD: .src:hover {
#+HTML_HEAD: background-color: rgb(240,240,240);
#+HTML_HEAD: padding: 4px;
#+HTML_HEAD: }
#+HTML_HEAD: dt {
#+HTML_HEAD: font-weight: bold;
#+HTML_HEAD: }
#+HTML_HEAD: li {
#+HTML_HEAD: margin-bottom: 0.5ex;
#+HTML_HEAD: }
#+HTML_HEAD: code {
#+HTML_HEAD: font-size: 115%;
#+HTML_HEAD: color: rgb(60,60,60);
#+HTML_HEAD: }
#+HTML_HEAD: .org-right {
#+HTML_HEAD: text-align: right;
#+HTML_HEAD: }
#+HTML_HEAD: nav ul {
#+HTML_HEAD: list-style-type: none;
#+HTML_HEAD: }
#+HTML_HEAD:

#+BEGIN_SRC R :results none :exports none
options(useFancyQuotes=FALSE)
#+END_SRC

* About tsdb

A terribly-simple database for numeric time series,
written purely in R, so no external database-software
is needed. Series are stored in plain-text files (the
most-portable and enduring file type) in CSV
format. Timestamps are encoded using R's native numeric
representation for 'Date'/'POSIXct', which makes them
fast to parse, but keeps them accessible with other
software. The package provides tools for saving and
updating series in this standardised format, for
retrieving and joining data, for summarising files and
directories, and for coercing series from and to other
data types (such as 'zoo' series).

** Good things about tsdb

- no setup needed, no system dependencies
(i.e. external software, such as a database)
- completely portable; moving from one computer to
another requires no effort other than copying the
files (the only thing to take care of is file
encoding if non-ASCII column names are used)
- data usable by other software

** When you need another database

- tsdb is potentially slow
- no multi-user support; no access-rights management
(other than that provided by the OS)
- no network protocols

* Using tsdb

** Writing data

We first load the package.

#+BEGIN_SRC R :session *R* :results none :exports code
library("tsdb")
#+END_SRC

Start by creating time-series data.
#+BEGIN_SRC R :session *R* :results output :exports both
library("zoo")
z <- zoo(1:5, as.Date("2016-1-1") + 0:4)
z
#+END_SRC

#+RESULTS:
: 2016-01-01 2016-01-02 2016-01-03 2016-01-04 2016-01-05
: 1 2 3 4 5

To store these data, we need to enforce a consistent
format, which the functions =ts_table= and
=as.ts_table= do.

#+BEGIN_SRC R :session *R* :results output :exports both
ts <- as.ts_table(z, columns = "A")
ts
#+END_SRC

#+RESULTS:
: 5 rows [2016-01-01 -> 2016-01-05]: A

Note that we had to provide a column name (=A=) for the
data. This is not optional. It is one of the things
that =ts_table= enforces. Another is that timestamps
need to be of class =Date= or =POSIXct=.

To store the data to a file, use =write_ts_table=. The
function will take a directory and file name as
arguments, which mimics the hierarchy of databases and
tables in a classical database.
#+BEGIN_SRC R :session *R* :results none :exports code
write_ts_table(ts, dir = "~/tsdb/daily", file = "example1")
#+END_SRC

The written file will look like this:
# +INCLUDE: ~/tsdb/daily/example1 example

#+BEGIN_EXAMPLE
"timestamp","A"
16801,1
16802,2
16803,3
16804,4
16805,5
#+END_EXAMPLE

You may notice that the dates have been replaced by
numbers. The mapping between these numbers and calendar
times is described later, when we discuss the
representation of timestamps. (But if you can't wait:
it is the number of days since 1 January 1970.)

Let us write a second file. This time, we use
=ts_table= directly.

#+BEGIN_SRC R :session *R* :results output :exports both
x <- array(1:20, dim = c(10, 2))
colnames(x) <- c("A", "B")
x
#+END_SRC

#+RESULTS:
#+begin_example
A B
[1,] 1 11
[2,] 2 12
[3,] 3 13
[4,] 4 14
[5,] 5 15
[6,] 6 16
[7,] 7 17
[8,] 8 18
[9,] 9 19
[10,] 10 20
#+end_example

#+BEGIN_SRC R :session *R* :results output :exports both
ts_table(x, timestamp = as.Date("2016-1-1") + 0:9)
#+END_SRC

#+RESULTS:
: 10 rows [2016-01-01 -> 2016-01-10]: A, B

We can also explicitly specify the column names, which
will override the column names of the data. In fact,
this is the preferred way, since it makes things more
explicit (which usually means safer).
#+BEGIN_SRC R :session *R* :results output :exports both
ts <- ts_table(x, timestamp = as.Date("2016-1-1") + 0:9,
columns = c("B", "A"))
ts
#+END_SRC

#+RESULTS:
: 10 rows [2016-01-01 -> 2016-01-10]: B, A

We write the data to a file =example2=.
#+BEGIN_SRC R :session *R* :results none :exports code
write_ts_table(ts, dir = "~/tsdb/daily", file = "example2")
#+END_SRC

The written file looks like this:
# +INCLUDE: ~/tsdb/daily/example2 example

#+BEGIN_EXAMPLE
"timestamp","B","A"
16801,1,11
16802,2,12
16803,3,13
16804,4,14
16805,5,15
16806,6,16
16807,7,17
16808,8,18
16809,9,19
16810,10,20
#+END_EXAMPLE

** TODO Reading data

Use the function =read_ts_tables=.

#+name: read1
#+BEGIN_SRC R :session *R* :results output :exports both
read_ts_tables("example1", dir = "~/tsdb/daily", columns = "A")
#+END_SRC

The default return value is a list with components
=data=, =timestamp=, =columns= and =file.path=.
#+RESULTS: read1
#+begin_example
$data
A
[1,] 1
[2,] 2
[3,] 3
[4,] 4
[5,] 5

$timestamp
[1] "2016-01-01" "2016-01-02" "2016-01-03" "2016-01-04" "2016-01-05"

$columns
[1] "A"

$file.path
[1] "~/tsdb/daily/example1::A"
#+end_example

More convenient may be to specify a =return.class=.
#+BEGIN_SRC R :session *R* :results output :exports both
read_ts_tables("example1", dir = "~/tsdb/daily", columns = "A",
return.class = "zoo")
#+END_SRC

#+RESULTS:
: ~/tsdb/daily/example1::A
: 2016-01-01 1
: 2016-01-02 2
: 2016-01-03 3
: 2016-01-04 4
: 2016-01-05 5

#+BEGIN_SRC R :session *R* :results output :exports both
read_ts_tables("example1", dir = "~/tsdb/daily", columns = "A",
return.class = "data.frame")
#+END_SRC

#+RESULTS:
: timestamp ~/tsdb/daily/example1::A
: 1 2016-01-01 1
: 2 2016-01-02 2
: 3 2016-01-03 3
: 4 2016-01-04 4
: 5 2016-01-05 5

With =tsdb= before version 0.7, =read_ts_tables= would
per default only have returned values for non-weekend
days. (=tsdb= was written with financial data in mind,
and on weekends there are no prices.) This behaviour is
controlled by argument =drop.weekends=, which defaults
to =FALSE=.

#+BEGIN_SRC R :session *R* :results output :exports both
weekdays(as.Date("2016-1-1")+0:4)
#+END_SRC

#+RESULTS:
: [1] "Friday" "Saturday" "Sunday" "Monday" "Tuesday"

To obtain data for weekends as well, specify the
argument =drop.weekends=.
#+BEGIN_SRC R :session *R* :results output :exports both
read_ts_tables("example1", dir = "~/tsdb/daily",
columns = "A",
return.class = "data.frame",
drop.weekends = TRUE)
#+END_SRC

#+RESULTS:
: timestamp ~/tsdb/daily/example1::A
: 1 2016-01-01 1
: 2 2016-01-04 4
: 3 2016-01-05 5

You may have noticed a small difference in the names of
the functions for reading and writing. We always write
a single table, but we read tables.

#+BEGIN_SRC R :session *R* :results output :exports both
read_ts_tables(c("example1", "example2"),
dir = "~/tsdb/daily",
columns = "A",
return.class = "data.frame")
#+END_SRC

#+RESULTS:
#+begin_example
timestamp ~/tsdb/daily/example1::A ~/tsdb/daily/example2::A
1 2016-01-01 1 11
2 2016-01-02 2 12
3 2016-01-03 3 13
4 2016-01-04 4 14
5 2016-01-05 5 15
6 2016-01-06 NA 16
7 2016-01-07 NA 17
8 2016-01-08 NA 18
9 2016-01-09 NA 19
10 2016-01-10 NA 20
#+end_example

The column names of the returned object consist of the
filepaths and the column, which may be more information
than we actually want. The argument =column.name=
specifies the format; its default is
=%dir%/%file%::%column%=.
#+BEGIN_SRC R :session *R* :results output :exports both
read_ts_tables(c("example1", "example2"),
dir = "~/tsdb/daily",
columns = "A",
return.class = "data.frame",
column.name = "%file%/%column%")
#+END_SRC

#+RESULTS:
#+begin_example
timestamp example1/A example2/A
1 2016-01-01 1 11
2 2016-01-02 2 12
3 2016-01-03 3 13
4 2016-01-04 4 14
5 2016-01-05 5 15
6 2016-01-06 NA 16
7 2016-01-07 NA 17
8 2016-01-08 NA 18
9 2016-01-09 NA 19
10 2016-01-10 NA 20
#+end_example

Missing values are by default set to =NA=. That happens
even for missing columns, with a warning though.
#+BEGIN_SRC R :session *R* :results output :exports both
read_ts_tables(c("example1", "example2"),
dir = "~/tsdb/daily",
columns = c("A", "B"),
return.class = "data.frame",
column.name = "%file%/%column%")
#+END_SRC

#+RESULTS:
#+begin_example
timestamp example1/A example1/B example2/A example2/B
1 2016-01-01 1 NA 11 1
2 2016-01-02 2 NA 12 2
3 2016-01-03 3 NA 13 3
4 2016-01-04 4 NA 14 4
5 2016-01-05 5 NA 15 5
6 2016-01-06 NA NA 16 6
7 2016-01-07 NA NA 17 7
8 2016-01-08 NA NA 18 8
9 2016-01-09 NA NA 19 9
10 2016-01-10 NA NA 20 10
Warning message:
In read_ts_tables(c("example1", "example2"), dir = "~/tsdb/daily", :
columns missing
#+end_example

* How tsdb works

** ts_tables

tsdb works with /time-series tables/ (objects of
class =ts_table=). A =ts_table= is a numeric matrix,
so there is always a =dim= attribute. For a
time-series table =x=, you get the number of
observations with =dim(x)[1L]=.

Attached to this matrix are several attributes:

- timestamp :: a vector: the numeric representation of
the timestamp
- t.type :: character: the class of the original
timestamp, either =Date= or =POSIXct=
- columns :: a character vector that provides the
columns names

There may be other attributes as well, but these three
are always present.

A =ts_table= is not meant as a time-series class. For
most computations (plotting, calculation of statistics,
etc), the =ts_table= must first be coerced to =zoo=,
=xts=, a data-frame or a similar data
structure. Methods that perform such coercions are
responsible for converting the numeric timestamp vector
to an actual timestamp. For this, they may use the
function =ttime=, whose pronounciation may remind you
of a hot beverage, but whose name really stands for
=translate time=.

** The file format
:PROPERTIES:
:CUSTOM_ID: file-format
:END:

=tsdb= can store and load time-series data. The format
it uses is plain CSV. A sample file may look as
follows:

#+BEGIN_EXAMPLE
"timestamp","close"
17131,11
17132,12
17133,13
17134,14
17135,15
#+END_EXAMPLE

Thus, the file has a header line that gives the
names of the columns, with the first column always
being named =timestamp=.

The advantage of this plain format is that the data
are in no way dependent on =tsdb=. The files can be
used and manipulated by other software as well.

** Timestamps
:PROPERTIES:
:CUSTOM_ID: timestamps
:END:

Two types of timestamps are supported: =Date= and
=POSXIct=. As part of a =ts_table=, timestamps are
always stored in their numeric representation: daily
timestamps are represented as the number of days
since 1 Jan 1970; intraday timestamps are the number
of seconds since 1 Jan 1970.