https://github.com/qcri/tasrif

Tasrif is a python library for processing of wearable data from fitness trackers and wearable health devices
https://github.com/qcri/tasrif
data-processing fitbit myheartcounts sleephealth wearable wearable-devices withings
Last synced: 3 months ago
JSON representation
Tasrif is a python library for processing of wearable data from fitness trackers and wearable health devices
Host: GitHub
URL: https://github.com/qcri/tasrif
Owner: qcri
License: bsd-3-clause
Created: 2021-11-18T12:49:35.000Z (almost 4 years ago)
Default Branch: master
Last Pushed: 2022-11-13T04:59:51.000Z (almost 3 years ago)
Last Synced: 2025-04-11T07:08:57.733Z (6 months ago)
Topics: data-processing, fitbit, myheartcounts, sleephealth, wearable, wearable-devices, withings
Language: Python
Homepage: https://tasrif.qcri.org
Size: 7.5 MB
Stars: 15
Watchers: 5
Forks: 6
Open Issues: 1
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project

README

          


  





  A Python framework for processing wearable data in the health domain.





  

    

  

  

    

  

  

    

  

  

    

  

  

    

  

  

    

  

  

    

  

  

    

  



#

Tasrif is a library for processing of eHealth data. It provides:

- A pipeline DSL for chaining together commonly used processing operations on time-series eHealth

  data, such as resampling, normalization, etc.

- DataReaders for reading eHealth datasets such as

  [MyHeartCounts](https://www.synapse.org/?source=post_page---------------------------#!Synapse:syn11269541/wiki/), [SleepHealth](https://www.synapse.org/#!Synapse:syn18492837/wiki/) and data from FitBit devices.

## Installation

To use Tasrif, you will need to have the package installed. Please follow the bellow steps to install Tasrif:

First, create a virtual environment using [venv](https://docs.python.org/3/library/venv.html) with a linux operating system machine, or with [Windows Subsystem for Linux](https://docs.microsoft.com/en-us/windows/wsl/install)

```python

# Create a virtual environment

python3 -m venv tasrif-env

# Activate the virtual environment

source tasrif-env/bin/activate

# Upgrade pip

(tasrif-env) pip install --upgrade pip

```

Then, install Tasrif either from PyPI

```python

(tasrif-env) pip install tasrif

```

or install from source

```python

(tasrif-env) git clone https://github.com/qcri/tasrif

(tasrif-env) cd tasrif

(tasrif-env) pip install -e .

```

> **Important installation note**: one of Tasrif's dependancies is [pyjq](https://pypi.org/project/pyjq/) which requires gcc commands to be installed in your local machine. Specifically `pyjq` requires autoconf, automake, libtool, python3-devel.x86_64, python3-tkinter, python-pip, jq and awscli. See more about this [issue here](https://github.com/duo-labs/cloudmapper/issues/301). To avoid the hassle of installing these libraries, we recommend running Tasrif with Docker.

If no installation errors occur, see [Quick start by usecase](#quick-start-by-usecase) section to use Tasrif.

## Running Tasrif with docker

To avoid the hassle of Tasrif installation, you can use Tasrif in a docker container that launches a local jupyter notebook.

Make sure you have docker installed in your operating system before running the following commands.

```python

cd tasrif

docker build -t tasrif .

docker run -i -p 8888:8888 tasrif

```

You can mount a local directory to the container with the following command

```python

docker run -i -v :/home/mnt -p 8888:8888 tasrif

```

After running the container, visit `http://127.0.0.1:8888/` in your preferred browser to work with the jupyter notebook.

### Note on feature extraction using Tasrif

Due to some outdated internal Tasrif dependancies on Pypi, we have decided to place those dependancies in `requirements.txt`. Once those packages are updated in Pypi, we will move them back to `setup.py`. The current `requirements.txt` specifies the dependancies links directly from Github. If you plan to use the following two operators: `TSFreshFeatureExtractorOperator` or `CalculateTimeseriesPropertiesOperator`, you will need [TSFresh](https://github.com/blue-yonder/tsfresh) and [Kats](https://github.com/facebookresearch/Kats) packages installed, which can be done by running the following command

```python

(tasrif-env) MINIMAL_KATS=1 pip install -r requirements.txt

```

Note that `MINIMAL_KATS=1` is passed in the installation script to minimally install Kats. See [requirements.txt](requirements.txt) for details.

## Features

### Pipeline DSL

Tasrif provies a variety of processing operators that can be chained together in a pipeline. The

operators themselves take as input and output [Pandas](https://pandas.pydata.org/)

[DataFrames](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dataframe).

For example, consider the `AggregateOperator`:

```python

>>> import pandas as pd

>>> from tasrif.processing_pipeline.custom import AggregateOperator

>>> from tasrif.processing_pipeline.pandas import DropNAOperator

>>> df0 = pd.DataFrame([

        ['Doha', 25, 30],

        ['Doha', 17, 50],

        ['Dubai', 20, 40],

        ['Dubai', 21, 42]],

        columns=['city', 'min_temp', 'max_temp'])

>>> operator = AggregateOperator(

    groupby_feature_names="city",

    aggregation_definition={"min_temp": ["mean", "std"]})

>>> df0 = operator.process(df0)

>>> df0

[    city  min_temp_mean  min_temp_std

0   Doha           21.0      5.656854

1  Dubai           20.5      0.707107]

```

Operators are meant to be used as part of a pipeline, where they can be chained together for

sequential processing of data:

```python

>>> import pandas as pd

>>> from tasrif.processing_pipeline import SequenceOperator

>>> from tasrif.processing_pipeline.custom import AggregateOperator, CreateFeatureOperator

>>> from tasrif.processing_pipeline.pandas import ConvertToDatetimeOperator, SortOperator

>>> df0 = pd.DataFrame([

        ['15-07-2021', 'Doha', 25, 30],

        ['16-07-2021', 'Doha', 17, 50],

        ['15-07-2021', 'Dubai', 20, 40],

        ['16-07-2021', 'Dubai', 21, 42]],

        columns=['date', 'city', 'min_temp', 'max_temp'])

>>> pipeline = SequenceOperator([

        ConvertToDatetimeOperator(feature_names=["date"]),

        CreateFeatureOperator(

            feature_name='avg_temp',

            feature_creator=lambda df: (df['min_temp'] + df['max_temp'])/2),

        SortOperator(by='avg_temp')

    ])

>>> pipeline.process(df0)

[        date   city  min_temp  max_temp  avg_temp

0 2021-07-15   Doha        25        30      27.5

2 2021-07-15  Dubai        20        40      30.0

3 2021-07-16  Dubai        21        42      31.5

1 2021-07-16   Doha        17        50      33.5]

```

### DataReaders

Tasrif also comes with DataReader classes for importing various eHealth datasets into pipelines.

These readers preprocess the raw data and convert them into a DataFrame for downstream processing in a pipeline.

Supported datasets include:

- [MyHeartCounts](https://www.synapse.org/?source=post_page---------------------------#!Synapse:syn11269541/wiki/)

- [SleepHealth](https://www.synapse.org/#!Synapse:syn18492837/wiki/)

- [Zenodo FitBit](https://zenodo.org/record/53894)

- Export data from FitBit devices

- Export data from Withings devices

- ...and more

DataReaders can be used by treating them as source operators in a pipeline:

```python

from tasrif.processing_pipeline import SequenceOperator

from tasrif.data_readers.my_heart_counts import DayOneSurveyDataset

from tasrif.processing_pipeline import DropNAOperator

day_one_survey_path = 

pipeline = Pipeline([

    DayOneSurveyDataset(day_one_survey_path),

    DropNAOperator,

    SetIndexOperator('healthCode'),

])

pipeline.process()

```

## Quick start by usecase

- [Reading data](#reading-data)

- [Compute statistics](#compute-statistics)

- [Extract features from existing columns](#extract-features-from-existing-columns)

- [Filter data](#filter-data)

- [Wrangle data](#wrangle-data)

- [Test prepared data](#test-prepared-data)

- [Create a pipeline to link the operators](#create-a-pipeline-to-link-the-operators)

- [Debug your pipeline](#debug-your-pipeline)

- [Define a custom operator](#define-a-custom-operator)

- [Other references](#other-references)

### Reading data

Reading a single csv file

```python

from tasrif.processing_pipeline.pandas import ReadCsvOperator

operator = ReadCsvOperator('examples/quick_start/csvs/participant1.csv')

df = operator.process()[0]

```

Reading multiple csvs in a folder

```python

from tasrif.processing_pipeline.custom import ReadCsvFolderOperator

operator = ReadCsvFolderOperator(name_pattern='examples/quick_start/csvs/*.csv')

df = operator.process()[0]

```

by default, `ReadCsvFolderOperator` concatenates the csvs into one dataframe. if you would like to work on the csvs separately, you can pass the argument `concatenate=False` to `ReadCsvFolderOperator`, which returns a python generator that iterates the csvs.

Reading csvs referenced by a column in dataframe `df`

```python

import pandas as pd

from tasrif.processing_pipeline.custom import ReadNestedCsvOperator

df = pd.DataFrame({"name": ['Alfred', 'Roy'],

                   "age": [43, 32],

                   "csv_files_column": ['participant1.csv', 'participant2.csv']})

operator = ReadNestedCsvOperator(folder_path='examples/quick_start/csvs/',

                                 field='csv_files_column')

generator = operator.process(df)[0]

for record, details in generator:

    print(record)

    print(details)

```

Reading json files referenced by a column in dataframe `df`

```python

import pandas as pd

from tasrif.processing_pipeline.custom import IterateJsonOperator

df = pd.DataFrame({"name": ['Alfred', 'Roy'],

                   "age": [43, 32],

                   "json_files_column": ['participant1.json', 'participant2.json']})

operator = IterateJsonOperator(folder_path='examples/quick_start/jsons/',

                               field='json_files_column',

                               pipeline=None)

generator = operator.process(df)[0]

for record, details in generator:

    print(record)

    print(details)

```

### Compute statistics

Compute quick statistics using `StatisticsOperator`. `StatisticsOperator` includes counts of rows, missing data, duplicate rows, and others.

```python

import pandas as pd

from tasrif.processing_pipeline.custom import StatisticsOperator

df = pd.DataFrame( [

    ['2020-02-20', 1000, 1800, 1], ['2020-02-21', 5000, 2100, 1], ['2020-02-22', 10000, 2400, 1],

    ['2020-02-20', 1000, 1800, 1], ['2020-02-21', 5000, 2100, 1], ['2020-02-22', 10000, 2400, 1],

    ['2020-02-20', 0, 1600, 2], ['2020-02-21', 4000, 2000, 2], ['2020-02-22', 11000, 2400, 2],

    ['2020-02-20', None, 2000, 3], ['2020-02-21', 0, 2700, 3], ['2020-02-22', 15000, 3100, 3]],

columns=['Day', 'Steps', 'Calories', 'PersonId'])

filter_features = {

    'Steps': lambda x : x > 0

}

sop = StatisticsOperator(participant_identifier='PersonId',

                         date_feature_name='Day',

                         filter_features=filter_features)

sop.process(df)[0]

```

Or use `ParticipationOverviewOperator` to see statistics per participant. Pass the argument `overview_type="date_vs_features"` to compute statistics per date. See below

```python

from tasrif.processing_pipeline.custom import ParticipationOverviewOperator

sop = ParticipationOverviewOperator(participant_identifier='PersonId',

                                    date_feature_name='Day',

                                    overview_type='participant_vs_features')

sop.process(df)[0]

```

Use `AggregateOperator` if you require specific statistics for some columns

```python

from tasrif.processing_pipeline.custom import AggregateOperator

operator = AggregateOperator(groupby_feature_names ="PersonId",

                            aggregation_definition= {"Steps": ["mean", "std"],

                                                     "Calories": ["sum"]

                                                    })

operator.process(df)[0]

```

### Extract features from existing columns

Convert time columns into cyclical features, which are more efficiently grasped by machine learning models

```python

from tasrif.processing_pipeline.custom import EncodeCyclicalFeaturesOperator

from tasrif.processing_pipeline.pandas import ReadCsvOperator

df = ReadCsvOperator('examples/quick_start/steps_per_day.csv',

                     parse_dates=['Date']).process()[0]

operator = EncodeCyclicalFeaturesOperator(date_feature_name="Date",

                                          category_definition="day")

operator.process(df)[0]

```

Extract timeseries features using `CalculateTimeseriesPropertiesOperator` which internally calls `kats` package

```python

from tasrif.processing_pipeline.kats import CalculateTimeseriesPropertiesOperator

from tasrif.processing_pipeline.pandas import ReadCsvOperator

df = ReadCsvOperator('examples/quick_start/long_ts.csv',

                     parse_dates=['Date']).process()[0]

operator = CalculateTimeseriesPropertiesOperator(date_feature_name="Date", value_column='Steps')

operator.process(df)[0]

```

Extract using features using `tsfresh` package

```python

from tasrif.processing_pipeline.custom import SlidingWindowOperator

from tasrif.processing_pipeline.pandas import ReadCsvOperator

from tasrif.processing_pipeline.tsfresh import TSFreshFeatureExtractorOperator

df = ReadCsvOperator('examples/quick_start/cgm.csv',

                     parse_dates=['dateTime']).process()[0]

op = SlidingWindowOperator(winsize="1h15t",

                           time_col="dateTime",

                           label_col="CGM",

                           participant_identifier="patientID")

df_timeseries, df_labels, df_label_time, df_pids = op.process(df)[0]

op = TSFreshFeatureExtractorOperator(seq_id_col="seq_id", date_feature_name='dateTime', value_col='CGM')

features = op.process(df_timeseries)[0]

features.dropna(axis=1)

```

Note that `TSFreshFeatureExtractorOperator` requires a column `seq_id`. This column indicates which entities the time series belong to. Features will be extracted individually for each entity (id). The resulting feature matrix will contain one row per id. The column can be created manually or be created via `SlidingWindowOperator`.

### Filter data

filter rows, days, or participants with a custom condition using `FilterOperator`

```python

from tasrif.processing_pipeline.pandas import ReadCsvOperator

from tasrif.processing_pipeline.custom import FilterOperator

df = ReadCsvOperator('examples/quick_start/filter_example.csv',

                     parse_dates=['Hours']).process()[0]

operator = FilterOperator(participant_identifier="Id",

                          date_feature_name="Hours",

                          epoch_filter=lambda df: df['Steps'] > 10,

                          day_filter={

                              "column": "Hours",

                              "filter": lambda x: x.count() < 10,

                              "consecutive_days": (7, 12) # 7 minimum consecutive days, and 12 max

                          },

                          filter_type="include")

operator.process(df)[0]

```

### Wrangle data

Add a column using `CreateFeatureOperator`

```python

import pandas as pd

from pandas import Timestamp

df = pd.DataFrame([

 [Timestamp('2016-12-31 00:00:00'), Timestamp('2017-01-01 09:03:00'), 5470, 2968, 1],

 [Timestamp('2017-01-01 00:00:00'), Timestamp('2017-01-01 23:44:00'), 9769, 2073, 1],

 [Timestamp('2017-01-02 00:00:00'), Timestamp('2017-01-02 16:54:00'), 9444, 2883, 1],

 [Timestamp('2017-01-03 00:00:00'), Timestamp('2017-01-05 22:49:00'), 20064, 2287, 1],

 [Timestamp('2017-01-04 00:00:00'), Timestamp('2017-01-06 07:27:00'),16771, 2716, 1]],

    columns = ['startTime', 'endTime', 'steps', 'calories', 'personId']

)

operator = CreateFeatureOperator(

   feature_name="duration",

   feature_creator=lambda df: df['endTime'] - df['startTime'])

operator.process(df)[0]

```

Upsample or downsample date features using `ResampleOperator`. The first argument `rule` can be minutes `min`, hours `H`, days `D`, and more. See details of resampling [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html)

```python

from tasrif.processing_pipeline.pandas import ReadCsvOperator

from tasrif.processing_pipeline.custom import ResampleOperator

df = ReadCsvOperator('examples/quick_start/sleep.csv',

                     parse_dates=['timestamp'],

                     index_col=['timestamp']).process()[0]

op = ResampleOperator('D', {'sleep_level': 'mean'})

op.process(df)

```

Note that, currently, the index of the dataframe has to be of type `DatetimeIndex` so that `ResampleOperator` can be called correctly.

Set the start hour of the day to some hour using `SetStartHourOfDayOperator`

```python

from tasrif.processing_pipeline.pandas import ReadCsvOperator

from tasrif.processing_pipeline.custom import SetStartHourOfDayOperator

df = ReadCsvOperator('examples/quick_start/filter_example.csv',

                     parse_dates=['Hours']).process()[0]

operator = SetStartHourOfDayOperator(date_feature_name='Hours',

                                     participant_identifier='Id',

                                     shift=6)

operator.process(df)[0]

```

a new column `shifted_time_col` will be created. This can be useful if the user wants to calculate statistics at a redefined times of the day instead of midnight-to-midnight (e.g. 8:00 AM - 8:00 AM).

Concatenate multiple dataframes or a generator using `ConcatOperator`

```python

import pandas as pd

from tasrif.processing_pipeline.pandas import ConcatOperator

df = pd.DataFrame([

 [Timestamp('2016-12-31 00:00:00'), Timestamp('2017-01-01 09:03:00'), 5470, 2968, 1],

 [Timestamp('2017-01-01 00:00:00'), Timestamp('2017-01-01 23:44:00'), 9769, 2073, 1],

 [Timestamp('2017-01-02 00:00:00'), Timestamp('2017-01-02 16:54:00'), 9444, 2883, 1],

 [Timestamp('2017-01-03 00:00:00'), Timestamp('2017-01-05 22:49:00'), 20064, 2287, 1],

 [Timestamp('2017-01-04 00:00:00'), Timestamp('2017-01-06 07:27:00'),16771, 2716, 1]],

    columns = ['startTime', 'endTime', 'steps', 'calories', 'personId']

)

df1 = df.copy()

df2 = df.copy()

concatenated_df = ConcatOperator().process(df1, df2)[0]

```

Normalize selected columns

```python

import pandas as pd

from tasrif.processing_pipeline.custom import NormalizeOperator

from tasrif.processing_pipeline.custom import NormalizeTransformOperator

df = pd.DataFrame([

    [1, "2020-05-01 00:00:00", 10],

    [1, "2020-05-01 01:00:00", 15],

    [1, "2020-05-01 03:00:00", 23],

    [2, "2020-05-02 00:00:00", 17],

    [2, "2020-05-02 01:00:00", 11]],

    columns=['logId', 'timestamp', 'sleep_level'])

op = NormalizeOperator('all', 'minmax', {'feature_range': (0, 2)})

output = op.process(df)

```

Use the fit normalizer on different data using `NormalizeTransformOperator`

```python

trained_model = output[0][1]

op = NormalizeTransformOperator('all', trained_model)

output = op.process(df)

output

```

Use `AggregateActivityDatesOperator` to view the start date and end date of a

dataframe that has a date column per row per participant.

```python

import pandas as pd

from tasrif.processing_pipeline.custom import AggregateActivityDatesOperator

from tasrif.processing_pipeline.pandas import ReadCsvOperator

reader = ReadCsvOperator('examples/quick_start/activity_long.csv')

df = reader.process()[0]

operator = AggregateActivityDatesOperator(date_feature_name="date",

                                        participant_identifier=['Id', 'logId'])

df = operator.process(df)[0]

df

```

You can use `jqOperator` to process JSON data

```python

import pandas as pd

from tasrif.processing_pipeline.custom import JqOperator

df = [

  {

    "date": "2020-01-01",

    "sleep": [

      {

        "sleep_data": [

          {

            "level": "rem",

            "minutes": 180

          },

          {

            "level": "deep",

            "minutes": 80

          },

          {

            "level": "light",

            "minutes": 300

          }

        ]

      }

    ]

  },

  {

    "date": "2020-01-02",

    "sleep": [

      {

        "sleep_data": [

          {

            "level": "rem",

            "minutes": 280

          },

          {

            "level": "deep",

            "minutes": 60

          },

          {

            "level": "light",

            "minutes": 200

          }

        ]

      }

    ]

  }

]

op = JqOperator("map({date, sleep: .sleep[].sleep_data})")

op.process(df)

```

### Test prepared data

See if your prepared data can act as an input to a machine learning model

```python

from tasrif.processing_pipeline.custom import LinearFitOperator

df = pd.DataFrame([

    [1, "2020-05-01 00:00:00", 10, 'poor'],

    [1, "2020-05-01 01:00:00", 15, 'poor'],

    [1, "2020-05-01 03:00:00", 23, 'good'],

    [2, "2020-05-02 00:00:00", 17, 'good'],

    [2, "2020-05-02 01:00:00", 11, 'poor']],

    columns=['logId', 'timestamp', 'sleep_level', 'sleep_quality'])

op = LinearFitOperator(feature_names='sleep_level',

                       target='sleep_quality',

                       target_type='categorical')

op.process(df)

```

### Create a pipeline to link the operators

Chain operators using `SequenceOperator`

```python

import pandas as pd

from tasrif.processing_pipeline import SequenceOperator

from tasrif.processing_pipeline.custom import AggregateOperator, CreateFeatureOperator, SetStartHourOfDayOperator

from tasrif.processing_pipeline.pandas import ConvertToDatetimeOperator, SortOperator, ReadCsvOperator

df = ReadCsvOperator('examples/quick_start/cgm.csv').process()[0]

df

pipeline = SequenceOperator([

    ConvertToDatetimeOperator(feature_names=["dateTime"]),

    SetStartHourOfDayOperator(date_feature_name='dateTime',

                                     participant_identifier='patientID',

                                     shift=6),

    SortOperator(by='dateTime'),

    AggregateOperator(groupby_feature_names ="patientID",

                      aggregation_definition= {"CGM": ["mean", "std"]})

])

pipeline.process(df)

```

### Debug your pipeline

Tasrif contains observers under `tasrif/processing_pipeline/observers/` that are useful for seeing how the operators change your data. For instance, you can print the head of processed dataframe after every operator. You can do so by passing an `observer` to the `observers` argument in `SequenceOperator`.

```python

import pandas as pd

from tasrif.processing_pipeline.pandas import RenameOperator

from tasrif.processing_pipeline.observers import FunctionalObserver, LoggingObserver, GroupbyLoggingObserver

from tasrif.processing_pipeline import SequenceOperator, Observer

df = pd.DataFrame([

    [1, "2020-05-01 00:00:00", 1],

    [1, "2020-05-01 01:00:00", 1],

    [1, "2020-05-01 03:00:00", 2],

    [2, "2020-05-02 00:00:00", 1],

    [2, "2020-05-02 01:00:00", 1]],

    columns=['logId', 'timestamp', 'sleep_level'])

pipeline = SequenceOperator([RenameOperator(columns={"timestamp": "time"}),

                             RenameOperator(columns={"time": "time_difference"})],

                             observers=[LoggingObserver("head,tail")])

result = pipeline.process(df[0])

result

```

### Define a custom operator

Users can inherit from `MapProcessingOperator` to quickly build their own custom operators that perform map-like operations.

```python

from tasrif.processing_pipeline.map_processing_operator import MapProcessingOperator

class SizeOperator(MapProcessingOperator):

    def _processing_function(self, df):

        return df.size

```

### Other references

- You may examine `tasrif/processing_pipeline/test_scripts/` for other minimal examples of Tasrif's operators.

- Common Pandas functions can be found under `tasrif/processing_pipeline/pandas/`

## Documentation

Tasrif's official documentation is hosted here: [https://tasrif.qcri.org](https://tasrif.qcri.org)

You can build the docs locally after installing the dependencies in `setup.py` and

`requirements.txt` by:

```

cd docs

make html

```

You can then browse through them by opening `docs/build/html/index.html` in a browser.

# Contributing

   

   



This project is much stronger with your collaboration. Be part of it!


Thank you all amazing contributors!
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/qcri/tasrif

Awesome Lists containing this project

README