{"id":13571083,"url":"https://github.com/firmai/deltapy","last_synced_at":"2025-10-24T00:39:49.680Z","repository":{"id":40440080,"uuid":"253993655","full_name":"firmai/deltapy","owner":"firmai","description":"DeltaPy - Tabular Data Augmentation (by @firmai)","archived":false,"fork":false,"pushed_at":"2023-09-19T11:11:53.000Z","size":1543,"stargazers_count":543,"open_issues_count":2,"forks_count":55,"subscribers_count":16,"default_branch":"master","last_synced_at":"2025-05-05T02:51:39.798Z","etag":null,"topics":["augmentation","data-augmentation","data-science","feature-engineering","feature-extraction","finance","machine-learning","tabular-data","time-series"],"latest_commit_sha":null,"homepage":"https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3582219","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/firmai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2020-04-08T05:27:53.000Z","updated_at":"2025-04-15T07:19:36.000Z","dependencies_parsed_at":"2023-12-11T21:48:31.981Z","dependency_job_id":null,"html_url":"https://github.com/firmai/deltapy","commit_stats":{"total_commits":39,"total_committers":5,"mean_commits":7.8,"dds":0.3846153846153846,"last_synced_commit":"05c8a6440440bcc5ee1d051d7dfeee70807329de"},"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/firmai%2Fdeltapy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/firmai%2Fdeltapy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/firmai%2Fdeltapy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/firmai%2Fdeltapy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/firmai","download_url":"https://codeload.github.com/firmai/deltapy/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252741726,"owners_count":21797074,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["augmentation","data-augmentation","data-science","feature-engineering","feature-extraction","finance","machine-learning","tabular-data","time-series"],"created_at":"2024-08-01T14:00:58.257Z","updated_at":"2025-10-24T00:39:49.608Z","avatar_url":"https://github.com/firmai.png","language":"Jupyter Notebook","readme":"\n# DeltaPy⁠⁠ — Tabular Data Augmentation \u0026 Feature Engineering\n\n[![Downloads](https://pepy.tech/badge/deltapy)](https://pepy.tech/project/deltapy)\n\n[![DOI](https://zenodo.org/badge/253993655.svg)](https://zenodo.org/badge/latestdoi/253993655)\n\n---------\n\nFinance Quant Machine Learning\n------------------\n- [ML-Quant.com](https://www.ml-quant.com/)  -  Automated Research Repository \n\n### Introduction\n\nTabular augmentation is a new experimental space that makes use of novel and traditional data generation and synthesisation techniques to improve model prediction success. It is in essence a process of modular feature engineering and observation engineering while emphasising the order of augmentation to achieve the best predicted outcome from a given information set. DeltaPy was created with finance applications in mind, but it can be broadly applied to any data-rich environment.\n\nTo take full advantage of tabular augmentation for time-series you would perform the techniques in the following order: **(1) transforming**, **(2) interacting**, **(3) mapping**, **(4) extracting**, and **(5) synthesising**. What follows is a practical example of how the above methodology can be used. The purpose here is to establish a framework for table augmentation and to point and guide the user to existing packages.\n\nFor most the [Colab Notebook](https://colab.research.google.com/drive/1-uJqGeKZfJegX0TmovhsO90iasyxZYiT) format might be preferred. I have enabled comments if you want to ask question or address any issues you uncover. For anything pressing use the issues tab. Also have a look at the [SSRN report](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3582219) for a more succinct insights. \n\n![](assets/Tabularz.png)\n\nData augmentation can be defined as any method that could increase the size or improve the quality of a dataset by generating new features or instances without the collection of additional data-points. Data augmentation is of particular importance in image classification tasks where additional data can be created by cropping, padding, or flipping existing images.\n\nTabular cross-sectional and time-series prediction tasks can also benefit from augmentation. Here we divide tabular augmentation into columnular and row-wise methods. Row-wise methods are further divided into extraction and data synthesisation techniques, whereas columnular methods are divided into transformation, interaction, and mapping methods.  \n\nSee the [Skeleton Example](#example), for a combination of multiple methods that lead to a halfing of the mean squared error. \n\n\n#### Installation \u0026 Citation\n----------\n```\npip install deltapy\n```\n\n```\n@software{deltapy,\n  title = {{DeltaPy}: Tabular Data Augmentation},\n  author = {Snow, Derek},\n  url = {https://github.com/firmai/deltapy/},\n  version = {0.1.0},\n  date = {2020-04-11},\n}\n```\n\n```\n Snow, Derek, DeltaPy: A Framework for Tabular Data Augmentation in Python (April 22, 2020). Available at SSRN: https://ssrn.com/abstract=3582219\n```\n\n\n### Function Glossary\n---------------\n\n**Transformation**\n```python\ndf_out = transform.robust_scaler(df.copy(), drop=[\"Close_1\"]); df_out.head()\ndf_out = transform.standard_scaler(df.copy(), drop=[\"Close\"]); df_out.head()           \ndf_out = transform.fast_fracdiff(df.copy(), [\"Close\",\"Open\"],0.5); df_out.head()\ndf_out = transform.windsorization(df.copy(),\"Close\",para,strategy='both'); df_out.head()\ndf_out = transform.operations(df.copy(),[\"Close\"]); df_out.head()\ndf_out = transform.triple_exponential_smoothing(df.copy(),[\"Close\"], 12, .2,.2,.2,0); \ndf_out = transform.naive_dec(df.copy(), [\"Close\",\"Open\"]); df_out.head()\ndf_out = transform.bkb(df.copy(), [\"Close\"]); df_out.head()\ndf_out = transform.butter_lowpass_filter(df.copy(),[\"Close\"],4); df_out.head()\ndf_out = transform.instantaneous_phases(df.copy(), [\"Close\"]); df_out.head()\ndf_out = transform.kalman_feat(df.copy(), [\"Close\"]); df_out.head()\ndf_out = transform.perd_feat(df.copy(),[\"Close\"]); df_out.head()\ndf_out = transform.fft_feat(df.copy(), [\"Close\"]); df_out.head()\ndf_out = transform.harmonicradar_cw(df.copy(), [\"Close\"],0.3,0.2); df_out.head()\ndf_out = transform.saw(df.copy(),[\"Close\",\"Open\"]); df_out.head()\ndf_out = transform.modify(df.copy(),[\"Close\"]); df_out.head()\ndf_out  = transform.prophet_feat(df.copy().reset_index(),[\"Close\",\"Open\"],\"Date\", \"D\"); df_out.head()\n```\n**Interaction**\n```python\ndf_out = interact.lowess(df.copy(), [\"Open\",\"Volume\"], df[\"Close\"], f=0.25, iter=3); df_out.head()\ndf_out = interact.autoregression(df.copy()); df_out.head()\ndf_out = interact.muldiv(df.copy(), [\"Close\",\"Open\"]); df_out.head()\ndf_out = interact.decision_tree_disc(df.copy(), [\"Close\"]); df_out.head()\ndf_out = interact.quantile_normalize(df.copy(), drop=[\"Close\"]); df_out.head()\ndf_out = interact.tech(df.copy()); df_out.head()\ndf_out = interact.genetic_feat(df.copy()); df_out.head()\n```\n**Mapping**\n```python\ndf_out = mapper.pca_feature(df.copy(),variance_or_components=0.80,drop_cols=[\"Close_1\"]); df_out.head()\ndf_out = mapper.cross_lag(df.copy()); df_out.head()\ndf_out = mapper.a_chi(df.copy()); df_out.head()\ndf_out = mapper.encoder_dataset(df.copy(), [\"Close_1\"], 15); df_out.head()\ndf_out = mapper.lle_feat(df.copy(),[\"Close_1\"],4); df_out.head()\ndf_out = mapper.feature_agg(df.copy(),[\"Close_1\"],4 ); df_out.head()\ndf_out = mapper.neigh_feat(df.copy(),[\"Close_1\"],4 ); df_out.head()\n```\n\n**Extraction**\n```python\nextract.abs_energy(df[\"Close\"])\nextract.cid_ce(df[\"Close\"], True)\nextract.mean_abs_change(df[\"Close\"])\nextract.mean_second_derivative_central(df[\"Close\"])\nextract.variance_larger_than_standard_deviation(df[\"Close\"])\nextract.var_index(df[\"Close\"].values,var_index_param)\nextract.symmetry_looking(df[\"Close\"])\nextract.has_duplicate_max(df[\"Close\"])\nextract.partial_autocorrelation(df[\"Close\"])\nextract.augmented_dickey_fuller(df[\"Close\"])\nextract.gskew(df[\"Close\"])\nextract.stetson_mean(df[\"Close\"])\nextract.length(df[\"Close\"])\nextract.count_above_mean(df[\"Close\"])\nextract.longest_strike_below_mean(df[\"Close\"])\nextract.wozniak(df[\"Close\"])\nextract.last_location_of_maximum(df[\"Close\"])\nextract.fft_coefficient(df[\"Close\"])\nextract.ar_coefficient(df[\"Close\"])\nextract.index_mass_quantile(df[\"Close\"])\nextract.number_cwt_peaks(df[\"Close\"])\nextract.spkt_welch_density(df[\"Close\"])\nextract.linear_trend_timewise(df[\"Close\"])\nextract.c3(df[\"Close\"])\nextract.binned_entropy(df[\"Close\"])\nextract.svd_entropy(df[\"Close\"].values)\nextract.hjorth_complexity(df[\"Close\"])\nextract.max_langevin_fixed_point(df[\"Close\"])\nextract.percent_amplitude(df[\"Close\"])\nextract.cad_prob(df[\"Close\"])\nextract.zero_crossing_derivative(df[\"Close\"])\nextract.detrended_fluctuation_analysis(df[\"Close\"])\nextract.fisher_information(df[\"Close\"])\nextract.higuchi_fractal_dimension(df[\"Close\"])\nextract.petrosian_fractal_dimension(df[\"Close\"])\nextract.hurst_exponent(df[\"Close\"])\nextract.largest_lyauponov_exponent(df[\"Close\"])\nextract.whelch_method(df[\"Close\"])\nextract.find_freq(df[\"Close\"])\nextract.flux_perc(df[\"Close\"])\nextract.range_cum_s(df[\"Close\"])\nextract.structure_func(df[\"Close\"])\nextract.kurtosis(df[\"Close\"])\nextract.stetson_k(df[\"Close\"])\n```\n\nTest sets should ideally not be preprocessed with the training data, as in such a way one could be peaking ahead in the training data. The preprocessing parameters should be identified on the test set and then applied on the test set, i.e., the test set should not have an impact on the transformation applied. As an example, you would learn the parameters of PCA decomposition on the training set and then apply the parameters to both the train and the test set. \n\nThe benefit of pipelines become clear when one wants to apply multiple augmentation methods. It makes it easy to learn the parameters and then apply them widely. For the most part, this notebook does not concern itself with 'peaking ahead' or pipelines, for some functions, one might have to restructure to code and make use of open source packages to create your preferred solution.\n\n\nDocumentation by Example\n-----------------\n\n**Notebook Dependencies**\n\n\n```python\npip install deltapy\n```\n\n\n```python\npip install pykalman\npip install tsaug\npip install ta\npip install tsaug\npip install pandasvault\npip install gplearn\npip install ta\npip install seasonal\npip install pandasvault\n```\n\n### Data and Package Load\n\n\n```python\nimport pandas as pd\nimport numpy as np\nfrom deltapy import transform, interact, mapper, extract \nimport warnings\nwarnings.filterwarnings('ignore')\n\ndef data_copy():\n  df = pd.read_csv(\"https://github.com/firmai/random-assets-two/raw/master/numpy/tsla.csv\")\n  df[\"Close_1\"] = df[\"Close\"].shift(-1)\n  df = df.dropna()\n  df[\"Date\"] = pd.to_datetime(df[\"Date\"])\n  df = df.set_index(\"Date\")\n  return df\ndf = data_copy(); df.head()\n```\n\nSome of these categories are fluid and some techniques could fit into multiple buckets. This is an attempt to find an exhaustive number of techniques, but not an exhaustive list of implementations of the techniques. For example, there are thousands of ways to smooth a time-series, but we have only includes 1-2 techniques of interest under each category.\n\n### **(1) [\u003cfont color=\"black\"\u003eTransformation:\u003c/font\u003e](#transformation)**\n-----------------\n1. Scaling/Normalisation\n2. Standardisation\n10. Differencing\n3. Capping\n13. Operations\n4. Smoothing\n5. Decomposing\n6. Filtering\n7. Spectral Analysis\n8. Waveforms\n9. Modifications\n11. Rolling\n12. Lagging\n14. Forecast Model\n\n### **(2) [\u003cfont color=\"black\"\u003eInteraction:\u003c/font\u003e](#interaction)**\n-----------------\n1. Regressions\n2. Operators\n3. Discretising\n4. Normalising\n5. Distance\n6. Speciality\n7. Genetic\n\n### **(3) [\u003cfont color=\"black\"\u003eMapping:\u003c/font\u003e](#mapping)**\n-----------------\n1. Eigen Decomposition\n2. Cross Decomposition\n3. Kernel Approximation\n4. Autoencoder\n5. Manifold Learning\n6. Clustering\n7. Neighbouring\n\n### **(4) [\u003cfont color=\"black\"\u003eExtraction:\u003c/font\u003e](#extraction)**\n-----------------\n1. Energy\n2. Distance\n3. Differencing\n4. Derivative\n5. Volatility\n6. Shape\n7. Occurrence\n8. Autocorrelation\n9. Stochasticity\n10. Averages\n11. Size\n13. Count\n14. Streaks\n14. Location\n15. Model Coefficients\n16. Quantile\n17. Peaks\n18. Density\n20. Linearity\n20. Non-linearity\n21. Entropy\n22. Fixed Points\n23. Amplitude\n23. Probability\n24. Crossings\n25. Fluctuation\n26. Information\n27. Fractals\n29. Exponent\n30. Spectral Analysis\n31. Percentile\n32. Range\n33. Structural\n12. Distribution\n\n\n\u003ca name=\"transformation\"\u003e\u003c/a\u003e\n\n## **(1) Transformation**\n\nHere transformation is any method that includes only one feature as an input to produce a new feature/s. Transformations can be applied to cross-section and time-series data. Some transformations are exclusive to time-series data (smoothing, filtering), but a handful of functions apply to both. \n\nWhere the time series methods has a centred mean, or are forward-looking, there is a need to recalculate the outputed time series on a running basis to ensure that information of the future does not leak into the model. The last value of this recalculated series or an extracted feature from this series can then be used as a running value that is only backward looking, satisfying the no 'peaking' ahead rule. \n\nThere are some packaged in Python that dynamically create time series and extracts their features, but none that incoropates the dynamic creation of a time series in combination with a wide application of prespecified list of extractions. Because this technique is expensive, we have a preference for models that only take historical data into account. \n\nIn this section we will include a list of all types of transformations, those that only use present information (operations), those that incorporate all values (interpolation methods), those that only include past values (smoothing functions), and those that incorporate a subset window of lagging and leading values (select filters). Only those that use historical values or are turned into prediction methods can be used out of the box. The entire time series can be used in the model development process for historical value methods, and only the forecasted values can be used for prediction models. \n\nCurve fitting can involve either interpolation, where an exact fit to the data is required, or smoothing, in which a \"smooth\" function is constructed that approximately fits the data. When using an interpolation method, you are taking future information into account e.g, cubic spline. You can use interpolation methods to forecast into the future (extrapolation), and then use those forecasts in a training set. Or you could recalculate the interpolation for each time step and then extract features out of that series (extraction method). Interpolation and other forward-looking methods can be used if they are turned into prediction problems, then the forecasted values can be trained and tested on, and the fitted data can be diregarded. In the list presented below the first five methods can be used for cross-section and time series data, after that the time-series only methods follow.\n\n#### **(1) Scaling/Normalisation**\n\nThere are a multitude of scaling methods available. Scaling generally gets applied to the entire dataset and is especially necessary for certain algorithms. K-means make use of euclidean distance hence the need for scaling. For PCA because we are trying to identify the feature with maximus variance we also need scaling. Similarly, we need scaled features for gradient descent. Any algorithm that is not based on a distance measure is not affected by feature scaling. Some of the methods include range scalers like minimum-maximum scaler, maximum absolute scaler or even standardisation methods like the standard scaler can be used for scaling. The example used here is robust scaler. Normalisation is a good technique when you don't know the distribution of the data. Scaling looks into the future, so parameters have to be training on a training set and applied to a test set.\n\n(i) Robust Scaler\n\nScaling according to the interquartile range, making it robust to outliers.\n\n\n```python\ndef robust_scaler(df, drop=None,quantile_range=(25, 75) ):\n    if drop:\n      keep = df[drop]\n      df = df.drop(drop, axis=1)\n    center = np.median(df, axis=0)\n    quantiles = np.percentile(df, quantile_range, axis=0)\n    scale = quantiles[1] - quantiles[0]\n    df = (df - center) / scale\n    if drop:\n      df = pd.concat((keep,df),axis=1)\n    return df\n\ndf_out = transform.robust_scaler(df.copy(), drop=[\"Close_1\"]); df_out.head()\n```\n\n#### **(2) Standardisation**\n\nWhen using a standardisation method, it is often more effective when the attribute itself if Gaussian. It is also useful to apply the technique when the model you want to use makes assumptions of Gaussian distributions like linear regression, logistic regression, and linear discriminant analysis. For most applications, standardisation is recommended.\n\n(i) Standard Scaler\n\nStandardize features by removing the mean and scaling to unit variance\n\n\n```python\ndef standard_scaler(df,drop ):\n    if drop:\n      keep = df[drop]\n      df = df.drop(drop, axis=1)\n    mean = np.mean(df, axis=0)\n    scale = np.std(df, axis=0)\n    df = (df - mean) / scale  \n    if drop:\n      df = pd.concat((keep,df),axis=1)\n    return df\n\n\ndf_out = transform.standard_scaler(df.copy(), drop=[\"Close\"]); df_out.head()           \n```\n\n#### **(3) Differencing**\n\nComputing the differences between consecutive observation, normally used to obtain a stationary time series.\n\n(i) Fractional Differencing\n\nFractional differencing, allows us to achieve stationarity while maintaining the maximum amount of memory compared to integer differencing.\n\n\n```python\nimport pylab as pl\n\ndef fast_fracdiff(x, cols, d):\n    for col in cols:\n      T = len(x[col])\n      np2 = int(2 ** np.ceil(np.log2(2 * T - 1)))\n      k = np.arange(1, T)\n      b = (1,) + tuple(np.cumprod((k - d - 1) / k))\n      z = (0,) * (np2 - T)\n      z1 = b + z\n      z2 = tuple(x[col]) + z\n      dx = pl.ifft(pl.fft(z1) * pl.fft(z2))\n      x[col+\"_frac\"] = np.real(dx[0:T])\n    return x \n  \ndf_out = transform.fast_fracdiff(df.copy(), [\"Close\",\"Open\"],0.5); df_out.head()\n```\n\n#### **(4) Capping**\n\nAny method that provides sets a floor and a cap to a feature's value. Capping can affect the distribution of data, so it should not be exagerated. One can cap values by using the average, by using the max and min values, or by an arbitrary extreme value.\n\n\n\n\n(i) Winzorisation\n\nThe transformation of features by limiting extreme values in the statistical data to reduce the effect of possibly spurious outliers by replacing it with a certain percentile value.\n\n\n```python\ndef outlier_detect(data,col,threshold=1,method=\"IQR\"):\n  \n    if method == \"IQR\":\n      IQR = data[col].quantile(0.75) - data[col].quantile(0.25)\n      Lower_fence = data[col].quantile(0.25) - (IQR * threshold)\n      Upper_fence = data[col].quantile(0.75) + (IQR * threshold)\n    if method == \"STD\":\n      Upper_fence = data[col].mean() + threshold * data[col].std()\n      Lower_fence = data[col].mean() - threshold * data[col].std()   \n    if method == \"OWN\":\n      Upper_fence = data[col].mean() + threshold * data[col].std()\n      Lower_fence = data[col].mean() - threshold * data[col].std() \n    if method ==\"MAD\":\n      median = data[col].median()\n      median_absolute_deviation = np.median([np.abs(y - median) for y in data[col]])\n      modified_z_scores = pd.Series([0.6745 * (y - median) / median_absolute_deviation for y in data[col]])\n      outlier_index = np.abs(modified_z_scores) \u003e threshold\n      print('Num of outlier detected:',outlier_index.value_counts()[1])\n      print('Proportion of outlier detected',outlier_index.value_counts()[1]/len(outlier_index))\n      return outlier_index, (median_absolute_deviation, median_absolute_deviation)\n\n    para = (Upper_fence, Lower_fence)\n    tmp = pd.concat([data[col]\u003eUpper_fence,data[col]\u003cLower_fence],axis=1)\n    outlier_index = tmp.any(axis=1)\n    print('Num of outlier detected:',outlier_index.value_counts()[1])\n    print('Proportion of outlier detected',outlier_index.value_counts()[1]/len(outlier_index))\n    \n    return outlier_index, para\n\ndef windsorization(data,col,para,strategy='both'):\n    \"\"\"\n    top-coding \u0026 bottom coding (capping the maximum of a distribution at an arbitrarily set value,vice versa)\n    \"\"\"\n\n    data_copy = data.copy(deep=True)  \n    if strategy == 'both':\n        data_copy.loc[data_copy[col]\u003epara[0],col] = para[0]\n        data_copy.loc[data_copy[col]\u003cpara[1],col] = para[1]\n    elif strategy == 'top':\n        data_copy.loc[data_copy[col]\u003epara[0],col] = para[0]\n    elif strategy == 'bottom':\n        data_copy.loc[data_copy[col]\u003cpara[1],col] = para[1]  \n    return data_copy\n\n_, para = transform.outlier_detect(df, \"Close\")\ndf_out = transform.windsorization(df.copy(),\"Close\",para,strategy='both'); df_out.head()\n\n```\n\n#### **(5) Operations**\n\n\nOperations here are treated like traditional transformations. It is the replacement of a variable by a function of that variable. In a stronger sense, a transformation is a replacement that changes the shape of a distribution or relationship.\n\n(i) Power, Log, Recipricol, Square Root\n\n\n```python\ndef operations(df,features):\n  df_new = df[features]\n  df_new = df_new - df_new.min()\n\n  sqr_name = [str(fa)+\"_POWER_2\" for fa in df_new.columns]\n  log_p_name = [str(fa)+\"_LOG_p_one_abs\" for fa in df_new.columns]\n  rec_p_name = [str(fa)+\"_RECIP_p_one\" for fa in df_new.columns]\n  sqrt_name = [str(fa)+\"_SQRT_p_one\" for fa in df_new.columns]\n\n  df_sqr = pd.DataFrame(np.power(df_new.values, 2),columns=sqr_name, index=df.index)\n  df_log = pd.DataFrame(np.log(df_new.add(1).abs().values),columns=log_p_name, index=df.index)\n  df_rec = pd.DataFrame(np.reciprocal(df_new.add(1).values),columns=rec_p_name, index=df.index)\n  df_sqrt = pd.DataFrame(np.sqrt(df_new.abs().add(1).values),columns=sqrt_name, index=df.index)\n\n  dfs = [df, df_sqr, df_log, df_rec, df_sqrt]\n\n  df=  pd.concat(dfs, axis=1)\n\n  return df\n\ndf_out = transform.operations(df.copy(),[\"Close\"]); df_out.head()\n```\n\n#### **(6) Smoothing**\n\nHere we maintain that any method that has a component of historical averaging is a smoothing method such as a simple moving average and single, double and tripple exponential smoothing methods. These forms of non-causal filters are also popular in signal processing and are called filters, where exponential smoothing is called an IIR filter and a moving average a FIR filter with equal weighting factors.\n\n(i) Tripple Exponential Smoothing (Holt-Winters Exponential Smoothing)\n\nThe Holt-Winters seasonal method comprises the forecast equation and three smoothing equations — one for the level $ℓt$, one for the trend \u0026bt\u0026, and one for the seasonal component $st$. This particular version is performed by looking at the last 12 periods. For that reason, the first 12 records should be disregarded because they can't make use of the required window size for a fair calculation. The calculation is such that values are still provided for those periods based on whatever data might be available. \n\n\n```python\ndef initial_trend(series, slen):\n    sum = 0.0\n    for i in range(slen):\n        sum += float(series[i+slen] - series[i]) / slen\n    return sum / slen\n\ndef initial_seasonal_components(series, slen):\n    seasonals = {}\n    season_averages = []\n    n_seasons = int(len(series)/slen)\n    # compute season averages\n    for j in range(n_seasons):\n        season_averages.append(sum(series[slen*j:slen*j+slen])/float(slen))\n    # compute initial values\n    for i in range(slen):\n        sum_of_vals_over_avg = 0.0\n        for j in range(n_seasons):\n            sum_of_vals_over_avg += series[slen*j+i]-season_averages[j]\n        seasonals[i] = sum_of_vals_over_avg/n_seasons\n    return seasonals\n\ndef triple_exponential_smoothing(df,cols, slen, alpha, beta, gamma, n_preds):\n    for col in cols:\n      result = []\n      seasonals = initial_seasonal_components(df[col], slen)\n      for i in range(len(df[col])+n_preds):\n          if i == 0: # initial values\n              smooth = df[col][0]\n              trend = initial_trend(df[col], slen)\n              result.append(df[col][0])\n              continue\n          if i \u003e= len(df[col]): # we are forecasting\n              m = i - len(df[col]) + 1\n              result.append((smooth + m*trend) + seasonals[i%slen])\n          else:\n              val = df[col][i]\n              last_smooth, smooth = smooth, alpha*(val-seasonals[i%slen]) + (1-alpha)*(smooth+trend)\n              trend = beta * (smooth-last_smooth) + (1-beta)*trend\n              seasonals[i%slen] = gamma*(val-smooth) + (1-gamma)*seasonals[i%slen]\n              result.append(smooth+trend+seasonals[i%slen])\n      df[col+\"_TES\"] = result\n    #print(seasonals)\n    return df\n\ndf_out= transform.triple_exponential_smoothing(df.copy(),[\"Close\"], 12, .2,.2,.2,0); df_out.head()\n```\n\n#### **(7) Decomposing**\n\nDecomposition procedures are used in time series to describe the trend and seasonal factors in a time series. More extensive decompositions might also include long-run cycles, holiday effects, day of week effects and so on. Here, we’ll only consider trend and seasonal decompositions. A naive decomposition makes use of moving averages, other decomposition methods are available that make use of LOESS.\n\n(i) Naive Decomposition\n\nThe base trend takes historical information into account and established moving averages; it does not have to be linear. To estimate the seasonal component for each season, simply average the detrended values for that season. If the seasonal variation looks constant, we should use the additive model. If the magnitude is increasing as a function of time, we will use multiplicative. Here because it is predictive in nature we are using a one sided moving average, as opposed to a two-sided centred average. \n\n\n```python\nimport statsmodels.api as sm\n\ndef naive_dec(df, columns, freq=2):\n  for col in columns:\n    decomposition = sm.tsa.seasonal_decompose(df[col], model='additive', freq = freq, two_sided=False)\n    df[col+\"_NDDT\" ] = decomposition.trend\n    df[col+\"_NDDT\"] = decomposition.seasonal\n    df[col+\"_NDDT\"] = decomposition.resid\n  return df\n\ndf_out = transform.naive_dec(df.copy(), [\"Close\",\"Open\"]); df_out.head()\n```\n\n#### **(8) Filtering**\n\nIt is often useful to either low-pass filter (smooth) time series in order to reveal low-frequency features and trends, or to high-pass filter (detrend) time series in order to isolate high frequency transients (e.g. storms). Low pass filters use historical values, high-pass filters detrends with low-pass filters, so also indirectly uses historical values.\n\nThere are a few filters available, closely associated with decompositions and smoothing functions. The Hodrick-Prescott filter separates a time-series $yt$ into a trend $τt$ and a cyclical component $ζt$. The Christiano-Fitzgerald filter is a generalization of Baxter-King filter and can be seen as weighted moving average.\n\n(i) Baxter-King Bandpass\n\nThe Baxter-King filter is intended to explicitly deal with the periodicity of the business cycle. By applying their band-pass filter to a series, they produce a new series that does not contain fluctuations at higher or lower than those of the business cycle. The parameters are arbitrarily chosen. This method uses a centred moving average that has to be changed to a lagged moving average before it can be used as an input feature. The maximum period of oscillation should be used as the point to truncate the dataset, as that part of the time series does not incorporate all the required datapoints.\n\n\n```python\nimport statsmodels.api as sm\n\ndef bkb(df, cols):\n  for col in cols:\n    df[col+\"_BPF\"] = sm.tsa.filters.bkfilter(df[[col]].values, 2, 10, len(df)-1)\n  return df\n\ndf_out = transform.bkb(df.copy(), [\"Close\"]); df_out.head()\n```\n\n(ii) Butter Lowpass (IIR Filter Design)\n\nThe Butterworth filter is a type of signal processing filter designed to have a frequency response as flat as possible in the passban. Like other filtersm the first few values have to be disregarded for accurate downstream prediction. Instead of disregarding these values on a per case basis, they can be diregarded in one chunk once the database of transformed features have been developed.\n\n\n```python\nfrom scipy import signal, integrate\ndef butter_lowpass(cutoff, fs=20, order=5):\n    nyq = 0.5 * fs\n    normal_cutoff = cutoff / nyq\n    b, a = signal.butter(order, normal_cutoff, btype='low', analog=False)\n    return b, a\n    \ndef butter_lowpass_filter(df,cols, cutoff, fs=20, order=5):\n    b, a = butter_lowpass(cutoff, fs, order=order)\n    for col in cols:\n      df[col+\"_BUTTER\"] = signal.lfilter(b, a, df[col])\n    return df\n\ndf_out = transform.butter_lowpass_filter(df.copy(),[\"Close\"],4); df_out.head()\n```\n\n(iii) Hilbert Transform Angle\n\nThe Hilbert transform is a time-domain to time-domain transformation which shifts the phase of a signal by 90 degrees. It is also a centred measure and would be difficult to use in a time series prediction setting, unless it is recalculated on a per step basis or transformed to be based on historical values only.\n\n\n```python\nfrom scipy import signal\nimport numpy as np\n\ndef instantaneous_phases(df,cols):\n    for col in cols:\n      df[col+\"_HILLB\"] = np.unwrap(np.angle(signal.hilbert(df[col], axis=0)), axis=0)\n    return df\n\ndf_out = transform.instantaneous_phases(df.copy(), [\"Close\"]); df_out.head()\n```\n\n(iiiv) Unscented Kalman Filter\n\n\nThe Kalman filter is better suited for estimating things that change over time. The most tangible example is tracking moving objects. A Kalman filter will be very close to the actual trajectory because it says the most recent measurement is more important than the older ones. The Unscented Kalman Filter (UKF) is a model based-techniques that recursively estimates the states (and with some modifications also parameters) of a nonlinear, dynamic, discrete-time system. The UKF is based on the typical prediction-correction style methods. The Kalman Smoother incorporates future values, the Filter doesn't and can be used for online prediction. The normal Kalman filter is a forward filter in the sense that it makes forecast of the current state using only current and past observations, whereas the smoother is based on computing a suitable linear combination of two filters, which are ran in forward and backward directions. \n\n\n\n\n```python\nfrom pykalman import UnscentedKalmanFilter\n\ndef kalman_feat(df, cols):\n  for col in cols:\n    ukf = UnscentedKalmanFilter(lambda x, w: x + np.sin(w), lambda x, v: x + v, observation_covariance=0.1)\n    (filtered_state_means, filtered_state_covariances) = ukf.filter(df[col])\n    (smoothed_state_means, smoothed_state_covariances) = ukf.smooth(df[col])\n    df[col+\"_UKFSMOOTH\"] = smoothed_state_means.flatten()\n    df[col+\"_UKFFILTER\"] = filtered_state_means.flatten()\n  return df \n\ndf_out = transform.kalman_feat(df.copy(), [\"Close\"]); df_out.head()\n```\n\n#### **(9) Spectral Analysis**\n\nThere are a range of functions for spectral analysis. You can use periodograms and the welch method to estimate the power spectral density. You can also use the welch method to estimate the cross power spectral density. Other techniques include spectograms, Lomb-Scargle periodograms and, short time fourier transform.\n\n(i) Periodogram\n\nThis returns an array of sample frequencies and the power spectrum of x, or the power spectral density of x.\n\n\n```python\nfrom scipy import signal\ndef perd_feat(df, cols):\n  for col in cols:\n    sig = signal.periodogram(df[col],fs=1, return_onesided=False)\n    df[col+\"_FREQ\"] = sig[0]\n    df[col+\"_POWER\"] = sig[1]\n  return df\n\ndf_out = transform.perd_feat(df.copy(),[\"Close\"]); df_out.head()\n```\n\n(ii) Fast Fourier Transform\n\nThe FFT, or fast fourier transform is an algorithm that essentially uses convolution techniques to efficiently find the magnitude and location of the tones that make up the signal of interest. We can often play with the FFT spectrum, by adding and removing successive tones (which is akin to selectively filtering particular tones that make up the signal), in order to obtain a smoothed version of the underlying signal. This takes the entire signal into account, and as a result has to be recalculated on a running basis to avoid peaking into the future. \n\n\n\n```python\ndef fft_feat(df, cols):\n  for col in cols:\n    fft_df = np.fft.fft(np.asarray(df[col].tolist()))\n    fft_df = pd.DataFrame({'fft':fft_df})\n    df[col+'_FFTABS'] = fft_df['fft'].apply(lambda x: np.abs(x)).values\n    df[col+'_FFTANGLE'] = fft_df['fft'].apply(lambda x: np.angle(x)).values\n  return df \n\ndf_out = transform.fft_feat(df.copy(), [\"Close\"]); df_out.head()\n```\n\n#### **(10) Waveforms**\n\nThe waveform of a signal is the shape of its graph as a function of time.\n\n(i) Continuous Wave Radar\n\n\n```python\nfrom scipy import signal\ndef harmonicradar_cw(df, cols, fs,fc):\n    for col in cols:\n      ttxt = f'CW: {fc} Hz'\n      #%% input\n      t = df[col]\n      tx = np.sin(2*np.pi*fc*t)\n      _,Pxx = signal.welch(tx,fs)\n      #%% diode\n      d = (signal.square(2*np.pi*fc*t))\n      d[d\u003c0] = 0.\n      #%% output of diode\n      rx = tx * d\n      df[col+\"_HARRAD\"] = rx.values\n    return df\n\ndf_out = transform.harmonicradar_cw(df.copy(), [\"Close\"],0.3,0.2); df_out.head()\n```\n\n(ii) Saw Tooth\n\nReturn a periodic sawtooth or triangle waveform.\n\n\n```python\ndef saw(df, cols):\n  for col in cols:\n    df[col+\" SAW\"] = signal.sawtooth(df[col])\n  return df\n\ndf_out = transform.saw(df.copy(),[\"Close\",\"Open\"]); df_out.head()\n```\n\n##### **(9) Modifications**\n\nA range of modification usually applied ot images, these values would have to be recalculate for each time-series. \n\n(i) Various Techniques\n\n\n```python\nfrom tsaug import *\ndef modify(df, cols):\n  for col in cols:\n    series = df[col].values\n    df[col+\"_magnify\"], _ = magnify(series, series)\n    df[col+\"_affine\"], _ = affine(series, series)\n    df[col+\"_crop\"], _ = crop(series, series)\n    df[col+\"_cross_sum\"], _ = cross_sum(series, series)\n    df[col+\"_resample\"], _ = resample(series, series)\n    df[col+\"_trend\"], _ = trend(series, series)\n\n    df[col+\"_random_affine\"], _ = random_time_warp(series, series)\n    df[col+\"_random_crop\"], _ = random_crop(series, series)\n    df[col+\"_random_cross_sum\"], _ = random_cross_sum(series, series)\n    df[col+\"_random_sidetrack\"], _ = random_sidetrack(series, series)\n    df[col+\"_random_time_warp\"], _ = random_time_warp(series, series)\n    df[col+\"_random_magnify\"], _ = random_magnify(series, series)\n    df[col+\"_random_jitter\"], _ = random_jitter(series, series)\n    df[col+\"_random_trend\"], _ = random_trend(series, series)\n  return df\n\ndf_out = transform.modify(df.copy(),[\"Close\"]); df_out.head()\n```\n\n#### **(11) Rolling**\n\nFeatures that are calculated on a rolling basis over fixed window size.\n\n(i) Mean, Standard Deviation\n\n\n```python\ndef multiple_rolling(df, windows = [1,2], functions=[\"mean\",\"std\"], columns=None):\n  windows = [1+a for a in windows]\n  if not columns:\n    columns = df.columns.to_list()\n  rolling_dfs = (df[columns].rolling(i)                                    # 1. Create window\n                  .agg(functions)                                # 1. Aggregate\n                  .rename({col: '{0}_{1:d}'.format(col, i)\n                                for col in columns}, axis=1)  # 2. Rename columns\n                for i in windows)                                # For each window\n  df_out = pd.concat((df, *rolling_dfs), axis=1)\n  da = df_out.iloc[:,len(df.columns):]\n  da = [col[0] + \"_\" + col[1] for col in  da.columns.to_list()]\n  df_out.columns = df.columns.to_list() + da \n\n  return  df_out                      # 3. Concatenate dataframes\n\ndf_out = transform.multiple_rolling(df, columns=[\"Close\"]); df_out.head()\n```\n\n#### **(12) Lagging**\n\nLagged values from existing features.\n\n(i) Single Steps\n\n\n```python\ndef multiple_lags(df, start=1, end=3,columns=None):\n  if not columns:\n    columns = df.columns.to_list()\n  lags = range(start, end+1)  # Just two lags for demonstration.\n\n  df = df.assign(**{\n      '{}_t_{}'.format(col, t): df[col].shift(t)\n      for t in lags\n      for col in columns\n  })\n  return df\n\ndf_out = transform.multiple_lags(df, start=1, end=3, columns=[\"Close\"]); df_out.head()\n```\n\n#### **(13) Forecast Model**\n\nThere are a range of time series model that can be implemented like AR, MA, ARMA, ARIMA, SARIMA, SARIMAX, VAR, VARMA, VARMAX, SES, and HWES. The models can be divided into autoregressive models and smoothing models. In an autoregression model, we forecast the variable of interest using a linear combination of past values of the variable. Each method might requre specific tuning and parameters to suit your prediction task. You need to drop a certain amount of historical data that you use during the fitting stage. Models that take seasonality into account need more training data.\n\n\n(i) Prophet\n\nProphet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality. You can apply additive models to your training data but also interactive models like deep learning models. The problem is that because these models have learned from future observations, there would this be a need to recalculate the time series on a running basis, or to only include the predicted as opposed to fitted values in future training and test sets. In this example, I train on 150 data points to illustrate how the remaining or so 100 datapoints can be used in a new prediction problem. You can plot with ```df[\"PROPHET\"].plot()``` to see the effect.\n\nYou can apply additive models to your training data but also interactive models like deep learning models. The problem is that these models have learned from future observations, there would this be a need to recalculate the time series on a running basis, or to only include the predicted as opposed to fitted values in future training and test sets.\n\n\n```python\nfrom fbprophet import Prophet\n\ndef prophet_feat(df, cols,date, freq,train_size=150):\n  def prophet_dataframe(df): \n    df.columns = ['ds','y']\n    return df\n\n  def original_dataframe(df, freq, name):\n    prophet_pred = pd.DataFrame({\"Date\" : df['ds'], name : df[\"yhat\"]})\n    prophet_pred = prophet_pred.set_index(\"Date\")\n    #prophet_pred.index.freq = pd.tseries.frequencies.to_offset(freq)\n    return prophet_pred[name].values\n\n  for col in cols:\n    model = Prophet(daily_seasonality=True)\n    fb = model.fit(prophet_dataframe(df[[date, col]].head(train_size)))\n    forecast_len = len(df) - train_size\n    future = model.make_future_dataframe(periods=forecast_len,freq=freq)\n    future_pred = model.predict(future)\n    df[col+\"_PROPHET\"] = list(original_dataframe(future_pred,freq,col))\n  return df\n\ndf_out  = transform.prophet_feat(df.copy().reset_index(),[\"Close\",\"Open\"],\"Date\", \"D\"); df_out.head()\n```\n\n\u003ca name=\"interaction\"\u003e\u003c/a\u003e\n\n## **(2) Interaction**\n\nInteractions are defined as methods that require more than one feature to create an additional feature. Here we include normalising and discretising techniques that are non-feature specific. Almost all of these method can be applied to cross-section method. The only methods that are time specific is the technical features in the speciality section and the autoregression model.\n\n#### **(1) Regression**\n\nRegression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome variable') and one or more independent variables. \n\n(i) Lowess Smoother\n\nThe lowess smoother is a robust locally weighted regression. The function fits a nonparametric regression curve to a scatterplot.\n\n\n```python\nfrom math import ceil\nimport numpy as np\nfrom scipy import linalg\nimport math\n\ndef lowess(df, cols, y, f=2. / 3., iter=3):\n    for col in cols:\n      n = len(df[col])\n      r = int(ceil(f * n))\n      h = [np.sort(np.abs(df[col] - df[col][i]))[r] for i in range(n)]\n      w = np.clip(np.abs((df[col][:, None] - df[col][None, :]) / h), 0.0, 1.0)\n      w = (1 - w ** 3) ** 3\n      yest = np.zeros(n)\n      delta = np.ones(n)\n      for iteration in range(iter):\n          for i in range(n):\n              weights = delta * w[:, i]\n              b = np.array([np.sum(weights * y), np.sum(weights * y * df[col])])\n              A = np.array([[np.sum(weights), np.sum(weights * df[col])],\n                            [np.sum(weights * df[col]), np.sum(weights * df[col] * df[col])]])\n              beta = linalg.solve(A, b)\n              yest[i] = beta[0] + beta[1] * df[col][i]\n\n          residuals = y - yest\n          s = np.median(np.abs(residuals))\n          delta = np.clip(residuals / (6.0 * s), -1, 1)\n          delta = (1 - delta ** 2) ** 2\n      df[col+\"_LOWESS\"] = yest\n\n    return df\n\ndf_out = interact.lowess(df.copy(), [\"Open\",\"Volume\"], df[\"Close\"], f=0.25, iter=3); df_out.head()\n```\n\nAutoregression\n\nAutoregression is a time series model that uses observations from previous time steps as input to a regression equation to predict the value at the next time step\n\n\n```python\nfrom statsmodels.tsa.ar_model import AR\nfrom timeit import default_timer as timer\ndef autoregression(df, drop=None, settings={\"autoreg_lag\":4}):\n\n    autoreg_lag = settings[\"autoreg_lag\"]\n    if drop:\n      keep = df[drop]\n      df = df.drop([drop],axis=1).values\n\n    n_channels = df.shape[0]\n    t = timer()\n    channels_regg = np.zeros((n_channels, autoreg_lag + 1))\n    for i in range(0, n_channels):\n        fitted_model = AR(df.values[i, :]).fit(autoreg_lag)\n        # TODO: This is not the same as Matlab's for some reasons!\n        # kk = ARMAResults(fitted_model)\n        # autore_vals, dummy1, dummy2 = arburg(x[i, :], autoreg_lag) # This looks like Matlab's but slow\n        channels_regg[i, 0: len(fitted_model.params)] = np.real(fitted_model.params)\n\n    for i in range(channels_regg.shape[1]):\n      df[\"LAG_\"+str(i+1)] = channels_regg[:,i]\n    \n    if drop:\n      df = pd.concat((keep,df),axis=1)\n\n    t = timer() - t\n    return df\n\ndf_out = interact.autoregression(df.copy()); df_out.head()\n```\n\n####  **(2) Operator**\n\nLooking at interaction between different features. Here the methods employed are multiplication and division.\n\n(i) Multiplication and Division\n\n\n```python\ndef muldiv(df, feature_list):\n  for feat in feature_list:\n    for feat_two in feature_list:\n      if feat==feat_two:\n        continue\n      else:\n       df[feat+\"/\"+feat_two] = df[feat]/(df[feat_two]-df[feat_two].min()) #zero division guard\n       df[feat+\"_X_\"+feat_two] = df[feat]*(df[feat_two])\n\n  return df\n\ndf_out = interact.muldiv(df.copy(), [\"Close\",\"Open\"]); df_out.head()\n```\n\n#### **(3) Discretising**\n\nIn statistics and machine learning, discretization refers to the process of converting or partitioning continuous attributes, features or variables to discretized or nominal attributes\n\n(i) Decision Tree Discretiser\n\nThe first method that will be applies here is a supersived discretiser. Discretisation with Decision Trees consists of using a decision tree to identify the optimal splitting points that would determine the bins or contiguous intervals.\n\n\n```python\nfrom sklearn.tree import DecisionTreeRegressor\n\ndef decision_tree_disc(df, cols, depth=4 ):\n  for col in cols:\n    df[col +\"_m1\"] = df[col].shift(1)\n    df = df.iloc[1:,:]\n    tree_model = DecisionTreeRegressor(max_depth=depth,random_state=0)\n    tree_model.fit(df[col +\"_m1\"].to_frame(), df[col])\n    df[col+\"_Disc\"] = tree_model.predict(df[col +\"_m1\"].to_frame())\n  return df\n\ndf_out = interact.decision_tree_disc(df.copy(), [\"Close\"]); df_out.head()\n```\n\n#### **(4) Normalising**\n\nNormalising normally pertains to the scaling of data. There are many method available, interacting normalising methods makes use of all the feature's attributes to do the scaling.\n\n(i) Quantile Normalisation\n\nIn statistics, quantile normalization is a technique for making two distributions identical in statistical properties.\n\n\n```python\nimport numpy as np\nimport pandas as pd\n\ndef quantile_normalize(df, drop):\n\n    if drop:\n      keep = df[drop]\n      df = df.drop(drop,axis=1)\n\n    #compute rank\n    dic = {}\n    for col in df:\n      dic.update({col : sorted(df[col])})\n    sorted_df = pd.DataFrame(dic)\n    rank = sorted_df.mean(axis = 1).tolist()\n    #sort\n    for col in df:\n        t = np.searchsorted(np.sort(df[col]), df[col])\n        df[col] = [rank[i] for i in t]\n    \n    if drop:\n      df = pd.concat((keep,df),axis=1)\n    return df\n\ndf_out = interact.quantile_normalize(df.copy(), drop=[\"Close\"]); df_out.head()\n```\n\n#### **(5) Distance**\n\nThere are multiple types of distance functions like Euclidean, Mahalanobis, and Minkowski distance. Here we are using a contrived example in a location based haversine distance.\n\n(i) Haversine Distance\n\nThe Haversine (or great circle) distance is the angular distance between two points on the surface of a sphere.\n\n\n```python\nfrom math import sin, cos, sqrt, atan2, radians\ndef haversine_distance(row, lon=\"Open\", lat=\"Close\"):\n    c_lat,c_long = radians(52.5200), radians(13.4050)\n    R = 6373.0\n    long = radians(row['Open'])\n    lat = radians(row['Close'])\n    \n    dlon = long - c_long\n    dlat = lat - c_lat\n    a = sin(dlat / 2)**2 + cos(lat) * cos(c_lat) * sin(dlon / 2)**2\n    c = 2 * atan2(sqrt(a), sqrt(1 - a))\n    \n    return R * c\n\ndf_out['distance_central'] = df.apply(interact.haversine_distance,axis=1); df_out.head()\n```\n\n#### **(6) Speciality**\n\n(i) Technical Features\n\nTechnical indicators are heuristic or mathematical calculations based on the price, volume, or open interest of a security or contract used by traders who follow technical analysis. By analyzing historical data, technical analysts use indicators to predict future price movements.\n\n\n```python\nimport ta\n\ndef tech(df):\n  return ta.add_all_ta_features(df, open=\"Open\", high=\"High\", low=\"Low\", close=\"Close\", volume=\"Volume\")\n  \ndf_out = interact.tech(df.copy()); df_out.head()\n```\n\n#### **(7) Genetic**\n\nGenetic programming has shown promise in constructing feature by osing original features to form high-level ones that can help algorithms achieve better performance.\n\n(i) Symbolic Transformer\n\n\nA symbolic transformer is a supervised transformer that begins by building a population of naive random formulas to represent a relationship.\n\n\n```python\ndf.head()\n```\n\n\n```python\nfrom gplearn.genetic import SymbolicTransformer\n\ndef genetic_feat(df, num_gen=20, num_comp=10):\n  function_set = ['add', 'sub', 'mul', 'div',\n                  'sqrt', 'log', 'abs', 'neg', 'inv','tan']\n\n  gp = SymbolicTransformer(generations=num_gen, population_size=200,\n                          hall_of_fame=100, n_components=num_comp,\n                          function_set=function_set,\n                          parsimony_coefficient=0.0005,\n                          max_samples=0.9, verbose=1,\n                          random_state=0, n_jobs=6)\n\n  gen_feats = gp.fit_transform(df.drop(\"Close_1\", axis=1), df[\"Close_1\"]); df.iloc[:,:8]\n  gen_feats = pd.DataFrame(gen_feats, columns=[\"gen_\"+str(a) for a in range(gen_feats.shape[1])])\n  gen_feats.index = df.index\n  return pd.concat((df,gen_feats),axis=1)\n\ndf_out = interact.genetic_feat(df.copy()); df_out.head()\n```\n\n\u003ca name=\"mapping\"\u003e\u003c/a\u003e\n\n## **(3) Mapping**\n\nMethods that help with the summarisation of features by remapping them to achieve some aim like the maximisation of variability or class separability. These methods tend to be unsupervised, but can also take an supervised form.\n\n#### **(1) Eigen Decomposition**\n\nEigendecomposition or sometimes spectral decomposition is the factorization of a matrix into a canonical form, whereby the matrix is represented in terms of its eigenvalues and eigenvectors. Some examples are LDA and PCA.\n\n(i) Principal Component Analysis\n\nPrincipal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.\n\n\n```python\ndef pca_feature(df, memory_issues=False,mem_iss_component=False,variance_or_components=0.80,n_components=5 ,drop_cols=None, non_linear=True):\n    \n  if non_linear:\n    pca = KernelPCA(n_components = n_components, kernel='rbf', fit_inverse_transform=True, random_state = 33, remove_zero_eig= True)\n  else:\n    if memory_issues:\n      if not mem_iss_component:\n        raise ValueError(\"If you have memory issues, you have to preselect mem_iss_component\")\n      pca = IncrementalPCA(mem_iss_component)\n    else:\n      if variance_or_components\u003e1:\n        pca = PCA(n_components=variance_or_components) \n      else: # automated selection based on variance\n        pca = PCA(n_components=variance_or_components,svd_solver=\"full\") \n  if drop_cols:\n    X_pca = pca.fit_transform(df.drop(drop_cols,axis=1))\n    return pd.concat((df[drop_cols],pd.DataFrame(X_pca, columns=[\"PCA_\"+str(i+1) for i in range(X_pca.shape[1])],index=df.index)),axis=1)\n\n  else:\n    X_pca = pca.fit_transform(df)\n    return pd.DataFrame(X_pca, columns=[\"PCA_\"+str(i+1) for i in range(X_pca.shape[1])],index=df.index)\n\n\n  return df\n\ndf_out = mapper.pca_feature(df.copy(), variance_or_components=0.9, n_components=8,non_linear=False)\n```\n\n#### **(2) Cross Decomposition**\n\nThese families of algorithms are useful to find linear relations between two multivariate datasets.\n\n(1) Canonical Correlation Analysis\n\nCanonical-correlation analysis (CCA) is a way of inferring information from cross-covariance matrices.\n\n\n```python\nfrom sklearn.cross_decomposition import CCA\n\ndef cross_lag(df, drop=None, lags=1, components=4 ):\n\n  if drop:\n    keep = df[drop]\n    df = df.drop([drop],axis=1)\n\n  df_2 = df.shift(lags)\n  df = df.iloc[lags:,:]\n  df_2 = df_2.dropna().reset_index(drop=True)\n\n  cca = CCA(n_components=components)\n  cca.fit(df_2, df)\n\n  X_c, df_2 = cca.transform(df_2, df)\n  df_2 = pd.DataFrame(df_2, index=df.index)\n  df_2 = df.add_prefix('crd_')\n\n  if drop:\n    df = pd.concat([keep,df,df_2],axis=1)\n  else:\n    df = pd.concat([df,df_2],axis=1)\n  return df\n\ndf_out = mapper.cross_lag(df.copy()); df_out.head()\n```\n\n#### **(3) Kernel Approximation**\n\nFunctions that approximate the feature mappings that correspond to certain kernels, as they are used for example in support vector machines.\n\n(i) Additive Chi2 Kernel\n\nComputes the additive chi-squared kernel between observations in X and Y The chi-squared kernel is computed between each pair of rows in X and Y. X and Y have to be non-negative.\n\n\n```python\nfrom sklearn.kernel_approximation import AdditiveChi2Sampler\n\ndef a_chi(df, drop=None, lags=1, sample_steps=2 ):\n\n  if drop:\n    keep = df[drop]\n    df = df.drop([drop],axis=1)\n\n  df_2 = df.shift(lags)\n  df = df.iloc[lags:,:]\n  df_2 = df_2.dropna().reset_index(drop=True)\n\n  chi2sampler = AdditiveChi2Sampler(sample_steps=sample_steps)\n\n  df_2 = chi2sampler.fit_transform(df_2, df[\"Close\"])\n\n  df_2 = pd.DataFrame(df_2, index=df.index)\n  df_2 = df.add_prefix('achi_')\n\n  if drop:\n    df = pd.concat([keep,df,df_2],axis=1)\n  else:\n    df = pd.concat([df,df_2],axis=1)\n  return df\n\ndf_out = mapper.a_chi(df.copy()); df_out.head()\n```\n\n#### **(4) Autoencoder**\n\nAn autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore noise.\n\n(i) Feed Forward\n\nThe simplest form of an autoencoder is a feedforward, non-recurrent neural network similar to single layer perceptrons that participate in multilayer perceptrons\n\n\n```python\nfrom sklearn.preprocessing import minmax_scale\nimport tensorflow as tf\nimport numpy as np\n\ndef encoder_dataset(df, drop=None, dimesions=20):\n\n  if drop:\n    train_scaled = minmax_scale(df.drop(drop,axis=1).values, axis = 0)\n  else:\n    train_scaled = minmax_scale(df.values, axis = 0)\n\n  # define the number of encoding dimensions\n  encoding_dim = dimesions\n  # define the number of features\n  ncol = train_scaled.shape[1]\n  input_dim = tf.keras.Input(shape = (ncol, ))\n\n  # Encoder Layers\n  encoded1 = tf.keras.layers.Dense(3000, activation = 'relu')(input_dim)\n  encoded2 = tf.keras.layers.Dense(2750, activation = 'relu')(encoded1)\n  encoded3 = tf.keras.layers.Dense(2500, activation = 'relu')(encoded2)\n  encoded4 = tf.keras.layers.Dense(750, activation = 'relu')(encoded3)\n  encoded5 = tf.keras.layers.Dense(500, activation = 'relu')(encoded4)\n  encoded6 = tf.keras.layers.Dense(250, activation = 'relu')(encoded5)\n  encoded7 = tf.keras.layers.Dense(encoding_dim, activation = 'relu')(encoded6)\n\n  encoder = tf.keras.Model(inputs = input_dim, outputs = encoded7)\n  encoded_input = tf.keras.Input(shape = (encoding_dim, ))\n\n  encoded_train = pd.DataFrame(encoder.predict(train_scaled),index=df.index)\n  encoded_train = encoded_train.add_prefix('encoded_')\n  if drop:\n    encoded_train = pd.concat((df[drop],encoded_train),axis=1)\n\n  return encoded_train\n\ndf_out = mapper.encoder_dataset(df.copy(), [\"Close_1\"], 15); df_out.head()\n```\n\n\n```python\ndf_out.head()\n```\n\n#### **(5) Manifold Learning**\n\nManifold Learning can be thought of as an attempt to generalize linear frameworks like PCA to be sensitive to non-linear structure in data. \n\n(i) Local Linear Embedding\n\nLocally Linear Embedding is a method of non-linear dimensionality reduction. It tries to reduce these n-Dimensions while trying to preserve the geometric features of the original non-linear feature structure.\n\n\n```python\nfrom sklearn.manifold import LocallyLinearEmbedding\n\ndef lle_feat(df, drop=None, components=4):\n\n  if drop:\n    keep = df[drop]\n    df = df.drop(drop, axis=1)\n\n  embedding = LocallyLinearEmbedding(n_components=components)\n  em = embedding.fit_transform(df)\n  df = pd.DataFrame(em,index=df.index)\n  df = df.add_prefix('lle_')\n  if drop:\n    df = pd.concat((keep,df),axis=1)\n  return df\n\ndf_out = mapper.lle_feat(df.copy(),[\"Close_1\"],4); df_out.head()\n\n```\n\n#### **(6) Clustering**\n\nMost clustering techniques start with a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together with some measure. Although these clustering techniques are typically used for observations, it can also be used for feature dimensionality reduction; especially hierarchical clustering techniques. \n\n(i) Feature Agglomeration\n\nFeature agglomerative uses clustering to group together features that look very similar, thus decreasing the number of features.\n\n\n```python\nimport numpy as np\nfrom sklearn import datasets, cluster\n\ndef feature_agg(df, drop=None, components=4):\n\n  if drop:\n    keep = df[drop]\n    df = df.drop(drop, axis=1)\n\n  components = min(df.shape[1]-1,components)\n  agglo = cluster.FeatureAgglomeration(n_clusters=components)\n  agglo.fit(df)\n  df = pd.DataFrame(agglo.transform(df),index=df.index)\n  df = df.add_prefix('feagg_')\n\n  if drop:\n    return pd.concat((keep,df),axis=1)\n  else:\n    return df\n\n\ndf_out = mapper.feature_agg(df.copy(),[\"Close_1\"],4 ); df_out.head()\n```\n\n#### **(7) Neigbouring**\n\nNeighbouring points can be calculated using distance metrics like Hamming, Manhattan, Minkowski distance. The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point, and predict the label from these.\n\n(i) Nearest Neighbours\n\nUnsupervised learner for implementing neighbor searches.\n\n\n```python\nfrom sklearn.neighbors import NearestNeighbors\n\ndef neigh_feat(df, drop, neighbors=6):\n  \n  if drop:\n    keep = df[drop]\n    df = df.drop(drop, axis=1)\n\n  components = min(df.shape[0]-1,neighbors)\n  neigh = NearestNeighbors(n_neighbors=neighbors)\n  neigh.fit(df)\n  neigh = neigh.kneighbors()[0]\n  df = pd.DataFrame(neigh, index=df.index)\n  df = df.add_prefix('neigh_')\n\n  if drop:\n    return pd.concat((keep,df),axis=1)\n  else:\n    return df\n\n  return df\n\ndf_out = mapper.neigh_feat(df.copy(),[\"Close_1\"],4 ); df_out.head()\n\n```\n\n\u003ca name=\"extraction\"\u003e\u003c/a\u003e\n\n## **(4) Extraction**\n\nWhen working with extraction, you have decide the size of the time series history to take into account when calculating a collection of walk-forward feature values. To facilitate our extraction, we use an excellent package called TSfresh, and also some of their default features. For completeness, we also include 12 or so custom features to be added to the extraction pipeline.\n\nThe *time series* methods in the transformation section and the interaction section are similar to the methods we will uncover in the extraction section, however, for transformation and interaction methods the output is an entire new time series, whereas extraction methods takes as input multiple constructed time series and extracts a singular value from each time series to reconstruct an entirely new time series. \n\nSome methods naturally fit better in one format over another, e.g., lags are too expensive for extraction; time series decomposition only has to be performed once, because it has a low level of 'leakage' so is better suited to transformation; and forecast methods attempt to predict multiple future training samples, so won't work with extraction that only delivers one value per time series. Furthermore all non time-series (cross-sectional) transformation and extraction techniques can not make use of extraction as it is solely a time-series method. \n\nLastly, when we want to double apply specific functions we can apply it as a transformation/interaction then all the extraction methods can be applied to this feature as well. For example, if we calculate a smoothing function (transformation) then all other extraction functions (median, entropy, linearity etc.) can now be applied to that smoothing function, including the application of the smoothing function itself, e.g., a double smooth, double lag, double filter etc. So separating these methods out give us great flexibility.\n\nDecorator\n\n\n```python\ndef set_property(key, value):\n    \"\"\"\n    This method returns a decorator that sets the property key of the function to value\n    \"\"\"\n    def decorate_func(func):\n        setattr(func, key, value)\n        if func.__doc__ and key == \"fctype\":\n            func.__doc__ = func.__doc__ + \"\\n\\n    *This function is of type: \" + value + \"*\\n\"\n        return func\n    return decorate_func\n```\n\n#### **(1) Energy**\n\nYou can calculate the linear, non-linear and absolute energy of a time series. In signal processing, the energy $E_S$ of a continuous-time signal $x(t)$ is defined as the area under the squared magnitude of the considered signal. Mathematically, $E_{s}=\\langle x(t), x(t)\\rangle=\\int_{-\\infty}^{\\infty}|x(t)|^{2} d t$\n\n(i) Absolute Energy\n\nReturns the absolute energy of the time series which is the sum over the squared values\n\n\n```python\n#-\u003e In Package\ndef abs_energy(x):\n\n    if not isinstance(x, (np.ndarray, pd.Series)):\n        x = np.asarray(x)\n    return np.dot(x, x)\n\nextract.abs_energy(df[\"Close\"])\n```\n\n#### **(2) Distance**\n\nHere we widely define distance measures as those that take a difference between attributes or series of datapoints.\n\n(i) Complexity-Invariant Distance\n\nThis function calculator is an estimate for a time series complexity.\n\n\n```python\n#-\u003e In Package\ndef cid_ce(x, normalize):\n\n    if not isinstance(x, (np.ndarray, pd.Series)):\n        x = np.asarray(x)\n    if normalize:\n        s = np.std(x)\n        if s!=0:\n            x = (x - np.mean(x))/s\n        else:\n            return 0.0\n\n    x = np.diff(x)\n    return np.sqrt(np.dot(x, x))\n\nextract.cid_ce(df[\"Close\"], True)\n```\n\n#### **(3) Differencing**\n\nMany alternatives to differencing exists, one can for example take the difference of every other value, take the squared difference, take the fractional difference, or like our example, take the mean absolute difference.\n\n(i) Mean Absolute Change\n\nReturns the mean over the absolute differences between subsequent time series values.\n\n\n```python\n#-\u003e In Package\ndef mean_abs_change(x):\n    return np.mean(np.abs(np.diff(x)))\n\nextract.mean_abs_change(df[\"Close\"])\n```\n\n#### **(4) Derivative**\n\nFeatures where the emphasis is on the rate of change. \n\n(i) Mean Central Second Derivative\n\nReturns the mean value of a central approximation of the second derivative\n\n\n```python\n#-\u003e In Package\ndef _roll(a, shift):\n    if not isinstance(a, np.ndarray):\n        a = np.asarray(a)\n    idx = shift % len(a)\n    return np.concatenate([a[-idx:], a[:-idx]])\n\ndef mean_second_derivative_central(x):\n\n    diff = (_roll(x, 1) - 2 * np.array(x) + _roll(x, -1)) / 2.0\n    return np.mean(diff[1:-1])\n\nextract.mean_second_derivative_central(df[\"Close\"])\n```\n\n#### **(5) Volatility**\n\nVolatility is a statistical measure of the dispersion of a time-series.\n\n(i) Variance Larger than Standard Deviation\n\n\n```python\n#-\u003e In Package\ndef variance_larger_than_standard_deviation(x):\n\n    y = np.var(x)\n    return y \u003e np.sqrt(y)\n\nextract.variance_larger_than_standard_deviation(df[\"Close\"])\n```\n\n(ii) Variability Index\n\nVariability Index is a way to measure how smooth or 'variable' a time series is.\n\n\n```python\nvar_index_param = {\"Volume\":df[\"Volume\"].values, \"Open\": df[\"Open\"].values}\n\n@set_property(\"fctype\", \"combiner\")\n@set_property(\"custom\", True)\ndef var_index(time,param=var_index_param):\n    final = []\n    keys = []\n    for key, magnitude in param.items():\n      w = 1.0 / np.power(np.subtract(time[1:], time[:-1]), 2)\n      w_mean = np.mean(w)\n\n      N = len(time)\n      sigma2 = np.var(magnitude)\n\n      S1 = sum(w * (magnitude[1:] - magnitude[:-1]) ** 2)\n      S2 = sum(w)\n\n      eta_e = (w_mean * np.power(time[N - 1] -\n                time[0], 2) * S1 / (sigma2 * S2 * N ** 2))\n      final.append(eta_e)\n      keys.append(key)\n    return {\"Interact__{}\".format(k): eta_e for eta_e, k in zip(final,keys) }\n\nextract.var_index(df[\"Close\"].values,var_index_param)\n```\n\n#### **(6) Shape**\n\nFeatures that emphasises a particular shape not ordinarily considered as a distribution statistic. Extends to derivations of the original time series too For example a feature looking at the sinusoidal shape of an autocorrelation plot.\n\n(i) Symmetrical\n\nBoolean variable denoting if the distribution of x looks symmetric.\n\n\n```python\n#-\u003e In Package\ndef symmetry_looking(x, param=[{\"r\": 0.2}]):\n\n    if not isinstance(x, (np.ndarray, pd.Series)):\n        x = np.asarray(x)\n    mean_median_difference = np.abs(np.mean(x) - np.median(x))\n    max_min_difference = np.max(x) - np.min(x)\n    return [(\"r_{}\".format(r[\"r\"]), mean_median_difference \u003c (r[\"r\"] * max_min_difference))\n            for r in param]\n            \nextract.symmetry_looking(df[\"Close\"])\n```\n\n#### **(7) Occurrence**\n\nLooking at the occurrence, and reoccurence of defined values.\n\n(i) Has Duplicate Max\n\n\n```python\n#-\u003e In Package\ndef has_duplicate_max(x):\n    \"\"\"\n    Checks if the maximum value of x is observed more than once\n\n    :param x: the time series to calculate the feature of\n    :type x: numpy.ndarray\n    :return: the value of this feature\n    :return type: bool\n    \"\"\"\n    if not isinstance(x, (np.ndarray, pd.Series)):\n        x = np.asarray(x)\n    return np.sum(x == np.max(x)) \u003e= 2\n\nextract.has_duplicate_max(df[\"Close\"])\n```\n\n#### **(8) Autocorrelation**\n\nAutocorrelation, also known as serial correlation, is the correlation of a signal with a delayed copy of itself as a function of delay. \n\n(i) Partial Autocorrelation\n\nPartial autocorrelation is a summary of the relationship between an observation in a time series with observations at prior time steps with the relationships of intervening observations removed.\n\n\n```python\n#-\u003e In Package\nfrom statsmodels.tsa.stattools import acf, adfuller, pacf\n\ndef partial_autocorrelation(x, param=[{\"lag\": 1}]):\n\n    # Check the difference between demanded lags by param and possible lags to calculate (depends on len(x))\n    max_demanded_lag = max([lag[\"lag\"] for lag in param])\n    n = len(x)\n\n    # Check if list is too short to make calculations\n    if n \u003c= 1:\n        pacf_coeffs = [np.nan] * (max_demanded_lag + 1)\n    else:\n        if (n \u003c= max_demanded_lag):\n            max_lag = n - 1\n        else:\n            max_lag = max_demanded_lag\n        pacf_coeffs = list(pacf(x, method=\"ld\", nlags=max_lag))\n        pacf_coeffs = pacf_coeffs + [np.nan] * max(0, (max_demanded_lag - max_lag))\n\n    return [(\"lag_{}\".format(lag[\"lag\"]), pacf_coeffs[lag[\"lag\"]]) for lag in param]\n\nextract.partial_autocorrelation(df[\"Close\"])\n```\n\n#### **(9) Stochasticity**\n\nStochastic refers to a randomly determined process. Any features trying to capture stochasticity by degree or type are included under this branch.\n\n(i) Augmented Dickey Fuller\n\nThe Augmented Dickey-Fuller test is a hypothesis test which checks whether a unit root is present in a time series sample.\n\n\n```python\n#-\u003e In Package\ndef augmented_dickey_fuller(x, param=[{\"attr\": \"teststat\"}]):\n\n    res = None\n    try:\n        res = adfuller(x)\n    except LinAlgError:\n        res = np.NaN, np.NaN, np.NaN\n    except ValueError: # occurs if sample size is too small\n        res = np.NaN, np.NaN, np.NaN\n    except MissingDataError: # is thrown for e.g. inf or nan in the data\n        res = np.NaN, np.NaN, np.NaN\n\n    return [('attr_\"{}\"'.format(config[\"attr\"]),\n                  res[0] if config[\"attr\"] == \"teststat\"\n             else res[1] if config[\"attr\"] == \"pvalue\"\n             else res[2] if config[\"attr\"] == \"usedlag\" else np.NaN)\n            for config in param]\n\nextract.augmented_dickey_fuller(df[\"Close\"])\n```\n\n#### **(10) Averages**\n\n(i) Median of Magnitudes Skew\n\n\n```python\n@set_property(\"fctype\", \"simple\")\n@set_property(\"custom\", True)\ndef gskew(x):\n    interpolation=\"nearest\"\n    median_mag = np.median(x)\n    F_3_value = np.percentile(x, 3, interpolation=interpolation)\n    F_97_value = np.percentile(x, 97, interpolation=interpolation)\n\n    skew = (np.median(x[x \u003c= F_3_value]) +\n            np.median(x[x \u003e= F_97_value]) - 2 * median_mag)\n\n    return skew\n\nextract.gskew(df[\"Close\"])\n```\n\n(ii) Stetson Mean\n\nAn iteratively weighted mean used in the Stetson variability index\n\n\n```python\nstestson_param = {\"weight\":100., \"alpha\":2., \"beta\":2., \"tol\":1.e-6, \"nmax\":20}\n\n@set_property(\"fctype\", \"combiner\")\n@set_property(\"custom\", True)\ndef stetson_mean(x, param=stestson_param):\n    \n    weight= stestson_param[\"weight\"]\n    alpha= stestson_param[\"alpha\"]\n    beta = stestson_param[\"beta\"]\n    tol= stestson_param[\"tol\"]\n    nmax= stestson_param[\"nmax\"]\n    \n    \n    mu = np.median(x)\n    for i in range(nmax):\n        resid = x - mu\n        resid_err = np.abs(resid) * np.sqrt(weight)\n        weight1 = weight / (1. + (resid_err / alpha)**beta)\n        weight1 /= weight1.mean()\n        diff = np.mean(x * weight1) - mu\n        mu += diff\n        if (np.abs(diff) \u003c tol*np.abs(mu) or np.abs(diff) \u003c tol):\n            break\n\n    return mu\n\nextract.stetson_mean(df[\"Close\"])\n```\n\n#### **(11) Size**\n\n(i) Lenght\n\n\n```python\n#-\u003e In Package\ndef length(x):\n    return len(x)\n    \nextract.length(df[\"Close\"])\n```\n\n#### **(12) Count**\n\n(i) Count Above Mean\n\nReturns the number of values in x that are higher than the mean of x\n\n\n```python\n#-\u003e In Package\ndef count_above_mean(x):\n    m = np.mean(x)\n    return np.where(x \u003e m)[0].size\n\nextract.count_above_mean(df[\"Close\"])\n```\n\n#### **(13) Streaks**\n\n(i) Longest Strike Below Mean\n\nReturns the length of the longest consecutive subsequence in x that is smaller than the mean of x\n\n\n```python\n#-\u003e In Package\nimport itertools\ndef get_length_sequences_where(x):\n\n    if len(x) == 0:\n        return [0]\n    else:\n        res = [len(list(group)) for value, group in itertools.groupby(x) if value == 1]\n        return res if len(res) \u003e 0 else [0]\n\ndef longest_strike_below_mean(x):\n\n    if not isinstance(x, (np.ndarray, pd.Series)):\n        x = np.asarray(x)\n    return np.max(get_length_sequences_where(x \u003c= np.mean(x))) if x.size \u003e 0 else 0\n\nextract.longest_strike_below_mean(df[\"Close\"])\n```\n\n(ii) Wozniak\n\nThis is an astronomical feature, we count the number of three consecutive data points that are brighter or fainter than $2σ$ and normalize the number by $N−2$\n\n\n```python\nwoz_param = [{\"consecutiveStar\": n} for n in [2, 4]]\n\n@set_property(\"fctype\", \"combiner\")\n@set_property(\"custom\", True)\ndef wozniak(magnitude, param=woz_param):\n\n    iters = []\n    for consecutiveStar in [stars[\"consecutiveStar\"] for stars in param]:\n      N = len(magnitude)\n      if N \u003c consecutiveStar:\n          return 0\n      sigma = np.std(magnitude)\n      m = np.mean(magnitude)\n      count = 0\n\n      for i in range(N - consecutiveStar + 1):\n          flag = 0\n          for j in range(consecutiveStar):\n              if(magnitude[i + j] \u003e m + 2 * sigma or\n                  magnitude[i + j] \u003c m - 2 * sigma):\n                  flag = 1\n              else:\n                  flag = 0\n                  break\n          if flag:\n              count = count + 1\n      iters.append(count * 1.0 / (N - consecutiveStar + 1))\n\n    return [(\"consecutiveStar_{}\".format(config[\"consecutiveStar\"]), iters[en] )  for en, config in enumerate(param)]\n\nextract.wozniak(df[\"Close\"])\n```\n\n####  **(14) Location**\n\n(i) Last location of Maximum\n\nReturns the relative last location of the maximum value of x.\nlast_location_of_minimum(x),\n\n\n```python\n#-\u003e In Package\ndef last_location_of_maximum(x):\n\n    x = np.asarray(x)\n    return 1.0 - np.argmax(x[::-1]) / len(x) if len(x) \u003e 0 else np.NaN\n\nextract.last_location_of_maximum(df[\"Close\"])\n```\n\n#### **(15) Model Coefficients**\n\nAny coefficient that are obtained from a model that might help in the prediction problem. For example here we might include coefficients of polynomial $h(x)$, which has been fitted to the deterministic dynamics of Langevin model.\n\n(i) FFT Coefficient\n\nCalculates the fourier coefficients of the one-dimensional discrete Fourier Transform for real input.\n\n\n```python\n#-\u003e In Package\ndef fft_coefficient(x, param = [{\"coeff\": 10, \"attr\": \"real\"}]):\n\n    assert min([config[\"coeff\"] for config in param]) \u003e= 0, \"Coefficients must be positive or zero.\"\n    assert set([config[\"attr\"] for config in param]) \u003c= set([\"imag\", \"real\", \"abs\", \"angle\"]), \\\n        'Attribute must be \"real\", \"imag\", \"angle\" or \"abs\"'\n\n    fft = np.fft.rfft(x)\n\n    def complex_agg(x, agg):\n        if agg == \"real\":\n            return x.real\n        elif agg == \"imag\":\n            return x.imag\n        elif agg == \"abs\":\n            return np.abs(x)\n        elif agg == \"angle\":\n            return np.angle(x, deg=True)\n\n    res = [complex_agg(fft[config[\"coeff\"]], config[\"attr\"]) if config[\"coeff\"] \u003c len(fft)\n           else np.NaN for config in param]\n    index = [('coeff_{}__attr_\"{}\"'.format(config[\"coeff\"], config[\"attr\"]),res[0]) for config in param]\n    return index\n\nextract.fft_coefficient(df[\"Close\"])\n```\n\n(ii) AR Coefficient\n\nThis feature calculator fits the unconditional maximum likelihood of an autoregressive AR(k) process.\n\n\n```python\n#-\u003e In Package\nfrom statsmodels.tsa.ar_model import AR\n\ndef ar_coefficient(x, param=[{\"coeff\": 5, \"k\": 5}]):\n\n    calculated_ar_params = {}\n\n    x_as_list = list(x)\n    calculated_AR = AR(x_as_list)\n\n    res = {}\n\n    for parameter_combination in param:\n        k = parameter_combination[\"k\"]\n        p = parameter_combination[\"coeff\"]\n\n        column_name = \"k_{}__coeff_{}\".format(k, p)\n\n        if k not in calculated_ar_params:\n            try:\n                calculated_ar_params[k] = calculated_AR.fit(maxlag=k, solver=\"mle\").params\n            except (LinAlgError, ValueError):\n                calculated_ar_params[k] = [np.NaN]*k\n\n        mod = calculated_ar_params[k]\n\n        if p \u003c= k:\n            try:\n                res[column_name] = mod[p]\n            except IndexError:\n                res[column_name] = 0\n        else:\n            res[column_name] = np.NaN\n\n    return [(key, value) for key, value in res.items()]\n\nextract.ar_coefficient(df[\"Close\"])\n```\n\n#### **(16) Quantiles**\n\nThis includes finding normal quantile values in the series, but also quantile derived measures like change quantiles and index max quantiles. \n\n(i) Index Mass Quantile\n\nThe relative index $i$ where $q\\%$ of the mass of the time series $x$ lie left of $i$\n.\n\n\n```python\n#-\u003e In Package\ndef index_mass_quantile(x, param=[{\"q\": 0.3}]):\n\n    x = np.asarray(x)\n    abs_x = np.abs(x)\n    s = sum(abs_x)\n\n    if s == 0:\n        # all values in x are zero or it has length 0\n        return [(\"q_{}\".format(config[\"q\"]), np.NaN) for config in param]\n    else:\n        # at least one value is not zero\n        mass_centralized = np.cumsum(abs_x) / s\n        return [(\"q_{}\".format(config[\"q\"]), (np.argmax(mass_centralized \u003e= config[\"q\"])+1)/len(x)) for config in param]\n\nextract.index_mass_quantile(df[\"Close\"])\n```\n\n#### **(17) Peaks**\n\n(i) Number of CWT Peaks\n\nThis feature calculator searches for different peaks in x.\n\n\n```python\nfrom scipy.signal import cwt, find_peaks_cwt, ricker, welch\n\ncwt_param = [ka for ka in [2,6,9]]\n\n@set_property(\"fctype\", \"combiner\")\n@set_property(\"custom\", True)\ndef number_cwt_peaks(x, param=cwt_param):\n\n    return [(\"CWTPeak_{}\".format(n), len(find_peaks_cwt(vector=x, widths=np.array(list(range(1, n + 1))), wavelet=ricker))) for n in param]\n\nextract.number_cwt_peaks(df[\"Close\"])\n```\n\n#### **(18) Density**\n\nThe density, and more specifically the power spectral density of the signal describes the power present in the signal as a function of frequency, per unit frequency.\n\n(i) Cross Power Spectral Density\n\nThis feature calculator estimates the cross power spectral density of the time series $x$ at different frequencies.\n\n\n```python\n#-\u003e In Package\ndef spkt_welch_density(x, param=[{\"coeff\": 5}]):\n    freq, pxx = welch(x, nperseg=min(len(x), 256))\n    coeff = [config[\"coeff\"] for config in param]\n    indices = [\"coeff_{}\".format(i) for i in coeff]\n\n    if len(pxx) \u003c= np.max(coeff):  # There are fewer data points in the time series than requested coefficients\n\n        # filter coefficients that are not contained in pxx\n        reduced_coeff = [coefficient for coefficient in coeff if len(pxx) \u003e coefficient]\n        not_calculated_coefficients = [coefficient for coefficient in coeff\n                                       if coefficient not in reduced_coeff]\n\n        # Fill up the rest of the requested coefficients with np.NaNs\n        return zip(indices, list(pxx[reduced_coeff]) + [np.NaN] * len(not_calculated_coefficients))\n    else:\n        return pxx[coeff].ravel()[0]\n\nextract.spkt_welch_density(df[\"Close\"])\n```\n\n#### **(19) Linearity**\n\nAny measure of linearity that might make use of something like the linear least-squares regression for the values of the time series. This can be against the time series minus one and many other alternatives.\n\n(i) Linear Trend Time Wise\n\nCalculate a linear least-squares regression for the values of the time series versus the sequence from 0 to length of the time series minus one.\n\n\n```python\nfrom scipy.stats import linregress\n\n#-\u003e In Package\ndef linear_trend_timewise(x, param= [{\"attr\": \"pvalue\"}]):\n\n    ix = x.index\n\n    # Get differences between each timestamp and the first timestamp in seconds.\n    # Then convert to hours and reshape for linear regression\n    times_seconds = (ix - ix[0]).total_seconds()\n    times_hours = np.asarray(times_seconds / float(3600))\n\n    linReg = linregress(times_hours, x.values)\n\n    return [(\"attr_\\\"{}\\\"\".format(config[\"attr\"]), getattr(linReg, config[\"attr\"]))\n            for config in param]\n\nextract.linear_trend_timewise(df[\"Close\"])\n```\n\n#### **(20) Non-Linearity**\n\n(i) Schreiber Non-Linearity\n\n\n```python\n#-\u003e In Package\ndef c3(x, lag=3):\n    if not isinstance(x, (np.ndarray, pd.Series)):\n        x = np.asarray(x)\n    n = x.size\n    if 2 * lag \u003e= n:\n        return 0\n    else:\n        return np.mean((_roll(x, 2 * -lag) * _roll(x, -lag) * x)[0:(n - 2 * lag)])\n\nextract.c3(df[\"Close\"])\n```\n\n#### **(21) Entropy**\n\nAny feature looking at the complexity of a time series. This is typically used in medical signal disciplines (EEG, EMG). There are multiple types of measures like spectral entropy, permutation entropy, sample entropy, approximate entropy, Lempel-Ziv complexity and other. This includes entropy measures and there derivations.\n\n(i) Binned Entropy\n\nBins the values of x into max_bins equidistant bins.\n\n\n```python\n#-\u003e In Package\ndef binned_entropy(x, max_bins=10):\n    if not isinstance(x, (np.ndarray, pd.Series)):\n        x = np.asarray(x)\n    hist, bin_edges = np.histogram(x, bins=max_bins)\n    probs = hist / x.size\n    return - np.sum(p * np.math.log(p) for p in probs if p != 0)\n\nextract.binned_entropy(df[\"Close\"])\n```\n\n(ii) SVD Entropy\n\nSVD entropy is an indicator of the number of eigenvectors that are needed for an adequate explanation of the data set.\n\n\n```python\nsvd_param = [{\"Tau\": ta, \"DE\": de}\n                      for ta in [4] \n                      for de in [3,6]]\n                      \ndef _embed_seq(X,Tau,D):\n  N =len(X)\n  if D * Tau \u003e N:\n      print(\"Cannot build such a matrix, because D * Tau \u003e N\")\n      exit()\n  if Tau\u003c1:\n      print(\"Tau has to be at least 1\")\n      exit()\n  Y= np.zeros((N - (D - 1) * Tau, D))\n\n  for i in range(0, N - (D - 1) * Tau):\n      for j in range(0, D):\n          Y[i][j] = X[i + j * Tau]\n  return Y                     \n\n@set_property(\"fctype\", \"combiner\")\n@set_property(\"custom\", True)\ndef svd_entropy(epochs, param=svd_param):\n    axis=0\n    \n    final = []\n    for par in param:\n\n      def svd_entropy_1d(X, Tau, DE):\n          Y = _embed_seq(X, Tau, DE)\n          W = np.linalg.svd(Y, compute_uv=0)\n          W /= sum(W)  # normalize singular values\n          return -1 * np.sum(W * np.log(W))\n\n      Tau = par[\"Tau\"]\n      DE = par[\"DE\"]\n\n      final.append(np.apply_along_axis(svd_entropy_1d, axis, epochs, Tau, DE).ravel()[0])\n\n\n    return [(\"Tau_\\\"{}\\\"__De_{}\\\"\".format(par[\"Tau\"], par[\"DE\"]), final[en]) for en, par in enumerate(param)]\n\nextract.svd_entropy(df[\"Close\"].values)\n```\n\n(iii) Hjort\n\nThe Complexity parameter represents the change in frequency. The parameter compares the signal's similarity to a pure sine wave, where the value converges to 1 if the signal is more similar. \n\n\n```python\ndef _hjorth_mobility(epochs):\n    diff = np.diff(epochs, axis=0)\n    sigma0 = np.std(epochs, axis=0)\n    sigma1 = np.std(diff, axis=0)\n    return np.divide(sigma1, sigma0)\n\n@set_property(\"fctype\", \"simple\")\n@set_property(\"custom\", True)\ndef hjorth_complexity(epochs):\n    diff1 = np.diff(epochs, axis=0)\n    diff2 = np.diff(diff1, axis=0)\n    sigma1 = np.std(diff1, axis=0)\n    sigma2 = np.std(diff2, axis=0)\n    return np.divide(np.divide(sigma2, sigma1), _hjorth_mobility(epochs))\n\nextract.hjorth_complexity(df[\"Close\"])\n```\n\n#### **(22) Fixed Points**\n\nFixed points and equilibria as identified from fitted models.\n\n(i) Langevin Fixed Points\n\nLargest fixed point of dynamics $max\\ {h(x)=0}$ estimated from polynomial $h(x)$ which has been fitted to the deterministic dynamics of Langevin model\n\n\n```python\n#-\u003e In Package\ndef _estimate_friedrich_coefficients(x, m, r):\n    assert m \u003e 0, \"Order of polynomial need to be positive integer, found {}\".format(m)\n    df = pd.DataFrame({'signal': x[:-1], 'delta': np.diff(x)})\n    try:\n        df['quantiles'] = pd.qcut(df.signal, r)\n    except ValueError:\n        return [np.NaN] * (m + 1)\n\n    quantiles = df.groupby('quantiles')\n\n    result = pd.DataFrame({'x_mean': quantiles.signal.mean(), 'y_mean': quantiles.delta.mean()})\n    result.dropna(inplace=True)\n\n    try:\n        return np.polyfit(result.x_mean, result.y_mean, deg=m)\n    except (np.linalg.LinAlgError, ValueError):\n        return [np.NaN] * (m + 1)\n\n\ndef max_langevin_fixed_point(x, r=3, m=30):\n    coeff = _estimate_friedrich_coefficients(x, m, r)\n\n    try:\n        max_fixed_point = np.max(np.real(np.roots(coeff)))\n    except (np.linalg.LinAlgError, ValueError):\n        return np.nan\n\n    return max_fixed_point\n\nextract.max_langevin_fixed_point(df[\"Close\"])\n```\n\n#### **(23) Amplitude**\n\nFeatures derived from peaked values in either the positive or negative direction. \n\n(i) Willison Amplitude\n\nThis feature is defined as the amount of times that the change in the signal amplitude exceeds a threshold.\n\n\n```python\nwill_param = [ka for ka in [0.2,3]]\n\n@set_property(\"fctype\", \"combiner\")\n@set_property(\"custom\", True)\ndef willison_amplitude(X, param=will_param):\n  return [(\"Thresh_{}\".format(n),np.sum(np.abs(np.diff(X)) \u003e= n)) for n in param]\n\nextract.willison_amplitude(df[\"Close\"])\n```\n\n(ii) Percent Amplitude\n\nReturns the largest distance from the median value, measured\n    as a percentage of the median\n\n\n```python\nperc_param = [{\"base\":ba, \"exponent\":exp} for ba in [3,5] for exp in [-0.1,-0.2]]\n\n@set_property(\"fctype\", \"combiner\")\n@set_property(\"custom\", True)\ndef percent_amplitude(x, param =perc_param):\n    final = []\n    for par in param:\n      linear_scale_data = par[\"base\"] ** (par[\"exponent\"] * x)\n      y_max = np.max(linear_scale_data)\n      y_min = np.min(linear_scale_data)\n      y_med = np.median(linear_scale_data)\n      final.append(max(abs((y_max - y_med) / y_med), abs((y_med - y_min) / y_med)))\n\n    return [(\"Base_{}__Exp{}\".format(pa[\"base\"],pa[\"exponent\"]),fin) for fin, pa in zip(final,param)]\n\nextract.percent_amplitude(df[\"Close\"])\n```\n\n#### **(24) Probability**\n\n(i) Cadence Probability\n\nGiven the observed distribution of time lags cads, compute the probability that the next observation occurs within time minutes of an arbitrary epoch.\n\n\n```python\n#-\u003e fixes required\nimport scipy.stats as stats\n\ncad_param = [0.1,1000, -234]\n\n@set_property(\"fctype\", \"combiner\")\n@set_property(\"custom\", True)\ndef cad_prob(cads, param=cad_param):\n    return [(\"time_{}\".format(time), stats.percentileofscore(cads, float(time) / (24.0 * 60.0)) / 100.0) for time in param]\n    \nextract.cad_prob(df[\"Close\"])\n```\n\n#### **(25) Crossings**\n\nCalculates the crossing of the series with other defined values or series.\n\n(i) Zero Crossing Derivative\n\nThe positioning of the edge point is located at the zero crossing of the first derivative of the filter.\n\n\n```python\nzero_param = [0.01, 8]\n\n@set_property(\"fctype\", \"combiner\")\n@set_property(\"custom\", True)\ndef zero_crossing_derivative(epochs, param=zero_param):\n    diff = np.diff(epochs)\n    norm = diff-diff.mean()\n    return [(\"e_{}\".format(e), np.apply_along_axis(lambda epoch: np.sum(((epoch[:-5] \u003c= e) \u0026 (epoch[5:] \u003e e))), 0, norm).ravel()[0]) for e in param]\n\nextract.zero_crossing_derivative(df[\"Close\"])\n```\n\n#### **(26) Fluctuations**\n\nThese features are again from medical signal sciences, but under this category we would include values such as fluctuation based entropy measures, fluctuation of correlation dynamics, and co-fluctuations.\n\n(i) Detrended Fluctuation Analysis (DFA)\n\nDFA Calculate the Hurst exponent using DFA analysis.\n\n\n```python\nfrom scipy.stats import kurtosis as _kurt\nfrom scipy.stats import skew as _skew\nimport numpy as np\n\n@set_property(\"fctype\", \"simple\")\n@set_property(\"custom\", True)\ndef detrended_fluctuation_analysis(epochs):\n    def dfa_1d(X, Ave=None, L=None):\n        X = np.array(X)\n\n        if Ave is None:\n            Ave = np.mean(X)\n\n        Y = np.cumsum(X)\n        Y -= Ave\n\n        if L is None:\n            L = np.floor(len(X) * 1 / (\n                    2 ** np.array(list(range(1, int(np.log2(len(X))) - 4))))\n                            )\n            \n        F = np.zeros(len(L))  # F(n) of different given box length n\n\n        for i in range(0, len(L)):\n            n = int(L[i])  # for each box length L[i]\n            if n == 0:\n                print(\"time series is too short while the box length is too big\")\n                print(\"abort\")\n                exit()\n            for j in range(0, len(X), n):  # for each box\n                if j + n \u003c len(X):\n                    c = list(range(j, j + n))\n                    # coordinates of time in the box\n                    c = np.vstack([c, np.ones(n)]).T\n                    # the value of data in the box\n                    y = Y[j:j + n]\n                    # add residue in this box\n                    F[i] += np.linalg.lstsq(c, y, rcond=None)[1]\n            F[i] /= ((len(X) / n) * n)\n        F = np.sqrt(F)\n\n        stacked = np.vstack([np.log(L), np.ones(len(L))])\n        stacked_t = stacked.T\n        Alpha = np.linalg.lstsq(stacked_t, np.log(F), rcond=None)\n\n        return Alpha[0][0]\n\n    return np.apply_along_axis(dfa_1d, 0, epochs).ravel()[0]\n\nextract.detrended_fluctuation_analysis(df[\"Close\"])\n```\n\n#### **(27) Information**\n\nClosely related to entropy and complexity measures. Any measure that attempts to measure the amount of information from an observable variable is included here.\n\n(i) Fisher Information\n\nFisher information is a statistical information concept distinct from, and earlier than, Shannon information in communication theory.\n\n\n```python\ndef _embed_seq(X, Tau, D):\n\n    shape = (X.size - Tau * (D - 1), D)\n    strides = (X.itemsize, Tau * X.itemsize)\n    return np.lib.stride_tricks.as_strided(X, shape=shape, strides=strides)\n\nfisher_param = [{\"Tau\":ta, \"DE\":de} for ta in [3,15] for de in [10,5]]\n\n@set_property(\"fctype\", \"combiner\")\n@set_property(\"custom\", True)\ndef fisher_information(epochs, param=fisher_param):\n    def fisher_info_1d(a, tau, de):\n        # taken from pyeeg improvements\n\n        mat = _embed_seq(a, tau, de)\n        W = np.linalg.svd(mat, compute_uv=False)\n        W /= sum(W)  # normalize singular values\n        FI_v = (W[1:] - W[:-1]) ** 2 / W[:-1]\n        return np.sum(FI_v)\n\n    return [(\"Tau_{}__DE_{}\".format(par[\"Tau\"], par[\"DE\"]),np.apply_along_axis(fisher_info_1d, 0, epochs, par[\"Tau\"], par[\"DE\"]).ravel()[0]) for par in param]\n\nextract.fisher_information(df[\"Close\"])\n```\n\n#### **(28) Fractals**\n\nIn mathematics, more specifically in fractal geometry, a fractal dimension is a ratio providing a statistical index of complexity comparing how detail in a pattern (strictly speaking, a fractal pattern) changes with the scale at which it is measured.\n\n(i) Highuchi Fractal\n\nCompute a Higuchi Fractal Dimension of a time series\n\n\n```python\nhig_para = [{\"Kmax\": 3},{\"Kmax\": 5}]\n\n@set_property(\"fctype\", \"combiner\")\n@set_property(\"custom\", True)\ndef higuchi_fractal_dimension(epochs, param=hig_para):\n    def hfd_1d(X, Kmax):\n        \n        L = []\n        x = []\n        N = len(X)\n        for k in range(1, Kmax):\n            Lk = []\n            for m in range(0, k):\n                Lmk = 0\n                for i in range(1, int(np.floor((N - m) / k))):\n                    Lmk += abs(X[m + i * k] - X[m + i * k - k])\n                Lmk = Lmk * (N - 1) / np.floor((N - m) / float(k)) / k\n                Lk.append(Lmk)\n            L.append(np.log(np.mean(Lk)))\n            x.append([np.log(float(1) / k), 1])\n\n        (p, r1, r2, s) = np.linalg.lstsq(x, L, rcond=None)\n        return p[0]\n    \n    return [(\"Kmax_{}\".format(config[\"Kmax\"]), np.apply_along_axis(hfd_1d, 0, epochs, config[\"Kmax\"]).ravel()[0] ) for  config in param]\n    \nextract.higuchi_fractal_dimension(df[\"Close\"])\n```\n\n(ii) Petrosian Fractal\n\nCompute a Petrosian Fractal Dimension of a time series.\n\n\n```python\n@set_property(\"fctype\", \"simple\")\n@set_property(\"custom\", True)\ndef petrosian_fractal_dimension(epochs):\n    def pfd_1d(X, D=None):\n        # taken from pyeeg\n        \"\"\"Compute Petrosian Fractal Dimension of a time series from either two\n        cases below:\n            1. X, the time series of type list (default)\n            2. D, the first order differential sequence of X (if D is provided,\n               recommended to speed up)\n        In case 1, D is computed using Numpy's difference function.\n        To speed up, it is recommended to compute D before calling this function\n        because D may also be used by other functions whereas computing it here\n        again will slow down.\n        \"\"\"\n        if D is None:\n            D = np.diff(X)\n            D = D.tolist()\n        N_delta = 0  # number of sign changes in derivative of the signal\n        for i in range(1, len(D)):\n            if D[i] * D[i - 1] \u003c 0:\n                N_delta += 1\n        n = len(X)\n        return np.log10(n) / (np.log10(n) + np.log10(n / n + 0.4 * N_delta))\n    return np.apply_along_axis(pfd_1d, 0, epochs).ravel()[0]\n\nextract.petrosian_fractal_dimension(df[\"Close\"])\n```\n\n#### **(29) Exponent**\n\n(i) Hurst Exponent\n\nThe Hurst exponent is used as a measure of long-term memory of time series. It relates to the autocorrelations of the time series, and the rate at which these decrease as the lag between pairs of values increases.\n\n\n```python\n@set_property(\"fctype\", \"simple\")\n@set_property(\"custom\", True)\ndef hurst_exponent(epochs):\n    def hurst_1d(X):\n\n        X = np.array(X)\n        N = X.size\n        T = np.arange(1, N + 1)\n        Y = np.cumsum(X)\n        Ave_T = Y / T\n\n        S_T = np.zeros(N)\n        R_T = np.zeros(N)\n        for i in range(N):\n            S_T[i] = np.std(X[:i + 1])\n            X_T = Y - T * Ave_T[i]\n            R_T[i] = np.ptp(X_T[:i + 1])\n\n        for i in range(1, len(S_T)):\n            if np.diff(S_T)[i - 1] != 0:\n                break\n        for j in range(1, len(R_T)):\n            if np.diff(R_T)[j - 1] != 0:\n                break\n        k = max(i, j)\n        assert k \u003c 10, \"rethink it!\"\n\n        R_S = R_T[k:] / S_T[k:]\n        R_S = np.log(R_S)\n\n        n = np.log(T)[k:]\n        A = np.column_stack((n, np.ones(n.size)))\n        [m, c] = np.linalg.lstsq(A, R_S, rcond=None)[0]\n        H = m\n        return H\n    return np.apply_along_axis(hurst_1d, 0, epochs).ravel()[0]\n\nextract.hurst_exponent(df[\"Close\"])\n```\n\n(ii) Largest Lyauponov Exponent\n\nIn mathematics the Lyapunov exponent or Lyapunov characteristic exponent of a dynamical system is a quantity that characterizes the rate of separation of infinitesimally close trajectories.\n\n\n```python\ndef _embed_seq(X, Tau, D):\n    shape = (X.size - Tau * (D - 1), D)\n    strides = (X.itemsize, Tau * X.itemsize)\n    return np.lib.stride_tricks.as_strided(X, shape=shape, strides=strides)\n\nlyaup_param = [{\"Tau\":4, \"n\":3, \"T\":10, \"fs\":9},{\"Tau\":8, \"n\":7, \"T\":15, \"fs\":6}]\n\n@set_property(\"fctype\", \"combiner\")\n@set_property(\"custom\", True)\ndef largest_lyauponov_exponent(epochs, param=lyaup_param):\n    def LLE_1d(x, tau, n, T, fs):\n\n        Em = _embed_seq(x, tau, n)\n        M = len(Em)\n        A = np.tile(Em, (len(Em), 1, 1))\n        B = np.transpose(A, [1, 0, 2])\n        square_dists = (A - B) ** 2  # square_dists[i,j,k] = (Em[i][k]-Em[j][k])^2\n        D = np.sqrt(square_dists[:, :, :].sum(axis=2))  # D[i,j] = ||Em[i]-Em[j]||_2\n\n        # Exclude elements within T of the diagonal\n        band = np.tri(D.shape[0], k=T) - np.tri(D.shape[0], k=-T - 1)\n        band[band == 1] = np.inf\n        neighbors = (D + band).argmin(axis=0)  # nearest neighbors more than T steps away\n\n        # in_bounds[i,j] = (i+j \u003c= M-1 and i+neighbors[j] \u003c= M-1)\n        inc = np.tile(np.arange(M), (M, 1))\n        row_inds = (np.tile(np.arange(M), (M, 1)).T + inc)\n        col_inds = (np.tile(neighbors, (M, 1)) + inc.T)\n        in_bounds = np.logical_and(row_inds \u003c= M - 1, col_inds \u003c= M - 1)\n        # Uncomment for old (miscounted) version\n        # in_bounds = numpy.logical_and(row_inds \u003c M - 1, col_inds \u003c M - 1)\n        row_inds[~in_bounds] = 0\n        col_inds[~in_bounds] = 0\n\n        # neighbor_dists[i,j] = ||Em[i+j]-Em[i+neighbors[j]]||_2\n        neighbor_dists = np.ma.MaskedArray(D[row_inds, col_inds], ~in_bounds)\n        J = (~neighbor_dists.mask).sum(axis=1)  # number of in-bounds indices by row\n        # Set invalid (zero) values to 1; log(1) = 0 so sum is unchanged\n\n        neighbor_dists[neighbor_dists == 0] = 1\n\n        # !!! this fixes the divide by zero in log error !!!\n        neighbor_dists.data[neighbor_dists.data == 0] = 1\n\n        d_ij = np.sum(np.log(neighbor_dists.data), axis=1)\n        mean_d = d_ij[J \u003e 0] / J[J \u003e 0]\n\n        x = np.arange(len(mean_d))\n        X = np.vstack((x, np.ones(len(mean_d)))).T\n        [m, c] = np.linalg.lstsq(X, mean_d, rcond=None)[0]\n        Lexp = fs * m\n        return Lexp\n\n    return [(\"Tau_{}__n_{}__T_{}__fs_{}\".format(par[\"Tau\"], par[\"n\"], par[\"T\"], par[\"fs\"]), np.apply_along_axis(LLE_1d, 0, epochs, par[\"Tau\"], par[\"n\"], par[\"T\"], par[\"fs\"]).ravel()[0]) for par in param]\n  \nextract.largest_lyauponov_exponent(df[\"Close\"])\n```\n\n#### **(30) Spectral Analysis**\n\nSpectral analysis is analysis in terms of a spectrum of frequencies or related quantities such as energies, eigenvalues, etc.\n\n(i) Whelch Method\n\nThe Whelch Method is an approach for spectral density estimation. It is used in physics, engineering, and applied mathematics for estimating the power of a signal at different frequencies.\n\n\n```python\nfrom scipy import signal, integrate\n\nwhelch_param = [100,200]\n\n@set_property(\"fctype\", \"combiner\")\n@set_property(\"custom\", True)\ndef whelch_method(data, param=whelch_param):\n\n  final = []\n  for Fs in param:\n    f, pxx = signal.welch(data, fs=Fs, nperseg=1024)\n    d = {'psd': pxx, 'freqs': f}\n    df = pd.DataFrame(data=d)\n    dfs = df.sort_values(['psd'], ascending=False)\n    rows = dfs.iloc[:10]\n    final.append(rows['freqs'].mean())\n  \n  return [(\"Fs_{}\".format(pa),fin) for pa, fin in zip(param,final)]\n\nextract.whelch_method(df[\"Close\"])\n```\n\n\n```python\n#-\u003e Basically same as above\nfreq_param = [{\"fs\":50, \"sel\":15},{\"fs\":200, \"sel\":20}]\n\n@set_property(\"fctype\", \"combiner\")\n@set_property(\"custom\", True)\ndef find_freq(serie, param=freq_param):\n\n    final = []\n    for par in param:\n      fft0 = np.fft.rfft(serie*np.hanning(len(serie)))\n      freqs = np.fft.rfftfreq(len(serie), d=1.0/par[\"fs\"])\n      fftmod = np.array([np.sqrt(fft0[i].real**2 + fft0[i].imag**2) for i in range(0, len(fft0))])\n      d = {'fft': fftmod, 'freq': freqs}\n      df = pd.DataFrame(d)\n      hop = df.sort_values(['fft'], ascending=False)\n      rows = hop.iloc[:par[\"sel\"]]\n      final.append(rows['freq'].mean())\n\n    return [(\"Fs_{}__sel{}\".format(pa[\"fs\"],pa[\"sel\"]),fin) for pa, fin in zip(param,final)]\n\nextract.find_freq(df[\"Close\"])\n```\n\n#### **(31) Percentile**\n\n(i) Flux Percentile\n\nFlux (or radiant flux) is the total amount of energy that crosses a unit area per unit time. Flux is an astronomical value, measured in joules per square metre per second (joules/m2/s), or watts per square metre. Here we provide the ratio of flux percentiles.\n\n\n```python\n#-\u003e In Package\n\nimport math\ndef flux_perc(magnitude):\n    sorted_data = np.sort(magnitude)\n    lc_length = len(sorted_data)\n\n    F_60_index = int(math.ceil(0.60 * lc_length))\n    F_40_index = int(math.ceil(0.40 * lc_length))\n    F_5_index = int(math.ceil(0.05 * lc_length))\n    F_95_index = int(math.ceil(0.95 * lc_length))\n\n    F_40_60 = sorted_data[F_60_index] - sorted_data[F_40_index]\n    F_5_95 = sorted_data[F_95_index] - sorted_data[F_5_index]\n    F_mid20 = F_40_60 / F_5_95\n\n    return {\"FluxPercentileRatioMid20\": F_mid20}\n\nextract.flux_perc(df[\"Close\"])\n```\n\n#### **(32) Range**\n\n(i) Range of Cummulative Sum\n\n\n```python\n@set_property(\"fctype\", \"simple\")\n@set_property(\"custom\", True)\ndef range_cum_s(magnitude):\n    sigma = np.std(magnitude)\n    N = len(magnitude)\n    m = np.mean(magnitude)\n    s = np.cumsum(magnitude - m) * 1.0 / (N * sigma)\n    R = np.max(s) - np.min(s)\n    return {\"Rcs\": R}\n\nextract.range_cum_s(df[\"Close\"])\n```\n\n#### **(33) Structural**\n\nStructural features, potential placeholders for future research.\n\n(i) Structure Function\n\nThe structure function of rotation measures (RMs) contains information on electron density and magnetic field fluctuations when used i astronomy. It becomes a custom feature when used with your own unique time series data.\n\n\n```python\nfrom scipy.interpolate import interp1d\n\nstruct_param = {\"Volume\":df[\"Volume\"].values, \"Open\": df[\"Open\"].values}\n\n@set_property(\"fctype\", \"combiner\")\n@set_property(\"custom\", True)\ndef structure_func(time, param=struct_param):\n\n      dict_final = {}\n      for key, magnitude in param.items():\n        dict_final[key] = []\n        Nsf, Np = 100, 100\n        sf1, sf2, sf3 = np.zeros(Nsf), np.zeros(Nsf), np.zeros(Nsf)\n        f = interp1d(time, magnitude)\n\n        time_int = np.linspace(np.min(time), np.max(time), Np)\n        mag_int = f(time_int)\n\n        for tau in np.arange(1, Nsf):\n            sf1[tau - 1] = np.mean(\n                np.power(np.abs(mag_int[0:Np - tau] - mag_int[tau:Np]), 1.0))\n            sf2[tau - 1] = np.mean(\n                np.abs(np.power(\n                    np.abs(mag_int[0:Np - tau] - mag_int[tau:Np]), 2.0)))\n            sf3[tau - 1] = np.mean(\n                np.abs(np.power(\n                    np.abs(mag_int[0:Np - tau] - mag_int[tau:Np]), 3.0)))\n        sf1_log = np.log10(np.trim_zeros(sf1))\n        sf2_log = np.log10(np.trim_zeros(sf2))\n        sf3_log = np.log10(np.trim_zeros(sf3))\n\n        if len(sf1_log) and len(sf2_log):\n            m_21, b_21 = np.polyfit(sf1_log, sf2_log, 1)\n        else:\n\n            m_21 = np.nan\n\n        if len(sf1_log) and len(sf3_log):\n            m_31, b_31 = np.polyfit(sf1_log, sf3_log, 1)\n        else:\n\n            m_31 = np.nan\n\n        if len(sf2_log) and len(sf3_log):\n            m_32, b_32 = np.polyfit(sf2_log, sf3_log, 1)\n        else:\n\n            m_32 = np.nan\n        dict_final[key].append(m_21)\n        dict_final[key].append(m_31)\n        dict_final[key].append(m_32)\n\n      return [(\"StructureFunction_{}__m_{}\".format(key, name), li)  for key, lis in dict_final.items() for name, li in zip([21,31,32], lis)]\n\nstruct_param = {\"Volume\":df[\"Volume\"].values, \"Open\": df[\"Open\"].values}\n\nextract.structure_func(df[\"Close\"],struct_param)\n```\n\n#### **(34) Distribution**\n\n(i) Kurtosis\n\n\n```python\n#-\u003e In Package\ndef kurtosis(x):\n\n    if not isinstance(x, pd.Series):\n        x = pd.Series(x)\n    return pd.Series.kurtosis(x)\n\nextract.kurtosis(df[\"Close\"])\n```\n\n(ii) Stetson Kurtosis\n\n\n```python\n@set_property(\"fctype\", \"simple\")\n@set_property(\"custom\", True)\ndef stetson_k(x):\n    \"\"\"A robust kurtosis statistic.\"\"\"\n    n = len(x)\n    x0 = stetson_mean(x, 1./20**2)\n    delta_x = np.sqrt(n / (n - 1.)) * (x - x0) / 20\n    ta = 1. / 0.798 * np.mean(np.abs(delta_x)) / np.sqrt(np.mean(delta_x**2))\n    return ta\n  \nextract.stetson_k(df[\"Close\"])\n```\n\n## **(5) Synthesise**\n\nTime-Series synthesisation (TSS) happens before the feature extraction step and Cross Sectional Synthesisation (CSS) happens after the feature extraction step. Currently I will only include a CSS package, in the future, I would further work on developing out this section. This area still has a lot of performance and stability issues. In the future it might be a more viable candidate to improve prediction.\n\n\n```python\nfrom lightgbm import LGBMRegressor\nfrom sklearn.metrics import mean_squared_error\n\ndef model(df_final):\n  model = LGBMRegressor()\n  test =  df_final.head(int(len(df_final)*0.4))\n  train = df_final[~df_final.isin(test)].dropna()\n  model = model.fit(train.drop([\"Close_1\"],axis=1),train[\"Close_1\"])\n  preds = model.predict(test.drop([\"Close_1\"],axis=1))\n  test =  df_final.head(int(len(df_final)*0.4))\n  train = df_final[~df_final.isin(test)].dropna()\n  model = model.fit(train.drop([\"Close_1\"],axis=1),train[\"Close_1\"])\n  val = mean_squared_error(test[\"Close_1\"],preds); \n  return val\n```\n\n\n```python\npip install ctgan\n```\n\n\n```python\nfrom ctgan import CTGANSynthesizer\n\n#discrete_columns = [\"\"]\nctgan = CTGANSynthesizer()\nctgan.fit(df,epochs=10) #15\n```\n\nRandom Benchmark\n\n\n```python\nnp.random.seed(1)\ndf_in = df.copy()\ndf_in[\"Close_1\"] = np.random.permutation(df_in[\"Close_1\"].values)\nmodel(df_in)\n```\n\nGenerated Performance\n\n\n```python\ndf_gen = ctgan.sample(len(df_in)*100)\nmodel(df_gen)\n```\n\nAs expected a cross-sectional technique, does not work well on time-series data, in the future, other methods will be investigated.\n\n\u003ca name=\"example\"\u003e\u003c/a\u003e\n\n## **(6) Skeleton Example**\n\nHere I will perform tabular agumenting methods on a small dataset single digit features and around 250 instances. This is not necessarily the best sized dataset to highlight the performance of tabular augmentation as some method like extraction would be overkill as it would lead to dimensionality problems. It is also good to know that there are close to infinite number of ways to perform these augmentation methods. In the future, automated augmentation methods can guide the experiment process. \n\nThe approach taken in this skeleton is to develop running models that are tested after each augmentation to highlight what methods might work well on this particular dataset. The metric we will use is mean squared error. In this implementation we do not have special hold-out sets.\n\nThe above framework of implementation will be consulted, but one still have to be strategic as to when you apply what function, and you have to make sure that you are processing your data with appropriate techniques (drop null values, fill null values) at the appropriate time.\n\n#### **Validation**\n\nDevelop Model and Define Metric\n\n\n```python\nfrom lightgbm import LGBMRegressor\nfrom sklearn.metrics import mean_squared_error\n\ndef model(df_final):\n  model = LGBMRegressor()\n  test =  df_final.head(int(len(df_final)*0.4))\n  train = df_final[~df_final.isin(test)].dropna()\n  model = model.fit(train.drop([\"Close_1\"],axis=1),train[\"Close_1\"])\n  preds = model.predict(test.drop([\"Close_1\"],axis=1))\n  test =  df_final.head(int(len(df_final)*0.4))\n  train = df_final[~df_final.isin(test)].dropna()\n  model = model.fit(train.drop([\"Close_1\"],axis=1),train[\"Close_1\"])\n  val = mean_squared_error(test[\"Close_1\"],preds); \n  return val\n```\n\nReload Data\n\n\n```python\ndf = data_copy()\n```\n\n\n```python\nmodel(df)\n```\n\n    302.61676570345287\n\n\n\n**(1) (7) (i) Transformation - Decomposition - Naive**\n\n\n```python\n## If Inferred Seasonality is Too Large Default to Five\nseasons = transform.infer_seasonality(df[\"Close\"],index=0) \ndf_out = transform.naive_dec(df.copy(), [\"Close\",\"Open\"], freq=5)\nmodel(df_out) #improvement\n```\n\n\n\n\n    274.34477082783525\n\n\n\n**(1) (8) (i) Transformation - Filter - Baxter-King-Bandpass**\n\n\n```python\ndf_out = transform.bkb(df_out, [\"Close\",\"Low\"])\ndf_best = df_out.copy()\nmodel(df_out) #improvement\n```\n\n\n\n\n    267.1826850968307\n\n\n\n**(1) (3) (i) Transformation - Differentiation - Fractional**\n\n\n```python\ndf_out = transform.fast_fracdiff(df_out, [\"Close_BPF\"],0.5)\nmodel(df_out) #null\n```\n\n\n\n\n    267.7083192402742\n\n\n\n**(1) (1) (i) Transformation - Scaling - Robust Scaler**\n\n\n```python\ndf_out = df_out.dropna()\ndf_out = transform.robust_scaler(df_out, drop=[\"Close_1\"])\nmodel(df_out) #noisy\n```\n\n\n\n\n    270.96980399571214\n\n\n\n**(2) (2) (i) Interactions - Operator - Multiplication/Division**\n\n\n```python\ndf_out.head()\n```\n\n\n\n\n\n\u003ctable class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003eClose_1\u003c/th\u003e\n      \u003cth\u003eHigh\u003c/th\u003e\n      \u003cth\u003eLow\u003c/th\u003e\n      \u003cth\u003eOpen\u003c/th\u003e\n      \u003cth\u003eClose\u003c/th\u003e\n      \u003cth\u003eVolume\u003c/th\u003e\n      \u003cth\u003eAdj Close\u003c/th\u003e\n      \u003cth\u003eClose_NDDT\u003c/th\u003e\n      \u003cth\u003eClose_NDDS\u003c/th\u003e\n      \u003cth\u003eClose_NDDR\u003c/th\u003e\n      \u003cth\u003eOpen_NDDT\u003c/th\u003e\n      \u003cth\u003eOpen_NDDS\u003c/th\u003e\n      \u003cth\u003eOpen_NDDR\u003c/th\u003e\n      \u003cth\u003eClose_BPF\u003c/th\u003e\n      \u003cth\u003eLow_BPF\u003c/th\u003e\n      \u003cth\u003eClose_BPF_frac\u003c/th\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003eDate\u003c/th\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003e\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003e2019-01-08\u003c/th\u003e\n      \u003ctd\u003e338.529999\u003c/td\u003e\n      \u003ctd\u003e1.018413\u003c/td\u003e\n      \u003ctd\u003e0.964048\u003c/td\u003e\n      \u003ctd\u003e1.096600\u003c/td\u003e\n      \u003ctd\u003e1.001175\u003c/td\u003e\n      \u003ctd\u003e-0.162616\u003c/td\u003e\n      \u003ctd\u003e1.001175\u003c/td\u003e\n      \u003ctd\u003e0.832297\u003c/td\u003e\n      \u003ctd\u003e0.834964\u003c/td\u003e\n      \u003ctd\u003e1.335433\u003c/td\u003e\n      \u003ctd\u003e0.758743\u003c/td\u003e\n      \u003ctd\u003e0.691596\u003c/td\u003e\n      \u003ctd\u003e2.259884\u003c/td\u003e\n      \u003ctd\u003e-2.534142\u003c/td\u003e\n      \u003ctd\u003e-2.249135\u003c/td\u003e\n      \u003ctd\u003e-3.593612\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e2019-01-09\u003c/th\u003e\n      \u003ctd\u003e344.970001\u003c/td\u003e\n      \u003ctd\u003e1.012068\u003c/td\u003e\n      \u003ctd\u003e1.023302\u003c/td\u003e\n      \u003ctd\u003e1.011466\u003c/td\u003e\n      \u003ctd\u003e1.042689\u003c/td\u003e\n      \u003ctd\u003e-0.501798\u003c/td\u003e\n      \u003ctd\u003e1.042689\u003c/td\u003e\n      \u003ctd\u003e0.908963\u003c/td\u003e\n      \u003ctd\u003e-0.165036\u003c/td\u003e\n      \u003ctd\u003e1.111346\u003c/td\u003e\n      \u003ctd\u003e0.835786\u003c/td\u003e\n      \u003ctd\u003e0.333361\u003c/td\u003e\n      \u003ctd\u003e1.129783\u003c/td\u003e\n      \u003ctd\u003e-3.081959\u003c/td\u003e\n      \u003ctd\u003e-2.776302\u003c/td\u003e\n      \u003ctd\u003e-2.523465\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e2019-01-10\u003c/th\u003e\n      \u003ctd\u003e347.260010\u003c/td\u003e\n      \u003ctd\u003e1.035581\u003c/td\u003e\n      \u003ctd\u003e1.027563\u003c/td\u003e\n      \u003ctd\u003e0.996969\u003c/td\u003e\n      \u003ctd\u003e1.126762\u003c/td\u003e\n      \u003ctd\u003e-0.367576\u003c/td\u003e\n      \u003ctd\u003e1.126762\u003c/td\u003e\n      \u003ctd\u003e1.029347\u003c/td\u003e\n      \u003ctd\u003e2.120026\u003c/td\u003e\n      \u003ctd\u003e0.853697\u003c/td\u003e\n      \u003ctd\u003e0.907588\u003c/td\u003e\n      \u003ctd\u003e0.000000\u003c/td\u003e\n      \u003ctd\u003e0.533777\u003c/td\u003e\n      \u003ctd\u003e-2.052768\u003c/td\u003e\n      \u003ctd\u003e-2.543449\u003c/td\u003e\n      \u003ctd\u003e-0.747382\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e2019-01-11\u003c/th\u003e\n      \u003ctd\u003e334.399994\u003c/td\u003e\n      \u003ctd\u003e1.073153\u003c/td\u003e\n      \u003ctd\u003e1.120506\u003c/td\u003e\n      \u003ctd\u003e1.098313\u003c/td\u003e\n      \u003ctd\u003e1.156658\u003c/td\u003e\n      \u003ctd\u003e-0.586571\u003c/td\u003e\n      \u003ctd\u003e1.156658\u003c/td\u003e\n      \u003ctd\u003e1.109144\u003c/td\u003e\n      \u003ctd\u003e-5.156051\u003c/td\u003e\n      \u003ctd\u003e0.591990\u003c/td\u003e\n      \u003ctd\u003e1.002162\u003c/td\u003e\n      \u003ctd\u003e-0.666639\u003c/td\u003e\n      \u003ctd\u003e0.608516\u003c/td\u003e\n      \u003ctd\u003e-0.694642\u003c/td\u003e\n      \u003ctd\u003e-0.831670\u003c/td\u003e\n      \u003ctd\u003e0.414063\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e2019-01-14\u003c/th\u003e\n      \u003ctd\u003e344.429993\u003c/td\u003e\n      \u003ctd\u003e0.999627\u003c/td\u003e\n      \u003ctd\u003e1.056991\u003c/td\u003e\n      \u003ctd\u003e1.102135\u003c/td\u003e\n      \u003ctd\u003e0.988773\u003c/td\u003e\n      \u003ctd\u003e-0.541752\u003c/td\u003e\n      \u003ctd\u003e0.988773\u003c/td\u003e\n      \u003ctd\u003e1.107633\u003c/td\u003e\n      \u003ctd\u003e0.000000\u003c/td\u003e\n      \u003ctd\u003e-0.660350\u003c/td\u003e\n      \u003ctd\u003e1.056302\u003c/td\u003e\n      \u003ctd\u003e-0.915491\u003c/td\u003e\n      \u003ctd\u003e0.263025\u003c/td\u003e\n      \u003ctd\u003e-0.645590\u003c/td\u003e\n      \u003ctd\u003e-0.116166\u003c/td\u003e\n      \u003ctd\u003e-0.118012\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\n\n\n```python\ndf_out = interact.muldiv(df_out, [\"Close\",\"Open_NDDS\",\"Low_BPF\"]) \nmodel(df_out) #noisy\n```\n\n\n\n\n    285.6420643864313\n\n\n\n\n```python\ndf_r = df_out.copy()\n```\n\n**(2) (6) (i) Interactions - Speciality - Technical**\n\n\n```python\nimport ta\ndf = interact.tech(df)\ndf_out = pd.merge(df_out,  df.iloc[:,7:], left_index=True, right_index=True, how=\"left\")\n```\n\n**Clean Dataframe and Metric**\n\n\n```python\n\"\"\"Droping column where missing values are above a threshold\"\"\"\ndf_out = df_out.dropna(thresh = len(df_out)*0.95, axis = \"columns\") \ndf_out = df_out.dropna()\ndf_out = df_out.replace([np.inf, -np.inf], np.nan).ffill().fillna(0)\nclose = df_out[\"Close\"].copy()\ndf_d = df_out.copy()\nmodel(df_out) #improve\n```\n\n\n\n\n    592.52971755184\n\n\n\n**(3) (1) (i) Mapping - Eigen Decomposition - PCA**\n\n\n\n```python\nfrom sklearn.decomposition import PCA, IncrementalPCA, KernelPCA\n\ndf_out = transform.robust_scaler(df_out, drop=[\"Close_1\"])\n```\n\n\n```python\ndf_out = df_out.replace([np.inf, -np.inf], np.nan).ffill().fillna(0)\ndf_out = mapper.pca_feature(df_out, drop_cols=[\"Close_1\"], variance_or_components=0.9, n_components=8,non_linear=False)\n```\n\n\n```python\nmodel(df_out) #noisy but not too bad given the 10 fold dimensionality reduction\n```\n\n\n\n\n    687.158330455884\n\n\n\n**(4) Extracting**\n\nHere at first, I show the functions that have been added to the DeltaPy fork of tsfresh. You have to add your own personal adjustments based on the features you would like to construct. I am using self-developed features, but you can also use TSFresh's community functions.\n\n*The following files have been appropriately ammended (Get in contact for advice)*\n1. https://github.com/firmai/tsfresh/blob/master/tsfresh/feature_extraction/settings.py\n1. https://github.com/firmai/tsfresh/blob/master/tsfresh/feature_extraction/feature_calculators.py\n1. https://github.com/firmai/tsfresh/blob/master/tsfresh/feature_extraction/extraction.py\n\n**(4) (10) (i) Extracting - Averages - GSkew**\n\n\n```python\nextract.gskew(df_out[\"PCA_1\"])\n```\n\n\n\n\n    -0.7903067336449059\n\n\n\n**(4) (21) (ii) Extracting - Entropy - SVD Entropy**\n\n\n```python\nsvd_param = [{\"Tau\": ta, \"DE\": de}\n                      for ta in [4] \n                      for de in [3,6]]\n\nextract.svd_entropy(df_out[\"PCA_1\"],svd_param)\n```\n\n\n\n\n    [('Tau_\"4\"__De_3\"', 0.7234823323374294),\n     ('Tau_\"4\"__De_6\"', 1.3014347840145244)]\n\n\n\n**(4) (13) (ii) Extracting - Streaks - Wozniak**\n\n\n```python\nwoz_param = [{\"consecutiveStar\": n} for n in [2, 4]]\n\nextract.wozniak(df_out[\"PCA_1\"],woz_param)\n```\n\n\n\n\n    [('consecutiveStar_2', 0.012658227848101266), ('consecutiveStar_4', 0.0)]\n\n\n\n**(4) (28) (i) Extracting - Fractal - Higuchi**\n\n\n```python\nhig_param = [{\"Kmax\": 3},{\"Kmax\": 5}]\n\nextract.higuchi_fractal_dimension(df_out[\"PCA_1\"],hig_param)\n```\n\n\n\n\n    [('Kmax_3', 0.577913816027104), ('Kmax_5', 0.8176960510304725)]\n\n\n\n**(4) (5) (ii) Extracting - Volatility - Variability Index**\n\n\n```python\nvar_index_param = {\"Volume\":df[\"Volume\"].values, \"Open\": df[\"Open\"].values}\n\nextract.var_index(df[\"Close\"].values,var_index_param)\n```\n\n\n\n\n    {'Interact__Open': 0.00396022538846289,\n     'Interact__Volume': 0.20550155114176533}\n\n\n\n**Time Series Extraction**\n\n\n```python\npip install git+git://github.com/firmai/tsfresh.git\n```\n\n```python\n#Construct the preferred input dataframe.\nfrom tsfresh.utilities.dataframe_functions import roll_time_series\ndf_out[\"ID\"] = 0\nperiods = 30\ndf_out = df_out.reset_index()\ndf_ts = roll_time_series(df_out,\"ID\",\"Date\",None,1,periods)\ncounts = df_ts['ID'].value_counts()\ndf_ts = df_ts[df_ts['ID'].isin(counts[counts \u003e periods].index)]\n```\n\n\n```python\n#Perform extraction\nfrom tsfresh.feature_extraction import extract_features, CustomFCParameters\nsettings_dict = CustomFCParameters()\nsettings_dict[\"var_index\"] = {\"PCA_1\":None, \"PCA_2\": None}\ndf_feat = extract_features(df_ts.drop([\"Close_1\"],axis=1),default_fc_parameters=settings_dict,column_id=\"ID\",column_sort=\"Date\")\n```\n\n    Feature Extraction: 100%|██████████| 5/5 [00:10\u003c00:00,  2.14s/it]\n\n\n\n```python\n# Cleaning operations\nimport pandasvault as pv\ndf_feat2 = df_feat.copy()\ndf_feat = df_feat.dropna(thresh = len(df_feat)*0.50, axis = \"columns\")\ndf_feat_cons = pv.constant_feature_detect(data=df_feat,threshold=0.9)\ndf_feat = df_feat.drop(df_feat_cons, axis=1)\ndf_feat = df_feat.ffill()\ndf_feat = pd.merge(df_feat,df[[\"Close_1\"]],left_index=True,right_index=True,how=\"left\")\nprint(df_feat.shape)\nmodel(df_feat) #noisy\n```\n\n    7  variables are found to be almost constant\n    (208, 48)\n    2064.7813982935995\n\n\n\n\n```python\nfrom tsfresh import select_features\nfrom tsfresh.utilities.dataframe_functions import impute\n\nimpute(df_feat)\ndf_feat_2 = select_features(df_feat.drop([\"Close_1\"],axis=1),df_feat[\"Close_1\"],fdr_level=0.05)\ndf_feat_2[\"Close_1\"] = df_feat[\"Close_1\"]\nmodel(df_feat_2) #improvement (b/ not an augmentation method)\n```\n\n\n\n\n    1577.5273071299482\n\n\n\n**(3) (6) (i) Feature Agglomoration; \u0026nbsp; (1)(2)(i) Standard Scaler.**\n\nLike in this step, after (1), (2), (3), (4) and (5), you can often circle back to the initial steps to normalise the data and dimensionally reduce the data for the final model. \n\n\n```python\nimport numpy as np\nfrom sklearn import datasets, cluster\n\ndef feature_agg(df, drop, components):\n  components = min(df.shape[1]-1,components)\n  agglo = cluster.FeatureAgglomeration(n_clusters=components,)\n  df = df.drop(drop,axis=1)\n  agglo.fit(df)\n  df = pd.DataFrame(agglo.transform(df))\n  df = df.add_prefix('fe_agg_')\n\n  return df\n\ndf_final = transform.standard_scaler(df_feat_2, drop=[\"Close_1\"])\ndf_final = mapper.feature_agg(df_final,[\"Close_1\"],4)\ndf_final.index = df_feat.index\ndf_final[\"Close_1\"] = df_feat[\"Close_1\"]\nmodel(df_final) #noisy\n```\n\n\n\n\n    1949.89085894338\n\n\n\n**Final Model** After Applying 13 Arbitrary Augmentation Techniques\n\n\n```python\nmodel(df_final) #improvement\n```\n\n\n\n\n    1949.89085894338\n\n\n\n**Original Model** Before Augmentation\n\n\n```python\ndf_org = df.iloc[:,:7][df.index.isin(df_final.index)]\nmodel(df_org)\n```\n\n\n\n\n    389.783990984133\n\n\n\n**Best Model** After Developing 8 Augmenting Features\n\n\n```python\ndf_best = df_best.replace([np.inf, -np.inf], np.nan).ffill().fillna(0)\nmodel(df_best)\n```\n\n\n\n\n    267.1826850968307\n\n\n\n**Commentary**\n\nThere are countless ways in which the current model can be improved, this can take on an automated process where all techniques are tested against a hold out set, for example, we can perform the operation below, and even though it improves the score here, there is a need for more robust tests. The skeleton example above is not meant to highlight the performance of the package. It simply serves as an example of how one can go about applying augmentation methods. \n\nQuite naturally this example suffers from dimensionality issues with array shapes reaching ```(208, 48)```, furthermore you would need a sample that is at least 50-100 times larger before machine learning methods start to make sense.\n\nNonetheless, in this example, *Transformation, Interactions* and *Mappings* (applied to extraction output) performed fairly well. *Extraction* augmentation was overkill, but created a reasonable model when dimensionally reduced. A better selection of one of the 50+ augmentation methods and the order of augmentation could further help improve the outcome if robustly tested against development sets.\n\n\n[[1]](https://colab.research.google.com/drive/1tstO4fja9wRWjkPgjxRYr9MEvYXTx7fA) DeltaPy Development \n","funding_links":[],"categories":["📦 Packages","Jupyter Notebook"],"sub_categories":["Python"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffirmai%2Fdeltapy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffirmai%2Fdeltapy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffirmai%2Fdeltapy/lists"}