{"id":19410867,"url":"https://github.com/d3group/dddex","last_synced_at":"2025-04-24T10:33:05.355Z","repository":{"id":61357486,"uuid":"550105787","full_name":"d3group/dddex","owner":"d3group","description":"The package 'data-driven density estimation x' (dddex) turns any standard point forecasting model into an estimator of the underlying conditional density ","archived":false,"fork":false,"pushed_at":"2024-10-09T23:19:36.000Z","size":45369,"stargazers_count":5,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2024-10-12T13:55:56.618Z","etag":null,"topics":["data-science","density-estimation","operations-research"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/d3group.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-10-12T08:03:10.000Z","updated_at":"2024-10-09T23:19:40.000Z","dependencies_parsed_at":"2023-12-03T16:35:43.892Z","dependency_job_id":null,"html_url":"https://github.com/d3group/dddex","commit_stats":{"total_commits":136,"total_committers":1,"mean_commits":136.0,"dds":0.0,"last_synced_commit":"ecc109e1de3b4e2ed317dbab8c1951e0a5601e5d"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/d3group%2Fdddex","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/d3group%2Fdddex/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/d3group%2Fdddex/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/d3group%2Fdddex/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/d3group","download_url":"https://codeload.github.com/d3group/dddex/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223949048,"owners_count":17230227,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","density-estimation","operations-research"],"created_at":"2024-11-10T12:18:21.881Z","updated_at":"2024-11-10T12:18:22.722Z","avatar_url":"https://github.com/d3group.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"dddex: Data-Driven Density Estimation x\n================\n\n\u003c!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! --\u003e\n\n## Install\n\n``` sh\npip install dddex\n```\n\n## What is dddex?\n\nThe package name `dddex` stands for *Data-Driven Density Estimaton x*.\nNew approaches are being implemented for estimating conditional\ndensities without any parametric assumption about the underlying\ndistribution. All those approaches take an arbitrary point forecaster as\ninput and turn them into a new object that outputs an estimation of the\nconditional density based on the point predictions of the original point\nforecaster. The *x* in the name emphasizes that the approaches can be\napplied to any point forecaster. In this package several approaches are\nbeing implementing via the following classes:\n\n- [`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex)\n- [`LevelSetKDEx_kNN`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex_knn)\n- [`LevelSetKDEx_NN`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex_nn)\n- [`LevelSetKDEx_multivariate`](https://kaiguender.github.io/dddex/levelsetkdex_multivariate.html#levelsetkdex_multivariate)\n\nIn the following we are going to work exclusively with the class\n[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex)\nbecause the most important methods are all pretty much the same. All\nmodels can be run easily with only a few lines of code and are designed\nto be compatible with the well known *Scikit-Learn* framework.\n\n## How to use: LevelSetKDEx\n\nTo ensure compatibility with Scikit-Learn, as usual the class\n[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex)\nimplements a `fit` and `predict` method. As the purposes of both classes\nis to compute estimations of conditional densities, the `predict` method\noutputs p-quantiles rather than point forecasts.\n\nOur choice of the class-names is supposed to be indicative of the\nunderlying models: The name *LevelSet* stems from the fact that both\nmethods operate with the underlying assumption that the values of point\nforecasts generated by the same point forecaster can be interpreted as a\nsimilarity measure between samples. *KDE* is short for *Kernel Density\nEstimator* and the *x* yet again signals that the classes can be\ninitialized with any point forecasting model.\n\nIn the following, we demonstrate how to use the class\n[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex)\nto compute estimations of the conditional densities and quantiles for\nthe [Yaz Data\nSet](https://opimwue.github.io/ddop/modules/auto_generated/ddop.datasets.load_yaz.html#ddop.datasets.load_yaz).\nAs explained above,\n[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex)\nis always based on a point forecaster that is being specified by the\nuser. In our example we use the well known `LightGBMRegressor` as the\nunderlying point predictor.\n\n``` python\nfrom dddex.levelSetKDEx_univariate import LevelSetKDEx, LevelSetKDEx_kNN, LevelSetKDEx_NN\nfrom dddex.levelSetKDEx_multivariate import LevelSetKDEx_multivariate\n\nfrom dddex.loadData import loadDataYaz\nfrom lightgbm import LGBMRegressor\n```\n\n``` python\ndataYaz, XTrain, yTrain, XTest, yTest = loadDataYaz(returnXY = True)\nLGBM = LGBMRegressor(n_jobs = 1)\n```\n\nThere are three parameters for\n[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex):\n\n- **estimator**: A point forecasting model that must have a `predict`\n  method.\n- **binSize**: The amount of training samples considered to compute the\n  conditional densities (for more details, see *To be written*).\n- **weightsByDistance**: If *False*, all considered training samples are\n  weighted equally. If *True*, training samples are weighted by the\n  inverse of the distance of their respective point forecast to the\n  point forecast of the test sample at hand.\n\n``` python\nLSKDEx = LevelSetKDEx(estimator = LGBM, \n                      binSize = 100,\n                      weightsByDistance = False)\n```\n\nThere is no need to run `fit` on the point forecasting model before\ninitializing *LevelSetKDEx*, because the `fit` method of\n[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex)\nautomatically checks whether the provided model has been fitted already\nor not and runs the respective `fit` method of the point forecaster if\nneeded.\n\nIt should be noted, that running `fit` for the *LevelSetKDEx* approaches\ntakes exceptionally little time even for datasets with $\u003e10^6$ samples\n(provided, of course, that the underlying point forecasting model has\nbeen fitted before hand).\n\n``` python\nLSKDEx.fit(X = XTrain, y = yTrain)\n```\n\nIn order to compute conditional densities for test samples now, we\nsimply run the `getWeights` method.\n\n``` python\nconditionalDensities = LSKDEx.getWeights(X = XTest,\n                                         outputType = 'summarized')\n\nprint(f\"probabilities: {conditionalDensities[0][0]}\")\nprint(f\"demand values: {conditionalDensities[0][1]}\")\n```\n\n    probabilities: [0.49 0.01 0.21 0.01 0.16 0.07 0.04 0.01]\n    demand values: [0.         0.01075269 0.04       0.04878049 0.08       0.12\n     0.16       0.2       ]\n\nHere, `conditionalDensities` is a list whose elements correspond to the\nsamples specified via `X`. Every element contains a tuple, whose first\nentry constitutes probabilities and the second entry corresponding\ndemand values (side note: The demand values have been scaled to lie in\n$[0, 1]$). In the above example, we can for example see that our model\nestimates that for the first test sample the demand will be 0 with a\nprobability of 49%.\n\nLike the input argument *outputType* of `getWeights` suggests, we can\noutput the conditional density estimations in various different forms.\nAll in all, there are currently 5 output types specifying how the output\nfor each sample looks like:\n\n- **all**: An array with the same length as the number of training\n  samples. Each entry represents the probability of each training\n  sample.\n- **onlyPositiveWeights**: A tuple. The first element of the tuple\n  represents the probabilities and the second one the indices of the\n  corresponding training sample. Only probalities greater than zero are\n  returned. Note: This is the most memory and computationally efficient\n  output type.\n- **summarized**: A tuple. The first element of the tuple represents the\n  probabilities and the second one the corresponding value of `yTrain`.\n  The probabilities corresponding to identical values of `yTrain` are\n  aggregated.\n- **cumulativeDistribution**: A tuple. The first element of the tuple\n  represents the probabilities and the second one the corresponding\n  value of `yTrain`.\n- **cumulativeDistributionSummarized**: A tuple. The first element of\n  the tuple represents the probabilities and the second one the\n  corresponding value of `yTrain`. The probabilities corresponding to\n  identical values of `yTrain` are aggregated.\n\nFor example, by setting\n`outputType = 'cumulativeDistributionSummarized'` we can compute an\nestimation of the conditional cumulative distribution function for each\nsample. Below, we can see that our model predicts the demand of the\nfirst sample to be lower or equal than 0.16 with a probability of 99%.\n\n``` python\ncumulativeDistributions = LSKDEx.getWeights(X = XTest,\n                                            outputType = 'cumulativeDistributionSummarized')\n\nprint(f\"cumulated probabilities: {cumulativeDistributions[0][0]}\")\nprint(f\"demand values: {cumulativeDistributions[0][1]}\")\n```\n\n    cumulated probabilities: [0.49 0.5  0.71 0.72 0.88 0.95 0.99 1.  ]\n    demand values: [0.         0.01075269 0.04       0.04878049 0.08       0.12\n     0.16       0.2       ]\n\nWe can also compute estimations of quantiles using the `predict` method.\nThe parameter *probs* specifies the quantiles we want to predict.\n\n``` python\npredRes = LSKDEx.predict(X = XTest,\n                         outputAsDf = True, \n                         probs = [0.1, 0.5, 0.75, 0.99])\nprint(predRes.iloc[0:6, :].to_markdown())\n```\n\n    |    |       0.1 |       0.5 |   0.75 |   0.99 |\n    |---:|----------:|----------:|-------:|-------:|\n    |  0 | 0         | 0.0107527 |   0.08 |   0.16 |\n    |  1 | 0         | 0.08      |   0.12 |   0.2  |\n    |  2 | 0.04      | 0.0967742 |   0.12 |   0.24 |\n    |  3 | 0.056338  | 0.12      |   0.16 |   0.28 |\n    |  4 | 0.04      | 0.0967742 |   0.12 |   0.24 |\n    |  5 | 0.0666667 | 0.16      |   0.2  |   0.32 |\n\n## How to tune binSize parameter of LevelSetKDEx\n\n`dddex` also comes with the class `QuantileCrossValidations` that allows\nto tune quantile predictors in an efficient manner. The class is\ndesigned in a very similar fashion to the cross-validation classes of\nScikit-Learn. As such, at first\n[`QuantileCrossValidation`](https://kaiguender.github.io/dddex/crossvalidation.html#quantilecrossvalidation)is\ninitialized with all the settings for the cross-validation.\n\n- **quantileEstimator**: A model that must have a `set_params`, `fit`\n  and `predict` method. Additionally, the `predict` method must (!) have\n  a function argument called `prob` that allows to specify which\n  quantiles to predict.\n- **cvFolds**: An iterable yielding (train, test) splits as arrays of\n  indices\n- **parameterGrid**: The candidate values of to evaluate. Must be a\n  dict.\n- **probs**: The probabilities for which quantiles are computed and\n  evaluated.\n- **refitPerProb**: If True, for ever probability a fitted copy of\n  *quantileEstimator* with the best parameter Setting for the respective\n  p-quantile is stored in the attribute *bestEstimator_perProb*.\n- **n_jobs**: How many cross-validation split results to compute in\n  parallel.\n\nAfter specifying the settings, `fit` has to be called to compute the\nresults of the cross validation. The performance of every parameter\nsetting is being evaluated by computing the relative reduction of the\npinball loss in comparison to the quantile estimations generated by\n*SAA* (Sample Average Approximation) for every quantile.\n\n``` python\nfrom dddex.crossValidation import groupedTimeSeriesSplit, QuantileCrossValidation\n\ndataTrain = dataYaz[dataYaz['label'] == 'train']\ncvFolds = groupedTimeSeriesSplit(data = dataTrain, \n                                 kFolds = 3,\n                                 testLength = 28,\n                                 groupFeature = 'id',\n                                 timeFeature = 'dayIndex')\n\nLSKDEx = LevelSetKDEx(estimator = LGBM)\nparamGrid = {'binSize': [20, 100, 400, 1000],\n             'weightsByDistance': [True, False]}\n\nCV = QuantileCrossValidation(quantileEstimator = LSKDEx,\n                             parameterGrid = paramGrid,\n                             cvFolds = cvFolds,\n                             probs = [0.01, 0.25, 0.5, 0.75, 0.99],\n                             refitPerProb = True,\n                             n_jobs = 3)\n\nCV.fit(X = XTrain, y = yTrain)\n```\n\nThe best value for *binSize* can either be computed for every quantile\nseparately or for all quantiles at once by computing the average cost\nreduction over all quantiles.\n\n``` python\nprint(f\"Best binSize over all quantiles: {CV.bestParams}\")\nCV.bestParams_perProb\n```\n\n    Best binSize over all quantiles: {'binSize': 1000, 'weightsByDistance': False}\n\n    {0.01: {'binSize': 1000, 'weightsByDistance': False},\n     0.25: {'binSize': 20, 'weightsByDistance': False},\n     0.5: {'binSize': 100, 'weightsByDistance': False},\n     0.75: {'binSize': 100, 'weightsByDistance': False},\n     0.99: {'binSize': 1000, 'weightsByDistance': False}}\n\nThe exact results are also stored as attributes. The easiest way to view\nthe results is given via `cv_results`, which depicts the average results\nover all cross-validation folds:\n\n``` python\nprint(CV.cvResults.to_markdown())\n```\n\n    |               |    0.01 |     0.25 |      0.5 |     0.75 |    0.99 |\n    |:--------------|--------:|---------:|---------:|---------:|--------:|\n    | (20, True)    | 3.79553 | 0.946626 | 0.89631  | 0.974659 | 2.98365 |\n    | (20, False)   | 3.23956 | 0.849528 | 0.808262 | 0.854069 | 2.46195 |\n    | (100, True)   | 3.11384 | 0.92145  | 0.871266 | 0.922703 | 2.22249 |\n    | (100, False)  | 1.65191 | 0.857026 | 0.803632 | 0.835323 | 1.81003 |\n    | (400, True)   | 2.57563 | 0.908214 | 0.851471 | 0.900311 | 2.03445 |\n    | (400, False)  | 1.64183 | 0.860281 | 0.812806 | 0.837641 | 1.57534 |\n    | (1000, True)  | 2.34575 | 0.893628 | 0.843721 | 0.888143 | 1.82368 |\n    | (1000, False) | 1.54641 | 0.869606 | 0.854369 | 0.88065  | 1.52644 |\n\nThe attentive reader will certainly notice that values greater than 1\nimply that the respective model performed worse than SAA. This is, of\ncourse, simply due to the fact, that we didn’t tune the hyperparameters\nof the underlying `LGBMRegressor` point predictor and instead used the\ndefault parameter values. The\n[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex)classes\nare able to produce highly accurate density estimations, but are\nobviously not able to turn a terrible point predictor into a highly\nperformant conditional density estimator. The performance of the\nunderlying point predictor and the constructed\n[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex)\nmodel go hand in hand.\n\nWe can also access the results for every fold separately via\n`cv_results_raw`, which is a list with one entry per fold:\n\n``` python\nCV.cvResults_raw\n```\n\n    [                               0.01      0.25      0.50      0.75      0.99\n     binSize weightsByDistance                                                  \n     20      True               3.730363  0.977152  0.949944  1.093261  4.590650\n             False              3.068598  0.854633  0.855041  0.953362  3.663885\n     100     True               3.359961  0.945510  0.922778  1.027477  3.475501\n             False              1.626054  0.871327  0.833379  0.907911  2.591117\n     400     True               2.663854  0.928036  0.907505  0.995238  3.149022\n             False              1.732673  0.860440  0.828015  0.890643  2.190292\n     1000    True               2.463221  0.914308  0.897978  0.979345  2.753553\n             False              1.464534  0.873277  0.858563  0.891858  1.830334,\n                                    0.01      0.25      0.50      0.75      0.99\n     binSize weightsByDistance                                                  \n     20      True               4.725018  0.958236  0.891472  0.914408  2.253200\n             False              4.157297  0.841141  0.795929  0.830544  1.883320\n     100     True               3.687090  0.933531  0.876655  0.875718  1.551640\n             False              1.752709  0.862970  0.812126  0.819613  1.416013\n     400     True               3.061210  0.920190  0.851794  0.873496  1.464974\n             False              2.085622  0.887758  0.839370  0.859290  1.296445\n     1000    True               2.784076  0.903801  0.840009  0.856845  1.381658\n             False              1.767468  0.869484  0.860893  0.876293  1.464460,\n                                    0.01      0.25      0.50      0.75      0.99\n     binSize weightsByDistance                                                  \n     20      True               2.931208  0.904490  0.847513  0.916307  2.107091\n             False              2.492787  0.852811  0.773815  0.778301  1.838642\n     100     True               2.294471  0.885308  0.814365  0.864913  1.640339\n             False              1.576956  0.836781  0.765390  0.778446  1.422947\n     400     True               2.001828  0.876417  0.795114  0.832198  1.489340\n             False              1.107203  0.832645  0.771034  0.762992  1.239275\n     1000    True               1.789944  0.862776  0.793177  0.828237  1.335825\n             False              1.407221  0.866058  0.843651  0.873799  1.284521]\n\nThe models with the best *binSize* parameter are automatically computed\nwhile running `fit` and can be accessed via `bestEstimatorLSx`. If\n`refitPerProb = True`, then `bestEstimatorLSx` is a dictionary whose\nkeys are the probabilities specified via the paramater *probs*.\n\n``` python\nLSKDEx_best99 = CV.bestEstimator_perProb[0.99]\npredRes = LSKDEx_best99.predict(X = XTest,\n                                probs = 0.99)\nprint(predRes.iloc[0:6, ].to_markdown())\n```\n\n    |    |   0.99 |\n    |---:|-------:|\n    |  0 |   0.32 |\n    |  1 |   0.32 |\n    |  2 |   0.32 |\n    |  3 |   0.32 |\n    |  4 |   0.32 |\n    |  5 |   0.32 |\n\n## Benchmarks: Random Forest wSAA\n\nThe `dddex` package also contains useful non-parametric benchmark models\nto compare the performance of the\n[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex)\nmodels to other state of the art non-parametric models capable of\ngenerating conditional density estimations. In a [meta analysis\nconducted by S. Butler et\nal.](https://ml-eval.github.io/assets/pdf/ICLR22_Workshop_ML_Eval_DDNV.pdf)\nthe most performant model has been found to be [weighted sample average\napproximation\n(wSAA)](https://pubsonline.informs.org/doi/10.1287/mnsc.2018.3253) based\non *Random Forest*. This model has been implemented in a Scikit-Learn\nfashion as well.\n\n``` python\nfrom dddex.wSAA import RandomForestWSAA\nRF = RandomForestWSAA()\n```\n\n[`RandomForestWSAA`](https://kaiguender.github.io/dddex/wsaa.html#randomforestwsaa)\nis a class derived from the original `RandomForestRegressor` class from\nScikit-Learn, that has been extended to be able to generate conditional\ndensity estimations in the manner described by Bertsimas et al. in their\npaper [*From Predictive to prescriptive\nanalytics*](https://pubsonline.informs.org/doi/10.1287/mnsc.2018.3253).\nThe *Random Forest* modell is being fitted in exactly the same way as\nthe original *RandomForestRegressor*:\n\n``` python\nRF.fit(X = XTrain, y = yTrain)\n```\n\nIdentical to the\n[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex)\nand\n[`LevelSetKDEx_kNN`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex_knn)\nclasses, an identical method called `getWeights` and `predict`are\nimplemented to compute conditional density estimations and quantiles.\nThe output is the same as before.\n\n``` python\nconditionalDensities = RF.getWeights(X = XTest,\n                                     outputType = 'summarized')\n\nprint(f\"probabilities: {conditionalDensities[0][0]}\")\nprint(f\"demand values: {conditionalDensities[0][1]}\")\n```\n\n    probabilities: [0.08334138 0.17368071 0.2987331  0.10053752 0.1893534  0.09121861\n     0.04362338 0.0145119  0.005     ]\n    demand values: [0.   0.04 0.08 0.12 0.16 0.2  0.24 0.28 0.32]\n\n``` python\npredRes = RF.predict(X = XTest,\n                     probs = [0.01, 0.5, 0.99],\n                     outputAsDf = True)\nprint(predRes.iloc[0:6, :].to_markdown())\n```\n\n    |    |   0.01 |   0.5 |   0.99 |\n    |---:|-------:|------:|-------:|\n    |  0 |      0 |  0.08 |   0.28 |\n    |  1 |      0 |  0.12 |   0.32 |\n    |  2 |      0 |  0.12 |   0.32 |\n    |  3 |      0 |  0.12 |   0.32 |\n    |  4 |      0 |  0.12 |   0.32 |\n    |  5 |      0 |  0.2  |   0.4  |\n\nThe original `predict` method of the `RandomForestRegressor` has been\nrenamed to `pointPredict`:\n\n``` python\nRF.pointPredict(X = XTest)[0:6]\n```\n\n    array([0.1064    , 0.1184    , 0.1324    , 0.1324    , 0.1364    ,\n           0.18892683])\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fd3group%2Fdddex","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fd3group%2Fdddex","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fd3group%2Fdddex/lists"}