{"id":16542113,"url":"https://github.com/bcebere/genentech-404-challenge","last_synced_at":"2026-05-13T02:12:05.554Z","repository":{"id":130016135,"uuid":"556637511","full_name":"bcebere/genentech-404-challenge","owner":"bcebere","description":"6th place entry for the Genentech – 404 Challenge","archived":false,"fork":false,"pushed_at":"2022-11-16T10:03:51.000Z","size":4783,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-14T10:17:27.768Z","etag":null,"topics":["automl","data-imputation","imputation-methods","kaggle-competition","tabular-data"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bcebere.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2022-10-24T08:18:56.000Z","updated_at":"2022-11-16T10:09:17.000Z","dependencies_parsed_at":"2023-12-18T15:13:12.460Z","dependency_job_id":"a449a954-e0cd-40f9-855f-641fd7462849","html_url":"https://github.com/bcebere/genentech-404-challenge","commit_stats":{"total_commits":33,"total_committers":1,"mean_commits":33.0,"dds":0.0,"last_synced_commit":"14d01c2ae9dcd610c340aaee86c9203ec8c071cf"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bcebere%2Fgenentech-404-challenge","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bcebere%2Fgenentech-404-challenge/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bcebere%2Fgenentech-404-challenge/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bcebere%2Fgenentech-404-challenge/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bcebere","download_url":"https://codeload.github.com/bcebere/genentech-404-challenge/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241794136,"owners_count":20021192,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automl","data-imputation","imputation-methods","kaggle-competition","tabular-data"],"created_at":"2024-10-11T18:56:40.501Z","updated_at":"2026-05-13T02:12:00.523Z","avatar_url":"https://github.com/bcebere.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Genentech 404 Challenge (Kaggle) - 6th place entry\n\n\n- Competition: https://www.kaggle.com/competitions/genentech-404-challenge\n- Solution notebook: [Notebook](https://github.com/bcebere/genentech-404-challenge/blob/main/solution.ipynb)\n\n\nChallenge review\n================================\n\nGiven a clinical time series dataset with missing values, complete the\nempty values. The patients are indexed by the id, and for each patient,\nthe temporal index is given by \\\"VISCODE\\\".\n\n-   While a time series problem, several patients have a single\n    visit(255 in the training set), which should be handled as a\n    static/horizontal imputation problem.\n\n-   Features \\\"PTGENDER\\_num\\\", \\\"PTEDUCAT\\\", \\\"APOE4\\\" are constant for\n    each patient and need special handling - we should not have multiple\n    values by \\\"RID\\_HASH\\\".\n\n-   There is a correlation between \\\"VISCODE\\\" and the \\\"AGE\\\", which\n    must be respected. More precisely, 1 \\\"VISCODE\\\" step maps to 1\n    month. \\\"VISCODE\\\" = 6 maps to 6 months after the baseline visit.\n\n-   The rest of the features are temporal, depending on the\n    \\\"RID\\_HASH\\\" and \\\"VISCODE\\\"/\\\"AGE\\\".\n\n-   \\\"CDRSB\\\" is a multiple of $0.5$.\n\n-   \\\"DX\\_NUM\\\" and \\\"MMSE\\\" must be integers.\n\n-   \\\"ADAS13\\\" is multiple of $1/3$.\n\nBuilding blocks for the solution\n================================\n\nThe solution tries to construct an iterative imputation method in 2\ndimensions - both static/horizontal and temporal/vertical imputation.\n\nWe first describe each building block and, finally, how to components\ninteract.\n\nDataset augmentation\n--------------------\n\n#### Data preprocessing\n\n1.  Sort the dataset by the tuple (\\\"RID\\_HASH\\\", \\\"VISCODE\\\"). This\n    speeds up the processing.\n\n2.  Augment each row with some temporal details: **total known visits**\n    and **last known visit** for the current patient.\n\n3.  Using a MinMaxScaler, scale the columns \\\"MMSE\\\", \\\"ADAS13\\\",\n    \\\"Ventricles\\\", \\\"Hippocampus\\\", \\\"WholeBrain\\\", \\\"Entorhinal\\\",\n    \\\"Fusiform\\\", \\\"MidTemp\\\". While this initially was for the Neural\n    Nets experiments, it also helped some linear models in the\n    longitudinal imputation.\n\nWhile the visits are not complete in the test set, adding the total\nvisits and last known visit improved the public score.\n\n#### Other failed approaches\n\nto augment the dataset include:\n\n-   Learn a latent space using an RNN, LSTM, or Transformer, - from each\n    row and its missingness mask - and append it to each row.\n\n-   Add details related to trends from \\[Ventricles,\n    Hippocampus,WholeBrain, Entorhinal, Fusiform, MidTemp\\].\n\nWhile these approaches might help in a forecasting problem, they didn't\nimprove the imputation error here. The latent space isn't trivial to\nmodel with missing values, and the neural net seemed to overfit the\n\\\"dev\\_set\\\", leading to a poor score on the test data. The trends\ndidn't help either, as the missingness mangled them.\n\nWhile some more additional augmentations would be ideal, only the\n\\\"total\\_visits\\\" and \\\"last\\_visit\\\" didn't affect the public test\nscore.\n\nStatic features \n---------------\n\nWhile trivial, the first step of the imputation is a good sanity check.\n\nFor each static feature: \\[\\\"PTGENDER\\_num\\\", \\\"PTEDUCAT\\\", \\\"APOE4\\\"\\],\nwe propagate the existing values for a patient to all the time points.\n\nFor the \\\"AGE\\\" feature, we propagate the value using the \\\"VISCODE\\\"\nvalue. More precisely, for observations $i$ and $i + 1$ for a patient,\nwe use the formula\n$\\text{AGE}[i + 1] = (\\text{VISCODE}[i + 1] - \\text{VISCODE}[i]) / 12 + \\text{AGE}[i]$.\n\nIterative Horizontal(visit-wise) imputation using [HyperImpute](https://github.com/vanderschaarlab/hyperimpute)\n-------------------------------------------\n\nNext, we need a method for imputing a single visit. Given a single visit\nrow, with some missing values, we should be able to impute the missing\nvalues from the observed ones. This step ignores the temporal setup. The\nmain benefits of the horizontal imputation are:\n\n1.  It addresses the patients with single visits, which cannot be\n    induced from temporal values.\n\n2.  It handles the static features of each patient - (\\\"PTGENDER\\\",\n    \\\"PTEDUCAT\\\", \\\"APOE4\\\").\n\n3.  It can be used as a seed for the longitudinal imputer, by imputing a\n    single visit - the easiest to impute statically. Starting from that,\n    the longitudinal imputer can deduct the other temporal values.\n\nFor the task, we are using [**HyperImpute**](https://github.com/vanderschaarlab/hyperimpute), an iterative imputation algorithm, which\ngeneralizes MICE and missForest, by allowing any type of base\nlearner(not just linear/random forest), and which uses AutoML to tune\nthe models by column and by iteration. The figure below shows the high-level\narchitecture of the iterative method, which we'll extend to the\nlongitudinal dimension for the current dataset.\n\n![image](img/figure.png)\n\nThe pool of **base learners** for HyperImpute is:\n\n-   *classifiers*: \\\"xgboost\\\", \\\"catboost\\\", \\\"logistic\\_regression\\\",\n    \\\"random\\_forest\\\", \\\"lgbm\\\".\n\n-   *regressor*: \\\"xgboost\\_regressor\\\", \\\"catboost\\_regressor\\\",\n    \\\"linear\\_regression\\\", \\\"random\\_forest\\_regressor\\\",\n    \\\"lgbm\\_regressor\\\".\n\nFor selecting a classifier, the internal optimizer searches for the\nmaximal **AUCROC** score obtained on the observed data(data without any\nmissing values). For selecting a regressor, the optimizer searches for\nthe maximal **R2 score** obtained on the observed data.\n\n#### Model search\n\nThe models are evaluated for each column and for for each imputation\niteration. The evaluation is performed on observed data, and for each\ntarget column, we use the rest of the features from each patient. For\nexample, for target \\\"AGE\\\", we benchmark a model trained on the patient\nfeatures without the \\\"AGE\\\" column.\n\n\nIterative Longitudinal imputation\n---------------------------------\n\nIt is important for patients with multiple visits to impute constrained\nby other visits.\n\nHyperImpute is dedicated to static/instance-wise imputation and cannot\nbe used directly for time series imputation. However, we can use its\nbase learners for this task by preprocessing the data.\n\n#### Preprocessing\n\nTwo sets of estimators are trained, the **forward** and **reverse**\nimputers, for the direction in which we try to approximate a particular\nfeature.\n\nFor every target feature:\n\n-   We retrieve the previous visit, given the direction we work in. For\n    the direction *forward*, the previous visit is the previous\n    \\\"VISCODE\\\". For the direction *reverse*, the previous visit is the\n    next \\\"VISCODE\\\".\n\n-   We generate a training set consisting of the previous visit, and the\n    current visit without the target feature.\n\n-   We evaluate the prediction capabilities of the current feature value\n    from the previous visit + the other features from the same\n    timestamp.\n\n-   The 'prepare\\_temporal\\_data\\\" function from the notebooks\n    implements this preprocessing.\n\n#### Model Search\n\nThe search pool consists of \\\"XGBoost\\\", \\\"CatBoost\\\", \\\"LGBM\\\",\n\\\"Random forest\\\", \\\"KNearestNeighbor\\\" and linear models. For each\nfeature, the objective task is to predict the next(for forward imputers)\nor the previous(for the reverse imputers) value of a column. Similar to\nthe Horizontal imputation setup, we run the AutoML logic on the\npreprocessed data, selecting the optimal AUCROC for classifiers and the\noptimal R2 score for regressors.\n\n\nPutting the pieces together\n===========================\n\nThe complete imputation algorithm is:\n\n**Steps:**\n\n1.  **Constants imputation:** Impute the constants \\[\\\"PTGENDER\\_num\\\",\n    \\\"PTEDUCAT\\\", \\\"APOE4\\\"\\] and the \\\"AGE\\\" from the existing observed\n    data.\n\n2.  **Longitudinal imputation loop,**.\n    1.  Create $X_{dummy}$, a support imputed version of $X$, created by\n        propagating the last valid observation forward(using pandas'\n        ffill) or using the next valid observation to fill the gap(using\n        bfill). Where ffill/bfill cannot cover all the missingness,\n        HyperImpute is used.\n\n    2.  For each missing value from a column $C$:\n\n        1.  If a previous value was observed in column $C$, we use the\n            longitudinal imputers from the **forward** set to fill the\n            value. The imputer uses the input values from $X_{dummy}$,\n            if any are missing.\n\n        2.  If a future value was observed in column $C$, we use the\n            longitudinal imputers from the **reverse** set to fill the\n            value. The imputer uses the input values from $X_{dummy}$,\n            if any are missing.\n\n        3.  If both previous and future values are observed for column\n            $C$, we use both forward and reverse imputers, and average\n            their output.\n\n        4.  If the longitudinal imputers cannot fill any other values,\n            break. Else, continue.\n\n        5.  After this step, the missing features are completely\n            unobserved for a patient. Otherwise, they would have been\n            filled in the longitudinal loop.\n\n3.  **Horizontal imputation**. We run the horizontal\n    imputation to fill in the missing values. For each patient, we merge\n    the horizontal imputation into the rows with the *least missing\n    values*. We are not using the full horizontal imputation because it\n    could miss some temporal patterns. Instead, we use it to seed a\n    second longitudinal imputation loop. In particular, this step\n    imputes all the patients with a single visit.\n\n4.  **Second Constants imputation:** Impute the constants and the\n    \\\"AGE\\\", seeded by the horizontal imputation.\n\n5.  **Second Longitudinal imputation loop.** Run the exact same\n    longitudinal imputation loop, seeded by the horizontal imputation.\n\n6.  **$X_{imputed}$ output** : $X$ is completely imputed at this point.\n\n**Submission dataset.** For the final submission, a few extra steps are\nexecuted:\n\n1.  Unscale the scaled features\n2.  Review the Data sanity checks.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbcebere%2Fgenentech-404-challenge","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbcebere%2Fgenentech-404-challenge","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbcebere%2Fgenentech-404-challenge/lists"}