{"id":19448660,"url":"https://github.com/neurodata/dirtysecrets","last_synced_at":"2026-02-07T16:34:01.017Z","repository":{"id":140561590,"uuid":"320432866","full_name":"neurodata/dirtysecrets","owner":"neurodata","description":"dirty secrets of data science","archived":false,"fork":false,"pushed_at":"2021-01-08T16:25:57.000Z","size":31,"stargazers_count":5,"open_issues_count":0,"forks_count":1,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-07-24T21:10:59.857Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/neurodata.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-12-11T01:20:10.000Z","updated_at":"2022-12-06T18:46:29.000Z","dependencies_parsed_at":null,"dependency_job_id":"9ccf683f-3373-46e1-93de-40c1c702b506","html_url":"https://github.com/neurodata/dirtysecrets","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/neurodata/dirtysecrets","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neurodata%2Fdirtysecrets","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neurodata%2Fdirtysecrets/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neurodata%2Fdirtysecrets/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neurodata%2Fdirtysecrets/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/neurodata","download_url":"https://codeload.github.com/neurodata/dirtysecrets/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neurodata%2Fdirtysecrets/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29199846,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-07T16:28:23.579Z","status":"ssl_error","status_checked_at":"2026-02-07T16:28:22.566Z","response_time":63,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-10T16:28:01.803Z","updated_at":"2026-02-07T16:34:01.000Z","avatar_url":"https://github.com/neurodata.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# dirtysecrets\ndirty secrets of data science\n\n\n## 0 preliminaries\n\n1. Data Science: The Three Cultures (https://docs.google.com/document/d/12XqXrTDH8jwFkmVb7vGB-KUDIfNmAtftPPtqaMr6DbE/edit?usp=sharing, and reference this: http://www.stat.columbia.edu/~gelman/research/published/gelman_breiman.pdf)\n2. data -\u003e data | data -\u003e objects | objects -\u003e understanding\n3. graphs \u003e images \u003e vectors\n4. hypothesis testing \u003c classification \u003c ranking \u003c regression \u003c density estimation \u003c model selection\n5. no free lunch \u0026 arbitrary slow convergence theorem\n6. bias/variance trade-off\n    4. the picture from DGL\n    1. MSE\n    2. for classification\n    3. bias/complexity\n6. the desiderata of learning methods (https://bitsandbrains.io/2018/09/24/modeling-desiderata.html)\n7. would you look at it (datasaurus, same with graphs, also other gifs)?\n\n## 1 hypothesis testing\n\n1. statistical theory is fundamentally flawed (https://plato.stanford.edu/entries/statistics/)\n    1. unresolved problems with classical stats\n    2. unresolved problems with bayesian stats (http://www.stat.columbia.edu/~gelman/research/published/holes.pdf)\n3. t-test are not what you want (MlqE figure)\n4. p-values are fine (p-value paper)\n5. pre-registration is good\n8. hypothesis testing is estimating mutual information (https://bitsandbrains.io/2018/09/29/categories-of-testing.html)\n13. your permutation test makes some assumptions\n\n\n## 2 estimation\n\n12. you didn't want the mean or variance anway (robust statistics, show generalization error for robust estimators of mean and covariance)\n15. high-dimensional statistics is impossible and counterintuitive - you need to embed in a low dimensional space (implicitly or explicitly)\n        1. curse of dimensionality\n        2. grazing goat\n        3. relative volume of sphere and cube\n        4. therefore, when p\u003en, we implicitly or explicitly embed in low dimensions\n16. we have no model of vision, audition, or anything but Euclidean\n13. estimation vs prediction \u0026 necessary vs sufficient \u0026 global vs local (mike's paper)\n12. all learning is sequential (ali's paper)\n\n## 3 unsupervised learning\n\n5. you didn't estimate the true number of clusters\n        1. show histogram that could be 1 or 2 gaussians with different means\n        2. show histogram that could be 1 or 2 gaussians with different variances\n        3. george and foster paper intuition, with examples showing AIC/BIC/DIC all get different answers\n        4. two truths\n6. kmeans is secretly gaussian mixture modeling (jk-means paper)\n9. you didn't do better than PCA (papadimitrious LSI, eckard young)\n10. manifold learning is nonsense (urerf)\n11. operating in the lower dimensional space isn't always better (On the Power of Likelihood Ratio Tests in Dimension-Restricted Submodels\n)\n\n## 4 supervised learning\n\n6. sparse models don't work (leekasso vs lasso, and https://projecteuclid.org/euclid.aos/1509436830)\n7. your deep net is not a universal function approximator, and even if it was, that's not what you wanted anyway (cybenko, hornik, consistency/generalization vs function estimation/interpolation)\n16. we have no model of vision, audition, or anything but Euclidean\n1. just use random forest (sporf)\n1. partition and vote (mml)\n\n## 5 existing stuff doesn't work (https://bitsandbrains.io/2019/11/21/new-method.html)\n\n1. build compelling evidence that existing tools don't work well/easily\n2. pseudocode your new plan carefully\n3. build compelling evidence that your new thing does work in simulated settings where their thing doesn't work\n3. build compelling evidence that your new thing does *not* work in simulated settings where their thing does work\n4. build compelling evidence that your new thing is useful *scientifically*\n5. FIRM guidelines (https://bitsandbrains.io/2018/10/21/numerical-packages.html)\n\n## 6 communicating data science\n\n1. how to write a paper (https://bitsandbrains.io/2019/02/10/how-to-write-a-paper.html)\n1. how to make a figure (https://bitsandbrains.io/2018/09/08/figures.html)\n1. how to write a paragraph (https://bitsandbrains.io/2018/10/14/paragraphs.html)\n1. words/phrases to avoid (https://bitsandbrains.io/2018/10/14/words.html)\n2. how to review a paper\n1. how to write a grant (https://bitsandbrains.io/2018/10/14/structuring-a-grant.html)\n1. how to make slides (https://bitsandbrains.io/2018/09/04/slides.html)\n1. how to choose a project (https://bitsandbrains.io/2018/08/31/sig-and-feas.html)\n1. how to get into grad school (https://bitsandbrains.io/2018/10/21/getting-into-grad-school.html)\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fneurodata%2Fdirtysecrets","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fneurodata%2Fdirtysecrets","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fneurodata%2Fdirtysecrets/lists"}