{"id":28449131,"url":"https://github.com/sodascience/social_science_inferences_with_llms","last_synced_at":"2026-01-30T19:02:56.632Z","repository":{"id":282637454,"uuid":"890407688","full_name":"sodascience/social_science_inferences_with_llms","owner":"sodascience","description":"Addressing LLM-related measurement error in social science modeling research.","archived":false,"fork":false,"pushed_at":"2025-05-08T11:53:56.000Z","size":94,"stargazers_count":7,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-07-01T03:41:35.496Z","etag":null,"topics":["data-collection","inference","large-language-models","llms"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sodascience.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-11-18T14:14:11.000Z","updated_at":"2025-06-27T11:12:03.000Z","dependencies_parsed_at":"2025-07-01T03:35:08.616Z","dependency_job_id":"f35d2a7a-f33d-435a-8345-fbef7ea61421","html_url":"https://github.com/sodascience/social_science_inferences_with_llms","commit_stats":null,"previous_names":["sodascience/social_science_inferences_with_llms"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/sodascience/social_science_inferences_with_llms","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sodascience%2Fsocial_science_inferences_with_llms","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sodascience%2Fsocial_science_inferences_with_llms/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sodascience%2Fsocial_science_inferences_with_llms/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sodascience%2Fsocial_science_inferences_with_llms/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sodascience","download_url":"https://codeload.github.com/sodascience/social_science_inferences_with_llms/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sodascience%2Fsocial_science_inferences_with_llms/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28917454,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-30T16:37:38.804Z","status":"ssl_error","status_checked_at":"2026-01-30T16:37:37.878Z","response_time":66,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-collection","inference","large-language-models","llms"],"created_at":"2025-06-06T14:06:46.182Z","updated_at":"2026-01-30T19:02:56.611Z","avatar_url":"https://github.com/sodascience.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Addressing LLM-related Measurement Error in Social Science Modeling Research\n\n## Workshop\n[Link](https://sodascience.github.io/workshop_llm_data_collection/) to slides, tutorials (`R` and `Python`) and data.\n\n## Abstract\n\nWith the advent of large language models (LLMs), the collection of measurements related to  \u003cins\u003esocial science constructs\u003c/ins\u003e (e.g., personality traits, political attitudes, human values) has become easier, faster and more affordable. These measurements are subsequently used for \u003cins\u003emodelling of societal and group processes\u003c/ins\u003e that social scientists typically engage in, where \u003cins\u003einferences from samples to populations are also made\u003c/ins\u003e. Valid modelling and inferences, however, requires high-quality measurements or at the very least, \u003cins\u003emethods to deal with the presence of measurement error\u003c/ins\u003e. Just like traditional questionnaire-based measurements, LLM-based measurements have been shown to suffer from validity and reliability issues. \n\nWhile there is an abundance of research literature in dealing with measurement error, they focus on questionnaire-based measurement error. \u003cins\u003eIt is relatively new to social scientists how to deal with measurement issues arising from LLMs\u003c/ins\u003e. \n\nThis project has three primary objectives. \n\nFirst, we \u003cins\u003ereview existing literature\u003c/ins\u003e to identify methods for addressing LLM-related measurement error in social science modelling. \nSecond, we conduct simulation studies to compare existing methods.\nLastly, we synthesise these findings with existing measurement modelling literature to propose \u003cins\u003ea practical framework\u003c/ins\u003e for making valid social science inferences using LLM-based measurements. By bridging the gap between LLM prediction capabilities and social science inference requirements, our framework aims to enhance the reliability and validity of social science research outcomes in the era of LLMs.\n\n## Literature Overview\nCurrent literature can be sorted into four groups: \n\n1. Inferences with LLM-based predictions;\n2. Inferences with general machine learning-based predictions;\n3. Inferences with general measurement error in the social sciences;\n4. Others, such as missing data imputation, conformal prediction, semi-supervised learning.\n\nExisting proposed methods can be distinguished based on whether the LLM- or machine learning-based predictions are made on the `predictors`, the `outcome variable` or `both` that are to be used in downstream modelling (typically with regression models). \n\n### Inferences with LLMs\n\n| Year | Title | Predicted Variable(s) | \n| --- | --- | --- |\n| 2023 | [Using Imperfect Surrogates for Downstream Inference: Design-based Supervised Learning for Social Science Applications of Large Language Models](https://openreview.net/pdf?id=e8RZwixcE4) | Outcome |\n| 2024 | [Inference for Regression with Variables Generated from Unstructured Data](https://arxiv.org/pdf/2402.15585) | Predictor |\n| 2024 | [From Narratives to Numbers: Valid Inference Using Language Model Predictions from Verbal Autopsies](https://openreview.net/pdf?id=QbCHlIqbDJ) | Outcome |\n| 2024 | [Using Large Language Model Annotations for the Social Sciences: A General Framework of Using Predicted Variables in Downstream Analyses](https://naokiegami.com/paper/dsl_ss.pdf) | Predictor and outcome |\n\n### Inferences with Machine Learning Predictions\n\n| Year | Title | Predicted Variable(s) | \n| --- | --- | --- |\n| 2020 | [Methods for correcting inference based on outcomes predicted by machine learning](https://www.pnas.org/doi/full/10.1073/pnas.2001238117?gad_source=1\u0026gclid=CjwKCAiAxqC6BhBcEiwAlXp45xykgurcH-QuopXIjbAOtssXUZoCauzjRRTmmd-Ud3FFmJp3RhODIBoCgUsQAvD_BwE) | Outcome |\n| 2022 | [How Using Machine Learning Classification as a Variable in Regression Leads to Attenuation Bias and What to Do about It](https://ideas.repec.org/p/osf/socarx/453jk.html) | Predictor or outcome |\n| 2023 | [Prediction-powered inference](https://www.science.org/doi/10.1126/science.adi6000) | Outcome |\n| 2024 | [PPI++: Efficient Prediction-Powered Inference](https://arxiv.org/abs/2311.01453) | Outcome |\n| 2024 | [Cross-prediction-powered inference](https://www.pnas.org/doi/abs/10.1073/pnas.2322083121?gad_source=1\u0026gclid=CjwKCAiAxqC6BhBcEiwAlXp455jkwwIzsaI_14eWknuE5daWeUS4TGu8V--VwXJf9bGEUJ5vJodv7BoCEGEQAvD_BwE) | Outcome |\n| 2024 | [A Note on the Prediction-Powered Bootstrap](https://arxiv.org/abs/2405.18379) | Outcome |\n| 2024 | [Assumption-Lean and Data-Adaptive Post-Prediction Inference](https://arxiv.org/abs/2311.14220) | Predictor and outcome |\n| 2024 | [ipd: An R Package for Conducting Inference on Predicted Data](https://arxiv.org/abs/2410.09665) | Outcome |\n| 2024 | [Task-Agnostic Machine-Learning-Assisted Inference](https://arxiv.org/abs/2405.20039) | Outcome |\n| 2024 | [Prediction De-Correlated Inference: A safe approach for post-prediction inference](https://arxiv.org/abs/2312.06478) | Outcome |\n\n### Inferences with Measurement Error\nTBA.\n\n### Other Approaches\ne.g., Missing data imputation, semi-supervised learning, conformal prediction. \nTBA.\n\n## Datasets\nTBA.\n\n## Software Packages\n| Name | Method | Language | Estimators | Predicted Variables |\n|----|----|----|----|----|\n| [PostPI](https://github.com/leekgroup/postpi) | Post-Prediction Inference | R | Means, quantitles and GLMs | Outcome | \n| [PPI, PPI++, Cross-PPI, PPBoot](https://github.com/aangelopoulos/ppi_py) | Prediction-powered inference and its extensions | Python | Any arbitrary estimator | Outcome | \n| [PSPA](https://github.com/qlu-lab/pspa) | PoSt-Prediction Adaptive inference | R | Means, quantiles, linear regression, logistic regression | Predictor and outcome | \n| [ipd](https://github.com/ipd-tools/ipd) | Implemented PostPI, PPI, PPI++ and PSPA | R | Means, quantiles, linear regression, logistic regression | Outcome | \n| [PSPS](https://github.com/qlu-lab/psps) | PoSt-Prediction Summary-statistics-based (PSPS) inference | R and Python | M-estimators | Outcome | \n| [DSL](https://naokiegami.com/dsl/) | Design-based Supervised Learning | R | Moment-based estimators | Predictor and outcome | \n\n## Simulation Studies\nTBA.\n\n## Contact\n\u003cimg src=\"./img/soda_logo.png\" alt=\"SoDa logo\" width=\"250px\"/\u003e\n\nThis project is developed and maintained by the [ODISSEI Social Data\nScience (SoDa)](https://odissei-soda.nl) team.\n\nDo you have questions, suggestions, or remarks? File an issue in the\nissue tracker or feel free to contact the team at [`odissei-soda.nl`](https://odissei-soda.nl)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsodascience%2Fsocial_science_inferences_with_llms","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsodascience%2Fsocial_science_inferences_with_llms","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsodascience%2Fsocial_science_inferences_with_llms/lists"}