https://github.com/sodascience/social_science_inferences_with_llms

Addressing LLM-related measurement error in social science modeling research.
https://github.com/sodascience/social_science_inferences_with_llms

data-collection inference large-language-models llms

Last synced: 5 months ago
JSON representation

Addressing LLM-related measurement error in social science modeling research.

Host: GitHub
URL: https://github.com/sodascience/social_science_inferences_with_llms
Owner: sodascience
License: mit
Created: 2024-11-18T14:14:11.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-05-08T11:53:56.000Z (about 1 year ago)
Last Synced: 2025-07-01T03:41:35.496Z (12 months ago)
Topics: data-collection, inference, large-language-models, llms
Homepage:
Size: 91.8 KB
Stars: 7
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Addressing LLM-related Measurement Error in Social Science Modeling Research

## Workshop

[Link](https://sodascience.github.io/workshop_llm_data_collection/) to slides, tutorials (`R` and `Python`) and data.

## Abstract

With the advent of large language models (LLMs), the collection of measurements related to  social science constructs (e.g., personality traits, political attitudes, human values) has become easier, faster and more affordable. These measurements are subsequently used for modelling of societal and group processes that social scientists typically engage in, where inferences from samples to populations are also made. Valid modelling and inferences, however, requires high-quality measurements or at the very least, methods to deal with the presence of measurement error. Just like traditional questionnaire-based measurements, LLM-based measurements have been shown to suffer from validity and reliability issues. 

While there is an abundance of research literature in dealing with measurement error, they focus on questionnaire-based measurement error. It is relatively new to social scientists how to deal with measurement issues arising from LLMs. 

This project has three primary objectives. 

First, we review existing literature to identify methods for addressing LLM-related measurement error in social science modelling. 

Second, we conduct simulation studies to compare existing methods.

Lastly, we synthesise these findings with existing measurement modelling literature to propose a practical framework for making valid social science inferences using LLM-based measurements. By bridging the gap between LLM prediction capabilities and social science inference requirements, our framework aims to enhance the reliability and validity of social science research outcomes in the era of LLMs.

## Literature Overview

Current literature can be sorted into four groups: 

1. Inferences with LLM-based predictions;

2. Inferences with general machine learning-based predictions;

3. Inferences with general measurement error in the social sciences;

4. Others, such as missing data imputation, conformal prediction, semi-supervised learning.

Existing proposed methods can be distinguished based on whether the LLM- or machine learning-based predictions are made on the `predictors`, the `outcome variable` or `both` that are to be used in downstream modelling (typically with regression models). 

### Inferences with LLMs

| Year | Title | Predicted Variable(s) | 

| --- | --- | --- |

| 2023 | [Using Imperfect Surrogates for Downstream Inference: Design-based Supervised Learning for Social Science Applications of Large Language Models](https://openreview.net/pdf?id=e8RZwixcE4) | Outcome |

| 2024 | [Inference for Regression with Variables Generated from Unstructured Data](https://arxiv.org/pdf/2402.15585) | Predictor |

| 2024 | [From Narratives to Numbers: Valid Inference Using Language Model Predictions from Verbal Autopsies](https://openreview.net/pdf?id=QbCHlIqbDJ) | Outcome |

| 2024 | [Using Large Language Model Annotations for the Social Sciences: A General Framework of Using Predicted Variables in Downstream Analyses](https://naokiegami.com/paper/dsl_ss.pdf) | Predictor and outcome |

### Inferences with Machine Learning Predictions

| Year | Title | Predicted Variable(s) | 

| --- | --- | --- |

| 2020 | [Methods for correcting inference based on outcomes predicted by machine learning](https://www.pnas.org/doi/full/10.1073/pnas.2001238117?gad_source=1&gclid=CjwKCAiAxqC6BhBcEiwAlXp45xykgurcH-QuopXIjbAOtssXUZoCauzjRRTmmd-Ud3FFmJp3RhODIBoCgUsQAvD_BwE) | Outcome |

| 2022 | [How Using Machine Learning Classification as a Variable in Regression Leads to Attenuation Bias and What to Do about It](https://ideas.repec.org/p/osf/socarx/453jk.html) | Predictor or outcome |

| 2023 | [Prediction-powered inference](https://www.science.org/doi/10.1126/science.adi6000) | Outcome |

| 2024 | [PPI++: Efficient Prediction-Powered Inference](https://arxiv.org/abs/2311.01453) | Outcome |

| 2024 | [Cross-prediction-powered inference](https://www.pnas.org/doi/abs/10.1073/pnas.2322083121?gad_source=1&gclid=CjwKCAiAxqC6BhBcEiwAlXp455jkwwIzsaI_14eWknuE5daWeUS4TGu8V--VwXJf9bGEUJ5vJodv7BoCEGEQAvD_BwE) | Outcome |

| 2024 | [A Note on the Prediction-Powered Bootstrap](https://arxiv.org/abs/2405.18379) | Outcome |

| 2024 | [Assumption-Lean and Data-Adaptive Post-Prediction Inference](https://arxiv.org/abs/2311.14220) | Predictor and outcome |

| 2024 | [ipd: An R Package for Conducting Inference on Predicted Data](https://arxiv.org/abs/2410.09665) | Outcome |

| 2024 | [Task-Agnostic Machine-Learning-Assisted Inference](https://arxiv.org/abs/2405.20039) | Outcome |

| 2024 | [Prediction De-Correlated Inference: A safe approach for post-prediction inference](https://arxiv.org/abs/2312.06478) | Outcome |

### Inferences with Measurement Error

TBA.

### Other Approaches

e.g., Missing data imputation, semi-supervised learning, conformal prediction. 

TBA.

## Datasets

TBA.

## Software Packages

| Name | Method | Language | Estimators | Predicted Variables |

|----|----|----|----|----|

| [PostPI](https://github.com/leekgroup/postpi) | Post-Prediction Inference | R | Means, quantitles and GLMs | Outcome | 

| [PPI, PPI++, Cross-PPI, PPBoot](https://github.com/aangelopoulos/ppi_py) | Prediction-powered inference and its extensions | Python | Any arbitrary estimator | Outcome | 

| [PSPA](https://github.com/qlu-lab/pspa) | PoSt-Prediction Adaptive inference | R | Means, quantiles, linear regression, logistic regression | Predictor and outcome | 

| [ipd](https://github.com/ipd-tools/ipd) | Implemented PostPI, PPI, PPI++ and PSPA | R | Means, quantiles, linear regression, logistic regression | Outcome | 

| [PSPS](https://github.com/qlu-lab/psps) | PoSt-Prediction Summary-statistics-based (PSPS) inference | R and Python | M-estimators | Outcome | 

| [DSL](https://naokiegami.com/dsl/) | Design-based Supervised Learning | R | Moment-based estimators | Predictor and outcome | 

## Simulation Studies

TBA.

## Contact



This project is developed and maintained by the [ODISSEI Social Data

Science (SoDa)](https://odissei-soda.nl) team.

Do you have questions, suggestions, or remarks? File an issue in the

issue tracker or feel free to contact the team at [`odissei-soda.nl`](https://odissei-soda.nl)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sodascience/social_science_inferences_with_llms

Awesome Lists containing this project

README