https://github.com/impresso/fair-sentence-transformers
Positionally fairer SentenceTransformers. Replication Code for preprint "Information Representation Fairness in Long-Document Embeddings: The Peculiar Interaction of Positional and Language Bias". All resources are publicly available and open-source, hoping to facilitate further research in Information Representation Fairness.
https://github.com/impresso/fair-sentence-transformers
Last synced: about 2 months ago
JSON representation
Positionally fairer SentenceTransformers. Replication Code for preprint "Information Representation Fairness in Long-Document Embeddings: The Peculiar Interaction of Positional and Language Bias". All resources are publicly available and open-source, hoping to facilitate further research in Information Representation Fairness.
- Host: GitHub
- URL: https://github.com/impresso/fair-sentence-transformers
- Owner: impresso
- License: agpl-3.0
- Created: 2025-04-29T15:08:23.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2026-04-01T09:28:39.000Z (2 months ago)
- Last Synced: 2026-04-01T11:35:48.935Z (2 months ago)
- Language: Python
- Homepage:
- Size: 12.2 MB
- Stars: 5
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# [Information Representation Fairness in Long-Document Embeddings: The Peculiar Interaction of Positional and Language Bias](https://arxiv.org/abs/2601.16934) - FairSentenceTransformers and Replication 
---
## Overview
This repository accompanies our [preprint(under review)](https://arxiv.org/abs/2601.16934) with source code for the examination of positional bias in long documents of multilingual embedding models and the fair-sentence-transformers extension.
---
## Table of Contents
- [Overview](#overview)
- [Fair Sentence Transformers](#fair-sentence-transformers)
- [Datasets](#datasets)
- [Repository Structure](#repository-structure)
- [Reproducing the Experiments](#reproducing-the-experiments)
- [Citation](#citation)
- [About Impresso](#about-impresso)
- [License](#license)
---
## Fair Sentence Transformers
We introduce an inference-time attention calibration method, implemented as an extension of Sentence Transformers called Fair Sentence Transformers. This tool aims to:
1. Provide a Wrapper Class for inference-time calibration techniques that improve fairness in embedding models.
2. Support existing and future embedding model releases through generic implementations configurable to each model's attributes.
# Datasets
We provide a multilingual comparable Wikipedia dataset for examining positional bias (see Section 3). The dataset includes six languages: English, Hindi, German, Italian, Korean, and Chinese.
Access the dataset: https://huggingface.co/datasets/impresso-project/wiki_comparable_corpus_en_de_hi_it_ko_zh
For instructions on loading and using the dataset, refer to the README on Hugging Face.
### Setup and Example use:
```bash
poetry install
```
```python
from src.fair_sentence_transformers.core.fair_sentence_transformer import FairSentenceTransformer
input_texts = [
"What is the capital of Switzerland?",
"How to make an omelette?",
"Wie viele Einwohner hat Deutschland?",
]
model_name_or_path = "Alibaba-NLP/gte-multilingual-base"
model = FairSentenceTransformer(model_name_or_path)
# Standard SentenceTransformer embeddings
embeddings = model.encode(input_texts) # shape: (3, 768)
# Fair SentenceTransformer embeddings
fair_embeddings = model.encode_positionally_fair(
input_texts,
calib_strength=0.5,
calib_basket_size=128,
calib_layers=6,
) # shape: (3, 768)
```
pip install coming soon
### Supported Models and Methods
**encode_positionally_fair** - Inference-time attention calibration to ensure fair representation of input from all positions.
**Tested Models:**
- [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base)
- [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3)
- [ibm-granite/granite-embedding-278m-multilingual](https://huggingface.co/ibm-granite/granite-embedding-278m-multilingual), [107M version](ibm-granite/granite-embedding-107m-multilingual)
- Qwen3-Embedding Family: [0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B), [4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B), [8B](https://huggingface.co/Qwen/Qwen3-Embedding-8B)
- microsoft/harrier-oss-v1: [0.6B](https://huggingface.co/microsoft/harrier-oss-v1-0.6b), 270M(WIP), 27B(WIP)
**Extensibility:** Our implementation can support additional models with a small configuration and quick test of new additions. Feel free put a pull request to add support to your favourite model or simply reach out to andrianos.michail@cl.uzh.ch
---
## Reproducing the Experiments
To fully replicate our results, please follow the instructions decipited in our Replication readme (REPL_README.md)
## Citation
If you use these resources, please cite our preprint:
```bibtex
@misc{schuhmacher2026informationrepresentationfairnesslongdocument,
title={Information Representation Fairness in Long-Document Embeddings: The Peculiar Interaction of Positional and Language Bias},
author={Elias Schuhmacher and Andrianos Michail and Juri Opitz and Rico Sennrich and Simon Clematide},
year={2026},
eprint={2601.16934},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.16934},
}
```
## About Impresso
### Impresso project
[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027) by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585) and the Luxembourg National Research Fund under grant No. 17498891.
### Copyright
Copyright (C) 2026 The Impresso team.
### License
This program is provided as open source under the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE) v3 or later.
---