https://github.com/impresso/fair-sentence-transformers

Positionally fairer SentenceTransformers. Replication Code for preprint "Information Representation Fairness in Long-Document Embeddings: The Peculiar Interaction of Positional and Language Bias". All resources are publicly available and open-source, hoping to facilitate further research in Information Representation Fairness.
https://github.com/impresso/fair-sentence-transformers

Last synced: about 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/impresso/fair-sentence-transformers
Owner: impresso
License: agpl-3.0
Created: 2025-04-29T15:08:23.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2026-04-01T09:28:39.000Z (2 months ago)
Last Synced: 2026-04-01T11:35:48.935Z (2 months ago)
Language: Python
Homepage:
Size: 12.2 MB
Stars: 5
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # [Information Representation Fairness in Long-Document Embeddings: The Peculiar Interaction of Positional and Language Bias](https://arxiv.org/abs/2601.16934) - FairSentenceTransformers and Replication ![License: AGPLV3+](https://img.shields.io/badge/License-AGPLV3+-brightgreen.svg) 

---

## Overview

This repository accompanies our [preprint(under review)](https://arxiv.org/abs/2601.16934) with source code for the examination of positional bias in long documents of multilingual embedding models and the fair-sentence-transformers extension.

---

## Table of Contents

- [Overview](#overview)

- [Fair Sentence Transformers](#fair-sentence-transformers)

- [Datasets](#datasets)

- [Repository Structure](#repository-structure)

- [Reproducing the Experiments](#reproducing-the-experiments)

- [Citation](#citation)

- [About Impresso](#about-impresso)

- [License](#license)

---

## Fair Sentence Transformers

We introduce an inference-time attention calibration method, implemented as an extension of Sentence Transformers called Fair Sentence Transformers. This tool aims to:

1. Provide a Wrapper Class for inference-time calibration techniques that improve fairness in embedding models.

2. Support existing and future embedding model releases through generic implementations configurable to each model's attributes.

# Datasets

We provide a multilingual comparable Wikipedia dataset for examining positional bias (see Section 3). The dataset includes six languages: English, Hindi, German, Italian, Korean, and Chinese.

Access the dataset: https://huggingface.co/datasets/impresso-project/wiki_comparable_corpus_en_de_hi_it_ko_zh

For instructions on loading and using the dataset, refer to the README on Hugging Face.

### Setup and Example use:

```bash

poetry install

```

```python

from src.fair_sentence_transformers.core.fair_sentence_transformer import FairSentenceTransformer

 

input_texts = [

    "What is the capital of Switzerland?",

    "How to make an omelette?",

    "Wie viele Einwohner hat Deutschland?",

]

 

model_name_or_path = "Alibaba-NLP/gte-multilingual-base"

model = FairSentenceTransformer(model_name_or_path)

 

# Standard SentenceTransformer embeddings

embeddings = model.encode(input_texts)  # shape: (3, 768)

 

# Fair SentenceTransformer embeddings

fair_embeddings = model.encode_positionally_fair(

    input_texts,

    calib_strength=0.5,

    calib_basket_size=128,

    calib_layers=6,

)  # shape: (3, 768)

```

pip install coming soon

### Supported Models and Methods

**encode_positionally_fair** - Inference-time attention calibration to ensure fair representation of input from all positions.

**Tested Models:**

- [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base)

- [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3)

- [ibm-granite/granite-embedding-278m-multilingual](https://huggingface.co/ibm-granite/granite-embedding-278m-multilingual), [107M version](ibm-granite/granite-embedding-107m-multilingual)

- Qwen3-Embedding Family: [0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B), [4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B), [8B](https://huggingface.co/Qwen/Qwen3-Embedding-8B)

- microsoft/harrier-oss-v1: [0.6B](https://huggingface.co/microsoft/harrier-oss-v1-0.6b), 270M(WIP), 27B(WIP)

**Extensibility:** Our implementation can support additional models with a small configuration and quick test of new additions. Feel free put a pull request to add support to your favourite model or simply reach out to andrianos.michail@cl.uzh.ch

---

## Reproducing the Experiments

To fully replicate our results, please follow the instructions decipited in our Replication readme (REPL_README.md)

## Citation

If you use these resources, please cite our preprint:

```bibtex

@misc{schuhmacher2026informationrepresentationfairnesslongdocument,

      title={Information Representation Fairness in Long-Document Embeddings: The Peculiar Interaction of Positional and Language Bias}, 

      author={Elias Schuhmacher and Andrianos Michail and Juri Opitz and Rico Sennrich and Simon Clematide},

      year={2026},

      eprint={2601.16934},

      archivePrefix={arXiv},

      primaryClass={cs.CL},

      url={https://arxiv.org/abs/2601.16934}, 

}

```

## About Impresso

### Impresso project

[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027) by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585) and the Luxembourg National Research Fund under grant No. 17498891.

### Copyright

Copyright (C) 2026 The Impresso team.

### License

This program is provided as open source under the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE) v3 or later.

---

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/impresso/fair-sentence-transformers

Awesome Lists containing this project

README