{"id":38544670,"url":"https://github.com/impresso/impresso-text-embedder","last_synced_at":"2026-01-17T07:17:12.164Z","repository":{"id":255272090,"uuid":"849046800","full_name":"impresso/impresso-text-embedder","owner":"impresso","description":"multilingual text vectorizer for semantic search and comparison","archived":false,"fork":false,"pushed_at":"2025-11-27T12:52:45.000Z","size":270,"stargazers_count":0,"open_issues_count":2,"forks_count":0,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-11-29T12:57:30.193Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/impresso.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-08-28T22:02:54.000Z","updated_at":"2025-11-18T15:56:08.000Z","dependencies_parsed_at":"2024-08-29T01:00:17.246Z","dependency_job_id":"8e27bd51-8c4b-4c77-aa66-993fbc02ab7e","html_url":"https://github.com/impresso/impresso-text-embedder","commit_stats":null,"previous_names":["impresso/impresso-text-embedder"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/impresso/impresso-text-embedder","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/impresso%2Fimpresso-text-embedder","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/impresso%2Fimpresso-text-embedder/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/impresso%2Fimpresso-text-embedder/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/impresso%2Fimpresso-text-embedder/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/impresso","download_url":"https://codeload.github.com/impresso/impresso-text-embedder/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/impresso%2Fimpresso-text-embedder/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28503381,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-17T06:57:29.758Z","status":"ssl_error","status_checked_at":"2026-01-17T06:56:03.931Z","response_time":85,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-17T07:17:12.080Z","updated_at":"2026-01-17T07:17:12.134Z","avatar_url":"https://github.com/impresso.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Impresso Multilingual Text Embedder\n\nThis repository offers tools for embedding texts in multiple languages with an efficient workflow. It uses the `transformers` library by Hugging Face and the `make` tool to manage large datasets. `Make` ensures robust and incremental processing, allowing you to handle new data, resume tasks, and run processes across different machines, all while avoiding redundant work.\n\n## Features\n\n- **Efficient Storage Management:** Minimal local storage is required as necessary data for each year of a newspaper are downloaded on-the-fly and truncated after uploading.\n- **Parallel Processing:** Processes run in parallel to optimize throughput.\n- **Selective Processing:** Only the necessary processing steps are executed, ensuring efficiency by not reprocessing existing outputs on S3.\n- **S3 Integration:** Integration with S3 for storing and resuming processing. The\n  system ensures no overwriting of files or partial uploads due to interruptions. It is\n  also posssible to run everything locally without S3.\n- **Custom Embedding Options:** Flexible configurations via normal environment variables or make variables, including the ability to specify model versions and filter text data.\n\n### Missing Features\n\n- Batch processing of texts is not yet implemented. This will be added in a future\n  release.\n- Installing specialized xformer implementation for sparse attention inference is not yet\n  imlemented. This shoulld be added in a future release for faster inference.\n\n## Concepts\n\n### Storage Locations\n\n- **Local Storage:** Temporary disk space used for processing tasks. This disk space can\n  be fast storage.\n- **S3 Storage:** Permanent storage where final results are stored. Processing can be resumed from this storage.\n\n### File Stamps\n\nTo manage dependencies, local file stamps are used:\n\n- **Input Stamps (`.stamp`):** Indicate the status of input files on S3.\n- **Output Stamps (no extension or `.done`):** Indicate the completion status of output\n  files on S3. Make uses these to determine if a file needs to be processed or not.\n\nLocal stamps help `make` determine which files need to be processed or skipped. The helper script `lib/sync_s3_filestamps.py` manages these stamps by syncing them with S3.\n\n### File Organization\n\nThe processing follows a structured organization:\n\n- **S3 Buckets:** Data is organized by processing steps and, in some cases, versions.\n- **Build Directory:** A local mirror of the S3 storage, structured similarly for consistency.\n\n```plaintext\n# Example directory structure\nBUILD_DIR/BUCKET/NEWSPAPER/\u003cNEWSPAPER-YEAR\u003e.jsonl.bz2\nBUILD_DIR/BUCKET/PROCESSING_TYPE/VERSION/NEWSPAPER/\u003cNEWSPAPER-YEAR\u003e.jsonl.bz2\n```\n\n## Setup\n\n1. **Clone the Repository:**\n\n   ```bash\n   git clone git@github.com:impresso/impresso-text-embedder.git\n   cd impresso-text-embedder\n   ```\n\n2. **Configure S3 Credentials:**\n   Copy the `dotenv.sample` file to `.env`. Modify the `.env` file to include your AWS credentials:\n\n   ```plaintext\n   SE_ACCESS_KEY=\u003cyour-access-key\u003e\n   SE_SECRET_KEY=\u003cyour-secret-key\u003e\n   SE_HOST_URL=\u003cyour host name\u003e\n   ```\n\n3. **Install Dependencies:**\n   Ensure `make` and `python3` and `pip3` are installed. For GPU support of pytorch, go to\n   https://pytorch.org/get-started/locally/ and get your installation command. Run:\n\n   ```bash\n   pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124\n   pip3 install -r requirements.txt\n\n   # or pipenv install\n   ```\n\n4. **Setup Environment and make variables**:\n\n   ```bash\n   cp dotenv.sample .env  # edit .env with your S3 credentials\n   cp local.config.sample.mk local.config.mk  # edit local.config.mk with your local settings\n   ```\n\n5. **Setup Directories and Model:**\n   Create necessary directories, the list of newspapers to process and download the Hugging Face model:\n   ```bash\n   make setup\n   ```\n\n## Usage\n\n### Makefile Targets\n\n```bash\nmake help\n```\n\n### Running the Embedder\n\n1. **Process a single newspaper:**\n   You can specify a list of newspapers to process using the `NEWSPAPER_LIST_FILE`. The default list is generated automatically from the S3 bucket:\n\n   ```bash\n   make newspaper\n   ```\n\n2. **Parallel Processing of each newspaper:**\n   To process newspapers in parallel, use:\n\n   ```bash\n   make each\n   ```\n\n## Data flow overview:\n\n```mermaid\nflowchart LR\n    %% Nodes\n    %% External Entities\n\n\n\n    subgraph cluster_local [\"Local Machine\"]\n        style cluster_local fill:#FFF3E0,stroke:#FF6F00,stroke-width:1px\n\n        %% Processes\n        F{{\"Text Embedding Processor\"}}\n\n        %% Data Stores\n        B[(\"Local Rebuilt Data\")]\n        E[(\"Text Embedding Model\")]\n        C[(\"Processed Output Data\")]\n\n\n    end\n    subgraph cluster_s3 [\"S3 Storage\"]\n        style cluster_s3 fill:#E0F7FA,stroke:#0097A7,stroke-width:1px\n        A[/\"Rebuilt Data\"/]\n        D[/\"Processed Data\"/]\n\n\n                %% Data Flows\n        A --\u003e|Sync Input| B\n        B --\u003e|Data| F\n        E --\u003e|Model| F\n        F --\u003e|Output| C\n        C --\u003e|Upload Output| D\n        D --\u003e|Sync Output| C\n    end\n\n```\n\n## About\n\n### Impresso\n\n[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an\ninterdisciplinary research project that aims to develop and consolidate tools for\nprocessing and exploring large collections of media archives across modalities, time,\nlanguages and national borders. The first project (2017-2021) was funded by the Swiss\nNational Science Foundation under grant\nNo. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027)\nby the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585))\nand the Luxembourg National Research Fund under grant No. 17498891.\n\n### Copyrights\n\nCopyright (C) 2018-2024 The Impresso team.  \nContributors to this program include: [Simon Clematide](https://github.com/simon-clematide)\n\n### License\n\nThis program is provided as open source under\nthe [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE)\nv3 or later.\n\n---\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true\" width=\"350\" alt=\"Impresso Project Logo\"/\u003e\n\u003c/p\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fimpresso%2Fimpresso-text-embedder","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fimpresso%2Fimpresso-text-embedder","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fimpresso%2Fimpresso-text-embedder/lists"}