https://github.com/eellak/glossapi
Ελληνικά κειμενικά δεδομένα - - Datasets in the Greek language
https://github.com/eellak/glossapi
dataset llm opensource
Last synced: 9 months ago
JSON representation
Ελληνικά κειμενικά δεδομένα - - Datasets in the Greek language
- Host: GitHub
- URL: https://github.com/eellak/glossapi
- Owner: eellak
- Created: 2023-06-28T08:51:18.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2025-03-27T12:22:09.000Z (10 months ago)
- Last Synced: 2025-03-29T16:01:51.369Z (10 months ago)
- Topics: dataset, llm, opensource
- Language: Python
- Homepage: https://huggingface.co/glossAPI
- Size: 161 MB
- Stars: 100
- Watchers: 20
- Forks: 14
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# GlossAPI
[](https://pypi.org/project/glossapi/)
A library for processing texts in Greek and other languages, developed by [Open Technologies Alliance(GFOSS)](https://gfoss.eu/).
## Features
- **PDF Processing**: Extract text content from academic PDFs with structure preservation
- **Quality Control**: Filter and cluster documents based on extraction quality
- **Section Extraction**: Identify and extract academic sections from documents
- **Section Classification**: Classify sections using machine learning models
- **Greek Language Support**: Specialized processing for Greek academic texts
- **Metadata Handling**: Process academic texts with accompanying metadata
- **Customizable Annotation**: Map section titles to standardized categories
## Installation
```bash
pip install glossapi
```
## Usage
The recommended way to use GlossAPI is through the `Corpus` class, which provides a complete pipeline for processing academic documents:
```python
from glossapi import Corpus
import logging
# Configure logging (optional)
logging.basicConfig(level=logging.INFO)
# Initialize Corpus with input and output directories
corpus = Corpus(
input_dir="/path/to/documents",
output_dir="/path/to/output"
# metadata_path="/path/to/metadata.parquet", # Optional
# annotation_mapping={
# 'Κεφάλαιο': 'chapter', # i.e. a label in document_type column : references text type to be annotated chapter or text for now
# # Add more mappings as needed
# }
)
# Step 1: Extract documents (with quality control)
corpus.extract()
# Step 2: Extract sections from filtered documents
corpus.section()
# Step 3: Classify and annotate sections
corpus.annotate()
```
## License
This project is licensed under the [European Union Public Licence 1.2 (EUPL 1.2)](https://interoperable-europe.ec.europa.eu/collection/eupl/eupl-text-eupl-12).