https://github.com/eellak/glossapi

Ελληνικά κειμενικά δεδομένα - - Datasets in the Greek language
https://github.com/eellak/glossapi

dataset llm opensource

Last synced: 9 months ago
JSON representation

Ελληνικά κειμενικά δεδομένα - - Datasets in the Greek language

Host: GitHub
URL: https://github.com/eellak/glossapi
Owner: eellak
Created: 2023-06-28T08:51:18.000Z (over 2 years ago)
Default Branch: master
Last Pushed: 2025-03-27T12:22:09.000Z (10 months ago)
Last Synced: 2025-03-29T16:01:51.369Z (10 months ago)
Topics: dataset, llm, opensource
Language: Python
Homepage: https://huggingface.co/glossAPI
Size: 161 MB
Stars: 100
Watchers: 20
Forks: 14
Open Issues: 5
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

README

          # GlossAPI

[![PyPI Status](https://img.shields.io/pypi/v/glossapi?logo=pypi)](https://pypi.org/project/glossapi/)

A library for processing texts in Greek and other languages, developed by [Open Technologies Alliance(GFOSS)](https://gfoss.eu/).

## Features

- **PDF Processing**: Extract text content from academic PDFs with structure preservation

- **Quality Control**: Filter and cluster documents based on extraction quality

- **Section Extraction**: Identify and extract academic sections from documents

- **Section Classification**: Classify sections using machine learning models

- **Greek Language Support**: Specialized processing for Greek academic texts

- **Metadata Handling**: Process academic texts with accompanying metadata

- **Customizable Annotation**: Map section titles to standardized categories

## Installation

```bash

pip install glossapi

```

## Usage

The recommended way to use GlossAPI is through the `Corpus` class, which provides a complete pipeline for processing academic documents:

```python

from glossapi import Corpus

import logging

# Configure logging (optional)

logging.basicConfig(level=logging.INFO)

# Initialize Corpus with input and output directories

corpus = Corpus(

    input_dir="/path/to/documents",

    output_dir="/path/to/output"

    # metadata_path="/path/to/metadata.parquet",  # Optional

    # annotation_mapping={

    #     'Κεφάλαιο': 'chapter', # i.e. a label in document_type column : references text type to be annotated chapter or text for now

    #     # Add more mappings as needed

    # }

)

# Step 1: Extract documents (with quality control)

corpus.extract()

# Step 2: Extract sections from filtered documents

corpus.section()

# Step 3: Classify and annotate sections

corpus.annotate()

```

## License

This project is licensed under the [European Union Public Licence 1.2 (EUPL 1.2)](https://interoperable-europe.ec.europa.eu/collection/eupl/eupl-text-eupl-12).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/eellak/glossapi

Awesome Lists containing this project

README