Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/os-climate/osc-transformer-based-extractor
Data Extraction: Transformer Based Extractor Tool
https://github.com/os-climate/osc-transformer-based-extractor
Last synced: 7 days ago
JSON representation
Data Extraction: Transformer Based Extractor Tool
- Host: GitHub
- URL: https://github.com/os-climate/osc-transformer-based-extractor
- Owner: os-climate
- License: apache-2.0
- Created: 2023-12-22T13:26:05.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2024-11-06T13:56:12.000Z (8 days ago)
- Last Synced: 2024-11-06T14:44:12.778Z (8 days ago)
- Language: Python
- Size: 14.4 MB
- Stars: 0
- Watchers: 15
- Forks: 4
- Open Issues: 2
-
Metadata Files:
- Readme: README.rst
- Changelog: CHANGELOG.rst
- Contributing: CONTRIBUTING.rst
- License: LICENSE.txt
- Authors: AUTHORS.rst
Awesome Lists containing this project
README
#############################################
OSC Transformer Based Extractor
#############################################|osc-climate-project| |osc-climate-slack| |osc-climate-github| |pypi| |build-status| |pdm| |PyScaffold|
***********************************
OS-Climate Data Extraction Tool
***********************************This project provides a CLI tool and Python scripts to train Transformer models (via Hugging Face) for two primary tasks:
1. **Relevance Detection**: Determines if a question-context pair is relevant.
2. **KPI Detection**: Fine-tunes models to extract key performance indicators (KPIs) from datasets like annual reports and perform inference.Quick Start
^^^^^^^^^^^^^To install the tool, use pip:
.. code-block:: shell
$ pip install osc-transformer-based-extractor
After installation, you can access the CLI tool with:
.. code-block:: shell
$ osc-transformer-based-extractor
This command will show the available commands and help via Typer, our CLI library.
Commands and Workflow
^^^^^^^^^^^^^^^^^^^^^^^1. Relevance Detection
--------------------------**Fine-tuning the Model:**
Assume your project structure looks like this:
.. code-block:: text
project/
│
├── kpi_mapping.csv
├── training_data.csv
├── data/
│ └── (JSON files for inference)
├── model/
│ └── (Model-related files)
├── saved__model/
│ └── (Output from training)
├── output/
│ └── (Results from inference)Use the following command to fine-tune the model:
.. code-block:: shell
$ osc-transformer-based-extractor relevance-detector fine-tune \
--data_path "project/training_data.csv" \
--model_name "bert-base-uncased" \
--num_labels 2 \
--max_length 128 \
--epochs 3 \
--batch_size 16 \
--output_dir "project/saved__model/" \
--save_steps 500**Running Inference:**
.. code-block:: shell
$ osc-transformer-based-extractor relevance-detector perform-inference \
--folder_path "project/data/" \
--kpi_mapping_path "project/kpi_mapping.csv" \
--output_path "project/output/" \
--model_path "project/model/" \
--tokenizer_path "project/model/" \
--threshold 0.52. KPI Detection
---------------------The KPI detection functionality includes **fine-tuning** and **inference**.
**Fine-tuning the KPI Model:**
Assume your project structure looks like this:
.. code-block:: text
project/
│
├── kpi_mapping.csv
├── training_data.csv
│
├── model/
│ └── (model-related files, e.g., tokenizer, config, checkpoints)
│
├── saved__model/
│ └── (Folder to store output from fine-tuning)
│
├── output/
│ └── (output files, e.g., inference_results.xlsx).. code-block:: shell
$ osc-transformer-based-extractor kpi-detection fine-tune \
--data_path "project/training_data.csv" \
--model_name "bert-base-uncased" \
--max_length 128 \
--epochs 3 \
--batch_size 16 \
--learning_rate 5e-5 \
--output_dir "project/saved__model/" \
--save_steps 500**Performing Inference:**
.. code-block:: shell
$ osc-transformer-based-extractor kpi-detection inference \
--data_file_path "project/data/input_dataset.csv" \
--output_path "project/output/inference_results.xlsx" \
--model_path "project/model/"Training Data Requirements
^^^^^^^^^^^^^^^^^^^^^^^^^^^^1. Relevance Detection Training File:
The training file should have the following columns:
- ``Question``
- ``Context``
- ``Label``Example:
.. list-table:: Training Data Example
:header-rows: 1* - Question
- Context
- Label
* - What is the company name?
- The Company is exposed to a risk...
- 02. KPI Detection Training File:
For KPI detection, the dataset should have these additional columns:
.. list-table:: KPI Detection Training Example
:header-rows: 1* - Question
- Context
- Label
- Company
- Source File
- KPI ID
- Year
- Answer
- Data Type
* - What is the company name?
- ...
- 0
- NOVATEK
- 04_NOVATEK_AR_2016_ENG_11.pdf
- 0
- 2016
- PAO NOVATEK
- TEXT3. KPI Mapping File:
.. list-table:: KPI Mapping File Example
:header-rows: 1* - kpi_id
- question
- sectors
- add_year
- kpi_category
* - 1
- In which year was the annual report...
- OG, CM, CU
- FALSE
- TEXTDeveloper Notes
^^^^^^^^^^^^^^^^^Local Development
----------------------Clone the repository:
.. code-block:: shell
$ git clone https://github.com/os-climate/osc-transformer-based-extractor/
We use **pdm** for package management and **tox** for testing.
1. Install ``pdm``:
.. code-block:: shell
$ pip install pdm
2. Sync dependencies:
.. code-block:: shell
$ pdm sync
3. Add new packages (e.g., numpy):
.. code-block:: shell
$ pdm add numpy
4. Run ``tox`` for linting and testing:
.. code-block:: shell
$ pip install tox
$ tox -e lint
$ tox -e testContributing
^^^^^^^^^^^^^^We welcome contributions! Please fork the repository and submit a pull request.
Ensure you sign off each commit with the **Developer Certificate of Origin (DCO)**.
Read more: http://developercertificate.org/.Governance Transition
^^^^^^^^^^^^^^^^^^^^^^^^On June 26, 2024, the **Linux Foundation** announced the merger of **FINOS** with OS-Climate.
Projects are now transitioning to the [FINOS governance framework](https://community.finos.org/docs/governance).Shields
^^^^^^^^^|osc-climate-project| |osc-climate-slack| |osc-climate-github| |pypi| |build-status| |pdm| |PyScaffold|
.. |osc-climate-project| image:: https://img.shields.io/badge/OS-Climate-blue
:alt: An OS-Climate Project
:target: https://os-climate.org/.. |osc-climate-slack| image:: https://img.shields.io/badge/slack-osclimate-brightgreen.svg?logo=slack
:alt: Join OS-Climate on Slack
:target: https://os-climate.slack.com.. |osc-climate-github| image:: https://img.shields.io/badge/GitHub-100000?logo=github&logoColor=white
:alt: Source code on GitHub
:target: https://github.com/ModeSevenIndustrialSolutions/osc-data-extractor.. |pypi| image:: https://img.shields.io/pypi/v/osc-data-extractor.svg
:alt: PyPI package
:target: https://pypi.org/project/osc-data-extractor/.. |build-status| image:: https://api.cirrus-ci.com/github/os-climate/osc-data-extractor.svg?branch=main
:alt: Build Status
:target: https://cirrus-ci.com/github/os-climate/osc-data-extractor.. |pdm| image:: https://img.shields.io/badge/PDM-Project-purple
:alt: Built using PDM
:target: https://pdm-project.org/latest/.. |PyScaffold| image:: https://img.shields.io/badge/-PyScaffold-005CA0?logo=pyscaffold
:alt: Project generated with PyScaffold
:target: https://pyscaffold.org/