https://github.com/opencitations/oc_sparql
This repository contains the SPARQL service for OpenCitations
https://github.com/opencitations/oc_sparql
Last synced: 4 months ago
JSON representation
This repository contains the SPARQL service for OpenCitations
- Host: GitHub
- URL: https://github.com/opencitations/oc_sparql
- Owner: opencitations
- Created: 2025-02-07T09:59:14.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-12-11T14:52:18.000Z (6 months ago)
- Last Synced: 2025-12-12T18:43:54.549Z (6 months ago)
- Language: JavaScript
- Size: 6.34 MB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# OpenCitations SPARQL Service
This repository contains the SPARQL service for OpenCitations, allowing users to query the OpenCitations datasets using SPARQL.
## Overview
The service provides two main SPARQL endpoints:
- **Index endpoint** (`/index`): For querying the OpenCitations Index database
- **Meta endpoint** (`/meta`): For querying the OpenCitations Meta database
## Features
- SPARQL query interface powered by YASQE/YASR
- Support for both GET and POST SPARQL queries
- SPARQL Update queries are not permitted
- Request logging
- Docker deployment ready
## Configuration
### Environment Variables
The service requires the following environment variables. These values take precedence over the ones defined in `conf.json`:
- `BASE_URL`: Base URL for the SPARQL endpoint
- `LOG_DIR`: Directory path where log files will be stored
- `SPARQL_ENDPOINT_INDEX`: URL for the index SPARQL endpoint
- `SPARQL_ENDPOINT_META`: URL for the meta SPARQL endpoint
- `SYNC_ENABLED`: Enable/disable static files synchronization (default: false)
For instance:
```env
BASE_URL=sparql.opencitations.net
LOG_DIR=/home/dir/log/
SPARQL_ENDPOINT_INDEX=http://qlever-service.default.svc.cluster.local:7011
SPARQL_ENDPOINT_META=http://virtuoso-service.default.svc.cluster.local:8890/sparql
SYNC_ENABLED=true
```
> **Note**: When running with Docker, environment variables always override the corresponding values in `conf.json`. If an environment variable is not set, the application will fall back to the values defined in `conf.json`.
### Static Files Synchronization
The application can synchronize static files from a GitHub repository. This configuration is managed in `conf.json`:
```json
{
"oc_services_templates": "https://github.com/opencitations/oc_services_templates",
"sync": {
"folders": [
"static",
"html-template/common"
],
"files": [
"test.txt"
]
}
}
```
- `oc_services_templates`: The GitHub repository URL to sync files from
- `sync.folders`: List of folders to synchronize
- `sync.files`: List of individual files to synchronize
When static sync is enabled (via `--sync-static` or `SYNC_ENABLED=true`), the application will:
1. Clone the specified repository
2. Copy the specified folders and files
3. Keep the local static files up to date
> **Note**: Make sure the specified folders and files exist in the source repository.
## Running Options
### Local Development
For local development and testing, the application uses the built-in web.py HTTP server.
The application supports the following command line arguments:
- `--sync-static`: Synchronize static files at startup and enable periodic sync (every 30 minutes)
- `--port PORT`: Specify the port to run the application on (default: 8080)
Examples:
```bash
# Run with default settings
python3 sparql_oc.py
# Run with static sync enabled
python3 sparql_oc.py --sync-static
# Run on custom port
python3 sparql_oc.py --port 8085
# Run with both options
python3 sparql_oc.py --sync-static --port 8085
```
The Docker container is configured to run with `--sync-static` enabled by default.
### Production Deployment (Docker)
When running in Docker/Kubernetes, the application uses **Gunicorn** as the WSGI HTTP server for better performance and concurrency handling:
- **Server**: Gunicorn with gevent workers
- **Workers**: 2 concurrent worker processes
- **Worker Type**: gevent (async) for handling thousands of simultaneous requests
- **Timeout**: 1200 seconds (to handle long-running SPARQL queries)
- **Connections per worker**: 800 simultaneous connections
The Docker container automatically uses Gunicorn and is configured with static sync enabled by default.
> **Note**: The application code automatically detects the execution environment. When run with `python3 sparql_oc.py`, it uses the built-in web.py server. When run with Gunicorn (as in Docker), it uses the WSGI interface.
You can customize the Gunicorn server configuration by modifying the `gunicorn.conf.py` file.
### Dockerfile
You can change these variables in the Dockerfile:
```dockerfile
# Base image: Python slim for a lightweight container
FROM python:3.11-slim
# Define environment variables with default values
# These can be overridden during container runtime
ENV BASE_URL="sparql.opencitations.net" \
LOG_DIR="/mnt/log_dir/oc_sparql" \
SPARQL_ENDPOINT_INDEX="http://qlever-service.default.svc.cluster.local:7011" \
SPARQL_ENDPOINT_META="http://virtuoso-service.default.svc.cluster.local:8890/sparql" \
SYNC_ENABLED="true"
# Ensure Python output is unbuffered
ENV PYTHONUNBUFFERED=1
# Install system dependencies required for Python package compilation
RUN apt-get update && \
apt-get install -y \
git \
python3-dev \
build-essential
# Set the working directory for our application
WORKDIR /website
# Clone the specific branch (sparql) from the repository
# The dot at the end means clone into current directory
RUN git clone --single-branch --branch main https://github.com/opencitations/oc_sparql .
# Install Python dependencies from requirements.txt
RUN pip install -r requirements.txt
# Expose the port that our service will listen on
EXPOSE 8080
# Start the application with gunicorn for production
CMD ["gunicorn", "-c", "gunicorn.conf.py", "sparql_oc:application"]