An open API service indexing awesome lists of open source software.

https://github.com/ozefe/yoktez

Typed Python client for searching, fetching metadata, and downloading theses from the National Thesis Center of Turkey (YÖK Ulusal Tez Merkezi)
https://github.com/ozefe/yoktez

academic-project api-client api-wrapper httpx-client thesis ulusal-tez-merkezi web-scraping

Last synced: 16 days ago
JSON representation

Typed Python client for searching, fetching metadata, and downloading theses from the National Thesis Center of Turkey (YÖK Ulusal Tez Merkezi)

Awesome Lists containing this project

README

          

# yoktez

yoktez mascot generated by Google's Nano Banana 2

Typed Python client for the [National Thesis Center of Turkey](https://tez.yok.gov.tr/UlusalTezMerkezi/).

`yoktez` wraps the YOK NTC JSP/AJAX surface behind a single synchronous `Client` with frozen-dataclass return types, a deterministic exception hierarchy, and bilingual-aware fields. Built for application and CLI developers who need a typed surface and a small install footprint without writing bespoke scraping code for each project.

## Installation

```bash
pip install yoktez
```

Requires Python 3.14+.

## Quickstart

```python
"""End-to-end yoktez quickstart: search -> metadata -> assets.

Demonstrates the typical three-call flow without writing files to disk.

Run with: `python examples/quickstart.py`
"""

from yoktez import AssetStatus, Client

_QUERY = "yapay zeka"

with Client() as client:
results = client.search.simple(_QUERY)
print(f"{results.total} matches for {_QUERY!r}")

thesis = results[0]
print(f" title: {thesis.title}")
print(f" author: {thesis.author}")
print(f" year: {thesis.year}")
print(f" keys: {thesis.registration_no} / {thesis.thesis_no}")

metadata = client.metadata.get(thesis)
print(f" advisor: {metadata.supervisor}")
if metadata.affiliation is not None:
print(f" uni: {metadata.affiliation.university}")
if metadata.keywords is not None:
print(f" tags: {len(metadata.keywords)} keywords")

assets = client.assets.get(thesis)
print(f" status: {assets.status.name}")
if assets.status is AssetStatus.AVAILABLE:
print(f" pdf_key: {assets.pdf_key}")
```

Sample output:

```text
6841 matches for 'yapay zeka'
title: Kimya eğitiminde yapay zekâ araştırmalarına ilişkin bir meta-sentez çalışması
author: MURAT EBUBEKİR YAYLA
year: 2026
keys: nslbSyAODG1_FIruL8qUAA / THvIvDpZXvJIiHZpuqpKVw
advisor: PROF. DR. MUSA ÜCE
uni: MARMARA ÜNİVERSİTESİ
tags: 5 keywords
status: AVAILABLE
pdf_key: 5T1_CZ5-UGb9QCmoURec4AbpuuyvqUeed_1PcCh_6DVZ4b1fbX7Gcu-DQFLIcE11
```

## Features

- **Four search modes:** `simple`, `advanced`, `detail`, and `recent` from a single `client.search` namespace, all returning a sliceable `SearchResults` carrying the database-wide match total alongside the result window.
- **Structured metadata:** `client.metadata.get(thesis)` returns a typed `ThesisMetadata` with bilingual keywords (`Bilingual(raw, tr, en)`), a tiered `Affiliation`, and pre-formatted citation strings (APA / IEEE / MLA / Chicago / Harvard).
- **Two-step asset download:** `client.assets.get(thesis)` resolves to one of `AVAILABLE` / `UNDER_EMBARGO` / `NO_PERMIT` / `PREPARING` before any bytes move; the available branch exposes a `pdf_key` (and optional `appendix_key`) to feed `download_pdf` / `download_appendix`.
- **Catalog lookups:** `client.lookups` covers universities (TR / INT), institutes, divisions, subjects, departments, sections, and keywords, with per-instance memoization and an explicit `refresh()`.
- **Typed value objects:** every returned record is a `@dataclass(frozen=True, slots=True)`; values are immutable, hashable where field types allow, and ship with `py.typed` for downstream type checkers.
- **Sync-only, thread-friendly:** no `async`/`await` surface; the recommended concurrency pattern is one `Client` per thread.
- **Small dependency surface:** `httpx`, `beautifulsoup4`, and `lxml`. No Rust core, no auth, no hidden state.

## Usage

All snippets assume `with Client() as client:` for deterministic cleanup of the underlying HTTP connection pool.

### Search

Simple search by free text, optionally narrowed to a single field:

```python
from yoktez import Client, SearchField

with Client() as client:
results = client.search.simple("yapay zeka", field=SearchField.ABSTRACT)

print(f"{results.total} matches")
for thesis in results[:5]:
print(thesis.year, thesis.title)
```

Advanced search joins up to three terms with boolean operators:

```python
from yoktez import AdvancedOperator, Client, MatchType

with Client() as client:
results = client.search.advanced(
"sosyal",
term2="medya",
op1=AdvancedOperator.AND,
match=MatchType.INCLUDES,
)
```

Detail search accepts the full filter surface; enum-shaped parameters also accept the member name as a string or the raw int code:

```python
from yoktez import Client, ThesisType

with Client() as client:
unis = client.lookups.universities()
results = client.search.detail(
university=unis[0],
year_min=2020,
year_max=2025,
degree_type=ThesisType.MASTER, # also accepts "MASTER" or 1
)
```

Recently added theses (server-fixed 15-day window):

```python
from yoktez import Client

with Client() as client:
results = client.search.recent()
```

### Metadata

```python
from yoktez import Client

with Client() as client:
thesis = client.search.simple("makine öğrenmesi")[0]
metadata = client.metadata.get(thesis)

if metadata.affiliation is not None:
print(metadata.affiliation.university)
if metadata.keywords:
print(metadata.keywords[0].tr, "=", metadata.keywords[0].en)
if metadata.references is not None:
print(metadata.references.apa)
```

### Assets (two-step download)

```python
from yoktez import AssetStatus, Client

with Client() as client:
thesis = client.search.simple("yapay zeka")[0]
assets = client.assets.get(thesis)

if assets.status is AssetStatus.AVAILABLE and assets.pdf_key is not None:
client.assets.download_pdf(assets.pdf_key, "thesis.pdf")

if assets.appendix_key is not None:
client.assets.download_appendix(assets.appendix_key, "thesis-ek.rar")
```

`download_pdf` and `download_appendix` accept a filesystem path (`Path` or `str`, opened and closed for you) or a pre-opened binary file-like (written to but not closed — ownership stays with the caller).

### Lookups

```python
from yoktez import Client, UniversitySource

with Client() as client:
unis = client.lookups.universities(UniversitySource.TR)
institutes = client.lookups.institutes(unis[0])
divisions = client.lookups.divisions(unis[0], institutes[0])

# Bulk catalogs; keywords() also accepts group / language / first_letter / search.
keywords = client.lookups.all_keywords()
```

Every `client.lookups.*` call is memoized on the `Client` instance. Call `client.lookups.refresh()` to clear the cache if YOKSIS IDs are suspected to have rotated.

### HTTP client configuration

`Client` accepts keyword-only overrides for the underlying `httpx.Client`:

```python
from yoktez import Client

with Client(timeout=60, retries=5, user_agent="my-app/1.0") as client:
...
```

For full control, inject a pre-built `httpx.Client` via `http_client=`. Ownership stays with the caller; `Client.close()` is a no-op for an injected client:

```python
import httpx
from yoktez import Client

http = httpx.Client(timeout=30.0, follow_redirects=True)
try:
with Client(http_client=http) as client:
...
finally:
http.close()
```

## Concurrency

`yoktez.Client` is single-threaded by design — share one per thread, never across threads. The library ships no concurrency primitives; threading strategy is the caller's choice.

## Design principles

- **Synchronous-only API:** Sync is sufficient for YOK NTC's IO patterns; an async surface would double the API and complicate testing for no proven benefit. Concurrency strategy belongs to the caller, and `examples/multithreaded_pool.py` demonstrates the one-`Client`-per-thread pattern.
- **Frozen-dataclass value objects:** Every returned record is `@dataclass(frozen=True, slots=True)`. Stdlib-only, immutable, hashable, and very fast.
- **Coerce-on-input enum handling:** Enum-shaped parameters accept the matching `Enum` member, its name (e.g., `"MASTER"`), or its raw int code; the raw-`int` passthrough tolerates new YOK NTC codes the library hasn't yet enumerated, so wire-side additions don't gate a release.
- **Two-step download flow:** `client.assets.get(...)` resolves status first; `download_pdf` and `download_appendix` run only on the available branch. Honest to the underlying YOK NTC flow, and lets callers inspect embargo dates and appendix availability before committing to a second request.
- **Hierarchical logger naming:** Every sub-package logs under `yoktez.` (`yoktez.http`, `yoktez.search`, `yoktez.lookups`, `yoktez.assets`). Operators can silence the high-volume HTTP DEBUG channel while preserving the rarer parser WARNING channels; a single `logging.getLogger("yoktez").setLevel(...)` still catches every child through parent propagation.

## Limitations

`yoktez` is intentionally narrow. The following are out of scope and will not land in this package:

- **No async API:** Synchronous code throughout; no `async def`, no asyncio surface.
- **No multi-threaded helper functions:** Concurrency strategy is the caller's choice.
- **No authentication or login flows (e-Devlet):** Anonymous public-data access only; features requiring login (favorites, history) are excluded.
- **No bypassing access restrictions:** Embargoed and no-permit theses surface their state via `AssetStatus` and the matching exception types; the library does not attempt to circumvent these.
- **No data hosting or mirroring:** The library fetches on demand; no bundled snapshots of the YOK NTC database.
- **No CLI shipped from this package:** A separate package may add one later — out of scope here.

## License

MIT — see [`LICENSE`](LICENSE).