{"id":35619763,"url":"https://github.com/scieloorg/usage","last_synced_at":"2026-05-05T00:02:28.681Z","repository":{"id":232569103,"uuid":"758690335","full_name":"scieloorg/usage","owner":"scieloorg","description":"This repository contains the code for the SciELO Usage application, which is a tool for managing and analyzing SciELO usage data.","archived":false,"fork":false,"pushed_at":"2024-06-13T00:43:24.000Z","size":10186,"stargazers_count":1,"open_issues_count":10,"forks_count":1,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-06-13T23:29:27.918Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/scieloorg.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"COPYING","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-16T21:17:10.000Z","updated_at":"2024-06-12T16:51:31.000Z","dependencies_parsed_at":"2024-06-01T21:14:09.540Z","dependency_job_id":"754268c9-882a-4178-9ec2-a8391b25379f","html_url":"https://github.com/scieloorg/usage","commit_stats":null,"previous_names":["scieloorg/usage"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/scieloorg/usage","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scieloorg%2Fusage","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scieloorg%2Fusage/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scieloorg%2Fusage/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scieloorg%2Fusage/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/scieloorg","download_url":"https://codeload.github.com/scieloorg/usage/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scieloorg%2Fusage/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28214414,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2026-01-05T02:00:06.358Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-05T06:02:39.642Z","updated_at":"2026-05-05T00:02:28.674Z","avatar_url":"https://github.com/scieloorg.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SciELO Usage Metrics Pipeline\n\nA modernized platform for processing and indexing SciELO usage logs into OpenSearch, adhering to COUNTER R5.1 standards.\n\n## Quick Start (Dev Installation)\n\nTo build and run the application locally:\n\n1. `make build compose=local.yml`\n2. `make django_migrate`\n3. `make django_createsuperuser`\n4. `make up`\n\nThe application will be accessible at [http://localhost:8009/admin](http://localhost:8009/admin).\n\n---\n\n## Key Commands\n\nAll commands run inside Docker via the `local.yml` compose file unless noted.\n\n```bash\nmake build                           # build images\nmake up                              # start all services (django, postgres, redis, celery worker+beat, mailhog)\nmake django_shell                    # Django shell via docker compose\nmake django_test                     # run full test suite (pytest)\nmake django_fast                     # tests with --failfast\nmake django_migrate                  # apply migrations\nmake django_makemigrations           # generate new migrations\nmake django_createsuperuser          # create Wagtail admin user\nmake logs                            # follow all service logs\nmake ps                              # list compose services\nmake django_bash                     # open a bash shell in the django container\nmake django_compilemessages          # compile translation files\n```\n\n**Run a single test file/path:**\n```bash\ndocker compose -f local.yml run --rm django pytest path/to/test_file.py\n```\n\n## Architecture \u0026 Data Pipeline\n\n### Apps\n\n| App | Purpose |\n|---|---|\n| `log_manager` | Log file discovery, validation, and status tracking |\n| `log_manager_config` | Collection-specific configuration (paths, emails, expected logs/day) |\n| `metrics` | Daily metric jobs, OpenSearch export, COUNTER R5.1 aggregation |\n| `document` | Unified metadata model for articles, books, chapters, datasets, and preprints |\n| `source` | Journal, book, preprint server, and data repository metadata |\n| `reports` | Weekly, monthly, and yearly log processing reports |\n| `resources` | Robot user-agent patterns and GeoIP MMDB management |\n| `tracker` | Discarded line tracking and error logging |\n| `core` | Wagtail pages, users, shared utilities, and external API collectors |\n| `collection` | SciELO collection management |\n\n### Core Collectors (`core/collectors/`)\n\n| Collector | Source |\n|---|---|\n| `articlemeta.py` | ArticleMeta REST/Thrift API |\n| `opac.py` | SciELO OPAC endpoint |\n| `preprints.py` | SciELO Preprints OAI-PMH |\n| `dataverse.py` | SciELO Data (Dataverse) |\n| `scielo_books.py` | SciELO Books CouchDB changes feed |\n\n### Log Ingestion Pipeline\n\nThe ingestion is fully automated via the **`[Log Pipeline] Daily Routine (Auto)`** task. It follows a strictly ordered sequence using Celery Chords:\n\n- **Search**: Scans configured directories for new `.log` or `.gz` files.\n- **Validate**: Performs statistical sampling to ensure log integrity and detect the usage date.\n- **Parse**: Extracts metrics using `scielo_usage_counter`, performs URL translation, and aggregates data.\n- **Export**: Pushes results to OpenSearch using idempotent upsert scripts.\n\n### Metadata Synchronization\n\nMetadata is kept in sync with SciELO sources (ArticleMeta, OPAC, Books, etc.) via the **`[Metadata] Daily Sync Routine (Auto)`** task, which runs parallel workers to ensure documents and sources are always up to date.\n\n## Supported Log Formats\n\n| Format | Description |\n|---|---|\n| NCSA Extended | Standard Apache combined log format with optional domain prefix and IP list fields. |\n| BunnyCDN | Pipe-delimited format with Unix timestamps (7 or 10 digits), country codes, and request IDs. |\n\n## Environment Variables\n\nRuntime configuration is loaded from `.envs/.local/` or `.envs/.production/` through the Compose files.\n\n### Core Services\n\n| Variable | Default | Description |\n|---|---|---|\n| `OPENSEARCH_URL` | `http://localhost:9200/` | OpenSearch cluster URL |\n| `OPENSEARCH_INDEX_NAME` | `usage` | OpenSearch index prefix |\n| `OPENSEARCH_BASIC_AUTH` | `admin:admin` | OpenSearch basic auth credentials |\n| `OPENSEARCH_VERIFY_CERTS` | `False` | Verify SSL certificates for OpenSearch connections |\n| `USE_LOCAL_SCIELO_LIBS` | `0` | Mount local `scielo_log_validator` and `scielo_usage_counter` repos for development |\n| `DJANGO_SETTINGS_MODULE` | `config.settings.local` | Django settings module |\n| `REDIS_URL` | — | Redis connection URL for Celery |\n\n### Collector Endpoints\n\n| Variable | Default | Description |\n|---|---|---|\n| `ARTICLEMETA_COLLECT_URL` | `http://articlemeta.scielo.org/api/v1/article/counter_dict` | ArticleMeta counter metadata endpoint |\n| `ARTICLEMETA_MAX_RETRIES` | `5` | ArticleMeta retry attempts |\n| `ARTICLEMETA_SLEEP_TIME` | `30` | Delay between ArticleMeta retries, in seconds |\n| `OPAC_ENDPOINT` | `https://www.scielo.br/api/v1/counter_dict` | OPAC counter metadata endpoint |\n| `OPAC_MAX_RETRIES` | `5` | OPAC retry attempts |\n| `OPAC_SLEEP_TIME` | `30` | Delay between OPAC retries, in seconds |\n| `OAI_PMH_PREPRINT_ENDPOINT` | `https://preprints.scielo.org/index.php/scielo/oai` | SciELO Preprints OAI-PMH endpoint |\n| `OAI_METADATA_PREFIX` | `oai_dc` | OAI-PMH metadata prefix |\n| `OAI_PMH_MAX_RETRIES` | `5` | OAI-PMH retry attempts |\n| `DATAVERSE_ENDPOINT` | `https://data.scielo.org/api` | SciELO Data Dataverse API endpoint |\n| `DATAVERSE_ROOT_COLLECTION` | `scielodata` | Dataverse root collection alias |\n| `DATAVERSE_SLEEP_TIME` | `30` | Dataverse request timeout/retry delay, in seconds |\n| `SCIELO_BOOKS_BASE_URL` | `http://localhost:5984` | SciELO Books CouchDB base URL |\n| `SCIELO_BOOKS_DB_NAME` | `scielobooks_1a` | SciELO Books CouchDB database name |\n| `SCIELO_BOOKS_TIMEOUT` | `60` | SciELO Books request timeout, in seconds |\n| `SCIELO_BOOKS_LIMIT` | `1000` | SciELO Books changes-feed page size |\n\n## OpenSearch Storage Strategy\n\nThe OpenSearch export keeps monthly usage documents with nested daily metrics, while index names depend on collection size:\n\n- **Large and xlarge collections**: annual indices, such as `usage_monthly_scl_2024` and `usage_yearly_scl_2024`.\n- **Small collections**: stable collection indices, such as `usage_monthly_books` and `usage_yearly_books`.\n- **One Document per Month**: Each document/PID has one monthly document per metric scope.\n- **Daily Nested Metrics**: Daily granularity is preserved inside each monthly document using a `daily_metrics` object.\n- **Atomic Upserts**: Data is merged using OpenSearch **Painless Scripts**, allowing multiple logs for the same day/month to be processed without data duplication or loss.\n\n## Management \u0026 Monitoring\n\nAll pipelines can be monitored through the **Wagtail Admin**:\n\n- **Log Manager**: Monitor the status of individual log files (`QUEUED`, `PARSING`, `PROCESSED`).\n- **Daily Metric Jobs**: Track the history of daily processing and OpenSearch export attempts.\n- **Log Config**: Manage collection-specific settings, log paths, and notification emails.\n\nInternally, log file statuses are stored as short codes such as `QUE`, `PAR`, and `PRO`, with labels displayed in the admin.\n\n### Useful Commands\n\n- `make django_shell`: Access the Django interactive shell.\n- `make django_bash`: Open a bash shell in the Django container.\n- `make logs`: Follow Docker Compose logs.\n- `make ps`: Show running services.\n- `docker compose -f local.yml run --rm django pytest path/to/test_file.py`: Run a single test file or path.\n- `docker logs -f scielo_usage_local_celeryworker`: Monitor real-time task execution.\n\n## Dependencies\n\n- [scielo_log_validator](https://github.com/scieloorg/scielo_log_validator) — log file validation\n- [scielo_usage_counter](https://github.com/scieloorg/scielo_usage_counter) — COUNTER R5.1 metrics extraction\n- [device_detector](https://github.com/thinkwelltwd/device_detector) — client name/version detection\n- [opensearch-py](https://github.com/opensearch-project/opensearch-py) — OpenSearch client\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscieloorg%2Fusage","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscieloorg%2Fusage","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscieloorg%2Fusage/lists"}