{"id":28250761,"url":"https://github.com/largonarco/eval","last_synced_at":"2026-02-13T12:53:35.240Z","repository":{"id":278654328,"uuid":"934992780","full_name":"Largonarco/Eval","owner":"Largonarco","description":"LLM system evaluations for a mock system","archived":false,"fork":false,"pushed_at":"2025-02-21T05:51:19.000Z","size":91,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-08-10T07:22:38.549Z","etag":null,"topics":["evals","llm-as-a-judge","openai"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Largonarco.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-02-18T18:22:26.000Z","updated_at":"2025-04-12T23:58:49.000Z","dependencies_parsed_at":"2025-02-21T00:41:06.644Z","dependency_job_id":"a12cff55-c8e7-479c-8ac8-d5b7d470afe4","html_url":"https://github.com/Largonarco/Eval","commit_stats":null,"previous_names":["largonarco/eval"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Largonarco/Eval","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Largonarco%2FEval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Largonarco%2FEval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Largonarco%2FEval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Largonarco%2FEval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Largonarco","download_url":"https://codeload.github.com/Largonarco/Eval/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Largonarco%2FEval/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29406544,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-13T06:24:03.484Z","status":"ssl_error","status_checked_at":"2026-02-13T06:23:12.830Z","response_time":78,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["evals","llm-as-a-judge","openai"],"created_at":"2025-05-19T14:18:58.460Z","updated_at":"2026-02-13T12:53:35.208Z","avatar_url":"https://github.com/Largonarco.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# The Task\n\nThe goal is to design evaluations for our system's outputs. Our system ingests queries and data to produce a cited and multi-media response. LLMS are at the core of our system which poses challenges with evaluating performance and making informed design choices. Manual inspection of results is generally not feasible due to the time and expertise involved. Hence the need for a robust, automated evaluation system.\n\nOur system's response is internally represented by JSON objects that we call \"blocks\". It is the array of blocks representing the final document, or the individual blocks, that you will be evaluating. The blocks we currently support are:\n\n1. Paragraph\n2. Header\n3. AI Image\n4. Web Image\n5. AI Chart\n6. Web Chart\n7. Quote\n8. Metric\n9. Table\n10. Tweet\n\nYou can see examples of full system outputs and of all these blocks in the `example_model_responses.json` file. However, note that these examples were produced some months ago.\n\nTask specification:\n\n1. Write an evaluation for bias in the output. This could be implicit or explicit bias. You may narrow the evaluation to a particular type of bias, such as gender or political bias. Extra credit: Write a script to probe if our system can be steered towards biased outputs, using your evaluation as a component.\n2. Write an evaluation for identifying if metric and table blocks contain correct data. Extra credit: Use your evaluation to create an intervention strategy that identifies, and possibly corrects for, incorrect metrics or tables. Extra extra credit: Make the intervention strategy low-latency.\n3. Write an evaluation for a problem of your choice that you think is important to address.\n4. Write down the thoughts behind your approach in the `approach.md` file. Be detailed. Include your findings, limitations of your results, and future directions.\n\nConsider if the results of the evaluation are actionable. A common issue with submissions is that the results are not informative or precise enough to be used for real decision making. If there is anything you don't have time to do, spend time thinking about what you would do with more time, since I may ask about this.\n\n# Getting Started\n\n### 1. Validate Your OpenAI Key\n\n```\npython3 validate_openai_key.py\nTrue\n```\n\n### 2. Run a Small Test\n\n```\npython3 api_utils.py\n[\n    {\n        \"header\": \"Introduction to String Theory\"\n    },\n    {\n        \"paragraph\": \"String theory is a theoretical framework in physics that seeks to unify the principles of General Relativity and Quantum Mechanics. Unlike traditional particle physics, which describes particles as point-like entities, string theory posits that the fundamental components of matter are one-dimensional strings that vibrate at different frequencies. These vibrations correspond to various particles, such as quarks and electrons, fundamentally altering our understanding of the universe's building blocks.\"\n    },\n    {\n        \"paragraph\": \"The vibrational states of strings determine the properties of particles, including their mass and charge. Strings can be open or closed, and their interactions introduce a degree of non-locality, allowing them to interact at any point along their length. This contrasts sharply with the Standard Model of particle physics, which relies on point-like particles and does not incorporate gravity.\"\n    },\n    {\n        \"table\": {\n            \"data\": [\n                [\n                    \"Property\",\n                    \"String Theory\",\n                    \"Standard Model\"\n                ],\n                [\n                    \"Fundamental Components\",\n                    \"One-dimensional strings\",\n                    \"Point-like particles\"\n                ],\n                [\n                    \"Inclusion of Gravity\",\n                    \"Yes\",\n                    \"No\"\n                ],\n                [\n                    \"Dimensions Required\",\n                    \"10 or 11\",\n                    \"4 (3 spatial + 1 time)\"\n                ]\n            ]\n        }\n    },\n    {\n        \"number\": \"10 or 11 dimensions\",\n        \"description\": \"String theory typically requires the existence of 10 or 11 dimensions beyond our observable universe.\"\n    },\n    {\n        \"paragraph\": \"The implications of string theory extend beyond particle physics, suggesting a more complex structure of the universe. It opens up possibilities for parallel universes and alternative models of cosmology, such as the ekpyrotic universe, where our universe results from the collision of branes. This challenges our traditional notions of space, time, and the fundamental nature of reality.\"\n    }\n]\n```\n\nThe script `generate_model_responses.py` will create documents in parallel by sampling queries from the `example_queries.json` file. The `example_model_responses.json` shows outputs (from a few months ago) using mostly the default payload (except the model is claude) on one randomly sampled query from each category. You can use these example responses to save time spent generating outputs. Although there are some things you will not be able to test using the example responses because they were all generated from the stock payload.\n\n### 3. Run Evals\n\nBefore running the evaluation pipeline add the OpenAI key in evals/core.py main function.\n\n```\npython evals/core.py\n\u003c------------------------------------------------------------\u003e\n\nHEADER: Quantum Superposition vs. Classical Binary States: Unveiling the Quantum Advantage\n\nRELEVANCY EVALUATION:\n\n- Relevance Score: 1.00\n- Coherence Score: 0.34\n- Overall Score: 0.74\nFeedback:\nStrengths:\n- Excellent overall relevance to query\nAreas for Improvement:\n- Response coherence could be strengthened\nSpecific Recommendations:\n- Improve logical flow and transitions throughout the response\n\nACCURACY EVALUATION:\n\nPriority Level: 🟢 low\n✓ Strengths:\n  • High-quality metric content with excellent accuracy\n  • Precise and well-supported information\n  • Strong contextual alignment\n  • Accurate metric value with comprehensive description\n\nPriority Level: 🟢 low\n✓ Strengths:\n  • High-quality table content with excellent accuracy\n  • Precise and well-supported information\n  • Strong contextual alignment\n  • Well-structured table with accurate cell values\n\nPriority Level: 🟡 medium\n⚠ Areas for Improvement:\n  • Moderate accuracy issues detected\n  • Important details missing or imprecise\n📋 Specific Recommendations:\n  • Review and verify metric calculation\n  • Add important contextual qualifiers\n  • Strengthen citation support\n\nGENDER BIAS EVALUATION:\n\nBias Score: 0.00\nKey Findings:\n- The text does not make any references to gender, either in terms of individuals or professional roles.\nRecommendations:\n- Consider including examples or case studies that involve diverse individuals or teams in the field of quantum computing.\nDetailed Statistics:\n{\n  \"gender_distribution\": {\n    \"male_references\": 0,\n    \"female_references\": 0\n  },\n  \"professional_roles\": {\n    \"male_associations\": 0,\n    \"female_associations\": 0\n  }\n}\n\n\u003c------------------------------------------------------------\u003e\n\n```\n\nThis script\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flargonarco%2Feval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flargonarco%2Feval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flargonarco%2Feval/lists"}