{"id":16544504,"url":"https://github.com/msoedov/validex","last_synced_at":"2025-04-04T12:06:26.877Z","repository":{"id":251650939,"uuid":"832866623","full_name":"msoedov/validex","owner":"msoedov","description":"Simplifies the retrieval, extraction, and training of structured data from various unstructured sources.","archived":false,"fork":false,"pushed_at":"2025-03-26T23:15:52.000Z","size":386,"stargazers_count":137,"open_issues_count":9,"forks_count":12,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-28T11:07:51.203Z","etag":null,"topics":["llm-extraction","structured-data-extraction","structured-output"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/msoedov.png","metadata":{"files":{"readme":"Readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-23T22:12:20.000Z","updated_at":"2025-03-27T15:58:37.000Z","dependencies_parsed_at":"2024-10-27T11:11:03.668Z","dependency_job_id":"727d953f-40c8-41ec-ae3f-d9bb80cf8a89","html_url":"https://github.com/msoedov/validex","commit_stats":{"total_commits":10,"total_committers":2,"mean_commits":5.0,"dds":0.09999999999999998,"last_synced_commit":"2456258eaf856be85d711249d704d47f508adece"},"previous_names":["msoedov/validex","msoedov/morph"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msoedov%2Fvalidex","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msoedov%2Fvalidex/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msoedov%2Fvalidex/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msoedov%2Fvalidex/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/msoedov","download_url":"https://codeload.github.com/msoedov/validex/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247174415,"owners_count":20896078,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llm-extraction","structured-data-extraction","structured-output"],"created_at":"2024-10-11T19:03:00.874Z","updated_at":"2025-04-04T12:06:26.852Z","avatar_url":"https://github.com/msoedov.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ValidEx\n\nValidEx is a Python library that simplifies retrieval, extraction and training of structured data from various unstructured sources.\n\n\u003cp\u003e\n\u003cimg alt=\"GitHub Contributors\" src=\"https://img.shields.io/github/contributors/msoedov/validex\" /\u003e\n\u003cimg alt=\"GitHub Last Commit\" src=\"https://img.shields.io/github/last-commit/msoedov/validex\" /\u003e\n\u003cimg alt=\"\" src=\"https://img.shields.io/github/repo-size/msoedov/validex\" /\u003e\n\u003cimg alt=\"GitHub Issues\" src=\"https://img.shields.io/github/issues/msoedov/validex\" /\u003e\n\u003cimg alt=\"GitHub Pull Requests\" src=\"https://img.shields.io/github/issues-pr/msoedov/validex\" /\u003e\n\u003cimg alt=\"Github License\" src=\"https://img.shields.io/github/license/msoedov/validex\" /\u003e\n\u003c/p\u003e\n\n## 🏷 Features\n\n- **Structured Data Extraction**: Parse and extract structured data from various unstructured sources including web pages, text files, PDFs, and more.\n- **Heuristic data cleaning**  text normalization (case, whitespace, special characters), deduplication\n- **Concurrency Support**: Efficiently process multiple data sources simultaneously.\n- **Retry Mechanism**: Implement automatic retries for failed extraction attempts.\n- **Hallucination check**: Implement strategies to detect and reduce LLM hallucinations in extracted data.\n- **Fine-tuning Dataset Export**: Generate datasets in JSONL format for OpenAI chat fine-tuning.\n- **Local Model Creation**: Build custom extraction models combining Named Entity Recognition (NER) and regular expressions.\n\n## 📦 Installation\n\nTo get started with ValidEx, simply install the package using pip:\n\n```shell\npip install validex\n```\n\n## ⛓️ Quick Start\n\n```python\nimport validex\nfrom pydantic import BaseModel\n\n\nclass Superhero(BaseModel):\n    name: str\n    age: int\n    power: str\n    enemies: list[str]\n\n\ndef main():\n    app = validex.App()\n\n    app.add(\"https://www.britannica.com/topic/list-of-superheroes-2024795\")\n    app.add(\"*.txt\")\n    app.add(\"*.pdf\")\n    app.add(\"*.md\")\n\n    superheroes = app.extract(Superhero)\n    print(f\"Extracted superheroes: {list(superheroes)}\")\n\n    first_hero = app.extract_first(Superhero)\n    print(f\"First extracted hero: {first_hero}\")\n\n    print(f\"Total cost: ${app.cost()}\")\n    print(f\"Total usage: {app.usage}\")\n\n\nif __name__ == \"__main__\":\n    main()\n```\n\n```python\n[\n    (\n        Superhero(\n            name=\"Batman\",\n            age=81,\n            power=\"Brilliant detective skills, martial arts\",\n            enemies=[\"Joker\", \"Penguin\"],\n        ),\n        {\"url\": \"https://www.britannica.com/topic/list-of-superheroes-2024795\"},\n    ),\n    (\n        Superhero(\n            name=\"Wonder Woman\",\n            age=80,\n            power=\"Superhuman strength, speed, agility\",\n            enemies=[\"Ares\", \"Cheetah\"],\n        ),\n        {\"url\": \"https://www.britannica.com/topic/list-of-superheroes-2024795\"},\n    ),\n    (\n        Superhero(\n            name=\"Spider-Man\",\n            age=59,\n            power=\"Wall-crawling, spider sense\",\n            enemies=[\"Green Goblin\", \"Venom\"],\n        ),\n        {\"url\": \"https://www.britannica.com/topic/list-of-superheroes-2024795\"},\n    ),\n    (\n        Superhero(\n            name=\"Captain America\",\n            age=101,\n            power=\"Super soldier serum, shield\",\n            enemies=[\"Red Skull\", \"Hydra\"],\n        ),\n        {\"url\": \"https://www.britannica.com/topic/list-of-superheroes-2024795\"},\n    ),\n    (\n        Superhero(\n            name=\"Superman\", age=35, power=\"Flight\", enemies=[\"Lex Luthor\", \"Doomsday\"]\n        ),\n        {\"url\": \"https://www.britannica.com/robots.txt\"},\n    ),\n    (\n        Superhero(\n            name=\"Wonder Woman\",\n            age=30,\n            power=\"Super Strength\",\n            enemies=[\"Ares\", \"Cheetah\"],\n        ),\n        {\"url\": \"https://www.britannica.com/robots.txt\"},\n    ),\n    (\n        Superhero(\n            name=\"Spider-Man\",\n            age=25,\n            power=\"Wall-crawling\",\n            enemies=[\"Green Goblin\", \"Venom\"],\n        ),\n        {\"url\": \"https://www.britannica.com/robots.txt\"},\n    ),\n]\n```\n\n### Hallucinations and autofix\n\n```python\nclass Superhero(BaseModel):\n    name: str\n    age: int\n    power: str\n    enemies: list[str]\n\n    def fix(self):\n        # Logic to auto fix and normalize the generated data\n        if self.age \u003c 0:\n            self.age = 0\n\n    def check_hallucinations(self):\n        # Check name\n        if not re.match(r\"^[A-Za-z\\s-]+$\", self.name):\n            raise ValueError(f\"Name '{self.name}' contains unusual characters\")\n\n        # Check age\n        if self.age \u003c 0 or self.age \u003e 1000:\n            raise ValueError(f\"Age {self.age} seems unrealistic\")\n\n        # Check power\n        if len(self.power) \u003e 50:\n            raise ValueError(\"Power description is unusually long\")\n\n        # Check enemies\n        if len(self.enemies) \u003e 10:\n            raise ValueError(\"Unusually high number of enemies\")\n\n        for enemy in self.enemies:\n            if not re.match(r\"^[A-Za-z\\s-]+$\", enemy):\n                raise ValueError(f\"Enemy name '{enemy}' contains unusual characters\")\n```\n\n### Experimental: Export and fine tunning\n\n```python\n# Use the OpenAI chat fine-tuning format to save data\napp.export_jsonl(\"fine_tune.jsonl\")\n\n# Local model training\napp.fit()\napp.save(\"state.validex\")\n\n\napp.infer_extract(\"booob\")\n```\n\n### Multi-model Extraction\n\nValidEx supports extracting multiple models at once\n\n```python\nclass Superhero2(BaseModel):\n    name: str\n    age: int\n    power: str\n    enemies: list[str]\n\n\nmulti_results = app.multi_extract(Superhero, Superhero2)\nprint(f\"Multi-extraction results: {multi_results}\")\n```\n\n### Limitations\n\nTBD\n\n## 🛠️ Roadmap\n\n## 👋 Contributing\n\nContributions to ValidEx are welcome! If you'd like to contribute, please follow these steps:\n\n- Fork the repository on GitHub\n- Create a new branch for your changes\n- Commit your changes to the new branch\n- Push your changes to the forked repository\n- Open a pull request to the main ValidEx repository\n\nBefore contributing, please read the contributing guidelines.\n\n## License\n\nValidEx is released under the MIT License.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmsoedov%2Fvalidex","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmsoedov%2Fvalidex","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmsoedov%2Fvalidex/lists"}