{"id":31727254,"url":"https://github.com/liyakhathshaik/datascout.jl","last_synced_at":"2025-10-09T06:19:44.888Z","repository":{"id":310808829,"uuid":"1041181037","full_name":"liyakhathshaik/DataScout.jl","owner":"liyakhathshaik","description":"This is a julia package","archived":false,"fork":false,"pushed_at":"2025-08-21T03:45:57.000Z","size":239,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-28T07:21:36.790Z","etag":null,"topics":["data","datascout","julia"],"latest_commit_sha":null,"homepage":"https://liyakhathshaik.github.io/DataScout.jl/","language":"Julia","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/liyakhathshaik.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-08-20T05:28:01.000Z","updated_at":"2025-08-21T03:46:00.000Z","dependencies_parsed_at":"2025-08-20T11:47:35.767Z","dependency_job_id":"740b1840-8730-4cbb-b830-55a262e7dfa3","html_url":"https://github.com/liyakhathshaik/DataScout.jl","commit_stats":null,"previous_names":["liyakhathshaik/datascout.jl"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/liyakhathshaik/DataScout.jl","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liyakhathshaik%2FDataScout.jl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liyakhathshaik%2FDataScout.jl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liyakhathshaik%2FDataScout.jl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liyakhathshaik%2FDataScout.jl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/liyakhathshaik","download_url":"https://codeload.github.com/liyakhathshaik/DataScout.jl/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liyakhathshaik%2FDataScout.jl/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279000850,"owners_count":26082950,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-09T02:00:07.460Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data","datascout","julia"],"created_at":"2025-10-09T06:19:43.505Z","updated_at":"2025-10-09T06:19:44.882Z","avatar_url":"https://github.com/liyakhathshaik.png","language":"Julia","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DataScout.jl 🔍\n\n[![CI](https://github.com/liyakhathshaik/DataScout.jl/actions/workflows/CI.yml/badge.svg)](https://github.com/liyakhathshaik/DataScout.jl/actions/workflows/CI.yml)\n[![codecov](https://codecov.io/gh/liyakhathshaik/DataScout.jl/branch/main/graph/badge.svg)](https://codecov.io/gh/liyakhathshaik/DataScout.jl)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\n\u003e A unified Julia package for searching scientific repositories, libraries, and databases\n\nDataScout.jl provides a consistent interface to search across multiple scientific and academic data sources. Whether you're looking for research papers, books, datasets, or web content, DataScout.jl makes it easy to find what you need.\n\n## Features\n\n- **Unified Interface**: Search multiple sources with a single, consistent API\n- **Multiple Sources**: Support for 11+ different data sources\n- **Rate Limiting**: Built-in rate limiting to respect API limits\n- **Error Handling**: Robust error handling and retry mechanisms\n- **Flexible Configuration**: Easy API key management and customization\n- **Type Safety**: Full Julia type annotations and safety\n\n## Installation\n\n```julia\nusing Pkg\nPkg.add(\"DataScout\")\n```\n\nOr from the Julia REPL:\n\n```julia\n] add DataScout\n```\n\n## Quick Start\n\n```julia\nusing DataScout\n\n# Search Wikipedia\nresults = search(\"Julia programming language\", source=:wikipedia, max_results=5)\n\n# Search academic papers (requires API key)\nset_api_key!(:core, \"your-core-api-key\")\npapers = search(\"machine learning\", source=:core, max_results=10)\n\n# Search books\nbooks = search(\"data science\", source=:openlibrary, max_results=5)\n\n# View results\nprintln(results)\n```\n\nTip: An empty query returns an empty DataFrame (no error), so you can safely pass user input without pre-validation.\n\n## Supported Sources\n\n| Source | Symbol | API Key Required | Description |\n|--------|--------|------------------|-------------|\n| [CORE](https://core.ac.uk/) | `:core` | ✅ | Academic papers and research articles |\n| [OpenAlex](https://openalex.org/) | `:openalex` | ❌ | Scholarly works and publications |\n| [Zenodo](https://zenodo.org/) | `:zenodo` | ❌ | Research data and publications |\n| [Figshare](https://figshare.com/) | `:figshare` | ❌ | Research outputs and datasets |\n| [Project Gutenberg](https://www.gutenberg.org/) | `:gutenberg` | ❌ | Free ebooks |\n| [Open Library](https://openlibrary.org/) | `:openlibrary` | ❌ | Books and library catalog |\n| [Wikipedia](https://wikipedia.org/) | `:wikipedia` | ❌ | Encyclopedia articles |\n| [DuckDuckGo](https://duckduckgo.com/) | `:duckduckgo` | ❌ | Web search results |\n| [SearxNG](https://searx.space/) | `:searxng` | ❌ | Privacy-focused search |\n| [Whoogle](https://github.com/benbusby/whoogle-search) | `:whoogle` | ❌ | Privacy-focused Google search |\n| [Internet Archive](https://archive.org/) | `:internetarchive` | ❌ | Digital library and archives |\n\nWhy DataScout: unified schema across sources, built-in retries and rate limiting, simple API key management, and strong error handling.\n\n## Configuration\n\n### API Keys\n\nSome services require API keys for access:\n\n```julia\n# Set API keys\nset_api_key!(:core, \"your-core-api-key\")\nset_api_key!(:openalex, \"your-openalex-api-key\")  # Optional\n\n# Get API keys\napi_key = get_api_key(:core)\n```\n\nAPI keys are stored in `~/.datascout/config.toml` and persist between sessions.\n\n### Custom Instances\n\nFor services like SearxNG and Whoogle, you can specify custom instances:\n\n```julia\n# Using environment variables\nENV[\"SEARXNG_INSTANCE\"] = \"https://my-searxng-instance.com\"\nENV[\"WHOOGLE_INSTANCE\"] = \"https://my-whoogle-instance.com\"\n\n# Or pass as parameters\nresults = search(\"query\", source=:searxng, instance=\"https://custom-instance.com\")\n```\n\n## Usage Examples\n\n### Basic Search\n\n```julia\nusing DataScout\n\n# Simple Wikipedia search\nresults = search(\"quantum computing\", source=:wikipedia)\nprintln(\"Found $(nrow(results)) results\")\nprintln(results.title[1])  # First result title\nprintln(results.url[1])    # First result URL\n```\n\n### Academic Research\n\n```julia\n# Search for academic papers\nset_api_key!(:core, \"your-api-key\")\npapers = search(\"climate change\", source=:core, max_results=20)\n\n# Filter results\nrecent_papers = filter(row -\u003e !ismissing(row.authors), papers)\n\n# Display results\nfor i in 1:min(5, nrow(papers))\n    println(\"Title: $(papers.title[i])\")\n    println(\"Authors: $(papers.authors[i])\")\n    println(\"URL: $(papers.url[i])\")\n    println(\"---\")\nend\n```\n\n### Multi-Source Search\n\n```julia\nfunction search_multiple_sources(query, sources=[:wikipedia, :openalex, :zenodo])\n    all_results = DataFrame()\n    \n    for source in sources\n        try\n            results = search(query, source=source, max_results=5)\n            all_results = vcat(all_results, results, cols=:union)\n        catch e\n            @warn \"Failed to search $source: $e\"\n        end\n    end\n    \n    return all_results\nend\n\n# Search across multiple sources\nresults = search_multiple_sources(\"artificial intelligence\")\n```\n\n### Real-World Use Cases by Source\n\n- **CORE (`:core`)**: literature reviews, academic search portals, or internal tools where PDF links and authors are important.\n  - Why: high-quality academic index with direct download URLs.\n  - How:\n    ```julia\n    set_api_key!(:core, \"YOUR_CORE_KEY\")\n    df = search(\"graph neural networks\", source=:core, max_results=5)\n    ```\n\n- **OpenAlex (`:openalex`)**: topic exploration, citation-based workflows, profile building.\n  - Why: rich scholarly metadata; DOI normalization to `https://doi.org/...`.\n  - How:\n    ```julia\n    df = search(\"federated learning\", source=:openalex, max_results=5)\n    ```\n\n- **Zenodo (`:zenodo`)**: dataset discovery, research artifacts in pipelines.\n  - Why: research outputs with persistent identifiers.\n  - How:\n    ```julia\n    df = search(\"climate dataset\", source=:zenodo, max_results=5)\n    ```\n\n- **Figshare (`:figshare`)**: media, datasets, and supplementary materials.\n  - Why: broad research outputs beyond papers.\n  - How:\n    ```julia\n    df = search(\"microscopy\", source=:figshare, max_results=5)\n    ```\n\n- **Project Gutenberg (`:gutenberg`)**: classic texts for NLP experiments and demos.\n  - Why: public-domain ebooks at scale.\n  - How:\n    ```julia\n    df = search(\"sherlock holmes\", source=:gutenberg, max_results=5)\n    ```\n\n- **Open Library (`:openlibrary`)**: bibliographic enrichment and library apps.\n  - Why: book metadata for integrations and lookups.\n  - How:\n    ```julia\n    df = search(\"data visualization\", source=:openlibrary, max_results=5)\n    ```\n\n- **Wikipedia (`:wikipedia`)**: quick encyclopedic lookups in UIs or chatbots.\n  - Why: broad coverage, fast responses.\n  - How:\n    ```julia\n    df = search(\"julia programming language\", source=:wikipedia, max_results=5)\n    ```\n\n- **DuckDuckGo (`:duckduckgo`)**: general web results with privacy focus.\n  - Why: augment academic results with broader context.\n  - How:\n    ```julia\n    df = search(\"reproducible research tooling\", source=:duckduckgo, max_results=5)\n    ```\n\n- **SearxNG (`:searxng`)**: meta-search with custom instances for enterprise.\n  - Why: configurable, self-hostable meta search.\n  - How:\n    ```julia\n    ENV[\"SEARXNG_INSTANCE\"] = \"https://searx.example.org\"\n    df = search(\"open data portals\", source=:searxng, max_results=5)\n    ```\n\n- **Whoogle (`:whoogle`)**: privacy-preserving Google front-end.\n  - Why: keep queries private while leveraging Google results.\n  - How:\n    ```julia\n    ENV[\"WHOOGLE_INSTANCE\"] = \"https://whoogle.example.org\"\n    df = search(\"state of the art summarization\", source=:whoogle, max_results=5)\n    ```\n\n- **Internet Archive (`:internetarchive`)**: archives, media, and historical documents.\n  - Why: rich historical datasets and media collections.\n  - How:\n    ```julia\n    df = search(\"old computing magazines\", source=:internetarchive, max_results=5)\n    ```\n\n## Result Format\n\nAll search functions return a `DataFrame` with the following columns:\n\n- `title::Union{String, Missing}` - Title of the result\n- `url::Union{String, Missing}` - URL to access the resource\n- `authors::Union{Vector{String}, Missing}` - List of authors (when available)\n- `source::Union{String, Missing}` - Source name\n- `id::Union{String, Missing}` - Unique identifier from the source\n\n## Error Handling\n\nDataScout.jl includes comprehensive error handling:\n\n```julia\n# Graceful handling of network errors\ntry\n    results = search(\"test query\", source=:core)\ncatch e\n    @error \"Search failed: $e\"\n    results = DataFrame()  # Empty results\nend\n\n# Built-in retry mechanism for transient failures\n# Automatic rate limiting to respect API limits\n# Detailed error logging for debugging\n```\n\nBehavioral guarantees:\n- Empty queries return an empty standardized DataFrame (no exceptions).\n- All sources normalize to the same schema.\n- Transient failures are retried with exponential backoff.\n\n## Performance and Rate Limiting\n\nDataScout.jl automatically handles rate limiting for each service:\n\n- **CORE**: 0.3 seconds between requests (default)\n- **Other services**: 0.5 seconds between requests (default)\n- **Retry mechanism**: 3 attempts with exponential backoff\n- **Configurable**: Adjust rate limits in `~/.datascout/config.toml`\n\nRate-limiting state is persisted in `~/.datascout/state.toml` to smooth behavior across sessions.\n\n## Contributing\n\nContributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.\n\n### Development Setup\n\n```bash\ngit clone https://github.com/liyakhathshaik/DataScout.jl.git\ncd DataScout.jl\njulia --project=. -e 'using Pkg; Pkg.instantiate()'\njulia --project=. -e 'using Pkg; Pkg.test()'\n```\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Acknowledgments\n\n- Thanks to all the open data providers and APIs that make this package possible\n- Inspired by the need for unified access to scientific literature and data\n- Built with ❤️ for the Julia community\n\n## Support\n\n- 📖 [Documentation](https://github.com/liyakhathshaik/DataScout.jl)\n- 🐛 [Issue Tracker](https://github.com/liyakhathshaik/DataScout.jl/issues)\n- 💬 [Discussions](https://github.com/liyakhathshaik/DataScout.jl/discussions)\n\n---\n\n**DataScout.jl** - Making scientific data discovery simple and unified! 🔍✨","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fliyakhathshaik%2Fdatascout.jl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fliyakhathshaik%2Fdatascout.jl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fliyakhathshaik%2Fdatascout.jl/lists"}