{"id":41234322,"url":"https://github.com/DFKI/leechcrawler","last_synced_at":"2026-02-01T10:00:42.370Z","repository":{"id":3257990,"uuid":"4296332","full_name":"DFKI/leechcrawler","owner":"DFKI","description":"Incremental crawling capabilities for Apache Tika. Crawl content out of e.g. file systems, http(s) sources (webcrawling) imap(s) servers or your own arbitrary data sources. LeechCrawler offers additional Tika parsers providing these crawling capabilities.","archived":false,"fork":false,"pushed_at":"2025-09-15T14:56:23.000Z","size":99799,"stargazers_count":8,"open_issues_count":3,"forks_count":5,"subscribers_count":6,"default_branch":"master","last_synced_at":"2026-01-14T18:37:06.564Z","etag":null,"topics":["crawling","extraction","incremental","metadata","tika"],"latest_commit_sha":null,"homepage":"https://github.com/DFKI/leechcrawler","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DFKI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":"supporters.md","governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2012-05-11T11:49:36.000Z","updated_at":"2025-09-15T14:56:26.000Z","dependencies_parsed_at":"2024-04-15T15:48:31.042Z","dependency_job_id":"ce349aac-3431-48dd-b305-da733f0a584e","html_url":"https://github.com/DFKI/leechcrawler","commit_stats":null,"previous_names":["leechcrawler/leech"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/DFKI/leechcrawler","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DFKI%2Fleechcrawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DFKI%2Fleechcrawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DFKI%2Fleechcrawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DFKI%2Fleechcrawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DFKI","download_url":"https://codeload.github.com/DFKI/leechcrawler/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DFKI%2Fleechcrawler/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28975278,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-01T09:57:52.632Z","status":"ssl_error","status_checked_at":"2026-02-01T09:57:49.143Z","response_time":56,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawling","extraction","incremental","metadata","tika"],"created_at":"2026-01-23T01:00:32.313Z","updated_at":"2026-02-01T10:00:42.364Z","avatar_url":"https://github.com/DFKI.png","language":"Java","funding_links":[],"categories":["Public Groups and Projects on GitHub.com"],"sub_categories":[],"readme":"LeechCrawler\n=====\n\nIncremental crawling capabilities for Apache Tika. Crawl content out of e.g. file systems, http(s) sources (webcrawling) or imap(s) servers. LeechCrawler offers additional Tika parsers providing these crawling capabilities.  \nIt is the RDF free successor of Aperture from the DFKI GmbH Knowledge Management group. In the case you want to make a project with us, feel free to [contact us](https://github.com/leechcrawler/leech/blob/master/people.md)!\n\nLeechCrawler is published under the [3-Clause BSD License](http://opensource.org/licenses/BSD-3-Clause), Owner/Organization: [DFKI GmbH](http://www.dfki.de), 2013.\n\nThe key intentions of LeechCrawler:\n* Ease of use - crawl a data source with a few lines of code.\n* Low learning curve - Leech integrates seamlessly into the Tika world.\n* Extensibility - write your own crawlers, support new data source protocols and plug them in by simply adding your jar into the classpath.\n* All parsing capabilities from Apache Tika are supported, including your own parsers.\n* Incremental crawling (second run crawls only the differences inside a data source, according to the last crawl). Offered for existing and new crawlers.\n* Create easily Lucene and SOLR indices.\n\n***\n**[How to start](https://github.com/leechcrawler/leech/blob/master/how2start.md) | [Code snippets / Examples](https://github.com/leechcrawler/leech/blob/master/codeSnippets.md) | [Extending LeechCrawler](https://github.com/leechcrawler/leech/blob/master/extending.md) | [Mailing list](https://github.com/leechcrawler/leech/blob/master/mailinglist.md) | [People/Legal Information](https://github.com/leechcrawler/leech/blob/master/people.md) | [Supporters](https://github.com/leechcrawler/leech/blob/master/supporters.md)| [Data Protection](https://github.com/leechcrawler/leech/blob/master/dataprotection.md)**\n***\nCrawl something incrementally in 1 minute:\n\n    String strSourceUrl = \"URL4FileOrDirOrWebsiteOrImapfolderOrImapmessageOrSomething\";\n\n    Leech leech = new Leech();\n    CrawlerContext crawlerContext = new CrawlerContext();\n    crawlerContext.setIncrementalCrawlingHistoryPath(\"./history/4SourceUrl\");\n    leech.parse(strSourceUrl, new DataSinkContentHandlerAdapter()\n    {\n        public void processNewData(Metadata metadata, String strFulltext)\n        {\n            System.out.println(\"Extracted metadata:\\n\" + metadata + \"\\nExtracted fulltext:\\n\" + strFulltext);\n        }\n        public void processModifiedData(Metadata metadata, String strFulltext)\n        {\n        }\n        public void processRemovedData(Metadata metadata)\n        {\n        }\n        public void processErrorData(Metadata metadata)\n        {\n        }\n    }, crawlerContext.createParseContext());\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDFKI%2Fleechcrawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FDFKI%2Fleechcrawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDFKI%2Fleechcrawler/lists"}