{"id":13735128,"url":"https://github.com/nsoft/jesterj","last_synced_at":"2026-01-16T06:58:36.821Z","repository":{"id":21243166,"uuid":"24558581","full_name":"nsoft/jesterj","owner":"nsoft","description":"Document Ingestion Framework for Search Systems","archived":false,"fork":false,"pushed_at":"2024-05-01T00:42:05.000Z","size":6911,"stargazers_count":35,"open_issues_count":23,"forks_count":34,"subscribers_count":10,"default_branch":"master","last_synced_at":"2024-08-04T03:04:37.924Z","etag":null,"topics":["elasticsearch","java","search","solr"],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nsoft.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"patreon":"needhamsoftware"}},"created_at":"2014-09-28T12:06:27.000Z","updated_at":"2024-05-01T00:42:11.000Z","dependencies_parsed_at":"2024-05-01T01:42:34.463Z","dependency_job_id":"596616bf-dd3f-472b-be07-65cdf0c666c3","html_url":"https://github.com/nsoft/jesterj","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nsoft%2Fjesterj","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nsoft%2Fjesterj/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nsoft%2Fjesterj/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nsoft%2Fjesterj/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nsoft","download_url":"https://codeload.github.com/nsoft/jesterj/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224727173,"owners_count":17359532,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["elasticsearch","java","search","solr"],"created_at":"2024-08-03T03:01:03.272Z","updated_at":"2026-01-16T06:58:36.782Z","avatar_url":"https://github.com/nsoft.png","language":"Java","funding_links":["https://patreon.com/needhamsoftware"],"categories":["Projects"],"sub_categories":[],"readme":"JesterJ\n=======\nA highly flexible, scalable, fault-tolerant document ingestion system designed for search.\n\n[![License](https://img.shields.io/badge/license-Apache%202.0-B70E23.svg?style=plastic)](http://www.opensource.org/licenses/Apache-2.0)\n[![Build Status](https://github.com/nsoft/jesterj/actions/workflows/gradle.yml/badge.svg)](https://github.com/nsoft/jesterj/actions)\n\nBuilds are run on infrastructure kindly donated by [\u003cimg align=\"top\" src=\"https://crave.io/wp-content/uploads/2022/09/Crave_logo_black_bg-e1663023213710.png\" alt=\"\" width=\"100px\" height=\"26px\"\u003e](https://crave.io/)\n\n## The problem\nFrequently, search projects start by feeding a few documents manually to a search engine, often via the \"just for testing\" built in processing features of Solr such as [SolrCell](https://solr.apache.org/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html) or [post.jar](https://solr.apache.org/guide/6_6/post-tool.html#simpleposttool).\nThese features are documented and included in order to help the user get a feel for what they can do with Solr with a minimum of painful setup.\n\nThis is good and that's how it should be for first explorations. **Unfortunately it's also a potential trap.**\n\nAll too often, users who don't know any better, and are perhaps mislead by the fact that these interfaces are documented in the reference manual (and assume anything documented must be \"the right way\" to do it) continue developing their search system by automating the use of those same interfaces.\nIn fairness to those users, some older versions of the Solr Ref guide failed to identify the \"just for testing\" nature of the interface, sometimes because it took a while for the community to realize the pitfalls associated with it.\n\nUnfortunately, large scale ingestion of documents for search is non-trivial and those indexing interfaces not meant for production use.\nThe usual result is that it works \"ok\" for a small test corpus and then becomes unstable on a larger production corpus.\nThe code written to feed into such interfaces often needs to be repeated for several types of documents or for various document formats, and can easily lead to duplication and cut and paste copying of common functionality.\nAlso, after investing substantial engineering to get such solutions working on a large corpus, the next thing they discover is that they have no way to recover if indexing fails part way through.\nIn the worst cases the failure is related to the size of the corpus and the failures become increasingly common as the corpus grows until the chance of completing and indexing run is small and the system eventually cannot be indexed or upgraded at all if the problem is allowed to fester.\nThe result is a terrible, painful and potentially expensive set of growing pains.\n\n## JesterJ's solution\n\nJesterJ endeavors to make it easy to start with a robust full featured indexing infrastructure, so that you don't have to re-invent the wheel.\nJesterJ is meant to be a system you won't need to abandon until you are working with extremely large numbers of documents (and hopefully by that point you are already making good profits that can pay for a large custom solution!).\nA variety of re-usable processing components are provided and writing your own custom processors is as simple as implementing a 4 method interface following some simple guidelines.\n\nOften the first version of a system for indexing documents into Solr or other search engine is fairly linear and straight forward, but as time passes features and enhancements often add complexity.\nOther times, the system is complex from the very start, possibly because search is being added to an existing system.\nJesterJ is designed to handle complex indexing scenarios.\nConsider the following hypothetical indexing workflow:\n\n![Complex Processing](https://raw.githubusercontent.com/nsoft/jesterj/79ed481c7c0b98469e3e41c96b92170837a26130/code/examples/routing/complex-routing.png)\n\nJesterJ handles such scenarios with a single centralized processing plan, and will ensure that if the system is unplugged, you won't get a second message about an order received. The default mode for JesterJ is to ensure at most once delivery for steps that are not marked safe or idempotent. Safe steps do not have external effects, and idempotent steps may be repeated en-route to the final processing end point.\n\nSee the [website](http://www.jesterj.org) and the [documentation](https://github.com/nsoft/jesterj/wiki/Documentation) for more info\n\n# Getting Started\n\nPlease see the [documentation in the wiki](https://github.com/nsoft/jesterj/wiki/Documentation)\n\n# Project Status\n\n**Current release**: 1.0-Beta3. This is the best version to use, and should be mostly functional. (known issue: https://github.com/nsoft/jesterj/issues/189)\n\n**Next Release:** 1.0-Beta4 will be published soon  if no serious issues are found in within two weeks 1.0 will be released.\n\n\nNOTE: The current code and the upcoming 1.0 release target any design and load that can be serviced by a single machine.\nJesterJ is explicitly designed to take advantage of machines with many processors.\nYou can design your plan with duplicates of your slowest step to alleviate bottlenecks. Each duplicate implies an additional thread working on that step.\nAutomatic scaling of threads is planned for 1.1 and Scaling across many machines is a key priority for the 2.x releases. As always, if you want these features sooner, please start a discussion and contribute a PR if you are able!\n\n\n\n# JDK versions\n\nPresently only JDK 11 has been tested regularly. Any Distribution of JDK 11 should work. Support for Java 17 and future LTS versions is planned for future releases.\n\n# Discord Server\n\nDiscuss features, ask questions etc on Discord: https://discord.gg/RmdTYvpXr9\n\n## Features:\n\nIn this release we have the following features\n\n* Ability to visualize the structure of your plan (.dot or .png format: [example from unit tests here](https://tinyurl.com/22k7tu74) )\n* Simple filesystem scanner for locally mounted drives (replacement for post.jar)\n* JDBC scanner (replacement for Data Import Handler!)\n* Scanners can remember what documents they've seen (or not, boolean flag)\n* Scanners can recognize updated content (or not, boolean flag)\n* Send to Solr processor with tunable batch sizes\n* Tika processor to extract content from Word/PDF/xml/html, etc (Replacement for SolrCell!)\n* Stax extract processor for dissecting xml documents directly.\n* Copy field processor to rename source fields to desired index field\n* Regexp replace processor to edit field content, or drop fields that don't match\n* Split field processor to split delimited values for multi-value fields\n* Drop field processor to get rid of annoying excess fields.\n* Field template processor for composing field content using a velocity template\n* URL encode processor to encode the value of a field and make it safe for use in URLs\n* Fetch URL processor for acquiring or enhancing content by contacting other systems\n* Log and drop processor for when you identify an invalid docuemnt\n* Date Reformat processor, because dates, formatting... always. (*sigh*)\n* Human Readable File Size processor\n* Solr sender to send documents to solr in batches.\n* Pre-Analyze processor to move Solr analysis workload out of Solr (just give it your schema.xml!)\n* Embedded Cassandra server (no need to install cassandra yourself!)\n* Cassandra config and data location configurable, defaults to `~/.jj/cassandra`\n* Support for fault tolerance writing status change events to the embedded cassandra server\n* Initial API/process for user written document processors. (see [documentation](https://github.com/nsoft/jesterj/wiki/Documentation))\n* 60% test coverage (jacoco)\n* Simple, single java file to configure everything, non-java programmers need only follow a simple example (for use cases not requiring custom code)\n* If you DO need custom code that code can be packaged as an [uno-jar](https://github.com/nsoft/uno-jar) to provide all required dependencies and escape from any library versions that JesterJ uses! You only have to deal with your OWN jar hell, not ours! Of course, you can also just rely on whatever we already provide too. The classloaders for custom code prefer your uno-jar and then default back to whatever JesterJ has available on it's classpath.\n* Runnable example to [execute a plan](https://github.com/nsoft/jesterj/blob/master/code/ingest/README.md) that scans a filesystem, and indexes the documents in solr.\n\n\n## TODO for 1.0 final release\n * [Remaining issues](https://github.com/nsoft/jesterj/issues?q=is%3Aopen+is%3Aissue+milestone%3A1.0)\n * Beta release, testing.\n\nRelease 1.0 is intended to be the usable for single node systems, and therefore suitable for use on small to medium-sized projects (tens of millions or maybe low hundreds of million of documents).\n\n## Road Map\n\nThe best guess at any time of what will be in future releases is given by the milestones filters [on our issues page](https://github.com/nsoft/jesterj/issues)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnsoft%2Fjesterj","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnsoft%2Fjesterj","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnsoft%2Fjesterj/lists"}