{"id":18555729,"url":"https://github.com/romeh/failover-singlejob-ignite","last_synced_at":"2025-04-10T00:30:38.983Z","repository":{"id":137544033,"uuid":"110570777","full_name":"Romeh/failover-singlejob-ignite","owner":"Romeh","description":"a demo for how implement a failover guarantee for single compute task in apache ignite as for single job in the same primary node no failover will be covered  ","archived":false,"fork":false,"pushed_at":"2018-12-17T13:23:56.000Z","size":90,"stargazers_count":2,"open_issues_count":0,"forks_count":3,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-24T13:51:17.494Z","etag":null,"topics":["big-data","caching","ignite","in-memory-computations","nosql-database"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Romeh.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-11-13T16:12:11.000Z","updated_at":"2021-12-23T08:07:53.000Z","dependencies_parsed_at":null,"dependency_job_id":"37188995-a1af-4a4e-bd66-80ba90906c7b","html_url":"https://github.com/Romeh/failover-singlejob-ignite","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Romeh%2Ffailover-singlejob-ignite","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Romeh%2Ffailover-singlejob-ignite/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Romeh%2Ffailover-singlejob-ignite/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Romeh%2Ffailover-singlejob-ignite/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Romeh","download_url":"https://codeload.github.com/Romeh/failover-singlejob-ignite/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248134844,"owners_count":21053546,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["big-data","caching","ignite","in-memory-computations","nosql-database"],"created_at":"2024-11-06T21:27:44.208Z","updated_at":"2025-04-10T00:30:38.971Z","avatar_url":"https://github.com/Romeh.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"### How to guarantee your single computation task to be finished in case of node crash in apache Ignite \n\u003c/br\u003e\n\nHow to guarantee your single computation task is guaranteed to failover in case\nof node failures in apache Ignite ?\n\n\n![](https://cdn-images-1.medium.com/max/800/1*Yclh5mXd8QfJu3AYMEfBWQ.png)\n\nAs you know failover support in apache ignite for computation tasks is only\ncovered for map reduce jobs where slave nodes will do computations then reduce\nback to the master node , and in case of any failure in slave nodes where slave\njobs are executing , then it that failed slave job will fail over to another\nnode to continue execution .\n\nOk what about if I need to execute just single computation task and I need to\n have failover guarantee due may be it is a critical task that do financial data\nmodification or must finished task in an acceptable status (Success or Failure)\n, how we can do that ? it is not supported out of the box by Ignite but we can\nhave a small design extension using Ignite APIs to cover the same , HOW ?\n\n![Alt text](/config/igniteFailOver.jpg?raw=true \"Overview design\")\n\n**Here is the main steps from the overview above via the following flow :**\n\n\u003e 1- You need to create 2 partitioned caches , one for single jobs reference and\n\u003e one for node Ids reference , you should make those caches backed by persistence\nstore in production if you need to survive total grid crash\n\n\u003e 2- Define jobs cache after put interceptor to set the node id which is the\n\u003e primary owner and triggerer of that compute task \n\nHow the ignite jobs cache interceptor is implemented :\n```java\n\npublic class JobsInterceptor extends CacheInterceptorAdapter\u003cString, Job\u003e {\n\n    @IgniteInstanceResource\n    Ignite ignite;\n\n\n    @Nullable@Override\n    public void onAfterPut(Cache.Entry\u003cString, Job\u003e entry) {\n        // sample sensitive computation task\n        QueryTask queryTask=new QueryTask();\n        // get current node reference to get its node id\n        ClusterNode clusterNode = ignite.cluster().localNode();\n        System.out.println(\"intercepting for job action triggering and setting node id : \"+ clusterNode.id().toString());\n        //store node id in the job wrapper object\n        entry.getValue().setNodeId(clusterNode.id().toString());\n        //create async computation with specific timeout with affinity to the jobs data cache to have collocated computation\n        ignite.compute().withTimeout(5500)\n                .affinityRunAsync(ICEP_JOBS.name(),entry.getKey(),\n                        ()-\u003equeryTask.execute(entry.getValue().getRequest()));\n    }\n\n}\n```\n\u003e 3- Define nodes cache interceptor to intercept after put actions so it can query\n\u003e for all pending jobs for that node id then submit them again into the compute\ngrid with affinity \n```java\npublic class NodesInterceptor extends CacheInterceptorAdapter\u003cString, String\u003e {\n\n    @IgniteInstanceResource\n    Ignite ignite;\n    private transient IgniteCache\u003cString, Job\u003e jobs;\n    private final String sql = \"nodeId = ?\";\n    private transient SqlQuery\u003cString, Job\u003e affinityKeyRequestSqlQuery;\n\n\n    @Nullable@Override\n    public void onAfterPut(Cache.Entry\u003cString, String\u003e entry) {\n        // sample compute task that can be sensitive and it need to have fail over support\n        QueryTask task = new QueryTask();\n        // get partitioned jobs cache reference\n        jobs = ignite.cache(ICEP_JOBS.name());\n        // get the current local node reference\n        ClusterNode clusterNode = ignite.cluster().localNode();\n        System.out.println(\"intercepting for Node failure and retry from node id : \"+ clusterNode.id().toString()+\" to node id : \"+entry.getValue());\n\n        // Create query to get pending jobs for that node id and submit them again\n        affinityKeyRequestSqlQuery= new SqlQuery\u003c\u003e(Job.class, sql);\n        affinityKeyRequestSqlQuery.setArgs(entry.getValue());\n        jobs.query(affinityKeyRequestSqlQuery).forEach(affinityKeyJobEntry -\u003e {\n            System.out.println(\"found a pending jobs for node id: \"+entry.getValue() +\" and job id: \"+affinityKeyJobEntry.getKey());\n            // submit again the jobs for re-execution\n            ignite.compute().withTimeout(5500)\n                    .affinityRunAsync(ICEP_JOBS.name(),affinityKeyJobEntry.getKey(),\n                            ()-\u003etask.execute(affinityKeyJobEntry.getValue().request));\n\n        });\n    }\n}\n```\n\n\n\n\u003e 4- Enable event listening for node left and node removal in the grid to\n\u003e intercept node failure\n\n**Then let us run the show , imagine you have data and compute grid of 2 server\nnodes :**\n\n\u003e a- you trigger a job in node 1 which will do sensitive action like financial\n\u003e action and you need to be sure it is finished with a valid state whatever the\ncase \n\n\u003e b- what if that primary node 1 crashed , what will happen to that compute task ,\n\u003e without the extension highlighted above it will disappear with the wind \n\n\u003e c- but with that failover small extension , Node 2 . will catch an event that\n\u003e Node 1 left , then it will query jobs cache for all jobs that has that node id\nand resubmit them again for computation , optimal case if you have idempotent\nactions so it can be executed multiple times or use job checkpointing for saving\nthe execution state to resume from the last saved point \n\n**Testing flow :**\n\n1- first run the first ignite server node with that code commented out :\n```java\npublic class NodeApp {\n\n    public static void main(String[] args) throws Exception {\n        // just for demo and test purpose , you should design more generic bootstrap logic to start your node\n        Ignite ignite = Ignition.start(\"config/igniteFailOver.xml\");\n        try {\n\n            IgniteCache\u003cString, Job\u003e cache = ignite.cache(CacheNames.ICEP_JOBS.name());\n            // enable that ONLY for one node and after you start see the system outs , you can kill that node to see the fail over logic in the second node\n            System.out.println(\"start of jobs creation\");\n          /* for (int i = 0; i \u003c= 25; i++) {\n               String key = i + \"Key\";\n                // start creating jobs by inserting them into the\n                cache.put(key\n                        , Job.builder().nodeId(ignite.cluster().localNode().id().toString()).\n                                request(Request.builder().requestID(key).modifiedTimestamp(System.currentTimeMillis()).build()).\n                                build());\n            }*/\n            // listen globally for all nodes failed or removed events\n            ignite.events().localListen(event -\u003e {\n                DiscoveryEvent discoveryEvent = (DiscoveryEvent) event;\n                System.out.println(\"Received Node event [evt=\" + discoveryEvent.name() +\n                        \", nodeID=\" + discoveryEvent.eventNode() + ']');\n\n                ignite.compute().runAsync(() -\u003e {\n                    IgniteCache\u003cString, String\u003e nodes = ignite.cache(CacheNames.ICEP_NODES.name());\n                    String failedNodeId = discoveryEvent.eventNode().id().toString();\n                    // only one NODE will manage to insert successfully as it it is an atomic operation and thread safe\n                    nodes.withExpiryPolicy(new CreatedExpiryPolicy(Duration.ONE_HOUR)).putIfAbsent(failedNodeId, failedNodeId);\n                });\n\n                return true;\n\n            }, EventType.EVT_NODE_LEFT, EventType.EVT_NODE_FAILED);\n\n\n        } catch (Exception e) {\n            // just for test , do not do that in production code\n            e.printStackTrace();\n        }\n\n    }\n}\n\n```\n2- then run the second server node but before doing it , uncomment the\nhighlighted code above which simulate creating now jobs for computation by\ninserting them into the jobs cache\n\n3- once you run the second node , after 5 seconds kill it by shutting it down\nonce you see it started to submit jobs from the code you just uncommented, like:\n\n\u003e intercepting for job action triggering and setting node id :\n\u003e f0920c5b-3655–4e85-aa60-f763a9eb1111\u003cbr\u003e Executing computation logic for the\nrequest0Key\n\n4- you will see in the first still running node a message that highlight it\nreceived and event about the removal of the second node which from it , it will\nfetch the node id , then insert it on the failed nodes cache where its cache\ninterceptor will intercept the after put action , use the node id and query in\njobs cache for still pending jobs that has the same node id and resubmit them\nagain for execution in the compute grid and here we are happy that we caught the\nnon finished jobs from the failed crashed primary node that submitted those jobs\n\n\u003e Received Node event [evt=NODE_LEFT, nodeID=TcpDiscoveryNode\n\u003e [id=2da3e806–72e3–415b-acd3–07b7da0eabe0, addrs=[0:0:0:0:0:0:0:1%lo0, 127.0.0.1,\n192.168.1.169], sockAddrs=[/192.168.1.169:47501, /0:0:0:0:0:0:0:1%lo0:47501,\n/127.0.0.1:47501], discPort=47501, order=2, intOrder=2,\nlastExchangeTime=1510666504589, loc=false, ver=2.3.1#20171031-sha1:d2c82c3c,\nisClient=false]]\n\nand you will see it is fetching pending jobs and submitting it again, for\nexample you will see the following in the IDEA console:\n\n\u003e found a pending jobs for node id: c2a32b7d-1420–4e1a-8ca2-b7080e91dc22 and job\n\u003e id: 19Key\u003cbr\u003e Executing the expiry post action for the request19Key\n\n\u003cbr\u003e \n#### **References :**\n\n* Apache Ignite :\n[https://apacheignite.readme.io/docs](https://apacheignite.readme.io/docs)\n\n\u003cbr\u003e \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fromeh%2Ffailover-singlejob-ignite","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fromeh%2Ffailover-singlejob-ignite","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fromeh%2Ffailover-singlejob-ignite/lists"}