{"id":16139432,"url":"https://github.com/networkop/ecmp-conntrack","last_synced_at":"2025-04-06T17:43:19.070Z","repository":{"id":83590170,"uuid":"371383318","full_name":"networkop/ecmp-conntrack","owner":"networkop","description":"Maglev-style LB on Arista","archived":false,"fork":false,"pushed_at":"2021-05-27T13:50:01.000Z","size":109,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-02-12T23:45:15.731Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/networkop.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-05-27T13:28:59.000Z","updated_at":"2022-04-25T07:09:43.000Z","dependencies_parsed_at":"2023-07-08T01:46:44.379Z","dependency_job_id":null,"html_url":"https://github.com/networkop/ecmp-conntrack","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/networkop%2Fecmp-conntrack","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/networkop%2Fecmp-conntrack/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/networkop%2Fecmp-conntrack/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/networkop%2Fecmp-conntrack/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/networkop","download_url":"https://codeload.github.com/networkop/ecmp-conntrack/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247526675,"owners_count":20953141,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-09T23:49:03.429Z","updated_at":"2025-04-06T17:43:19.049Z","avatar_url":"https://github.com/networkop.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# Resilient ECMP plus\nImproving resilient ECMP with directflow connection tracking\n\n## Problem statement\nResilient ECMP has one major disadvantage that makes it difficult for it to be used for client-facing services. When new nexthops are being added or removed, resilient ECMP tables gets populated with those entries immediately, which results in reset of all existing TCP connections going to these new/removed nexthops. This repo contains the setup instructions and code required to build the proof-of-concept demo showing how directflow can be used to do (NAT-like) connection tracking to ensure existing TCP sessions \"stick\" to their old nexthops even when new ones are being added or removed. This can be used to gradually drain existing servers to take them down for maintenance or introduce new servers to horizontally scale the capacity. \n\n## Demo setup\n\n### 1. Build the physical topology\n\nBased on the following diagram:\n\n![](img/demo.png)\n\n\n\n### 3. Setup the server-side\nTwo servers S1 and S2 will be running a simple web server and [gocast](https://github.com/mayuresh82/gocast) to advertise /32 to load-balancing switch over BGP. Repeat the following steps on both S1 and S2:\n\n```bash\nbash\nsudo systemctl start docker\n\ndocker run -d -p 80:80 --name $(hostname) --hostname $(hostname) containous/whoami\n\nmkdir gocast\ncat \u003c\u003c EOF \u003e\u003e gocast/gocast.yaml\nagent:\n  # http server listen addr\n  listen_addr: :8181\n  # Interval for health check\n  monitor_interval: 10s\n  # Time to flush out inactive apps\n  cleanup_timer: 15m\n\nbgp:\n  local_as: 65100\n  remote_as: 65000\n  # override the peer IP to use instead of auto discovering\n  peer_ip: 5.5.5.5\n  communities:\n    - 65100:100\n  origin: igp\nEOF\n\ndocker run -d --name bgp --cap-add=NET_ADMIN --net=host -v $(pwd)/gocast/:/conf mayuresh82/gocast -v 9 --config=/conf/gocast.yaml\n```\n\n### 4. Install df-agent\nDF-agent is the process responsible for directflow rule provisioning on LB. It can be installed anywhere and the only requirement is for it to be reachable from the LB switch's default vrf. In this case it'll be installed on both S1 and S2 and reachable on the IP advertised by gocast. This needs to be done on S1 and S2.\n\n```\ndocker run -d --name df-agent --cap-add=NET_ADMIN --net=host networkop/df-agent\n```\n\n### 5. Enable BGP anycast\nFinally, we can tell gocast to advertise the 1.1.1.1/32 towards LB. This needs to be done on S1 and S2.\n\n```\ncurl \"http://127.0.0.1:8181/register?name=traefik\u0026vip=1.1.1.1/32\u0026monitor=port:tcp:80\"\n```\n\nExpected output on s7058 (LB) is as follows:\n\n```\ns7058#sh ip bgp 1.1.1.1/32\nBGP routing table information for VRF default\nRouter identifier 5.5.5.5, local AS number 65000\nBGP routing table entry for 1.1.1.1/32\n Paths: 2 available\n  65100\n    23.23.23.3 from 23.23.23.3 (23.23.23.3)\n      Origin IGP, metric 0, localpref 100, IGP metric 1, weight 0, received 00:01:07 ago, valid, external, ECMP head, ECMP, best, ECMP contributor\n      Community: 65100:100\n      Rx SAFI: Unicast\n  65100\n    34.34.34.4 from 34.34.34.4 (34.34.34.4)\n      Origin IGP, metric 0, localpref 100, IGP metric 1, weight 0, received 00:00:12 ago, valid, external, ECMP, ECMP contributor\n      Community: 65100:100\n      Rx SAFI: Unicast\n```\n\n## Problem demonstration\n\nThe following sequence of steps will reproduce the default (undesired) behaviour of resilient ECMP\n\n### 1. Identify two flows that take different ECMP paths\n\nThese steps need to be done from the LB. Port numbers may be different as long as they produce two different output interfaces:\n\n```text\ns7058#show load-balance destination ip ingress-interface Et47 src-ipv4-address 12.12.12.1 dst-ipv4-address 1.1.1.1 ip-protocol 6 src-l4-port 30002 dst-l4-port 80\nOutput Interface: Ethernet43\ns7058#show load-balance destination ip ingress-interface Et47 src-ipv4-address 12.12.12.1 dst-ipv4-address 1.1.1.1 ip-protocol 6 src-l4-port 30001 dst-l4-port 80\nOutput Interface: Ethernet42\n```\n![](img/ecmp.png)\n\n### 2. Establish TCP session to S1\n\ns7154 will play the role of a client\n\n```\ns7154#bash nc -v -p 30002  1.1.1.1 80\nNcat: Version 6.40 ( http://nmap.org/ncat )\nNcat: Connected to 1.1.1.1:80.\n```\n\n### 3. Simulate the failure of S1\n\nStop advertising the anycast IP from S1:\n\n```bash\n[admin@s7022 ~]$ curl \"http://127.0.0.1:8181/unregister?name=traefik\u0026vip=1.1.1.1/32\u0026monitor=port:tcp:80\"\n```\n\nFrom C1's open netcat session, request the home http page:\n\n```\nGET / HTTP/1.0\nHost: localhostNcat: Connection reset by peer.\n% 'nc -v -p 30002 1.1.1.1 80' returned error code: 1\n```\n\nWhen S1 fails, all existing flows will get redirected to S2 and will get TCP RST immediately.\n\n![](img/ecmp-failure.png)\n\n### 4. Open a new netcat session\n\nThe new session will land on s7021 (S2) this time (since it's the only server left)\n\n```\ns7154#bash nc -v -p 30002  1.1.1.1 80\nNcat: Version 6.40 ( http://nmap.org/ncat )\nNcat: Connected to 1.1.1.1:80.\n```\n\n### 5. Simulate S1 recovery\n\nStart advertising the anycast IP again:\n\n```bash\n[admin@s7022 ~]$ curl \"http://127.0.0.1:8181/register?name=traefik\u0026vip=1.1.1.1/32\u0026monitor=port:tcp:80\"\n```\n\nFrom C1's open netcat session, request home http page:\n\n```\nGET / HTTP/1.0\nHost: localhostNcat: Connection reset by peer.\n% 'nc -v -p 30002 1.1.1.1 80' returned error code: 1\n```\n\nWhen new server gets added to the ECMP nexthop group, part of the existing sessions will get redirected to it, which will result in an immediate TCP RST from the server.\n\n![](img/ecmp-recovery.png)\n\nThis happens since S1 doesn't know anything about the TCP session state to S2, which still thinks the session is established:\n\n```\n[admin@s7021 ~]$ netstat -an | grep 1.1.1.1\ntcp6       0      0 1.1.1.1:80              12.12.12.1:30002        ESTABLISHED\n```\n\n## Problem solution demonstration\n\nTo enable DF-agent to action on all incoming TCP sessions, we need to redirect them to DF-agent's anycast IP. Do the following from LB:\n\n```\ns7058(config)#monitor session TCP-SYN destination tunnel mode gre source 5.5.5.5 destination 1.1.1.1 \ns7058(config)#monitor session TCP-SYN source Ethernet47 rx ip access-group ACL-TCP-SYN\n```\n\n![](img/df.png)\n\nFirst, let's see how DF connection tracking will allow servers to be taken out of service gracefully. \n\n### 1. Establish a TCP session to S1\n\n```\ns7154#bash nc -v -p 30002  1.1.1.1 80\nNcat: Version 6.40 ( http://nmap.org/ncat )\nNcat: Connected to 1.1.1.1:80.\n```\n\n### 2. Gracefully shutdown S1\n\nInstead of withdrawing, we'll set local-preference of anycast route received from S1 to 1:\n\n```text\ns7058#sh run sec PL|RMAP\nip prefix-list PL-S1-NEXTHOP seq 10 permit 34.34.34.4/32\n!\nroute-map RMAP-TRAEFIK-IN permit 10\n   match ip next-hop prefix-list PL-S1-NEXTHOP\n   set local-preference 1\n!\nrouter bgp 65000\n   neighbor TRAEFIK route-map RMAP-TRAEFIK-IN in\n```\n\nThis should result in only S2 being selected as the best path:\n\n```text\ns7058#sh ip bgp 1.1.1.1\nBGP routing table information for VRF default\nRouter identifier 5.5.5.5, local AS number 65000\nBGP routing table entry for 1.1.1.1/32\n Paths: 2 available\n  65100\n    23.23.23.3 from 23.23.23.3 (23.23.23.3)\n      Origin IGP, metric 0, localpref 100, IGP metric 1, weight 0, received 00:11:41 ago, valid, external, best\n      Community: 65100:100\n      Rx SAFI: Unicast\n  65100\n    34.34.34.4 from 34.34.34.4 (34.34.34.4)\n      Origin IGP, metric 0, localpref 1, IGP metric 1, weight 0, received 00:03:13 ago, valid, external\n      Community: 65100:100\n      Rx SAFI: Unicast\n```\n\nFrom C1's open netcat session, request home http page:\n\n```\nGET / HTTP/1.0\n\nHTTP/1.0 200 OK\nDate: Mon, 01 Apr 2019 15:02:34 GMT\nContent-Length: 122\nContent-Type: text/plain; charset=utf-8\n\nHostname: s7022\nIP: 127.0.0.1\nIP: 172.17.0.2\nGET / HTTP/1.1\nHost: \nUser-Agent: Go-http-client/1.1\n\n```\n\nThe original flow is still pinned by directflow to its old ECMP nexthop\n\n```\ns7058(config-router-bgp)#show directflow flows\nFlow 12_12_12_1-30002-1_1_1_1-80:\n  persistent: False\n  priority: 0\n  priorityGroupType: default\n  tableType: ifp\n  hard timeout: 0\n  idle timeout: 300\n  match:\n    Ethernet type: IPv4\n    source IPv4 address: 12.12.12.1/255.255.255.255\n    destination IPv4 address: 1.1.1.1/255.255.255.255\n    IPv4 protocol: TCP\n    source TCP/UDP port or ICMP type: 30002\n    destination TCP/UDP port or ICMP type: 80\n  actions:\n    output nexthop: 34.34.34.4\n  source: config\n  matched: 5 packets, 367 bytes\n```\n\n### 3. Establish a new TCP session to S1\n\n300 seconds later, the DF flow should expire and we can simulate another HTTP session which will land on S2 this time\n\n```\ns7154#bash nc -v -p 30002  1.1.1.1 80\nNcat: Version 6.40 ( http://nmap.org/ncat )\nNcat: Connected to 1.1.1.1:80.\n```\n\n### 3. Gracefully recover S1\n\nRemove the route-map to restore the ECMP group.\n\n```\ns7058(config-router-bgp)#no neighbor TRAEFIK route-map RMAP-TRAEFIK-IN in\n\n```\n\n\nFrom C1's open netcat session, request home http page:\n\n```\nGET / HTTP/1.0\n\nHTTP/1.0 200 OK\nDate: Mon, 01 Apr 2019 15:14:27 GMT\nContent-Length: 122\nContent-Type: text/plain; charset=utf-8\n\nHostname: s7021\nIP: 127.0.0.1\nIP: 172.17.0.2\nGET / HTTP/1.1\nHost: \nUser-Agent: Go-http-client/1.1\nConnection: close\n```\n\nThe session is still pinned to S2 by a non-persistent directflow:\n\n\n```\ns7058(config-router-bgp)#show directflow flows\nFlow 12_12_12_1-30002-1_1_1_1-80:\n  persistent: False\n  priority: 0\n  priorityGroupType: default\n  tableType: ifp\n  hard timeout: 0\n  idle timeout: 300\n  match:\n    Ethernet type: IPv4\n    source IPv4 address: 12.12.12.1/255.255.255.255\n    destination IPv4 address: 1.1.1.1/255.255.255.255\n    IPv4 protocol: TCP\n    source TCP/UDP port or ICMP type: 30002\n    destination TCP/UDP port or ICMP type: 80\n  actions:\n    output nexthop: 23.23.23.3\n  source: config\n  matched: 7 packets, 509 bytes\n```\n\n\n## Notes\n\nthe DF agent code is not thoroughly tested. It can be run directly as a python script or as a docker container which can be built with `./build.sh`\n\nAll startup configs have be anonymized with [netconan](https://pypi.org/project/netconan/).","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnetworkop%2Fecmp-conntrack","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnetworkop%2Fecmp-conntrack","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnetworkop%2Fecmp-conntrack/lists"}