Synchronizing kernel routing tables
In a functioning cluster, the primary unit keeps all subordinate unit kernel routing tables (also called the forwarding information base FIB) up to date and synchronized with the primary unit. After a failover, because of these routing table updates the new primary unit does not have to populate its kernel routing table before being able to route traffic. This gives the new primary unit time to rebuild its regular routing table after a failover.
Use the following command to view the regular routing table. This table contains all of the configured routes and routes acquired from dynamic routing protocols and so on. This routing table is not synchronized. On subordinate units this command will not produce the same output as on the primary unit.
get router info routing-table
Use the following command to view the kernel routing table (FIB). This is the list of resolved routes actually being used by the FortiOS kernel. The output of this command should be the same on the primary unit and the subordinate units.
get router info kernel
This section describes how clusters handle dynamic routing failover and also describes how to use CLI commands to control the timing of routing table updates of the subordinate unit routing tables from the primary unit.
Configuring graceful restart for dynamic routing failover
When an HA failover occurs, neighbor routers will detect that the cluster has failed and remove it from the network until the routing topology stabilizes. During the time the routers may stop sending IP packets to the cluster and communications sessions that would normally be processed by the cluster may time out or be dropped. Also the new primary unit will not receive routing updates and so will not be able to build and maintain its routing database.
You can configure graceful restart (also called nonstop forwarding (NSF)) as described in RFC3623 (Graceful OSPF Restart) to solve the problem of dynamic routing failover. If graceful restart is enabled on neighbor routers, they will keep sending packets to the cluster following the HA failover instead of removing it from the network.
The neighboring routers assume that the cluster is experiencing a graceful restart.
After the failover, the new primary unit can continue to process communication sessions using the synchronized routing data received from the failed primary unit before the failover. This gives the new primary unit time to update its routing table after the failover.
You can use the following commands to enable graceful restart or NSF on Cisco routers:
router ospf 1
log-adjacency-changes
nsf ietf helper strict-lsa-checking
If the cluster is running BGP, use the following command to enable graceful restart for BGP:
config router bgp
set graceful-restart enable end
You can also add BGP neighbors and configure the cluster unit to notify these neighbors that it supports graceful restart.
config router bgp config neighbor
edit <neighbor_address_Ipv4>
set capability-graceful-restart enable end
end
If the cluster is running OSPF, use the following command to enable graceful restart for OSFP:
config router ospf
set restart-mode graceful-restart end
To make sure the new primary unit keeps its synchronized routing data long enough to acquire new routing data, you should also increase the HA route time to live, route wait, and route hold values to 60 using the following CLI command:
config system ha set route-ttl 60 set route-wait 60 set route-hold 60
end
Controlling how the FGCP synchronizes kernel routing table updates
You can use the following commands to control some of the timing settings that the FGCP uses when synchronizing routing updates from the primary unit to subordinate units and maintaining routes on the primary unit after a failover.
config system ha
set route-hold <hold_integer> set route-ttl <ttl_integer> set route-wait <wait_integer>
end
Change how long routes stay in a cluster unit routing table
Change the route-ttl time to control how long routes remain in a cluster unit routing table. The time to live range is 5 to 3600 seconds. The default time to live is 10 seconds.
The time to live controls how long routes remain active in a cluster unit routing table after the cluster unit becomes a primary unit. To maintain communication sessions after a cluster unit becomes a primary unit, routes remain active in the routing table for the route time to live while the new primary unit acquires new routes.
By default, route-ttl is set to 10 which may mean that only a few routes will remain in the routing table after a failover. Normally keeping route-ttl to 10 or reducing the value to 5 is acceptable because acquiring new routes usually occurs very quickly, especially if graceful restart is enabled, so only a minor delay is caused by acquiring new routes.
If the primary unit needs to acquire a very large number of routes, or if for other reasons, there is a delay in acquiring all routes, the primary unit may not be able to maintain all communication sessions.
You can increase the route time to live if you find that communication sessions are lost after a failover so that the primary unit can use synchronized routes that are already in the routing table, instead of waiting to acquire new routes.
Change the time between routing updates
Change the route-hold time to change the time that the primary unit waits between sending routing table updates to subordinate units. The route hold range is 0 to 3600 seconds. The default route hold time is 10 seconds.
To avoid flooding routing table updates to subordinate units, set route-hold to a relatively long time to prevent subsequent updates from occurring too quickly. Flooding routing table updates can affect cluster performance if a great deal of routing information is synchronized between cluster units. Increasing the time between updates means that this data exchange will not have to happen so often.
The route-hold time should be coordinated with the route-wait time.
Change the time the primary unit waits after receiving a routing update
Change the route-wait time to change how long the primary unit waits after receiving routing updates before sending the updates to the subordinate units. For quick routing table updates to occur, set route-wait to a relatively short time so that the primary unit does not hold routing table changes for too long before updating the subordinate units.
The route-wait range is 0 to 3600 seconds. The default route-wait is 0 seconds.
Normally, because the route-wait time is 0 seconds the primary unit sends routing table updates to the subordinate units every time its routing table changes.
Once a routing table update is sent, the primary unit waits the route-hold time before sending the next update.
Usually routing table updates are periodic and sporadic. Subordinate units should receive these changes as soon as possible so route-wait is set to 0 seconds. route-hold can be set to a relatively long time because normally the next route update would not occur for a while.
In some cases, routing table updates can occur in bursts. A large burst of routing table updates can occur if a router or a link on a network fails or changes. When a burst of routing table updates occurs, there is a potential that the primary unit could flood the subordinate units with routing table updates. Flooding routing table updates can affect cluster performance if a great deal of routing information is synchronized between cluster units. Setting route-wait to a longer time reduces the frequency of additional updates are and prevents flooding of routing table updates from occurring.