====== High Availability ====== **Remember**, when deploying VM HA in VMware, you need to update the security settings on the port-group that is used by the Infoblox VM's to accept "MAC address changes" and "Forged transmits". This is so that VMware allows the VM's to have multiple MAC addresses per vNIC (which is needed for Infoblox HA). Documentation [[https://docs.infoblox.com/space/nios90/280765644/About+HA+Pairs|here]] and [[https://www.edge-cloud.net/2013/05/21/infoblox-vnios-ha-pair-vip-unreachable-when-deployed-on-vsphere|more data here]]. General blog article [[https://blogs.infoblox.com/company/power-of-three-for-low-cost-ha-business-continuity/|here]] on using a standalone appliance and a HA pair. ===== Changing HA Pair Types ===== Cutting over from HA physical to HA virtual. When I cut the passive to vNIOS, it did not change the Member Type to Virtual NIOS. After I cut the second member of the HA pair, the Member Type changed to Virtual NIOS without intervention. ===== DFP ===== When using DFP, NIOS uses the LAN1 port to establish DoT on TCP-443 to Infoblox Anycast. This is true EVEN IF THE NIOS is HA. NIOS will not use the HA VIP for TCP-443. However, any plaintext queries will come from the HA VIP. ===== LACP ===== NIOS does not support LACP. In addition, for the bonding of LAN1/LAN2, NIOS only supports mode 1 (active-backup) bonding. Only one NIC will be "active" at a time. No protocols are communicated for achieving this - NIOS just speaks on one interface and the CAM table gets updated on the switch to the active port. ===== Cloud Support ===== As of NIOS 9.0.4, HA in AWS, Azure and GCP are supported. * HA is not supported on only Azure TE-926 appliance because the underlying Azure VM doesn't have enough network interfaces. * HA is not supported on Azure/GCP TE-825 appliance because the underlying Azure/GCP VM doesn't have enough network interfaces. Documentation on [[https://docs.infoblox.com/space/vniosazure/636026896/Deploying+the+vNIOS+Instance+with+High+Availability|Azure HA]]. ===== Change IP Settings ===== If you edit the subnet mask or default gateway of the VIP or either of the HA ports or either of the LAN ports of a HA pair, both members will do a product restart (not full reboot) at the same time when you save your changes. You can edit the MGMT interface of one none in a HA pair. It will reboot that node but not the other node of the HA pair. ===== Make Standalone ===== If you take a HA member and make it standalone, the active appliance will make the LAN1 interface IP be set to the current HA VIP address. If MGMT is used, that will stay the same. The device will then reboot. The other device will keep its LAN1 and MGMT IP address and also its DNS name and also its local admin accounts but will be made into a standalone device. ===== Proximity ===== NIOS HA pairs are designed to be deployed next to each other in adjacent racks. Deploying a HA pair over two separate sites (i.e. between two DC/data centers) connected with dark fibre is not supported. It may well work but it is bad practice because of the risk of split-brain should anything happen to the fibre. e.g. examples of fibre cuts. * 2025-09-21 [[https://www.nbcdfw.com/news/local/dfw-love-field-airport-delays-cancellations-cable-lines-friday-american-southwest/3921370/|Texas Airports Impacted]] As per [[https://docs.infoblox.com/space/nios90/1432819381/Planning+for+an+HA+Pair|Infoblox Documentation]] ... Infoblox uses VRRP advertisements for the active and passive HA design. Therefore, all HA pairs must be located **in the same location** connected to the highly available switching infrastructure. Any other deployment is not supported without a written agreement with Infoblox. Contact Infoblox Technical Support for more information about other deployment support. ===== HA Failover on DNS Nameservers ===== From the [[https://docs.infoblox.com/space/nios90/280765644/About+HA+Pairs#HA-failover-on-DNS-Nameservers|documentation]]. # When an HA failover occurs on NIOS, there is an approximate 4-5 second time interval in which the network is adjusted for the new active node and the new passive node. During this failover period, the active node becomes unresponsive. After the new active node comes up on the network, the DNS service loads all Response Policy Zone (RPZ) files if RPZ is configured. The larger the RPZ files, the longer it takes to load them, and the longer it takes the DNS service to start serving DNS. For example, on a TE-1425 with RPZs that contain 15 million resource records, it can take approximately one and a half minutes to start serving DNS. If your nameserver uses Grid replication to keep internal zones up to date and is not configured to use RPZ, then the delay before the DNS service starts serving DNS is slightly longer that it is for the HA failover itself. ===== LAN2 ===== The IP will float between the two LAN2 interfaces, but if you have a network failure on one of the LAN2 interfaces, it won't cause a failover to occur. Only LAN1/HA are guarded for failover. e.g. If LAN1 is for production and LAN2 is for OOB network, if LAN2 on the active node fails, there is no failover and the OOB network looses access to services on LAN2. ===== NSX ===== The only time I saw a customer deploy NIOS HA on NSX, they had to bypass NSX and expose the VM to ESXi directly because they couldn't get "Forged Transmits" enabled on NSX 4.11. * port group is on NSX "Segment" and that doesn't have option for forgesd transits. * "MAC address changes" are allowed in NSX but called "MAC address learning" * "Forged transmits" not allowed on NSX so the customer had to get the VM's working directly with ESXi. Without "Forged Transmits", everything would work for a minute and then stop for four hours ===== KB Article ===== * When does an HA failover occur? [[https://support.infoblox.com/s/article/6589|KB Article]] * High Availability (HA) and network usage [[https://support.infoblox.com/s/article/High-Availability-HA-and-network-usage|KB Article]] ===== HA Priority ===== "The priority is based on system status and events" is a generic statement which means the status of each node and the events that occur during the election process of an active node. Please refer to the following different start-up behaviors for more information on VRRP priorities. * Both starting at same time and both previously non-active. Both nodes would wait 3 to 12 seconds (depending on arping) of listening and if nothing received, they would go active with priority penalty of 1. If both go active, they will both sending VRRP advertisements on their HA ports using the VRRP shared mac address and their priorities will keep increase or decrease depending on whether they are receiving VRRP packets on their LAN port. If one of the two nodes has been active for any longer than the other, its priorities will stay bigger than their partner's for 3 minutes. * Both starting at same time and both previously active. Similar to A, except they wait 3 to 9 seconds and would have a priority penalty of 171. If both nodes end of restarting due to dual active conditions, they would both have been previously passive, which would be case A. * Both starting at same time - one previously active, other previously non-active. Previously active node would go active after 3 to 9 seconds with a priority penalty of 171. The passive node would have 3 seconds (after the initial 0-9) to get an advertisement. If it doesn't, it will go active, but will go passive (restart) if it gets an advertisement in the next 3 minutes- before its priority could grow to 180. * Node starting when other node is active and the starting node was previously active. Assuming that the running node is up and healthy, it will be active with priority 180 - so the joining node will listen for 3 to 9 seconds - and if it receives a VRRP advertisement within this time - it will go passive. If not, it will go active and its priority set to 171 - so if it gets an advertisement from the other node within the next 10 seconds - it will go passive - and we will have case E. If not, we will get to dual active. * Node starting when other node is active and the starting node was previously passive. Assuming that the running node is up and healthy, it will be active with priority 180 - so the joining node will listen for 3 to 12 seconds - and if it receives a VRRP advertisement within this time - it will go passive. If not, it will go active and its priority will be 1 - so if it gets an advertisement from the other node within the next 3 minutes - it will go passive - and we will have case E (again). If not, we will get to dual active.