First Rule Of Upgrading NIOS - Read the release notes. Then read them again. Understand what changes happen with the code and figure out if this affects your deployment of NIOS. We cannot stress this single point enough.
First Rule Of Upgrading NIOS - See the first rule of upgrading NIOS.
Official upgrade documentation here.
Remember, upgrading NIOS will empty the recycle bin.
NOTE: When you have install a hotfix bundle/collective hot fix (e.g. CHF 8.6.3.2 ), make sure you perform a product restart (of full reboot) on the systems to ensure the fix is fully implemented. If you forget and then try to upgrade to another version of NIOS (e.g. 9.0.1) you can (albeit, very unlikely) run into issues.
NOTE: From NIOS 9.0.6 onwards, upgrade status logs are captured in the Grid Master log files. You can view these logs using the show log debug follow /UPGRADE_STATUS/ CLI command.
You may need to increase the session time out limit for your user account if you are having issues uploading code to the GM prior to an upgrade. If the time out limit is too low, the time out can break the upload.
Support Articles:
NOTE: When NIOS upgrades, the recycle bin is emptied.
NOTE: Do not leave the default group empty. If you do, the upgrade may never (officially) finish.
show version and show upgrade_history)The downgrade procedure is for single independent appliances only. Infoblox does not support software downgrades for Grid members, but you can revert to the previous NIOS release on a Grid Master.
After you complete the downgrade procedure, all data in the database is lost. The downgrade process does not preserve data but does preserve license information and basic network settings.
SSH into GM and disable TLS 1.0 and TLS 1.1
set ssl_tls_settings override set ssl_tls_protocols disable TLSv1.0 set ssl_tls_protocols disable TLSv1.1
You will need to restart the GUI manually. Navigate to the Grid tab → Grid Manager tab → Members tab, select the member checkbox, expand the Toolbar, and click Control → Restart GUI
You may also get the following error logs in the GM syslog based on one or more of the Trusted Root CA in your CA store in NIOS
Upgrade check failed, SKI doesn't exist in CA-certificate subject=
Remember, when 9.0.3 has been distributed, you can then install the CHF2 for 9.0.3 to install it on the new partition. This means that CHF3 will be installed on 9.0.3 as soon as 9.0.3 is installed. Verify with
show upgrade_history
You should install Hotfix-NIOS-98022 BEFORE upgrading to NIOS 9.0 (but AFTER distribution of NIOS 9.0.x code) to ensure that all OpenVPN connections (Grid communication) is using a correct certificate. Failure to do this can result in members going offline (not connecting to GM) and/or GM entering a reboot loop. From NIOS 9.0.6 onwards, Upgrade Test and Upgrade will fail if OpenVPN certificates are not correct. More details here.
Consider setting the following after upgrading to 9.0 to ensure that DNS restarts don't take longer. named_max_exit_wait - default is to wait until exit happens. This command sets a max (e.g. 3 or 5 seconds)
In NIOS 9.0 and higher, if you use LDAP authentication and you need the LDAP connection to egress the MGMT interface, you must put a static route on the NIOS box to force the traffic to use the MGMT interface. This is because in NIOS 9.0.0, LDAP requests to the LDAP server and Active Directory server cannot be sent using the MGMT IP address, because OpenLDAP version 2.4.49 (Ubuntu) removed the options of binding the source IP address on the client. Therefore, an LDAP request or an Active Directory authentication request is always sent through the LAN IP address, even though you have enabled the Connect through Management Interface option.
For 9.0.x upgrades, read the release notes very carefully. Specifically, this is paraphrased from the “Upgrade Guidelines” section:
Subject Key Identifier fieldbasicConstraints marked a critical extension (RFC 5280 - 4.2.1.5)keyUsage extension field (RFC 5280)md5WithRSAEncryption or sha1WithRSAEncryption ciphers?acs formatting was used for IP and FQDN. Now in NIOS 9.x the /metadata must be used for FQDN.HOWEVER, remember that, if you are using the reporting server, you can't have your HTTPS certificates use a greater key size than 2048 because of a limitation of Java.
You must remove DNS Unbound for all external syslog servers if configured explicitly because NIOS 9.0 does not support unbound.
If you are running a NIOS member in Azure, and if you have “Accelerated Networking” (Azure’s SRIOV) enabled for any of the network interfaces of the member at an Azure level (configured in Azure portal - NOT the NIOS UI) then you MUST disable “Accelerated Networking” IF you are installing NIOS 9.0.0, 9.0.1, 9.0.2, 9.0.3, or 9.0.4. Accelerated Networking / SRIOV introduces extra NICs to the VM - NICs that have the same MAC address as the regular NICs. The duplicate MAC addresses causes issues for NIOS 9.0.3 so that NIOS no can no longer map the interfaces. NIOS 8.6.x instances don't have the Azure SRIOV NICs so the extra NICs do not show up and cause issues. NIOS 9.0.5 fixes this problem.
If your NIOS is authoritative DNS and uses DNSSEC to sign those zones, you MUST get rid of the following algorithms/ciphers
Switch to algorithms
(ISC has removed support for algorithms 1(RSA-MD5), 3(DSA), and 6(DSA-NSEC3-SHA1) in BIND 9.16)
https://datatracker.ietf.org/doc/html/rfc8624#section-3.1
The following table lists the implementation recommendations for DNSKEY algorithms
https://datatracker.ietf.org/doc/html/rfc8624#section-3.3
SHA-1 is still widely used for Delegation Signer (DS) records, so validators MUST implement validation, but it MUST NOT be used to generate new DS and CDS records (see “Operational Considerations” for caveats when upgrading from the SHA-1 to SHA-256 DS algorithm.)
As per this document, NIOS 9.0.1+ changes the way NIOS accounts for Available memory has changed to exclude DB Cache/Huge Pages from Available memory. This results in a more accurate calculation of the Used Memory in comparison to previous versions, although it might appear that memory usage is now higher.
Infoblox utilizes Huge Pages for Database (DB) Cache. Before NIOS 9.0.1, these DB Cache/Huge Pages were treated as Cached and were included in the Available memory calculation, resulting in the Available memory being high, although the huge pages were being used by the DB Cache.
When upgrading to NIOS 9.0.2 or higher, it won't work if you have imported Trusted Root CA's that don't have the SKI field. e.g. Root CA's generated on a Palo Alto Networks firewall won't have this field but certificates generated on a Linux CLI will. You will get the following error on the console and in syslog. The solution is to delete the trusted root CA (note: you don't have to replace live HTTPS certificates even if those certificates are signed by CA with no SKI)
2023-12-02 14:38:45 GMT syslog CRITICAL root[2179522] Upgrade check failed, SKI doesn't exist in CA-certificate 2023-12-02 14:38:45 GMT syslog CRITICAL root[2179557] Grid not compatible with 9.0.3-50212-ee11d5834df9 release due to unsupported hardware or incompatible configuration setting
After distributing the code, I then found the following:
CHECKING RFC 5280 compliance for /storage/etc/security/certs/aaa_ca_cert.pem Upgrade check failed, certificate violates RFC 5280 serial=44D33399A222D0AAA4FFF1118666444CCCEEE111 : Root and Subordinate CA certificate keyUsage extension MUST be present. CA certificate check failed test upgrade failure
RFC 5280 is here. In my case, I told NIOS to regenerate a self-signed cert for the GUI to allow me to get past the issue.
Another RFC 5280 issue encountered is
CHECKING RFC 5280 compliance for /storage/etc/security/certs/aaa_ca_cert.pem Upgrade check failed, certificate violates RFC 5280 serial=44D33399A222D0AAA4FFF1118666444CCCEEE111 : basicConstraints MUST appear as a critical extension.
The following command is available from NIOS 9.0 onwards
set enable_strict_ca_cert_check
set disable_strict_ca_cert_check
show strict_ca_cert_check
You may also find that the HTTPS certificate that came with NIOS has expired.
(on NIOS 8.6.2) crit Upgrade check failed, Apache certificate has expired on Jun 12 10:40:50 2023 GMT crit Grid not compatible with 9.0.3-50212-ee11d5834df9 release due to unsupported hardware or incompatible configuration
You can recreate this certificate or go to 8.6.4 first and then 9.0.3.
Note that NIOS 8.6.2 generates a SHA-265/RSA 2048bit certificate that is valid for 365 days and has the FQDN and IP of the GM in the SAN field.
Note: 9.0.5 introduces changes to Threat Insight. If you use Threat Insight, you may see some CPU usage increase.
When applying a hotfix, support will send you a zip with instructions, a hotfix file and a “revert” hotfix file.
Go to Grid > Upgrade and then select the member you want to apply the hot fix to. Then click the down arrow by “Apply Hotfix” and choose “To selected Grid Members”. You then get to select and upload the hotfix file. Uploading the hotfix file causes it to be installed more or less instantly. Good practice is to then manually reboot the appliance. You will then see the hot fix details listed in the Hotfix column.
For primary and secondary DNS servers, the zones will stay in sync while the members are on different versions. However, the zones on the servers stay in sync when its about notifications and updates but you will not be able to manually create records during the time of running two different NIOS versions in one grid.
Set the default “Revert Window Time” from 24 hours to 1 hour by running in Grid Master the command “set default_revert_window 1”
Putting each member in its own group helps with sequencing (what triggers next box in a sequence)
Upgrade policy timer is hard coded as 10 minutes. When upgrade process starts for a member in an upgrade group, grid master starts a timer for 10 minutes and prints the following debug log:
If the member completes upgrade and joins sucessfully within 10 minutes, grid master logs the following and then proceeds with next member in the current upgrade group or to the next upgrade group:
However, if the member does not complete upgrade within 10 minutes, grid master skips this member and then proceeds with next member in the current upgrade group or to the next upgrade group:
You can scheduled upgrades can occur over a period of 9 days as per the docs.
Did we ever get a reason why this revert window even exists? What does it do differently than a regular revert?
I believe it holds the grid in that “you can only change some things” interim state (so that it can work with the members at different revs) to allow for the per-member revert. The revert window allows you to do single node reverts.
Revert window is applied to each individual member, after the member completes upgrade, it has 24 hours (default) to 48 hours (max) to revert (there is still data loss). The “individual revert” is intended for troubleshooting and less disruption (instead of reverting the entire Grid).
Then, there's reverting the entire Grid, there's no time limit for this, everything reboots and boot back into the old partition (old software), all changes are lost.
set grid_upgrade forced_upgrade
shortens the upgrade window (“everybody forget about the schedule and upgrade now!”)
Do not use this command unless you know what you are doing and you have engaged Infoblox technical support.
set grid_upgrade forced_end
Make sure that the default group has at least one member associated with it, otherwise the appliance displays that the upgrade process is still in progress even though it is complete. To avoid this, you can either use the Infoblox > set grid_upgrade forced_end command to stop the upgrade process or keep at least one member in the default group.
Note: Using the command will force all upgrade groups to end upgrade immediately, all incomplete groups members will be logged-off the grid to perform an auto-sync of software with the grid this operation should only be used in an emergency situation to end a scheduled upgrade as it will result in member service outage until the operation is completed.
During an upgrade, you have the option to select an upgrade group and click “Upgrade Now”. This will tell NIOS to start the upgrade on all members of the group simultaneously and immediately.
– Note from Infoblox Community user: Probably my bad, but when an upgrade group is set to sequential it does not mean the node will upgrade one after the other i.e. waiting until one has finished to start the next upgrade….it means that the node upgrades get kicked off a minute or so apart from each other, so there is a huge overlap
This caused downtime because several node which are each other fallback to be offline at the same time.
But worse in my opinion, even HA-clusters now have a downtime during the failover from the node running the old version to the node running the newly installed version.
It seems that the node running the old version starts the failover as soon as it detects the other node running a higher version, but does not take in to account that this new node is not yet ready to handle traffic. So the old node goes offline and the new is still in a slow process of starting BIND. This resulted in a down time for DNS of 3 to 5 minutes.
if any grid member fails to upgrade within 10 minutes, the next one goes.
Upgrades can be automated via API.
Ideally, define your upgrade & distribution groups in advance, then you just
2023-12-03 12:23:38.623Z daemon NOTICE httpd[] [username]: Called - GridUpgrade: Args action="DISTRIBUTION_START" 2023-09-28 12:14:27 BST daemon INFO distribute_upgrade_file[] Started Grid Master distribution 2023-09-28 12:14:27 BST daemon INFO distribute_upgrade_file[] Completed Grid Master distribution. Begin distribution on members. 2023-09-28 12:14:17 BST daemon INFO systemd[11702] mnt_storage.mount: Succeeded. 2023-09-28 12:14:17 BST daemon INFO systemd[1] mnt_storage.mount: Succeeded. 2023-09-28 12:14:17 BST user NOTICE debug_umount[] umount < 76320 /bin/bash /infoblox/common/bin/debug_umount /mnt_storage < 5686 /infoblox/one/bin/controld < 4887 /infoblox/one/bin/process_manager < 4574 /infoblox/one/bin/clusterd 2023-09-28 12:14:17 BST daemon INFO systemd[11702] mnt.mount: Succeeded. 2023-09-28 12:14:17 BST daemon INFO systemd[1] mnt.mount: Succeeded. 2023-09-28 12:14:17 BST user NOTICE debug_umount[] umount < 76307 /bin/bash /infoblox/common/bin/debug_umount /mnt < 5686 /infoblox/one/bin/controld < 4887 /infoblox/one/bin/process_manager < 4574 /infoblox/one/bin/clusterd 2023-09-28 12:14:17 BST user INFO distribute_upgrade_file[] Grid Distribution Completed 2023-09-28 12:14:14 BST daemon INFO INFOBLOX-Grid[]
Errors
2023-09-28 12:14:14 BST syslog CRITICAL root[] Upgrade check failed, SKI doesn't exist in CA-certificate subject=CN = Name of CA 2023-09-28 12:14:14 BST syslog CRITICAL root[] Grid not compatible with 9.0.4-52074-0a9ee839965f release due to unsupported hardware or incompatible configuration setting
2023-09-28 12:16:01 BST daemon INFO systemd[1] mnt.mount: Succeeded. 2023-09-28 12:16:01 BST daemon INFO systemd[11702] mnt.mount: Succeeded. 2023-09-28 12:16:01 BST user NOTICE debug_umount[] umount < 81317 /bin/bash /usr/bin/umount /mnt < 81029 /bin/bash /infoblox/one/bin/test_upgrade /storage/upgrade_bin_file 2023-09-28 12:16:01 BST daemon INFO systemd[1] mnt_storage.mount: Succeeded. 2023-09-28 12:16:01 BST daemon INFO systemd[11702] mnt_storage.mount: Succeeded. 2023-09-28 12:16:01 BST user NOTICE debug_umount[] umount < 81305 /bin/bash /usr/bin/umount /mnt_storage < 81029 /bin/bash /infoblox/one/bin/test_upgrade /storage/upgrade_bin_file 2023-09-28 12:16:01 BST daemon INFO systemd[1] mnt-nios-rootfs-storage.mount: Succeeded. 2023-09-28 12:16:01 BST daemon INFO systemd[11702] mnt-nios-rootfs-storage.mount: Succeeded. 2023-09-28 12:16:01 BST user NOTICE debug_umount[] umount < 81293 /bin/bash /usr/bin/umount /mnt/nios/rootfs/storage < 81029 /bin/bash /infoblox/one/bin/test_upgrade /storage/upgrade_bin_file 2023-09-28 12:16:01 BST daemon INFO systemd[11702] mnt-nios-rootfs-proc.mount: Succeeded. 2023-09-28 12:16:01 BST daemon INFO systemd[1] mnt-nios-rootfs-proc.mount: Succeeded. 2023-09-28 12:16:01 BST user NOTICE debug_umount[] umount < 81251 /bin/bash /usr/bin/umount /mnt/nios/rootfs/proc < 81029 /bin/bash /infoblox/one/bin/test_upgrade /storage/upgrade_bin_file 2023-09-28 12:16:01 BST daemon INFO systemd[1] mnt-nios-rootfs-sys.mount: Succeeded. 2023-09-28 12:16:01 BST daemon INFO systemd[11702] mnt-nios-rootfs-sys.mount: Succeeded. 2023-09-28 12:16:00 BST user NOTICE debug_mount[] mount < 81161 /bin/bash /usr/bin/mount -t proc proc /mnt/nios/rootfs/proc < 81029 /bin/bash /infoblox/one/bin/test_upgrade /storage/upgrade_bin_file 2023-09-28 12:15:59 BST user NOTICE debug_mount[] mount < 81126 /bin/bash /usr/bin/mount -o bind /mnt_storage /mnt/nios/rootfs/storage < 81029 /bin/bash /infoblox/one/bin/test_upgrade /storage/upgrade_bin_file 2023-09-28 12:15:59 BST kern kernel info [ 1843.313284] EXT4-fs (sda6): mounted filesystem with ordered data mode. Opts: (null) 2023-09-28 12:15:59 BST user NOTICE debug_mount[] mount < 81113 /bin/bash /usr/bin/mount -t ext4,ext3 /dev/sda6 /mnt_storage < 81029 /bin/bash /infoblox/one/bin/test_upgrade /storage/upgrade_bin_file 2023-09-28 12:15:59 BST kern kernel info [ 1843.249086] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null) 2023-09-28 12:15:59 BST user NOTICE debug_mount[] mount < 81088 /bin/bash /usr/bin/mount -t ext4,ext3 /dev/sda2 /mnt < 81029 /bin/bash /infoblox/one/bin/test_upgrade /storage/upgrade_bin_file 2023-09-28 12:15:59 BST daemon NOTICE httpd[] 2023-09-28 11:15:59.739Z [username]: Called - GridUpgrade: Args action="UPGRADE_TEST_START"