This is an old revision of the document!

NIOS Upgrade

First Rule Of Upgrading NIOS - Read the release notes. Then read them again. We cannot stress this single point enough.

Official upgrade documentation here.

Notes

Remember, upgrading NIOS will empty the recycle bin.

NOTE: When you have install a hotfix bundle/collective hot fix (e.g. CHF 8.6.3.2 ), make sure you perform a product restart (of full reboot) on the systems to ensure the fix is fully implemented. If you forget and then try to upgrade to another version of NIOS (e.g. 9.0.1) you can (albeit, very unlikely) run into issues.

You may need to increase the session time out limit for your user account if you are having issues uploading code to the GM prior to an upgrade. If the time out limit is too low, the time out can break the upload.

Support Articles:

https://support.infoblox.com/s/article/9179 (Step by step explanation of NIOS upgrade process)
https://support.infoblox.com/s/article/Checklists-before-attempting-a-NIOS-upgrade-activity
https://support.infoblox.com/s/article/89 (Upgrade Groups)

NOTE: When NIOS upgrades, the recycle bin is emptied.

NOTE: Do not leave the default group empty. If you do, the upgrade may never (officially) finish.

Upgrade Plan

NEVER press the “Upgrade” button. Always used Upgrade Schedules because that is the only way to use the upgrade groups (and ensure service continuity)
ALWAYS check to see if any member of the Grid has a hotfix installed. Hotfixs are not copied over during upgrade. You will need to check to see if the new code has the fix 'as standard' or if you need to reapply the same hotfix or install a new one (check with support). (show version and show upgrade_history)
Recycle bin will empty at upgrade so check it and see if anything is needed
Make an external Grid backup before the update
Take support bundle of the GM just before the actual upgrade
Check Upgrade Groups (membership and order)
Check status of any vDiscovery tasks (if used)
Check status of NTP
Check DHCP FO state (if DHCP used)
Check CPU and RAM (RAM usage will 'appear' to increase when going from 8.6 to 9.0 because page files are represented as used RAM).
If you use the DNS Forwarding Proxy (DFP) or you have linked the GM/GMC to the Infoblox Portal, make sure that they are all showing as healthy in the Infoblox Portal. If they are not healthy, there may be a communication problem and that may cause problems after upgrade.
Check your account on the Infoblox Support Portal. Make sure that the phone number listed is correct and works internationally. In many cases, support try and contact you on this number but can't get through because the number is listed incorrectly.
Check reporting server to see what the current usage trends are (e.g. is DNS traffic distributed equally across all DNS servers, etc)
If you are using ILOM, check that it works. (Physical appliances only)
Raise a preemptive support ticket. Also upload a small file to show that you can (some customers have traffic inspection security systems that can interfere with the upload mechanism)
Read release notes - SERIOUSLY, read them carefully. This is where you will find details of changes to default behaviour, notes for upgrades, etc. Official upgrade documentation is now here.
Have a set of tests for validating services before and after upgrade (e.g. DNS recursion, DHCP, etc)
Where possible, upload, distribute and test the upgrade BEFORE the actual change window. Ideally two or more weeks before the upgrade window if you have a lot of process for change control. This gives you time to get support for any issues with those steps. This reduces the risk of issues impacting the upgrade. (e.g. code refusing to distribute because of a configuration error or the test failing because of a configuration error). Infoblox users have had change windows run out of time when they encountered issues at the Distribute or Test stage and didn't have enough time to get to the root of the problem and fix it (which meant having to schedule another change window).
If you are running any Grid member as a virtual appliance, make sure that you have access to the console of that VM (e.g. VMware, AWS, etc). If you do not have access, make sure you know who does and that they are available during the upgrade window. Scenario: member goes down for a reboot after upgrade and doesn't come back. You will need console access to see what is wrong and engage support and/or just reboot the appliance). What if you have to re-deploy the VM, do you know how?
If you are running any Grid member as a physical appliance, make sure that you know exactly where it is physically located (site, room, rack, U, etc). Make sure you have easy access to it (e.g. pre-request a data center access pass 'just-in-case') or make sure you know what local-hands are available to access the device. e.g. if it doesn't come back after a reboot, physically rebooting may be necessary and using a console cable to read off the console may also be necessary).

Downgrades

The downgrade procedure is for single independent appliances only. Infoblox does not support software downgrades for Grid members, but you can revert to the previous NIOS release on a Grid Master.

After you complete the downgrade procedure, all data in the database is lost. The downgrade process does not preserve data but does preserve license information and basic network settings.

Upgrades to NIOS 9.0

Remember, when 9.0.3 has been distributed, you can then install the CHF2 for 9.0.3 to install it on the new partition. This means that CHF3 will be installed on 9.0.3 as soon as 9.0.3 is installed. Verify with

show upgrade_history

You should install Hotfix-NIOS-98022 BEFORE upgrading to NIOS 9.0 (but AFTER distribution of NIOS 9.0.x code) to ensure that all OpenVPN connections (Grid communication) is using a correct certificate. Failure to do this can result in members going offline (not connecting to GM) and/or GM entering a reboot loop. From NIOS 9.0.6 onwards, Upgrade Test and Upgrade will fail if OpenVPN certificates are not correct. More details here.

In NIOS 9.0 and higher, if you use LDAP authentication and you need the LDAP connection to egress the MGMT interface, you must put a static route on the NIOS box to force the traffic to use the MGMT interface. This is because in NIOS 9.0.0, LDAP requests to the LDAP server and Active Directory server cannot be sent using the MGMT IP address, because OpenLDAP version 2.4.49 (Ubuntu) removed the options of binding the source IP address on the client. Therefore, an LDAP request or an Active Directory authentication request is always sent through the LAN IP address, even though you have enabled the Connect through Management Interface option.

For 9.0.x upgrades, read the release notes very carefully. Specifically, this is paraphrased from the “Upgrade Guidelines” section:

For NIOS members in Azure, you MUST disable “Accelerated Networking” for the member in the Azure portal (or the member will loose network connectivity after upgrade). But only if going to NIOS 9.0.0-9.0.4. This is because NIOS 9.0.5 fixes the problem by introducing proper support for ADP. KB Article
Root and Subordinate CA certificates MUST have Subject Key Identifier field
Root and Subordinate CA certificates MUST have basicConstraints marked a critical extension (RFC 5280 - 4.2.1.5)
Root and Subordinate CA certificates MUST have they keyUsage extension field (RFC 5280)
Root and Subordinate CA certificates MUST NOT have md5WithRSAEncryption or sha1WithRSAEncryption ciphers
HTTPS certificates and CA certificates must be “in date” according to the certificates' “Not Before” and “Not After” dates.
HTTPS certificates and CA certificates cannot use MD5 hash or SHA1 hash.
HTTPS certificates and CA certificates must have key size of 2048 or higher.
If you use SAML to access NIOS, read this documentation page. Previously, in NIOS 8.x releases ?acs formatting was used for IP and FQDN. Now in NIOS 9.x the /metadata must be used for FQDN.

HOWEVER, remember that, if you are using the reporting server, you can't have your HTTPS certificates use a greater key size than 2048 because of a limitation of Java.

You must remove DNS Unbound for all external syslog servers if configured explicitly because NIOS 9.0 does not support unbound.

If you are running a NIOS member in Azure, and if you have “Accelerated Networking” (Azure’s SRIOV) enabled for any of the network interfaces of the member at an Azure level (configured in Azure portal - NOT the NIOS UI) then you MUST disable “Accelerated Networking” IF you are installing NIOS 9.0.0, 9.0.1, 9.0.2, 9.0.3, or 9.0.4. Accelerated Networking / SRIOV introduces extra NICs to the VM - NICs that have the same MAC address as the regular NICs. The duplicate MAC addresses causes issues for NIOS 9.0.3 so that NIOS no can no longer map the interfaces. NIOS 8.6.x instances don't have the Azure SRIOV NICs so the extra NICs do not show up and cause issues. NIOS 9.0.5 fixes this problem.

If your NIOS is authoritative DNS and uses DNSSEC to sign those zones, you MUST get rid of the following algorithms/ciphers

RSAMD5(1),
DSA(3),
DSA-NSEC3-SHA1(6)
digest type SHA-1(1) for DS record.

Switch to algorithms

RSASHA1(5),
RSASHA1-NSEC3-SHA1(7),
RSASHA256(8), (recommended)
RSASHA512(10),
ECDSAP256SHA256(13), (recommended)
ECDSAP384SHA384(14) for child zones
Use digest type SHA-256(2) for generating DS records.

(ISC has removed support for algorithms 1(RSA-MD5), 3(DSA), and 6(DSA-NSEC3-SHA1) in BIND 9.16)

https://datatracker.ietf.org/doc/html/rfc8624#section-3.1

The following table lists the implementation recommendations for DNSKEY algorithms

RSAMD5 – MUST NOT be used for singing or validation
DSA – MUST NOT be used for singing or validation
DSA-NSEC3-SHA1 – MUST NOT be used for singing or validation

https://datatracker.ietf.org/doc/html/rfc8624#section-3.3

SHA-1 – MUST NOT use for DNSSEC Validation

SHA-1 is still widely used for Delegation Signer (DS) records, so validators MUST implement validation, but it MUST NOT be used to generate new DS and CDS records (see “Operational Considerations” for caveats when upgrading from the SHA-1 to SHA-256 DS algorithm.)

As per this document, NIOS 9.0.1+ changes the way NIOS accounts for Available memory has changed to exclude DB Cache/Huge Pages from Available memory. This results in a more accurate calculation of the Used Memory in comparison to previous versions, although it might appear that memory usage is now higher.

Infoblox utilizes Huge Pages for Database (DB) Cache. Before NIOS 9.0.1, these DB Cache/Huge Pages were treated as Cached and were included in the Available memory calculation, resulting in the Available memory being high, although the huge pages were being used by the DB Cache.

When upgrading to NIOS 9.0.2 or higher, it won't work if you have imported Trusted Root CA's that don't have the SKI field. e.g. Root CA's generated on a Palo Alto Networks firewall won't have this field but certificates generated on a Linux CLI will. You will get the following error on the console and in syslog. The solution is to delete the trusted root CA (note: you don't have to replace live HTTPS certificates even if those certificates are signed by CA with no SKI)

2023-12-02 14:38:45 GMT
	syslog
	CRITICAL
	root[2179522]
	Upgrade check failed, SKI doesn't exist in CA-certificate
		
		
2023-12-02 14:38:45 GMT
	syslog
	CRITICAL
	root[2179557]
	Grid not compatible with 9.0.3-50212-ee11d5834df9 release due to unsupported hardware or incompatible configuration setting

After distributing the code, I then found the following:

CHECKING RFC 5280 compliance for /storage/etc/security/certs/aaa_ca_cert.pem
Upgrade check failed, certificate violates RFC 5280
serial=44D33399A222D0AAA4FFF1118666444CCCEEE111 : Root and Subordinate CA certificate keyUsage extension MUST be present.
CA certificate check failed test upgrade failure

RFC 5280 is here. In my case, I told NIOS to regenerate a self-signed cert for the GUI to allow me to get past the issue.

Another RFC 5280 issue encountered is

CHECKING RFC 5280 compliance for /storage/etc/security/certs/aaa_ca_cert.pem
Upgrade check failed, certificate violates RFC 5280
serial=44D33399A222D0AAA4FFF1118666444CCCEEE111 : basicConstraints MUST appear as a critical extension.

The following command is available from NIOS 9.0 onwards

set disable_strict_ca_cert_check

show strict_ca_cert_check

You may also find that the HTTPS certificate that came with NIOS has expired.

(on NIOS 8.6.2)
crit Upgrade check failed, Apache certificate has expired on Jun 12 10:40:50 2023 GMT
crit Grid not compatible with 9.0.3-50212-ee11d5834df9 release due to unsupported hardware or incompatible configuration

You can recreate this certificate or go to 8.6.4 first and then 9.0.3.

Note that NIOS 8.6.2 generates a SHA-265/RSA 2048bit certificate that is valid for 365 days and has the FQDN and IP of the GM in the SAN field.

Note: 9.0.5 introduces changes to Threat Insight. If you use Threat Insight, you may see some CPU usage increase.

8.6 = 7.7p1
9.0.3 = 20.04 with 8.2p1
9.0.4 = 22.04 with 8.9p1

NIOS Hotfix

When applying a hotfix, support will send you a zip with instructions, a hotfix file and a “revert” hotfix file.

Go to Grid > Upgrade and then select the member you want to apply the hot fix to. Then click the down arrow by “Apply Hotfix” and choose “To selected Grid Members”. You then get to select and upload the hotfix file. Uploading the hotfix file causes it to be installed more or less instantly. Good practice is to then manually reboot the appliance. You will then see the hot fix details listed in the Hotfix column.

For primary and secondary DNS servers, the zones will stay in sync while the members are on different versions. However, the zones on the servers stay in sync when its about notifications and updates but you will not be able to manually create records during the time of running two different NIOS versions in one grid.

Set the default “Revert Window Time” from 24 hours to 1 hour by running in Grid Master the command “set default_revert_window 1”

Upgrade Timings

Putting each member in its own group helps with sequencing (what triggers next box in a sequence)

Upgrade policy timer is hard coded as 10 minutes. When upgrade process starts for a member in an upgrade group, grid master starts a timer for 10 minutes and prints the following debug log:

[2017/07/05 07:55:14.803] (16277 /infoblox/one/bin/clusterd) upgrade_policy.c:478 __should_wait_for_upgrade_set(): Start upgrade policy timer 10 mins to x.x.x.x ## x.x.x.x is the member IP address ##

If the member completes upgrade and joins sucessfully within 10 minutes, grid master logs the following and then proceeds with next member in the current upgrade group or to the next upgrade group:

[2017/07/05 08:00:13.739] (16277 /infoblox/one/bin/clusterd) upgrade_policy.c:1010 cd_upgrade_node_upgrade_timeout(): Node x.x.x.x finished upgrade

However, if the member does not complete upgrade within 10 minutes, grid master skips this member and then proceeds with next member in the current upgrade group or to the next upgrade group:

[2017/07/05 08:05:14.803] (16277 /infoblox/one/bin/clusterd) upgrade_policy.c:1005 cd_upgrade_node_upgrade_timeout(): Timeout waiting for x.x.x.x to upgrade ## x.x.x.x is the member IP address ##

Upgrade Revert Window

You can scheduled upgrades can occur over a period of 9 days as per the docs.

Did we ever get a reason why this revert window even exists? What does it do differently than a regular revert?

I believe it holds the grid in that “you can only change some things” interim state (so that it can work with the members at different revs) to allow for the per-member revert. The revert window allows you to do single node reverts.

Revert window is applied to each individual member, after the member completes upgrade, it has 24 hours (default) to 48 hours (max) to revert (there is still data loss). The “individual revert” is intended for troubleshooting and less disruption (instead of reverting the entire Grid).

Then, there's reverting the entire Grid, there's no time limit for this, everything reboots and boot back into the old partition (old software), all changes are lost.

set grid_upgrade forced_upgrade

shortens the upgrade window (“everybody forget about the schedule and upgrade now!”)

Force Upgrade End

Do not use this command unless you know what you are doing and you have engaged Infoblox technical support.

set grid_upgrade forced_end

Upgrade Groups

Make sure that the default group has at least one member associated with it, otherwise the appliance displays that the upgrade process is still in progress even though it is complete. To avoid this, you can either use the Infoblox > set grid_upgrade forced_end command to stop the upgrade process or keep at least one member in the default group.

Note: Using the command will force all upgrade groups to end upgrade immediately, all incomplete groups members will be logged-off the grid to perform an auto-sync of software with the grid this operation should only be used in an emergency situation to end a scheduled upgrade as it will result in member service outage until the operation is completed.

Automating Upgrades

Upgrades can be automated via API.

Ideally, define your upgrade & distribution groups in advance, then you just

Upload image to Grid
Set distribution schedule
Set upgrade schedule
Monitor it

You can upload the image via fileop via uploadinit and set_upgrade_file
You can directly call grid functions upgrade, upgrade_group_now, member_upgrade
There's the ability to set/run etc… distribution_schedule
There's the ability to set/run upgradeschedule
You can even “monitor” the progress via WAPI

Logs Generated When Distributing Code

2023-12-03 12:23:38.623Z
	daemon
	NOTICE
	httpd[]
	[username]: Called - GridUpgrade: Args action="DISTRIBUTION_START"

2023-09-28 12:14:27 BST
	daemon
	INFO
	distribute_upgrade_file[]
	Started Grid Master distribution

2023-09-28 12:14:27 BST
	daemon
	INFO
	distribute_upgrade_file[]
	Completed Grid Master distribution. Begin distribution on members.

2023-09-28 12:14:17 BST
	daemon
	INFO
	systemd[11702]
	mnt_storage.mount: Succeeded.
 
2023-09-28 12:14:17 BST
	daemon
	INFO
	systemd[1]
	mnt_storage.mount: Succeeded.

2023-09-28 12:14:17 BST
	user
	NOTICE
	debug_umount[]
	umount < 76320 /bin/bash /infoblox/common/bin/debug_umount /mnt_storage < 5686 /infoblox/one/bin/controld < 4887 /infoblox/one/bin/process_manager < 4574 /infoblox/one/bin/clusterd

2023-09-28 12:14:17 BST
	daemon
	INFO
	systemd[11702]
	mnt.mount: Succeeded.

2023-09-28 12:14:17 BST
	daemon
	INFO
	systemd[1]
	mnt.mount: Succeeded.

2023-09-28 12:14:17 BST
	user
	NOTICE
	debug_umount[]
	umount < 76307 /bin/bash /infoblox/common/bin/debug_umount /mnt < 5686 /infoblox/one/bin/controld < 4887 /infoblox/one/bin/process_manager < 4574 /infoblox/one/bin/clusterd

2023-09-28 12:14:17 BST
	user
	INFO
	distribute_upgrade_file[]
	Grid Distribution Completed

2023-09-28 12:14:14 BST
	daemon
	INFO
	INFOBLOX-Grid[]

Errors

2023-09-28 12:14:14 BST
	syslog
	CRITICAL
	root[]
	Upgrade check failed, SKI doesn't exist in CA-certificate subject=CN = Name of CA

2023-09-28 12:14:14 BST
	syslog
	CRITICAL
	root[]
	Grid not compatible with 9.0.4-52074-0a9ee839965f release due to unsupported hardware or incompatible configuration setting

Logs Generated When Testing Upgrade

2023-09-28 12:16:01 BST
	daemon
	INFO
	systemd[1]
	mnt.mount: Succeeded.
 	
	
2023-09-28 12:16:01 BST
	daemon
	INFO
	systemd[11702]
	mnt.mount: Succeeded.
 
	
2023-09-28 12:16:01 BST
	user
	NOTICE
	debug_umount[]
	umount < 81317 /bin/bash /usr/bin/umount /mnt < 81029 /bin/bash /infoblox/one/bin/test_upgrade /storage/upgrade_bin_file
 
	
2023-09-28 12:16:01 BST
	daemon
	INFO
	systemd[1]
	mnt_storage.mount: Succeeded.
 	
	
2023-09-28 12:16:01 BST
	daemon
	INFO
	systemd[11702]
	mnt_storage.mount: Succeeded.
 	
	
2023-09-28 12:16:01 BST
	user
	NOTICE
	debug_umount[]
	umount < 81305 /bin/bash /usr/bin/umount /mnt_storage < 81029 /bin/bash /infoblox/one/bin/test_upgrade /storage/upgrade_bin_file
 
	
2023-09-28 12:16:01 BST
	daemon
	INFO
	systemd[1]
	mnt-nios-rootfs-storage.mount: Succeeded.
 
		
2023-09-28 12:16:01 BST
	daemon
	INFO
	systemd[11702]
	mnt-nios-rootfs-storage.mount: Succeeded.
 	
2023-09-28 12:16:01 BST
	user
	NOTICE
	debug_umount[]
	umount < 81293 /bin/bash /usr/bin/umount /mnt/nios/rootfs/storage < 81029 /bin/bash /infoblox/one/bin/test_upgrade /storage/upgrade_bin_file
	
2023-09-28 12:16:01 BST
	daemon
	INFO
	systemd[11702]
	mnt-nios-rootfs-proc.mount: Succeeded.
	
2023-09-28 12:16:01 BST
	daemon
	INFO
	systemd[1]
	mnt-nios-rootfs-proc.mount: Succeeded.

2023-09-28 12:16:01 BST
	user
	NOTICE
	debug_umount[]
	umount < 81251 /bin/bash /usr/bin/umount /mnt/nios/rootfs/proc < 81029 /bin/bash /infoblox/one/bin/test_upgrade /storage/upgrade_bin_file

2023-09-28 12:16:01 BST
	daemon
	INFO
	systemd[1]
	mnt-nios-rootfs-sys.mount: Succeeded.

2023-09-28 12:16:01 BST
	daemon
	INFO
	systemd[11702]
	mnt-nios-rootfs-sys.mount: Succeeded.

2023-09-28 12:16:00 BST
	user
	NOTICE
	debug_mount[]
	mount < 81161 /bin/bash /usr/bin/mount -t proc proc /mnt/nios/rootfs/proc < 81029 /bin/bash /infoblox/one/bin/test_upgrade /storage/upgrade_bin_file

2023-09-28 12:15:59 BST
	user
	NOTICE
	debug_mount[]
	mount < 81126 /bin/bash /usr/bin/mount -o bind /mnt_storage /mnt/nios/rootfs/storage < 81029 /bin/bash /infoblox/one/bin/test_upgrade /storage/upgrade_bin_file

2023-09-28 12:15:59 BST
	kern
	kernel
	info [ 1843.313284] EXT4-fs (sda6): mounted filesystem with ordered data mode. Opts: (null)

2023-09-28 12:15:59 BST
	user
	NOTICE
	debug_mount[]
	mount < 81113 /bin/bash /usr/bin/mount -t ext4,ext3 /dev/sda6 /mnt_storage < 81029 /bin/bash /infoblox/one/bin/test_upgrade /storage/upgrade_bin_file

2023-09-28 12:15:59 BST
	kern
	kernel
	info [ 1843.249086] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null)

2023-09-28 12:15:59 BST
	user
	NOTICE
	debug_mount[]
	mount < 81088 /bin/bash /usr/bin/mount -t ext4,ext3 /dev/sda2 /mnt < 81029 /bin/bash /infoblox/one/bin/test_upgrade /storage/upgrade_bin_file

2023-09-28 12:15:59 BST
	daemon
	NOTICE
	httpd[]
	2023-09-28 11:15:59.739Z [username]: Called - GridUpgrade: Args action="UPGRADE_TEST_START"

Saucepan

Table of Contents