This is an old revision of the document!

NIOS DNS

Authorative zones are queriable from anywhere by default even if ACL is set to “none”. Recursive is different.

When NIOS is the primary server to a load of secondary servers (e.g 40 secondary servers), it can be better to make the NIOS box that the other (e.g. BIND) devices connect to a “transfer hub”. Not an official term but specifically we mean that it gets its own copy of the zone data from the primary DNS server via Zone transfer rather than Grid transfer. This is because zone transfer data is stored in RAM and not Database.

Zone Transfers of a zone are automatically allowed from IP addresses of NS servers specified in that zone. i.e. if you have external BIND servers that are named as external secondaries, those IP addresses can automatically do zone transfers.

Blacklists - Don't require extra licences

NXDomain - Requires “Query Redirection” licence.

Blacklists - create blacklist definition and then import data for the list from CSV. Don't bother adding a “_new_domain_name” column as the redirect IP is set under Grid DNS Properties > Blacklist.

header-blacklistrule,parent*,domain_name*,action*
BlacklistRule,my-blacklist,allowed.foo.com,PASS
BlacklistRule,my-blacklist,foo.com,REDIRECT
BlacklistRule,my-blacklist,bar.com,REDIRECT

NXDomain - Install licence

Pass: The DNS member resolves the query and forwards the response to the DNS client, even if it is an NXDOMAIN response.
Modify: The DNS member resolves the query and forwards the response to the DNS client, only if it is not an NXDOMAIN response. But if the member receives an NXDOMAIN response, it sends the client a synthesized response that includes predefined IP addresses.
Redirect: The DNS member does not resolve the query. Instead, it sends the client a synthesized response that includes predefined IP addresses.

Concurrent Queries

“Limit number of recursive clients to”. Max setting is 40,000 regardless of model type. This is very high. If you are hitting logs saying that your limit is exceeded and you limit is over 15,000, something else is likely wrong unless you are an ISP.

Docs here.

PTR Reverse Zones

Unless you are delegating authority for a specific child PTR domain, keep everything in one PTR zone. Build all the parent-level RFC 6303 zones internally and no child zones.

A clean reverse space is self-documenting and easier to troubleshoot.

Don't go below /16 for reverse zones. Unless you are delegating the space to another administrator on other systems, there is nothing to gain by going down to the /24 level.

NOTE: For some NIOS customers, the reverse space becomes “dirty” any time they create a network and the wizard gives them the option to build the reverse zone as well. Often they don't think about this or even understand the question and select the option. This can create the odd issue of black hole'ing portions of the reverse space (because these sub-zones are often not associated with ANY name servers.

On Microsoft there is a Group Policy to make Windows devices register dynamically the PTR record in addition to A record.

Local Computer Policy > Computer Configuration > Administrative Templates > networks > DNS Client > REgistere PTR records

Performance

Things that can massively impact the performance of a DNS server.

QPS goes up. More queries to process.
CHR goes down. Queries not being answered from cache/memory so more work to resolve queries.

One thing that decreases CHR is when you add extra DNS suffixes to endpoints. Endpoints shouldn't have more than three DNS suffixes and never more than five.

Cache Layer vs Auth Layer

Back in the old days, auth layer would be separated from cache because auth config files would be updated often but caching wouldn't (every little config to change). Thus, a bad config update to auth would break not just auth but also cache. By having them on separate appliances, you could make “most” DNS resolution more bullet proof to auth config misconfiguration. (e.g. back in the very early days of GSLB in the industry, it was possible to add a tab character in the GLSB FQDN which then broke BIND).

DNSSEC

Notes on DNSSEC validation when forwarding from one NIOS box to another DNS server.

Return Minimal Responses

Grid > Data Management > DNS > Members/Servers > [Edit Member ] > General > “Return minimal responses”

The option “Return Minimal Responses” should generally be disabled for external facing DNS servers.

It has been see that enabling “Return Minimal Responses” can cause issues when Microsoft clients query NIOS which has a forwarder to a Microsoft Active Directory domain controller.

This means it returns

;; ANSWER SECTION:
_mssms_mp_swa._tcp.domain.internal.local. 14400 IN SRV 0 0 80 domaincontrollerhostname.domain.internal.local.

Instead of

;; ANSWER SECTION:
_mssms_mp_swa._tcp.domain.internal.local. 14400 IN SRV 0 0 80 domaincontrollerhostname.domain.internal.local.

;; ADDITIONAL SECTION:
domaincontrollerhostname.domain.internal.local. 1200 IN      A       1.2.3.4
domaincontrollerhostname.domain.internal.local. 1200 IN      AAAA    2002:2002:2002::2002:2002

That extra bit is needed by the Microsoft clients so “Return Minimal Responses” had to be disabled.

General Design

If CHR is 40% or less look for top offenders (e.g. SIEM or Mailserver) and maybe give them their own DNS servers or turn off feature on the offending server (e.g. stop PTR lookups)

Bind - max recursive clients is 1,000. Increase to 5,000. If you have to increase this, never increase more than the max QPS. If you have to increase beyong 5,000 (max is 40,000) either the QPS has massively increased or the CHR has fallen. Find the top offenders using the reporting server. The reporting server doesn't need query logging to give top query clients and top query domains.

If BIND isn't running, NTP isn't available on anycast. NTP just runs on all operationals interfaces and route withdrawal is tied to BIND service. Thus, no BIND, no NTP on anycast address.

Global Forwarders. Don't use more than 4. Possibly 6 at a push.

Don't forget that you mustn't assume that the OS's will always use the first DNS server and will only try the second or third if the first one fails.

Three DNS servers can be configured on Linux. THree DNS servers can be configured on Windows using DHCP.

Disable Cache

You can't disable cache on NIOS but you can set TTL to 0 under DNS Grid Properties > General > Advanced > MAx Cache TTL (Set to 0).

DNS Tombstone

When showing “capacity” of a member, you may see entry “bind_tombstone”. The zone-maintenance phase of NIOS's OneDB's AZD (augmented zone data) handling creates a timestamped tombstone record when a DNS record in a multi-master zone is locally deleted or is deleted by DB replication on the Grid Manager.

Anycast

Remember, for Anycast, you need to setup the Anycast IP on the member, then edit the member's DNS properties and configure it to accept queries on the Anycast IP (General > Basic tab). It may then take a minute to apply.

Don't forget, in order to make sure that traffic is actually distributed across all members of the anycast group, the router must be configured with ECMP and load balance using IP Hash with using source source IP and source port.

OSPF advertises the route when the DNS service starts. The start DNS command creates an interface and starts the OSPF daemon.
OSPF stops advertising the route when the DNS service stops. The stop DNS command stops the OSPF daemon and deletes the interface.
The NIOS application does not support a route flap. For example, temporary DNS downtime such as restart, does not stop or re-instate the OSPF advertisement.
The OSPF advertisement stops if DNS service is down for more than 40 seconds.

BGP is preferable over OSPF as if can be more finely manipulated.

show bgp config

show ospf config

Query logs will show the Anycast IP the query was aimed at The following has 192.168.77.53 as the Anycast IP and 192.167.1.4 is the client IP.

client @0x7f7b7912ae28 192.168.1.4#60352 (www.google.com): query: www.google.com IN A +E(0)K (192.168.77.53)

You should not hand out multiple anycast addresses to a client that go to the same end member. If that member has an issue (or a routing issue), then none of the addresses the client's trying to use will work until (or if?) the routing protocol re-converges. If you want multiple Anycast IP addresses, use an “A side, B side” layout. I.E. If there are two members per data centre, one Anycast IP goes on the first member and the second Anycast IP goes on the second member. Replicate across all DC's.

When using Anycast you should also enable BFD and enable the DNS Health Check Monitor (documentation and documentation).

HA and Anycast are not mutually exclusive. You might want to HA DNS Anycast boxes to assist with NIOS upgrades. Also, it improves Geo resiliancy. If you have a HA box in USA, EMEA, and APAC, anycast will keep DNS available but a failure in one geo means clients in that geo now have higher DNS latency. Of course, if you are doing HA, it means you could deploy two standalone boxes per geo to reduce that risk. However, you should architect that a single box per geo can cope with the load of the geo (and at least one other).

If you only have two or three DNS servers, don't bother with anycast (probably). Just specify two or three DNS servers to the clients. Linux and Windows can handle this (use DHCP if required).

Don't put a second anycast IP on all the same Anycast servers. E.g. Say you had two DNS servers in AMER, two in EMEA, and two in APAC. You would put one Anycast IP on the first server in every GEO and a second Anycast IP on the second server in every GEO. Make sure that the two Anycast IP addresses cannot be summerised in teh same route internally. Exactly how far appart the two IP's should be depends on how far you summarise the routes internally.

If you have one anycast IP and want to use a secondary IP that is non-anycast as backup, make sure that the secondary IP is NEVER in the same DC as the endpoint's given the secondary IP. For example, if you have three datacenters, if you have endpoints in DC1 that use the DC1 local DNS IP and the Anycast IP, then if anything happens to the DNS server in DC1, routing won't update immediatley (BFD can help keep route converging to a few seconds) so both the primary and secondary DNS servers queried will both go to the (faulty) DNS server in DC1. Causing an outage for endpoints in DC1. This is why DC1 should have Anycast + a local IP from a DC in another data center.

LAN2

To get a NIOS appliance to receive DNS queries on LAN1 but then send queries (i.e. recursion) on LAN2 (e.g. in a bridged DMZ where LAN1 = internal network and LAN2 = external network), then under the member properties in Grid go to General > Basic and toggle “Send queries from” LAN2 interface. And “Send notify messages and zone transfer requests from” LAN2 interface.

Fault Tolerant Caching

Infoblox recommend this. In a lab it addresses about 10% extra memory usage.

ISP's might implement this to help mitigate (i.e. continue with cache responses in case of massive Authoritative failure) the end user impact of incidents such as the Facebook BGP/DNS outage in November 2021.

External DNS

To hide private IP of LAN1 interface when NIOS is externally facing,

Data Management→DNS→Members, edit member, Views.

Click on “Interface IP Address” for the view, change it to “Other IP Address”, then type in the IP you want published for glue in the view for the member.

Or you can make the NIOS entries in the Name Server Group to be “Stealth” and then add the external IP addresses as External Secondaries.

DNS Views

Multiple views on a member, fine. Looping/Forwarding between views is not fine. Possible and, in some cases, necessary, but not fine. It also means that the NIOS member probably cannot use itself as a resolver because it will often match the “wrong” view.

Alias

Remember, when putting an alias record on an authoritative DNS server (i.e. CNAME for APEX of domain), the DNS server will need to be able to resolve the place it is pointing to. This does not mean you have to enable recursion on the DNS server but the server itself will need to resolve the name. (e.g. management layer)

GSS-TSIG

When you have multiple AD domains in a Forest, you need to delegate the underscore zones and then enable GSS-TSIG updates at each delegated zone. This is exactly the same way it works in MSFT DNS. You CAN, but should NOT, enable GSS-TSIG on the “parent” AD zone. MSFT does this by default. It's a best practice to NOT allow that. Instead, you would use ACLs from server networks to allow servers to do updates. All client updates should be done only by the DHCP server or by some form of automation. If clients are doing the update, they can only update their own zone based on the AD domain the system belongs to. It's not possible for one client to update a different AD domain since (1) it's credentials won't allow it and (2) it only uses the domain name for which it is a member.

Forwarders

There is no single answer to the question “How long will NIOS take to fall back to root hints once a global forwarder fails”. It depends on how many forwarders are configured. More forwarders means more servers to try before failing back to root. It also depends on BIND version but the more modern BIND uses RTT which effects overall time, and finally there are mechanics at play as well for EDNS0 back off where it will try increasing Timeouts (last i read something like 1.6s, 3.2s, 6.4s 9s (until it hits the default max which i recall as 30s total). So there is no really easy answer for how long before any given server falls back to root.

In general, if the architecture requires you to use forwarders to another system, tick the “Use Forwarders Only” button. Otherwise, don't use forwarders at all.

If you have Forwarders configured and you have NOT ticked the “User Forwarders Only” button, that is called “Forward First”.

If you have “Use Forwarders Only”/Forward only selected then NIOS may still query root hints.

A server that is recursive, and has DNSSEC trust anchors installed for the root domain, and is performing DNSSEC validation, will reach out to root hints regardless of the forwarding configuration.
A server that is recursive but NOT set to “forward only” will look up root information from time to time to make sure it has the latest, rather than simply relying on the built-in hints that it starts up with. If they want to kill off queries to the root, they need to enable forward only.
An authoritative only server will reach out to root hints if there are ALIAS records hosted.

Forwarders Order

If a global forwarder is set and a member overrides it, the member setting takes priority.

In general, don't mess with forwarders and views but, it could also make sense to enable at a global level and disabled forwarding in an external view.

Limit Recursive Clients

“Limit number of recursive clients to”

By default this is 1,000 for all models.

If you get “no more recursive clients: quota reached” error.

Starting point for a setting is 10% of the appliance QPS rating. So if it was recently powered up you don't want a huge volume of clients hitting it all at once and overpowering it.

It's the number of clients in the queue for a response. DNS over UDP is stateless. Every query is treat as a unique client.

There is a recurring syslog message that gives insight into how many recursive clients you have at that point in time, along with a maximum seen and the number of times the hard limit (1000 by default) has been exceeded. Do a search for “recursion client quota” in the syslog on the member.

I generally use the syslog message as a guide. If it has hit the limit, then increase the limit by 1,000 and see if it is hit or if the max stays below the limit.

The most critical thing to know is that there is ALWAYS a limit, even if the box is not checked, that limit is 1,000. You can just adjust the limit, you can't eliminate it.

If you need to increase, do so 1k at a time.

KB Article - If you want to increase the number of outstanding recursive queries on the recursive name server, confirm that you have adequate memory available for that number of outstanding recursive queries and for other services that are configured on the same server. Every recursive query can take about 20 kilobytes.

TTL

Don't set minimum TTL to 0. It can cause issues when trying to resolve CNAME in the zone where the recursive server has to chase all the CNAME chain down and the 0 TTL one falls out before the recursive NS can return the answer back to the client, so it servfails.

NIOS SYSLOG

This section highlights some logs you can see in NIOS when testing DNS.

Query logging via syslog is highly discouraged, as it has a performance impact on QPS of up to 90%.

Use of Data Connector typically has an impact on QPS of around 40%. Each single Data Connector appliance supports processing of up to 47,000 QPS.

Updating Authoratative Zone

For when we update in NIOS a zone that NIOS is authoritative for. Notice the last log is a notification sent out to all secondary servers.

Facility: daemon
Level: Info
Server: named
Message: zone test.corp/IN: ZRQ applying transaction 135.
Message: zone test.corp/IN: ZRQ applied ADD for 'update': 28800 IN A 9.8.7.6 (none).
Message: zone test.corp/IN: ZRQ applied transaction 135 with SOA serial 3. Zone version is now 1.
Message: zone test.corp/IN: sending notifies (serial 2)

Initial Zone Transfer

For when NIOS is authoratative for a DNS zone and a new secondary server runs its initial zone transfer.

Facility: daemon
Level: Info
Server: named
Message: client @0x7f16562ed3a8 192.168.53.53#36591 (bloxer.corp): query: test.corp IN AXFR -T (192.168.11.153)
Message: client @0x7f16562ed3a8 192.168.53.53#36591 (bloxer.corp): transfer of 'test.corp/IN': AXFR started (serial 2)
Message: client @0x7f16562ed3a8 192.168.53.53#36591 (bloxer.corp): transfer of 'test.corp/IN': AXFR ended: 1 messages, 5 records, 210 bytes, 0.001 secs (210000 bytes/sec) (serial 2)

Incremental Update Logs

For when NIOS is authoratative for a DNS zone and an existing secondary server runs an incremental zone transfer to ensure it has the latest data.

Facility: daemon
Level: Info
Server: named
Message: client @0x7f418f21ce78 192.168.53.53#36019 (bloxer.corp): query: test.corp IN IXFR -T (192.168.11.153)
Message: client @0x7f418f21ce78 192.168.53.53#36019 (bloxer.corp): transfer of 'test.corp/IN': IXFR started (serial 2 → 3)
Message: client @0x7f418f21ce78 192.168.53.53#36019 (bloxer.corp): transfer of 'test.corp/IN': IXFR ended: 1 messages, 5 records, 250 bytes, 0.001 secs (250000 bytes/sec) (serial 3)

Same again but this time with TSIG enabled. (generate key with the tsig-keygen command.

Message: client @0x7f00972e9c08 192.168.53.53#51821/key mytsigkey (bloxer.corp): query: bloxer.corp IN IXFR -ST (192.168.1.1)
Message: client @0x7f00972e9c08 192.168.53.53#51821/key mytsigkey (bloxer.corp): transfer of 'bloxer.corp/IN': IXFR started: TSIG mytsigkey (serial 3 → 4)
Message: client @0x7f00972e9c08 192.168.53.532#51821/key mytsigkey (bloxer.corp): transfer of 'bloxer.corp/IN': IXFR ended: 1 messages, 5 records, 329 bytes, 0.001 secs (329000 bytes/sec) (serial 4)

Query Logs

Client = 192.168.99.1
DNS Server = 192.168.11.53

Query Log (TCP query which adds the T to TK)

client @0x7f80a84d0b78 192.168.99.1#56606 (example.com): query: example.com IN A +E(0)TK (192.168.11.53)
client @0x7f80a84d0b78 192.168.99.1#56606 (www.example.com): query: www.example.com IN A +E(0)TK (192.168.11.53)

Response Log

12-Mar-2024 08:34:25.811 client 192.168.99.1#56606: TCP: query: example.com IN A response: NOERROR +EV example.com. 3600 IN A 1.2.3.4;
12-Mar-2024 08:38:15.449 client 192.168.99.1#32884: UDP: query: www.example.com IN A response: NOERROR +EV www.example.com. 3483 IN CNAME example.com.; example.com. 3370 IN A 1.2.3.4;

client @0x7f80a8016fe8 192.168.12.153#45686 (www.sam.com): query: www.sam.com IN A +E(0)DC (192.168.11.156)

12-Mar-2024 08:42:58.412 client 192.168.12.153#45686: UDP: query: www.sam.com IN A response: NOERROR +EDV www.sam.com. 60 IN CNAME proxy-ssl.webflow.com.; proxy-ssl.webflow.com. 25 IN CNAME proxy-ssl-geo.webflow.com.; proxy-ssl-geo.webflow.com. 4 IN A 34.249.200.254; proxy-ssl-geo.webflow.com. 4 IN A 63.35.51.142; proxy-ssl-geo.webflow.com. 4 IN A 52.17.119.105;

12-Mar-2024 08:42:58.412 client 192.168.12.153#45686:

UDP:
query: www.sam.com IN A
response: NOERROR +EDV www.sam.com. 60 IN CNAME proxy-ssl.webflow.com.;
proxy-ssl.webflow.com. 25 IN CNAME proxy-ssl-geo.webflow.com.;
proxy-ssl-geo.webflow.com. 4 IN A 34.249.200.254;
proxy-ssl-geo.webflow.com. 4 IN A 63.35.51.142;
proxy-ssl-geo.webflow.com. 4 IN A 52.17.119.105;

Saucepan

Table of Contents