Troubleshooting node network configuration - Kubernetes NMState | Networking

Troubleshooting an incorrect node network configuration policy configuration
Troubleshooting DNS connectivity issues in a disconnected environment
- Configuring the bind9 DNS named server
- Configuring the dnsmasq DNS server

If the node network configuration encounters an issue, the policy is automatically rolled back and the enactments report failure. This includes issues such as:

The configuration fails to be applied on the host.
The host loses connection to the default gateway.
The host loses connection to the API server.

Troubleshooting an incorrect node network configuration policy configuration

You can apply changes to the node network configuration across your entire cluster by applying a node network configuration policy. If you apply an incorrect configuration, you can use the following example to troubleshoot and correct the failed node network policy.

In this example, a Linux bridge policy is applied to an example cluster that has three control plane nodes and three compute nodes. The policy fails to be applied because it references an incorrect interface. To find the error, investigate the available NMState resources. You can then update the policy with the correct configuration.

Procedure

Create a policy and apply it to your cluster. The following example creates a simple bridge on the ens01 interface:

apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: ens01-bridge-testfail
spec:
  desiredState:
    interfaces:
      - name: br1
        description: Linux bridge with the wrong port
        type: linux-bridge
        state: up
        ipv4:
          dhcp: true
          enabled: true
        bridge:
          options:
            stp:
              enabled: false
          port:
            - name: ens01

$ oc apply -f ens01-bridge-testfail.yaml

Example output

nodenetworkconfigurationpolicy.nmstate.io/ens01-bridge-testfail created

Verify the status of the policy by running the following command:
```
$ oc get nncp
```
The output shows that the policy failed:
Example output
```
NAME                    STATUS
ens01-bridge-testfail   FailedToConfigure
```
However, the policy status alone does not indicate if it failed on all nodes or a subset of nodes.

List the node network configuration enactments to see if the policy was successful on any of the nodes. If the policy failed for only a subset of nodes, it suggests that the problem is with a specific node configuration. If the policy failed on all nodes, it suggests that the problem is with the policy.

$ oc get nnce

The output shows that the policy failed on all nodes:

Example output

NAME                                         STATUS
control-plane-1.ens01-bridge-testfail        FailedToConfigure
control-plane-2.ens01-bridge-testfail        FailedToConfigure
control-plane-3.ens01-bridge-testfail        FailedToConfigure
compute-1.ens01-bridge-testfail              FailedToConfigure
compute-2.ens01-bridge-testfail              FailedToConfigure
compute-3.ens01-bridge-testfail              FailedToConfigure

View one of the failed enactments and look at the traceback. The following command uses the output tool jsonpath to filter the output:

$ oc get nnce compute-1.ens01-bridge-testfail -o jsonpath='{.status.conditions[?(@.type=="Failing")].message}'

This command returns a large traceback that has been edited for brevity:

Example output

error reconciling NodeNetworkConfigurationPolicy at desired state apply: , failed to execute nmstatectl set --no-commit --timeout 480: 'exit status 1' ''
...
libnmstate.error.NmstateVerificationError:
desired
=======
---
name: br1
type: linux-bridge
state: up
bridge:
  options:
    group-forward-mask: 0
    mac-ageing-time: 300
    multicast-snooping: true
    stp:
      enabled: false
      forward-delay: 15
      hello-time: 2
      max-age: 20
      priority: 32768
  port:
  - name: ens01
description: Linux bridge with the wrong port
ipv4:
  address: []
  auto-dns: true
  auto-gateway: true
  auto-routes: true
  dhcp: true
  enabled: true
ipv6:
  enabled: false
mac-address: 01-23-45-67-89-AB
mtu: 1500

current
=======
---
name: br1
type: linux-bridge
state: up
bridge:
  options:
    group-forward-mask: 0
    mac-ageing-time: 300
    multicast-snooping: true
    stp:
      enabled: false
      forward-delay: 15
      hello-time: 2
      max-age: 20
      priority: 32768
  port: []
description: Linux bridge with the wrong port
ipv4:
  address: []
  auto-dns: true
  auto-gateway: true
  auto-routes: true
  dhcp: true
  enabled: true
ipv6:
  enabled: false
mac-address: 01-23-45-67-89-AB
mtu: 1500

difference
==========
--- desired
+++ current
@@ -13,8 +13,7 @@
       hello-time: 2
       max-age: 20
       priority: 32768
-  port:
-  - name: ens01
+  port: []
 description: Linux bridge with the wrong port
 ipv4:
   address: []
  line 651, in _assert_interfaces_equal\n    current_state.interfaces[ifname],\nlibnmstate.error.NmstateVerificationError:

The NmstateVerificationError lists the desired policy configuration, the current configuration of the policy on the node, and the difference highlighting the parameters that do not match. In this example, the port is included in the difference, which suggests that the problem is the port configuration in the policy.

To ensure that the policy is configured properly, view the network configuration for one or all of the nodes by requesting the NodeNetworkState object. The following command returns the network configuration for the control-plane-1 node:
```
$ oc get nns control-plane-1 -o yaml
```
The output shows that the interface name on the nodes is ens1 but the failed policy incorrectly uses ens01:
Example output
```
   - ipv4:
...
      name: ens1
      state: up
      type: ethernet
```

Correct the error by editing the existing policy:

$ oc edit nncp ens01-bridge-testfail

...
          port:
            - name: ens1

Save the policy to apply the correction.

Check the status of the policy to ensure it updated successfully:

$ oc get nncp

Example output

NAME                    STATUS
ens01-bridge-testfail   SuccessfullyConfigured

The updated policy is successfully configured on all nodes in the cluster.

Troubleshooting DNS connectivity issues in a disconnected environment

If you experience DNS connectivity issues when configuring nmstate in a disconnected environment, you can configure the DNS server to resolve the list of name servers for the domain root-servers.net.

Ensure that the DNS server includes a name server (NS) entry for the root-servers.net zone. The DNS server does not need to forward a query to an upstream resolver, but the server must return a correct answer for the NS query.

Configuring the bind9 DNS named server

For a cluster configured to query a bind9 DNS server, you can add the root-servers.net zone to a configuration file that contains at least one NS record. For example you can use the /var/named/named.localhost as a zone file that already matches this criteria.

Procedure

Add the root-servers.net zone at the end of the /etc/named.conf configuration file by running the following command:

$ cat >> /etc/named.conf <<EOF
zone "root-servers.net" IN {
    	type master;
    	file "named.localhost";
};
EOF

Restart the named service by running the following command:
```
$ systemctl restart named
```

Confirm that the root-servers.net zone is present by running the following command:

$ journalctl -u named|grep root-servers.net

Example output

Jul 03 15:16:26 rhel-8-10 bash[xxxx]: zone root-servers.net/IN: loaded serial 0
Jul 03 15:16:26 rhel-8-10 named[xxxx]: zone root-servers.net/IN: loaded serial 0

Verify that the DNS server can resolve the NS record for the root-servers.net domain by running the following command:

$ host -t NS root-servers.net. 127.0.0.1

Example output

Using domain server:
Name: 127.0.0.1
Address: 127.0.0.53
Aliases:
root-servers.net name server root-servers.net.

Configuring the dnsmasq DNS server

If you are using dnsmasq as the DNS server, you can delegate resolution of the root-servers.net domain to another DNS server, for example, by creating a new configuration file that resolves root-servers.net using a DNS server that you specify.

Create a configuration file that delegates the domain root-servers.net to another DNS server by running the following command:
```
$ echo 'server=/root-servers.net/<DNS_server_IP>'> /etc/dnsmasq.d/delegate-root-servers.net.conf
```
Restart the dnsmasq service by running the following command:
```
$ systemctl restart dnsmasq
```

Confirm that the root-servers.net domain is delegated to another DNS server by running the following command:

$ journalctl -u dnsmasq|grep root-servers.net

Example output

Jul 03 15:31:25 rhel-8-10 dnsmasq[1342]: using nameserver 192.168.1.1#53 for domain root-servers.net

Verify that the DNS server can resolve the NS record for the root-servers.net domain by running the following command:

$ host -t NS root-servers.net. 127.0.0.1

Example output

Using domain server:
Name: 127.0.0.1
Address: 127.0.0.1#53
Aliases:
root-servers.net name server root-servers.net.