We had to fix the problem that cert-manager was no longer been able to issue new certificates. Specifically, we wanted to deploy a certificate for a new Grafana Kubernetes ingress resource. cert-manager's log always stated DNS update failed: dns: bad authentication. In the end, we fixed the issue by forcing cert-manager to use the parent DNS server and not the the DNS server for the subzone. It is always DNS, isn't it? But how do you find the root cause for such a problem?

Context

We had

  1. one Kubernetes cluster,
  2. one VM with Windows Server ($WINDOWS_SERVER_IP) hosting the DNS zone $DOMAIN.$TLD
  3. and one VM with a FreeIPA instance ($FREEIPA_IP) for issuing certificates via ACME DNS-01. It also handled the DNS zone nix.$DOMAIN.$TLD. cert-manager had been configured with a ClusterIssuer to request new certificates from FreeIPA via DNS-01. It was basically the same configuration as it is described in https://blog.hamzahkhan.com/using-freeipa-ca-as-an-acme-provider-for-cert-manager/.

context.png

Understanding the bigger picture

Writing some sentences and presenting the solution does not really help to understand the process and problem. When we started looking at the problem, we did not had any clue why it occurred, where to look at first and who the primary suspect was.

cert-manager

cert-manager creates a few other Kubernetes resources when a new Certificate has to be issued. In particular, these are CertificateRequest, Order and Challenge:

cert-manager-ressources.png

In our case, the created Certificate resource looked like this:

$ kubectl describe certificate tls-grafana-ingress

# ...

Spec:
  Dns Names:
    grafana.nix.$DOMAIN.$TLD
  Issuer Ref:
    Group:      cert-manager.io
    Kind:       ClusterIssuer
    Name:       ipa
  Secret Name:  tls-grafana-ingress
  Usages:
    digital signature
    key encipherment
Status:
  Conditions:
    Last Transition Time:        2024-12-31T12:58:51Z
    Message:                     Issuing certificate as Secret does not exist
    Observed Generation:         1
    Reason:                      DoesNotExist
    Status:                      True
    Type:                        Issuing
    Last Transition Time:        2024-12-31T12:58:51Z
    Message:                     Issuing certificate as Secret does not exist
    Observed Generation:         1
    Reason:                      DoesNotExist
    Status:                      False
    Type:                        Ready
  Next Private Key Secret Name:  tls-grafana-ingress-rnvcb
Events:
  Type    Reason     Age    From                                       Message
  ----    ------     ----   ----                                       -------
  Normal  Issuing    7m39s  cert-manager-certificates-trigger          Issuing certificate as Secret does not exist
  Normal  Generated  7m39s  cert-manager-certificates-key-manager      Stored new private key in temporary Secret resource "tls-grafana-ingress-rnvcb"
  Normal  Requested  7m39s  cert-manager-certificates-request-manager  Created new CertificateRequest resource "tls-grafana-ingress-1"
  

With a new Certificate resource, also a new CertificateRequest resource tls-grafana-ingress-1 was created:

$ kubectl describe certificaterequest tls-grafana-ingress-1

# ...

Status:
  Conditions:
    Last Transition Time:  2024-12-31T12:58:51Z
    Message:               Certificate request has been approved by cert-manager.io
    Reason:                cert-manager.io
    Status:                True
    Type:                  Approved
    Last Transition Time:  2024-12-31T12:58:51Z
    Message:               Waiting on certificate issuance from order monitoring/tls-grafana-ingress-1-3955588093: "pending"
    Reason:                Pending
    Status:                False
    Type:                  Ready
Events:
  Type    Reason              Age   From                                                Message
  ----    ------              ----  ----                                                -------
  Normal  WaitingForApproval  8m    cert-manager-certificaterequests-issuer-ca          Not signing CertificateRequest until it is Approved
  Normal  WaitingForApproval  8m    cert-manager-certificaterequests-issuer-acme        Not signing CertificateRequest until it is Approved
  Normal  WaitingForApproval  8m    cert-manager-certificaterequests-issuer-venafi      Not signing CertificateRequest until it is Approved
  Normal  WaitingForApproval  8m    cert-manager-certificaterequests-issuer-selfsigned  Not signing CertificateRequest until it is Approved
  Normal  WaitingForApproval  8m    cert-manager-certificaterequests-issuer-vault       Not signing CertificateRequest until it is Approved
  Normal  cert-manager.io     8m    cert-manager-certificaterequests-approver           Certificate request has been approved by cert-manager.io
  Normal  OrderCreated        8m    cert-manager-certificaterequests-issuer-acme        Created Order resource monitoring/tls-grafana-ingress-1-3955588093
  Normal  OrderPending        8m    cert-manager-certificaterequests-issuer-acme        Waiting on certificate issuance from order monitoring/tls-grafana-ingress-1-3955588093: ""

Based upon that CertificateRequest the Order resource with name tls-grafana-ingress-1-3955588093 was created. cert-manager waits for the update of that Order and then also updates the CertificateRequest. The created Order resource contained the following data:

$ kubectl describe order tls-grafana-ingress-1-3955588093

# ...

Status:
  Authorizations:
    Challenges:
      Token:        zMKia_X79XF3ZOeYDdDTKg
      Type:         dns-01
      URL:          https://freeipa.nix.$DOMAIN.$TLD/acme/rest/chall/9iyTCxwYUQ
      Token:        zMKia_X79XF3ZOeYDdDTKg
      Type:         http-01
      URL:          https://freeipa.nix.$DOMAIN.$TLD/acme/rest/chall/7Qfn2uUedB
    Identifier:     grafana.nix.kle.internal
    Initial State:  pending
    URL:            https://freeipa.nix.$DOMAIN.$TLD/acme/rest/authz/wdkOkgTB9l
    Wildcard:       false
  Finalize URL:     https://freeipa.nix.$DOMAIN.$TLD/acme/rest/order/hvIOtDLIpv/finalize
  State:            pending
  URL:              https://freeipa.nix.$DOMAIN.$TLD/acme/rest/order/hvIOtDLIpv
Events:
  Type    Reason   Age    From                 Message
  ----    ------   ----   ----                 -------
  Normal  Created  9m19s  cert-manager-orders  Created Challenge resource "tls-grafana-ingress-1-3955588093-332340918" for domain "grafana.nix.$DOMAIN.$TLD"

The Order resource cointains data which is provided by FreeIPA. More on this later. In response to the order, the Challenge resource got also created:

$ kubectl describe challenge tls-grafana-ingress-1-3955588093-332340918

# ...
Spec:
  Authorization URL:  https://freeipa.nix.$DOMAIN.$TLD/acme/rest/authz/jScPlE8k45
  Dns Name:           grafana.nix.kle.internal
  Issuer Ref:
    Group:  cert-manager.io
    Kind:   ClusterIssuer
    Name:   ipa
  Key:      <somekey>
  Solver:
    # ommitted
  Token:     zMKia_X79XF3ZOeYDdDTKg
  Type:      DNS-01
  URL:       https://freeipa.nix.$DOMAIN.$TLD/acme/rest/chall/Kt2qFTwEic
  Wildcard:  false
Status:
  Presented:   false
  Processing:  true
  Reason:      DNS update failed: dns: bad authentication
  State:       pending
Events:
  Type     Reason        Age                  From                     Message
  ----     ------        ----                 ----                     -------
  Normal   Started       5m22s                cert-manager-challenges  Challenge scheduled for processing
  Warning  PresentError  17s (x7 over 5m22s)  cert-manager-challenges  Error presenting challenge: DNS update failed: dns: bad authentication

The Challenge resource contains the status of the ACME challenge. In our case, DNS-01 was used as a method for providing the challenge. But the challenge could not be presented. Updating the DNS TXT record failed, with the reason DNS update failed: dns: bad authentication. Because of this, the status Presented was false and State was pending. No challenge has been written to the DNS.

Where did that log message dns: bad authentication came from? After looking through cert-manager's source code, that line appeared:

import (
	"fmt"
	"strings"
	"time"
	# !!!
	"github.com/miekg/dns"

	"github.com/cert-manager/cert-manager/internal/apis/certmanager/validation/util"
	logf "github.com/cert-manager/cert-manager/pkg/logs"
)

# ...

	c := new(dns.Client)
	c.TsigProvider = tsigHMACProvider(r.tsigSecret)
	// TSIG authentication / msg signing
	if len(r.tsigKeyName) > 0 && len(r.tsigSecret) > 0 {
		m.SetTsig(dns.Fqdn(r.tsigKeyName), r.tsigAlgorithm, 300, time.Now().Unix())
		c.TsigSecret = map[string]string{dns.Fqdn(r.tsigKeyName): r.tsigSecret}
	}

	// Send the query
	reply, _, err := c.Exchange(m, r.nameserver)
	if err != nil {
		# !!!
		return fmt.Errorf("DNS update failed: %v", err)
	}
	if reply != nil && reply.Rcode != dns.RcodeSuccess {
		return fmt.Errorf("DNS update failed. Server replied: %s", dns.RcodeToString[reply.Rcode])
	}

# ...

Jumping over to the miekg/dns, we have that line:

ErrAuth          error = &Error{err: "bad authentication"}             // ErrAuth indicates an error in the TSIG authentication.

and only one reference in tsig.go:

# ...

	if dh, off, err = unpackMsgHdr(msg, off); err != nil {
		return nil, nil, err
	}
	if dh.Arcount == 0 {
		return nil, nil, ErrNoSig
	}

	// Rcode, see msg.go Unpack()
	if int(dh.Bits&0xF) == RcodeNotAuth {
		# !!!
		return nil, nil, ErrAuth
	}

# ...

Looking through that source code did not show any direct clue. But we were able to verify that we did not have any other useful debug or tracing options which we could enable in cert-manager or miekg/dns.

FreeIPA

The FreeIPA project is not one big binary but consists upon multiple tools and services. In our case it is important that FreeIPA uses BIND9 and the bind-dyndb-ldap project. All DNS records are stored by BIND9 inside FreeIPA's LDAP backend. The second important part is, that FreeIPA uses the Dogtag Certificate System project to deal with all the certificate and also ACME stuff.

Dogtag Certificate System

The Dogtag Certificate System responds to cert-manager's requests and checks the status of the challenges. For the ACME responder, log files are available at /var/log/pki/pki-tomcat/acme:

2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: LDAPDatabase: Searching ou=config,ou=acme,o=ipaca
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: Creating new order
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: Header: {"alg":"RS256","kid":"https://freeipa.nix.$DOMAIN.$TLD/acme/rest/acct/xxxx","nonce":"6RRxVLv4Namm7iXZVeMFRA","url":"https://freeipa.nix.$DOMAIN.$TLD/acme/new-order"}
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: Valid nonce: 6RRxVLv4Namm7iXZVeMFRA
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: Account ID: xxxx
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: LDAPDatabase: Searching acmeAccountId=xxxx,ou=accounts,ou=acme,o=ipaca
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: Validating JWS
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: Payload: {"identifiers":[{"type":"dns","value":"grafana.nix.$DOMAIN.$TLD"}]}
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: Token: zMKia_X79XF3ZOeYDdDTKg
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: Generating authorization for each identifiers
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: Identifier dns: grafana.nix.$DOMAIN.$TLD
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: - authorization ID: 5ByG8qiANj
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: Creating DNS-01 challenge: XTasV3w4FL
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO:   - challenge ID: XTasV3w4FL
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: Creating HTTP-01 challenge: Pwqb8MoYtP
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO:   - challenge ID: Pwqb8MoYtP
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: LDAPDatabase: Adding acmeAuthorizationId=5ByG8qiANj,ou=authorizations,ou=acme,o=ipaca
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: LDAPDatabase: Adding acmeChallengeId=XTasV3w4FL,ou=challenges,ou=acme,o=ipaca
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: LDAPDatabase: Adding acmeChallengeId=Pwqb8MoYtP,ou=challenges,ou=acme,o=ipaca
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: Order ID: Dfn1ncvr2H
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: LDAPDatabase: Adding acmeOrderId=Dfn1ncvr2H,ou=orders,ou=acme,o=ipaca
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: Created nonce: {"id":"plwKogvf8gK6rmT1mD4Srg","expires":"2025-01-01T14:21:30+01:00"}
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: LDAPDatabase: Searching ou=config,ou=acme,o=ipaca
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: Checking authorization 5ByG8qiANj
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: Header: {"alg":"RS256","kid":"https://freeipa.nix.$DOMAIN.$TLD/acme/rest/acct/xxxx","nonce":"plwKogvf8gK6rmT1mD4Srg","url":"https://freeipa.nix.$DOMAIN.$TLD/acme/rest/authz/5ByG8qiANj"}
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: Valid nonce: plwKogvf8gK6rmT1mD4Srg
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: Account ID: xxxx
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: LDAPDatabase: Searching acmeAccountId=xxxx,ou=accounts,ou=acme,o=ipaca
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: Validating JWS
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: Payload:
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: LDAPDatabase: Searching acmeAuthorizationId=5ByG8qiANj,ou=authorizations,ou=acme,o=ipaca
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: LDAPDatabase: Searching ou=challenges,ou=acme,o=ipaca
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: Valid authorization: 5ByG8qiANj
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: Authorization status: pending
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: Challenges:
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: - dns-01: pending
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: - http-01: pending
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: Created nonce: {"id":"jdKl78gL1Ud9o-VtMCa5OQ","expires":"2025-01-01T14:21:30+01:00"}
2025-01-01 13:51:31 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-8] INFO: LDAPDatabase: Searching ou=config,ou=acme,o=ipaca

At least from that log file, everything worked as expected. cert-manager contacted successfully the Dogtag ACME endpoint via HTTP. Dogtag stored the challenge in the LDAP database. Every few seconds, it checked the DNS TXT record _acme-challenge for the expected token.

BIND9

We enabled the debug log levels for BIND9's different channels in /etc/named/ipa-logging-ext.conf by setting the severity to debug 3:

channel update {
        file "data/update.log" versions 10 size 10M;
        # !!! set it from `info` to `debug 3`
        severity debug 3;
        print-time yes;
        print-severity yes;
};

After a systemctl restart named, the logs in /var/named/data contained more information. In particular, /var/named/data/update.log contained the following information:

2024 16:28:29.332 info: client @0x7ff15584a168 https://$FREEIPA_IP/#56589/key acme-update: update failed: $DOMAIN.$TLD: not authoritative for update zone (NOTAUTH)

We also checked the UDP traffic with tcpdump to make sure that the log entry really contained $DOMAIN.$TLD and not nix.$DOMAIN.$TLD.

Instead of using nix.$DOMAIN.$TLD, $DOMAIN.$TLD had been tried to be updated by cert-manager. The log entry had been written, before cert-manager received the RcodeNotAuth response.

cert-manager-and-freeipa.png

Putting the pieces together

It was not instantly visible nor understandable why $DOMAIN.$TLD had been used for ACME requests. After having looked through the source code segments and log files, the understanding of the big picture was much clearer. That allowed us to understand the bigger picture.

cert-manager was able to successfully contact FreeIPA for the ACME challenge. Also both parties, FreeIPA and cert-manager, exchanged valid datasets. That data was stored in the LDAP database on the FreeIPA and in the corresponding Kubernetes resources on cert-manager's side. cert-manager did sent the DNS update request to the right DNS server but with the wrong DNS zone. The communication flow of the different moving parts should look like this:

But the part with contacting the Windows Server DNS was completely missing. This led us to the conclusion that is somehow a misconfiguration of cert-manager. In our context, the specific documentation Delegated Domains for DNS01 of cert-manager' was relevant. We re-created the ACME challenge CNAME in Windows DNS server.


_acme-challenge.$DOMAIN.$TLD IN CNAME _acme-challenge.nix.$DOMAIN.$TLD.

cert-manager still contacted the DNS server of the sub zone nix.$DOMAIN.$TLD, even when the Kubernetes nodes and also rke2-coredns used the DNS of the primary zone. This is still pretty unsatisfying but we do not know, why it used the wrong DNS - even after various reboots of the cert-manager pods. We looked for some configuration options to force cert-manager using a specific DNS and also stumbled upon the section Setting Nameservers for DNS01 Self Check. We added the following

extraArgs:
  - --dns01-recursive-nameservers-only
  - --dns01-recursive-nameservers=$WINDOWS_SERVER_IP:53

in cert-manager's Helm values.yaml.

After redeploying the pod, the whole communication flow looked like the sequence diagram. cert-manager now updated the _acme-challenge in the correct DNS zone and received a valid certificate from FreeIPA.