We had to fix the problem that cert-manager
was no longer been able to issue new certificates. Specifically, we wanted to deploy a certificate for a new Grafana Kubernetes ingress resource. cert-manager
's log always stated DNS update failed: dns: bad authentication
. In the end, we fixed the issue by forcing cert-manager
to use the parent DNS server and not the the DNS server for the subzone. It is always DNS, isn't it? But how do you find the root cause for such a problem?
Context
We had
- one Kubernetes cluster,
- one VM with Windows Server (
$WINDOWS_SERVER_IP
) hosting the DNS zone$DOMAIN.$TLD
- and one VM with a FreeIPA instance (
$FREEIPA_IP
) for issuing certificates via ACME DNS-01. It also handled the DNS zonenix.$DOMAIN.$TLD
.cert-manager
had been configured with aClusterIssuer
to request new certificates from FreeIPA via DNS-01. It was basically the same configuration as it is described in https://blog.hamzahkhan.com/using-freeipa-ca-as-an-acme-provider-for-cert-manager/.
Understanding the bigger picture
Writing some sentences and presenting the solution does not really help to understand the process and problem. When we started looking at the problem, we did not had any clue why it occurred, where to look at first and who the primary suspect was.
cert-manager
cert-manager
creates a few other Kubernetes resources when a new Certificate
has to be issued. In particular, these are CertificateRequest
, Order
and Challenge
:
In our case, the created Certificate
resource looked like this:
$ kubectl describe certificate tls-grafana-ingress
# ...
Spec:
Dns Names:
grafana.nix.$DOMAIN.$TLD
Issuer Ref:
Group: cert-manager.io
Kind: ClusterIssuer
Name: ipa
Secret Name: tls-grafana-ingress
Usages:
digital signature
key encipherment
Status:
Conditions:
Last Transition Time: 2024-12-31T12:58:51Z
Message: Issuing certificate as Secret does not exist
Observed Generation: 1
Reason: DoesNotExist
Status: True
Type: Issuing
Last Transition Time: 2024-12-31T12:58:51Z
Message: Issuing certificate as Secret does not exist
Observed Generation: 1
Reason: DoesNotExist
Status: False
Type: Ready
Next Private Key Secret Name: tls-grafana-ingress-rnvcb
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Issuing 7m39s cert-manager-certificates-trigger Issuing certificate as Secret does not exist
Normal Generated 7m39s cert-manager-certificates-key-manager Stored new private key in temporary Secret resource "tls-grafana-ingress-rnvcb"
Normal Requested 7m39s cert-manager-certificates-request-manager Created new CertificateRequest resource "tls-grafana-ingress-1"
With a new Certificate
resource, also a new CertificateRequest
resource tls-grafana-ingress-1
was created:
$ kubectl describe certificaterequest tls-grafana-ingress-1
# ...
Status:
Conditions:
Last Transition Time: 2024-12-31T12:58:51Z
Message: Certificate request has been approved by cert-manager.io
Reason: cert-manager.io
Status: True
Type: Approved
Last Transition Time: 2024-12-31T12:58:51Z
Message: Waiting on certificate issuance from order monitoring/tls-grafana-ingress-1-3955588093: "pending"
Reason: Pending
Status: False
Type: Ready
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal WaitingForApproval 8m cert-manager-certificaterequests-issuer-ca Not signing CertificateRequest until it is Approved
Normal WaitingForApproval 8m cert-manager-certificaterequests-issuer-acme Not signing CertificateRequest until it is Approved
Normal WaitingForApproval 8m cert-manager-certificaterequests-issuer-venafi Not signing CertificateRequest until it is Approved
Normal WaitingForApproval 8m cert-manager-certificaterequests-issuer-selfsigned Not signing CertificateRequest until it is Approved
Normal WaitingForApproval 8m cert-manager-certificaterequests-issuer-vault Not signing CertificateRequest until it is Approved
Normal cert-manager.io 8m cert-manager-certificaterequests-approver Certificate request has been approved by cert-manager.io
Normal OrderCreated 8m cert-manager-certificaterequests-issuer-acme Created Order resource monitoring/tls-grafana-ingress-1-3955588093
Normal OrderPending 8m cert-manager-certificaterequests-issuer-acme Waiting on certificate issuance from order monitoring/tls-grafana-ingress-1-3955588093: ""
Based upon that CertificateRequest
the Order
resource with name tls-grafana-ingress-1-3955588093
was created. cert-manager
waits for the update of that Order
and then also updates the CertificateRequest
. The created Order
resource contained the following data:
$ kubectl describe order tls-grafana-ingress-1-3955588093
# ...
Status:
Authorizations:
Challenges:
Token: zMKia_X79XF3ZOeYDdDTKg
Type: dns-01
URL: https://freeipa.nix.$DOMAIN.$TLD/acme/rest/chall/9iyTCxwYUQ
Token: zMKia_X79XF3ZOeYDdDTKg
Type: http-01
URL: https://freeipa.nix.$DOMAIN.$TLD/acme/rest/chall/7Qfn2uUedB
Identifier: grafana.nix.kle.internal
Initial State: pending
URL: https://freeipa.nix.$DOMAIN.$TLD/acme/rest/authz/wdkOkgTB9l
Wildcard: false
Finalize URL: https://freeipa.nix.$DOMAIN.$TLD/acme/rest/order/hvIOtDLIpv/finalize
State: pending
URL: https://freeipa.nix.$DOMAIN.$TLD/acme/rest/order/hvIOtDLIpv
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Created 9m19s cert-manager-orders Created Challenge resource "tls-grafana-ingress-1-3955588093-332340918" for domain "grafana.nix.$DOMAIN.$TLD"
The Order
resource cointains data which is provided by FreeIPA. More on this later.
In response to the order, the Challenge
resource got also created:
$ kubectl describe challenge tls-grafana-ingress-1-3955588093-332340918
# ...
Spec:
Authorization URL: https://freeipa.nix.$DOMAIN.$TLD/acme/rest/authz/jScPlE8k45
Dns Name: grafana.nix.kle.internal
Issuer Ref:
Group: cert-manager.io
Kind: ClusterIssuer
Name: ipa
Key: <somekey>
Solver:
# ommitted
Token: zMKia_X79XF3ZOeYDdDTKg
Type: DNS-01
URL: https://freeipa.nix.$DOMAIN.$TLD/acme/rest/chall/Kt2qFTwEic
Wildcard: false
Status:
Presented: false
Processing: true
Reason: DNS update failed: dns: bad authentication
State: pending
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Started 5m22s cert-manager-challenges Challenge scheduled for processing
Warning PresentError 17s (x7 over 5m22s) cert-manager-challenges Error presenting challenge: DNS update failed: dns: bad authentication
The Challenge
resource contains the status of the ACME challenge. In our case, DNS-01 was used as a method for providing the challenge. But the challenge could not be presented. Updating the DNS TXT record failed, with the reason DNS update failed: dns: bad authentication
.
Because of this, the status Presented
was false
and State
was pending
. No challenge has been written to the DNS.
Where did that log message dns: bad authentication
came from? After looking through cert-manager
's source code, that line appeared:
import (
"fmt"
"strings"
"time"
# !!!
"github.com/miekg/dns"
"github.com/cert-manager/cert-manager/internal/apis/certmanager/validation/util"
logf "github.com/cert-manager/cert-manager/pkg/logs"
)
# ...
c := new(dns.Client)
c.TsigProvider = tsigHMACProvider(r.tsigSecret)
// TSIG authentication / msg signing
if len(r.tsigKeyName) > 0 && len(r.tsigSecret) > 0 {
m.SetTsig(dns.Fqdn(r.tsigKeyName), r.tsigAlgorithm, 300, time.Now().Unix())
c.TsigSecret = map[string]string{dns.Fqdn(r.tsigKeyName): r.tsigSecret}
}
// Send the query
reply, _, err := c.Exchange(m, r.nameserver)
if err != nil {
# !!!
return fmt.Errorf("DNS update failed: %v", err)
}
if reply != nil && reply.Rcode != dns.RcodeSuccess {
return fmt.Errorf("DNS update failed. Server replied: %s", dns.RcodeToString[reply.Rcode])
}
# ...
Jumping over to the miekg/dns, we have that line:
ErrAuth error = &Error{err: "bad authentication"} // ErrAuth indicates an error in the TSIG authentication.
and only one reference in tsig.go:
# ...
if dh, off, err = unpackMsgHdr(msg, off); err != nil {
return nil, nil, err
}
if dh.Arcount == 0 {
return nil, nil, ErrNoSig
}
// Rcode, see msg.go Unpack()
if int(dh.Bits&0xF) == RcodeNotAuth {
# !!!
return nil, nil, ErrAuth
}
# ...
Looking through that source code did not show any direct clue. But we were able to verify that we did not have any other useful debug or tracing options which we could enable in cert-manager
or miekg/dns
.
FreeIPA
The FreeIPA project is not one big binary but consists upon multiple tools and services. In our case it is important that FreeIPA uses BIND9 and the bind-dyndb-ldap
project. All DNS records are stored by BIND9 inside FreeIPA's LDAP backend. The second important part is, that FreeIPA uses the Dogtag Certificate System project to deal with all the certificate and also ACME stuff.
Dogtag Certificate System
The Dogtag Certificate System
responds to cert-manager
's requests and checks the status of the challenges. For the ACME responder, log files are available at /var/log/pki/pki-tomcat/acme
:
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: LDAPDatabase: Searching ou=config,ou=acme,o=ipaca
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: Creating new order
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: Header: {"alg":"RS256","kid":"https://freeipa.nix.$DOMAIN.$TLD/acme/rest/acct/xxxx","nonce":"6RRxVLv4Namm7iXZVeMFRA","url":"https://freeipa.nix.$DOMAIN.$TLD/acme/new-order"}
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: Valid nonce: 6RRxVLv4Namm7iXZVeMFRA
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: Account ID: xxxx
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: LDAPDatabase: Searching acmeAccountId=xxxx,ou=accounts,ou=acme,o=ipaca
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: Validating JWS
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: Payload: {"identifiers":[{"type":"dns","value":"grafana.nix.$DOMAIN.$TLD"}]}
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: Token: zMKia_X79XF3ZOeYDdDTKg
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: Generating authorization for each identifiers
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: Identifier dns: grafana.nix.$DOMAIN.$TLD
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: - authorization ID: 5ByG8qiANj
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: Creating DNS-01 challenge: XTasV3w4FL
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: - challenge ID: XTasV3w4FL
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: Creating HTTP-01 challenge: Pwqb8MoYtP
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: - challenge ID: Pwqb8MoYtP
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: LDAPDatabase: Adding acmeAuthorizationId=5ByG8qiANj,ou=authorizations,ou=acme,o=ipaca
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: LDAPDatabase: Adding acmeChallengeId=XTasV3w4FL,ou=challenges,ou=acme,o=ipaca
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: LDAPDatabase: Adding acmeChallengeId=Pwqb8MoYtP,ou=challenges,ou=acme,o=ipaca
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: Order ID: Dfn1ncvr2H
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: LDAPDatabase: Adding acmeOrderId=Dfn1ncvr2H,ou=orders,ou=acme,o=ipaca
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-6] INFO: Created nonce: {"id":"plwKogvf8gK6rmT1mD4Srg","expires":"2025-01-01T14:21:30+01:00"}
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: LDAPDatabase: Searching ou=config,ou=acme,o=ipaca
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: Checking authorization 5ByG8qiANj
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: Header: {"alg":"RS256","kid":"https://freeipa.nix.$DOMAIN.$TLD/acme/rest/acct/xxxx","nonce":"plwKogvf8gK6rmT1mD4Srg","url":"https://freeipa.nix.$DOMAIN.$TLD/acme/rest/authz/5ByG8qiANj"}
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: Valid nonce: plwKogvf8gK6rmT1mD4Srg
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: Account ID: xxxx
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: LDAPDatabase: Searching acmeAccountId=xxxx,ou=accounts,ou=acme,o=ipaca
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: Validating JWS
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: Payload:
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: LDAPDatabase: Searching acmeAuthorizationId=5ByG8qiANj,ou=authorizations,ou=acme,o=ipaca
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: LDAPDatabase: Searching ou=challenges,ou=acme,o=ipaca
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: Valid authorization: 5ByG8qiANj
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: Authorization status: pending
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: Challenges:
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: - dns-01: pending
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: - http-01: pending
2025-01-01 13:51:30 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-7] INFO: Created nonce: {"id":"jdKl78gL1Ud9o-VtMCa5OQ","expires":"2025-01-01T14:21:30+01:00"}
2025-01-01 13:51:31 [ajp-nio-0:0:0:0:0:0:0:1-8009-exec-8] INFO: LDAPDatabase: Searching ou=config,ou=acme,o=ipaca
At least from that log file, everything worked as expected. cert-manager
contacted successfully the Dogtag ACME endpoint via HTTP. Dogtag stored the challenge in the LDAP database. Every few seconds, it checked the DNS TXT record _acme-challenge
for the expected token.
BIND9
We enabled the debug log levels for BIND9's different channels in /etc/named/ipa-logging-ext.conf
by setting the severity
to debug 3
:
channel update {
file "data/update.log" versions 10 size 10M;
# !!! set it from `info` to `debug 3`
severity debug 3;
print-time yes;
print-severity yes;
};
After a systemctl restart named
, the logs in /var/named/data
contained more information. In particular, /var/named/data/update.log
contained the following information:
2024 16:28:29.332 info: client @0x7ff15584a168 https://$FREEIPA_IP/#56589/key acme-update: update failed: $DOMAIN.$TLD: not authoritative for update zone (NOTAUTH)
We also checked the UDP traffic with tcpdump
to make sure that the log entry really contained $DOMAIN.$TLD
and not nix.$DOMAIN.$TLD
.
Instead of using nix.$DOMAIN.$TLD
, $DOMAIN.$TLD
had been tried to be updated by cert-manager
. The log entry had been written, before cert-manager
received the RcodeNotAuth
response.
Putting the pieces together
It was not instantly visible nor understandable why $DOMAIN.$TLD
had been used for ACME requests. After having looked through the source code segments and log files, the understanding of the big picture was much clearer. That allowed us to understand the bigger picture.
cert-manager
was able to successfully contact FreeIPA for the ACME challenge. Also both parties, FreeIPA and cert-manager
, exchanged valid datasets. That data was stored in the LDAP database on the FreeIPA and in the corresponding Kubernetes resources on cert-manager
's side.
cert-manager
did sent the DNS update request to the right DNS server but with the wrong DNS zone.
The communication flow of the different moving parts should look like this:
But the part with contacting the Windows Server DNS was completely missing.
This led us to the conclusion that is somehow a misconfiguration of cert-manager
. In our context, the specific documentation Delegated Domains for DNS01 of cert-manager
' was relevant. We re-created the ACME challenge CNAME in Windows DNS server.
_acme-challenge.$DOMAIN.$TLD IN CNAME _acme-challenge.nix.$DOMAIN.$TLD.
cert-manager
still contacted the DNS server of the sub zone nix.$DOMAIN.$TLD
, even when the Kubernetes nodes and also rke2-coredns
used the DNS of the primary zone. This is still pretty unsatisfying but we do not know, why it used the wrong DNS - even after various reboots of the cert-manager
pods.
We looked for some configuration options to force cert-manager
using a specific DNS and also stumbled upon the section Setting Nameservers for DNS01 Self Check. We added the following
extraArgs:
- --dns01-recursive-nameservers-only
- --dns01-recursive-nameservers=$WINDOWS_SERVER_IP:53
in cert-manager
's Helm values.yaml
.
After redeploying the pod, the whole communication flow looked like the sequence diagram. cert-manager
now updated the _acme-challenge
in the correct DNS zone and received a valid certificate from FreeIPA.