cert-manager Webhook Pod 的终极调试指南

上次验证：2022 年 9 月 8 日

cert-manager webhook 是作为 cert-manager 安装的一部分运行的 Pod。当使用 kubectl 应用清单时，Kubernetes API 服务器将通过 TLS 调用 cert-manager webhook 以验证您的清单。本指南可帮助您调试 Kubernetes API 服务器和 cert-manager webhook Pod 之间的通信问题。

此页面列出的错误消息是在安装或升级 cert-manager 时遇到的，或者是在安装或升级 cert-manager 后，尝试创建证书、颁发者或任何其他 cert-manager 自定义资源时遇到的。

在下图中，我们展示了调试 cert-manager webhook 问题时的常见模式：在创建 cert-manager 自定义资源时，API 服务器通过 TLS 连接到 cert-manager webhook pod。红色十字表示 API 服务器与 webhook 通信失败。

Diagram that shows a kubectl command that aims to create an issuer resource, and an arrow towards the Kubernetes API server, and an arrow between the API server and the webhook that indicates that the API server tries to connect to the webhook. This last arrow is crossed in red.

本文档的其余部分介绍了您可能会遇到的错误消息。

错误：`connect: connection refused`

此问题已在 4 个 GitHub 问题中报告 (#2736, #3133, #3445, #4425)，并在外部项目的 1 个 GitHub 问题中报告 (aws-load-balancer-controller#1563)，在 Stack Overflow (serverfault#1076563)，并在 13 条 Slack 消息中提到，可以通过搜索 in:#cert-manager in:#cert-manager-dev ":443: connect: connection refused" 列出这些消息。此错误消息也可以在其他构建 webhook 的项目中找到 (kubewarden-controller#110).

在安装或升级 cert-manager 后不久，您可能会在创建证书、颁发者或任何其他 cert-manager 自定义资源时遇到此错误。例如，使用以下命令创建颁发者资源

kubectl apply -f- <<EOF
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: example
spec:
  selfSigned: {}
EOF

显示以下错误消息

Error from server (InternalError): error when creating "STDIN":
  Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook:
    Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s":
      dial tcp 10.96.20.99:443: connect: connection refused

使用 Helm 安装或升级 cert-manager 1.5.0 及更高版本时，在运行 helm install 或 helm upgrade 时，可能会出现非常类似的错误消息

Error: INSTALLATION FAILED: Internal error occurred:
  failed calling webhook "webhook.cert-manager.io": failed to call webhook:
    Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s":
      dial tcp 10.96.20.99:443: connect: connection refused

当 API 服务器尝试与 cert-manager-webhook 建立 TCP 连接时，会出现“连接被拒绝”消息。在 TCP 术语中，API 服务器发送了 SYN 数据包来启动 TCP 握手，并收到一个 RST 数据包作为响应。

如果我们要在运行 API 服务器的控制平面节点中使用 tcpdump，我们会看到返回到 API 服务器的数据包

192.168.1.43 (apiserver)   -> 10.96.20.99 (webhook pod)  TCP   59466 → 443 [SYN]
10.96.20.99  (webhook pod) -> 192.168.1.43 (apiserver)   TCP   443 → 59466 [RST, ACK]

当没有监听请求端口时，RST 数据包由 Linux 内核发送。 RST 数据包也可以由 TCP 中的某个跃点（例如防火墙）返回，如 Stack Overflow 页面连接被拒绝错误的原因可能有哪些？中所述。

请注意，防火墙通常不会返回 RST 数据包；它们通常会完全丢弃 SYN 数据包，最终会导致错误消息 i/o timeout 或 context deadline exceeded。如果是这种情况，请继续阅读“错误：i/o timeout（连接问题）”和“错误：context deadline exceeded”部分，继续您的调查。

让我们从 TCP 连接的源头（API 服务器）到目标（pod cert-manager-webhook）逐一排除可能的原因。

让我们假设名称 cert-manager-webhook.cert-manager.svc 已解析为 10.43.183.232。这是一个集群 IP。运行 API 服务器进程的控制平面节点使用其 iptables 使用 pod IP 重新写入 IP 目标。这可能是第一个问题：有时，给定集群 IP 没有关联的 pod IP，因为只要就绪探测不起作用，kubelet 就不会在 Endpoint 资源中填充 pod IP。

让我们首先检查它是否是 Endpoint 资源的问题

kubectl get endpoints -n cert-manager cert-manager-webhook

有效的输出将如下所示

NAME                   ENDPOINTS           AGE
cert-manager-webhook   10.244.0.2:10250    27d    ✅

如果您有此有效输出，并且出现 connect: connection refused，那么问题在网络堆栈中更深层。我们不会深入研究这种情况，但您可能希望使用 tcpdump 和 Wireshark 查看流量是否从 API 服务器到节点的主机命名空间正常流动。从主机命名空间到 pod 命名空间的流量已经正常工作，因为 kubelet 已经能够访问就绪端点。

常见问题包括防火墙丢弃来自控制平面到工作节点的流量；例如，GKE 上的 API 服务器只允许通过端口 10250 与工作节点（cert-manager webhook 运行的位置）通信。在 EKS 中，您的安全组可能会阻止来自控制平面 VPC 到工作节点 VPC 的 TCP 10250 流量。

如果您看到 <none>，这表示 cert-manager webhook 正在正常运行，但无法访问其就绪端点

NAME                   ENDPOINTS           AGE
cert-manager-webhook   <none>              236d   ❌

要修复 <none>，您需要检查 cert-manager-webhook 部署是否健康。只要 cert-manager-webhook 未标记为 healthy，端点就会保持在 <none>。

kubectl get pod -n cert-manager -l app.kubernetes.io/name=webhook

您应该看到 pod 处于 Running 状态，并且准备就绪的容器数量为 0/1

NAME                            READY   STATUS    RESTARTS     AGE
cert-manager-76578c9687-24kmr   0/1     Running   7 (8h ago)   28d  ❌

我们不会详细介绍出现 1/1 和 Running 的情况，因为它表示 Kubernetes 中的状态不一致。

继续使用 0/1，这意味着就绪端点没有响应。在这种情况下，不会创建任何端点。下一步是找出为什么就绪端点没有响应。让我们看看 kubelet 在访问就绪端点时使用哪个端口

kubectl -n cert-manager get deploy cert-manager-webhook -oyaml | grep -A5 readiness

在我们的示例中，kubelet 将尝试访问的端口是 6080

readinessProbe:
  failureThreshold: 3
  httpGet:
    path: /healthz
    port: 6080 # ✨
    scheme: HTTP

现在，让我们将端口转发到该端口，并查看 /healthz 是否有效。在一个 shell 会话中，运行

kubectl -n cert-manager port-forward deploy/cert-manager-webhook 6080

在另一个 shell 会话中，运行

curl -sS --dump-header - 127.0.0.1:6080/healthz

理想的输出是

HTTP/1.1 200 OK ✅
Date: Tue, 07 Jun 2022 17:16:56 GMT
Content-Length: 0

如果就绪端点不起作用，您将看到

curl: (7) Failed to connect to 127.0.0.1 port 6080 after 0 ms: Connection refused ❌

此时，请验证就绪端点是否配置在同一个端口上。让我们查看日志，以检查我们的 webhook 是否在 6080 端口上监听其就绪端点

$ kubectl logs -n cert-manager -l app.kubernetes.io/name=webhook | head -10
I0607 webhook.go:129] "msg"="using dynamic certificate generating using CA stored in Secret resource"
I0607 server.go:133] "msg"="listening for insecure healthz connections"  "address"=":6081" ❌
I0607 server.go:197] "msg"="listening for secure connections"  "address"=":10250"
I0607 dynamic_source.go:267] "msg"="Updated serving TLS certificate"
...

在上面的示例中，问题是就绪端口的配置错误。在 webhook 部署中，参数 --healthz-port=6081 与就绪配置不匹配。

错误：`i/o timeout`（连接问题）

此错误消息已在 Slack 上报告了 26 次。要列出这些消息，请执行以下搜索：in:#cert-manager in:#cert-manager-dev "443: i/o timeout"。该错误消息已在 2 个 GitHub 问题中报告 (#2811, #4073)

Error from server (InternalError): error when creating "STDIN": Internal error occurred:
  failed calling webhook "webhook.cert-manager.io": failed to call webhook:
    Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s":
      dial tcp 10.0.0.69:443: i/o timeout

当 API 服务器尝试与 cert-manager webhook 通信时，SYN 数据包没有得到响应，连接超时。如果我们要在 webhook 的网络命名空间中运行 tcpdump，我们会看到

192.168.1.43 (apiserver) -> 10.0.0.69 (webhook pod) TCP 44772 → 443 [SYN]
192.168.1.43 (apiserver) -> 10.0.0.69 (webhook pod) TCP [TCP Retransmission] 44772 → 443 [SYN]
192.168.1.43 (apiserver) -> 10.0.0.69 (webhook pod) TCP [TCP Retransmission] 44772 → 443 [SYN]
192.168.1.43 (apiserver) -> 10.0.0.69 (webhook pod) TCP [TCP Retransmission] 44772 → 443 [SYN]

此问题是由 SYN 数据包在某个地方被丢弃造成的。

原因 1：GKE 私有集群

默认的 Helm 配置应该适用于 GKE 私有集群，但更改 securePort 可能会破坏它。

为了更好地理解，与公共 GKE 集群不同，公共 GKE 集群的控制平面可以自由地通过任何 TCP 端口与 Pod 通信，私有 GKE 集群的控制平面只能通过 TCP 端口 10250 和 443 与工作节点中的 Pod 通信。这两个开放端口指的是 Pod 内部的 containerPort，而不是 Service 资源中称为 port 的端口。

为了使其正常工作，Deployment 内部的 containerPort 必须与 10250 或 443 中的任何一个匹配；containerPort 由 Helm 值 webhook.securePort 配置。默认情况下，webhook.securePort 设置为 10250。

为了查看 containerPort 是否存在问题，让我们从查看 Service 资源开始。

kubectl get svc -n cert-manager cert-manager-webhook -oyaml

查看输出，我们看到 targetPort 设置为 "https"。

apiVersion: v1
kind: Service
metadata:
  name: cert-manager-webhook
spec:
  ports:
  - name: https
    port: 443           # ❌ This port is not the cause.
    protocol: TCP
    targetPort: "https" # 🌟 This port might be the cause.

上述 port: 443 无法成为原因的原因是，kube-proxy 也运行在控制平面节点上，它将 webhook 的集群 IP 转换为 Pod IP，并将上述 port: 443 转换为 containerPort 中的值。

为了查看目标端口 "https" 后面的内容，我们查看 Deployment 资源。

kubectl get deploy -n cert-manager cert-manager-webhook -oyaml | grep -A3 ports:

输出显示 containerPort 未设置为 10250，这意味着需要在 Google Cloud 中添加新的防火墙规则。

ports:
        - containerPort: 12345 # 🌟 This port matches neither 10250 nor 443.
          name: https
          protocol: TCP

总结一下，如果上述 containerPort 不是 443 或 10250，并且您不希望将 containerPort 更改为 10250，则需要添加新的防火墙规则。您可以在 Google 文档中阅读在 GKE 私有集群中添加防火墙规则部分。

为了更好地理解，我们没有将 securePort 默认设置为 443 的原因是，绑定到 443 需要一个额外的 Linux 功能 (NET_BIND_SERVICE)；另一方面，10250 不需要任何额外的功能。

原因 2：使用自定义 CNI 的 EKS

如果您使用的是 EKS 并且您使用的是自定义 CNI（例如 Weave 或 Calico），Kubernetes API 服务器（位于其自己的节点中）可能无法访问 webhook Pod。这是因为控制平面无法配置为在 EKS 上运行自定义 CNI，这意味着 CNI 无法启用 API 服务器与工作节点中运行的 Pod 之间的连接。

假设您正在使用 Helm，解决方法是在 values.yaml 文件中添加以下值。

webhook:
  hostNetwork: true
  securePort: 10260

或者，如果您从命令行使用 Helm，请使用以下标志。

--set webhook.hostNetwork=true --set webhook.securePort=10260

通过将 hostNetwork 设置为 true，webhook Pod 将在主机的网络命名空间中运行。通过在主机的网络命名空间中运行，webhook Pod 可以在节点的 IP 上访问，这意味着您可以解决 kube-apiserver 无法访问任何 Pod IP 或集群 IP 的问题。

通过将 securePort 设置为 10260，而不是依赖于默认值（即 10250），您可以防止 webhook 与 kubelet 之间的冲突。kubelet 是运行在每个 Kubernetes 工作节点上并直接运行在主机上的代理，它使用端口 10250 将其内部 API 公开给 kube-apiserver。

为了理解 hostnetwork 和 securePort 的交互方式，我们需要查看 TCP 连接是如何建立的。当 kube-apiserver 进程尝试连接到 webhook Pod 时，kube-proxy（即使没有 CNI，它也运行在控制平面节点上）会介入并将 webhook 的集群 IP 转换为 webhook 的主机 IP。

https://cert-manager-webhook.cert-manager.svc:443/validate
            |
            |Step 1: resolve to the cluster IP
            v
   https://10.43.103.211:443/validate
            |
            |Step 2: send TCP packet
            v
   src: 172.28.0.1:43021
   dst: 10.43.103.211:443
            |
            |Step 3: kube-proxy rewrite  (cluster IP to host IP)
            v
   src: 172.28.0.1:43021
   dst: 172.28.0.2:10260
            |
            |                              control-plane node
            |                           (host IP: 172.28.0.1)
------------|--------------------------------------------------
            |                           (host IP: 172.28.0.2)
            v                                     worker node
   +-------------------+
   | webhook pod       |
   | listens on        |
   | 172.28.0.2:10260  |
   +-------------------+

使用 10250 作为默认 securePort 的原因是它绕过了 GKE 私有集群的另一个限制，如上述部分 GKE 私有集群中所述。

原因 3：网络策略，Calico

假设您正在使用 Helm 图表，并且您正在使用 webhook.securePort 的默认值（即 10250），并且您正在使用网络策略控制器（例如 Calico），请检查是否存在允许 API 服务器通过 TCP 端口 10250 与 webhook Pod 之间通信的策略。

原因 4：EKS 和安全组

假设您正在使用 Helm 图表，并且您正在使用 webhook.securePort 的默认值（即 10250），您可能需要检查您的 AWS 安全组是否允许控制平面的 VPC 通过 10250 上的 TCP 流量连接到工作节点的 VPC。

其他原因

如果以上原因都不适用，您需要找出 webhook 为什么不可访问。

为了调试可访问性问题（例如，数据包被丢弃），我们建议在每个 TCP 跳跃点使用 tcpdump 和 Wireshark。您可以参考文章调试 Kubernetes 网络：我的 kube-dns 不工作！学习如何使用 tcpdump 和 Wireshark 调试网络问题。

错误：`x509: certificate is valid for xxx.internal, not cert-manager-webhook.cert-manager.svc`（使用 Fargate Pod 的 EKS）

Internal error occurred: failed calling webhook "webhook.cert-manager.io":
  Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s:
    x509: certificate is valid for ip-192-168-xxx-xxx.xxx.compute.internal,
    not cert-manager-webhook.cert-manager.svc

此问题首次在 #3237 中报告。

这可能是因为您在启用了 Fargate 的 EKS 上运行。Fargate 为每个 Pod 创建一个微型 VM，VM 的内核用于在自己的命名空间中运行容器。问题是每个微型 VM 都有自己的 kubelet。对于任何 Kubernetes 节点，VM 的端口 10250 由 kubelet 进程监听。而且 10250 也是 cert-manager webhook 监听的端口。

但这并不成问题：kubelet 进程和 cert-manager webhook 进程运行在两个不同的网络命名空间中，端口不会冲突。这在传统的 Kubernetes 节点以及 Fargate 微型 VM 内部都是如此。

问题出在 API 服务器尝试访问 Fargate Pod 时：微型 VM 的主机网络命名空间被配置为转发所有可能的端口以最大限度地兼容传统的 Pod，如 Stack Overflow 页面 EKS Fargate 连接到本地 kubelet 中所示。但端口 10250 已经被微型 VM 的 kubelet 使用，因此任何访问此端口的内容都不会被端口转发，而是会访问 kubelet。

总而言之，cert-manager webhook 看起来很健康，并且能够根据其日志监听端口 10250，但 microVM 的主机没有将端口 10250 转发到 webhook 的网络命名空间。这就是你在进行 TLS 握手时看到有关意外域名的消息的原因：虽然 cert-manager webhook 正确运行，但 kubelet 正在响应 API 服务器。

这是 Fargate 的 microVM 的一个限制：Pod 的 IP 和节点的 IP 相同。它提供了与传统 Pod 相同的体验，但会带来网络挑战。

要解决此问题，诀窍是更改 cert-manager webhook 监听的端口。使用 Helm，我们可以使用参数 webhook.securePort

helm install \
  cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --version v1.16.1 \
  --set webhook.securePort=10260

错误：`service "cert-managercert-manager-webhook" not found`

Error from server (InternalError): error when creating "test-resources.yaml": Internal error occurred:
  failed calling webhook "webhook.cert-manager.io": failed to call webhook:
    Post "https://cert-managercert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s":
      service "cert-managercert-manager-webhook" not found

此错误在 2 个 GitHub 问题中报告过（#3195，#4999）。

我们不知道此错误的原因，如果您碰巧遇到此错误，请在上面的其中一个 GitHub 问题中发表评论。

错误：`no endpoints available for service "cert-manager-webhook"` (OVHCloud)

Error: INSTALLATION FAILED: Internal error occurred:
  failed calling webhook "webhook.cert-manager.io":
    Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s:
      no endpoints available for service "cert-manager-webhook"

此问题在 Slack 中第一次被报告过（1）。

此错误很少见，仅在 OVHcloud 托管的 Kubernetes 集群中看到，在这些集群中，etcd 资源配额非常低。etcd 是一个数据库，您的 Kubernetes 资源（如 Pod 和部署）存储在其中。OVHcloud 限制了 etcd 中使用的资源磁盘空间。当达到限制时，整个集群开始表现异常，一个症状是 kubelet 不会创建 Endpoint 资源。

要验证它是否确实是配额问题，您应该能够在您的 kube-apiserver 日志中看到以下消息

rpc error: code = Unknown desc = ETCD storage quota exceeded
rpc error: code = Unknown desc = quota computation: etcdserver: not capable
rpc error: code = Unknown desc = The OVHcloud storage quota has been reached

解决方法是删除一些资源（如 CertificateRequest 资源）以降低配额限制，如 OVHCloud 的 ETCD 配额错误，故障排除页面所述。

错误：`x509: certificate has expired or is not yet valid`

此错误消息在 Slack 中被报告过一次（1）。

当使用 kubectl apply

Internal error occurred: failed calling webhook "webhook.cert-manager.io":
  Post https://kubernetes.default.svc:443/apis/webhook.cert-manager.io/v1beta1/mutations?timeout=30s:
    x509: certificate has expired or is not yet valid

此错误消息在 Slack 中被报告过一次（1）。

请回复上面的 Slack 消息，因为我们仍然不确定是什么原因导致了这个问题；要访问 Kubernetes Slack，请访问 https://slack.k8s.io/。

错误：`net/http: request canceled while waiting for connection`

Error from server (InternalError): error when creating "STDIN":
  Internal error occurred: failed calling webhook "webhook.cert-manager.io":
    Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s:
      net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

此错误消息在 Slack 中被报告过一次（1）。

错误：`context deadline exceeded`

此错误消息在 GitHub 问题中报告过（2319，2706 5189，5004），以及在 Stack Overflow 上报告过一次（1）。

此错误在安装或升级 cert-manager 后，尝试应用 Issuer 或其他 cert-manager 自定义资源时，会出现在 cert-manager 0.12 及更高版本中。

Error from server (InternalError): error when creating "STDIN":
  Internal error occurred: failed calling webhook "webhook.cert-manager.io":
    Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s:
      context deadline exceeded

ℹ️ 在较早的 cert-manager 版本（0.11 及更低版本）中，webhook 依赖于 APIService 机制，消息看起来略有不同，但原因相同。
Error from server (InternalError): error when creating "STDIN":
  Internal error occurred: failed calling webhook "webhook.certmanager.k8s.io":
    Post https://kubernetes.default.svc:443/apis/webhook.certmanager.k8s.io/v1beta1/mutations?timeout=30s:
      context deadline exceeded

ℹ️ 消息 context deadline exceeded 也会在使用 cmctl check api 时出现。原因相同，您可以继续阅读本节以进行调试。
Not ready: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook:
  Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s":
    context deadline exceeded

消息 context deadline exceeded 的问题在于，它掩盖了 HTTP 连接超时的那部分。当此消息出现时，我们无法确定 HTTP 交互的哪个部分超时。可能是 DNS 解析、TCP 握手、TLS 握手、发送 HTTP 请求或接收 HTTP 响应。

ℹ️ 为了背景，您可以在上面的错误消息中看到的查询参数 ?timeout=30s 是 API 服务器在调用 webhook 时决定的超时时间。它通常设置为 10 或 30 秒。

调试此问题的第一步是确保 cert-manager 突变和验证 webhook 配置上的 timeoutSeconds 字段配置为 30 秒（最大值）。默认情况下，它设置为 10 秒，这意味着 context deadline exceeded 可能隐藏其他超时消息。要检查 timeoutSeconds 字段的值，请运行

$ kubectl get mutatingwebhookconfigurations,validatingwebhookconfigurations cert-manager-webhook \
  -ojsonpath='{.items[*].webhooks[*].timeoutSeconds}'
10 10

这意味着这两个 webhook 都配置了 10 秒的上下文超时时间。要将它们配置为 30 秒，请运行

kubectl patch mutatingwebhookconfigurations,validatingwebhookconfigurations cert-manager-webhook \
  --type=json -p '[{"op": "replace", "path": "/webhooks/0/timeoutSeconds", "value": 30}]'

下图显示了可能隐藏在所有捕获的 context deadline exceeded 错误消息（由外部框表示）背后的三个错误，该错误在 30 秒后抛出

context deadline exceeded
                                                                          |
                                 30 seconds                               |
                                  timeout                                 v
+-------------------------------------------------------------------------+
|                                                                         |
|       i/o timeout                                                       |
|            |        net/http: TLS handshake timeout                     |
| 10 seconds |                     |                                      |
|  timeout   v                     |                                      |
|------------+      30 seconds     |           net/http: request canceled |
|TCP         |       timeout       v           while awaiting headers     |
|handshake   +---------------------+                         |            |
|------------|      TLS            |                         |            |
|            |      handshake      +------------+ 10 seconds |            |
|            +---------------------|  sending   |  timeout   v            |
|                                  |  request   +------------+            |
|                                  +------------|receiving   |------+     |
|                                               |resp. header| recv.|     |
|                                               +------------+ resp.|     |
|                                                            | body +-----+
|                                                            +------|other|
|                                                                   |logic|
|                                                                   +-----+
+-------------------------------------------------------------------------+

在本节的其余部分，我们将尝试触发三个“更具体”的错误之一

i/o timeout 是 TCP 握手超时，来自 Kubernetes API 服务器的 DialTimeout。名称解析可能是原因，但通常，此消息会在 API 服务器发送 SYN 数据包并等待 10 秒接收来自 cert-manager webhook 的 SYN-ACK 数据包后出现。
net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 是 HTTP 响应超时，来自这里，并配置为 30 秒。Kubernetes API 服务器已经发送了 HTTP 请求，正在等待 HTTP 响应头（例如，HTTP/1.1 200 OK）。
net/http: TLS handshake timeout 是指 TCP 握手完成，Kubernetes API 服务器发送了初始 TLS 握手数据包（ClientHello）并等待 10 秒，等待 cert-manager webhook 用 ServerHello 数据包进行回复。

我们可以将这三条消息分为两类：要么是连接问题（SYN 被丢弃），要么是 webhook 问题（即 TLS 证书错误或 webhook 未返回任何 HTTP 响应）

超时消息	类别
`i/o timeout`	连接问题
`net/http: TLS handshake timeout`	webhook 端问题
`net/http: request canceled while awaiting headers`	webhook 端问题

第一步是排除 webhook 端问题。在您的 shell 会话中，运行以下命令

kubectl -n cert-manager port-forward deploy/cert-manager-webhook 10250

在另一个 shell 会话中，检查您是否可以连接到 webhook

curl -vsS --resolve cert-manager-webhook.cert-manager.svc:10250:127.0.0.1 \
    --service-name cert-manager-webhook-ca \
    --cacert <(kubectl -n cert-manager get secret cert-manager-webhook-ca -ojsonpath='{.data.ca\.crt}' | base64 -d) \
    https://cert-manager-webhook.cert-manager.svc:10250/validate 2>&1 -d@- <<'EOF' | sed '/^* /d; /bytes data]$/d; s/> //; s/< //'
{"kind":"AdmissionReview","apiVersion":"admission.k8s.io/v1","request":{"requestKind":{"group":"cert-manager.io","version":"v1","kind":"Certificate"},"requestResource":{"group":"cert-manager.io","version":"v1","resource":"certificates"},"name":"foo","namespace":"default","operation":"CREATE","object":{"apiVersion":"cert-manager.io/v1","kind":"Certificate","spec":{"dnsNames":["foo"],"issuerRef":{"group":"cert-manager.io","kind":"Issuer","name":"letsencrypt"},"secretName":"foo","usages":["digital signature"]}}}}
EOF

理想的输出看起来像这样

POST /validate HTTP/1.1
Host: cert-manager-webhook.cert-manager.svc:10250
User-Agent: curl/7.83.0
Accept: */*
Content-Length: 1299
Content-Type: application/x-www-form-urlencoded

HTTP/1.1 200 OK
Date: Wed, 08 Jun 2022 14:52:21 GMT
Content-Length: 2029
Content-Type: text/plain; charset=utf-8

...
"response": {
  "uid": "",
  "allowed": true
}

如果响应显示 200 OK，我们可以排除 webhook 端问题。由于初始错误消息是 context deadline exceeded 而不是 API 服务器端问题（如 x509: certificate signed by unknown authority 或 x509: certificate has expired or is not yet valid），我们可以得出结论，问题是连接问题：Kubernetes API 服务器无法建立到 cert-manager webhook 的 TCP 连接。请按照上面错误：i/o timeout（连接问题）一节中的说明继续进行调试。

错误：`net/http: TLS 握手超时`

此错误信息在 1 个 GitHub 问题中被报告 (#2602).

Error from server (InternalError): error when creating "STDIN":
  Internal error occurred: failed calling webhook "webhook.cert-manager.io":
    Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s:
      net/http: TLS handshake timeout

查看上面的图表，此错误信息表示 Kubernetes API 服务器成功建立了与 cert-manager Webhook 关联的 Pod IP 的 TCP 连接。TLS 握手超时意味着 cert-manager Webhook 进程不是终止 TCP 连接的那一方：可能存在一个 HTTP 代理在两者之间，它可能正在等待一个普通的 HTTP 请求而不是一个ClientHello 数据包。

我们不知道此错误的原因。如果您遇到此错误，请在上述 GitHub 问题中发表评论。

错误：`HTTP 探测失败，状态码：500`

此错误信息在 2 个 GitHub 问题中被报告 (#3185, #4557).

此错误信息在 cert-manager Webhook 上显示为一个事件。

Warning  Unhealthy  <invalid> (x13 over 15s)  kubelet, node83
  Readiness probe failed: HTTP probe failed with statuscode: 500

我们不知道此错误的原因。如果您遇到此错误，请在上述 GitHub 问题中发表评论。

错误：`服务不可用`

此错误在 1 个 GitHub 问题中被报告 (#4281)

Error from server (InternalError): error when creating "STDIN": Internal error occurred:
  failed calling webhook "webhook.cert-manager.io":
    Post "https://my-cert-manager-webhook.default.svc:443/mutate?timeout=10s":
      Service Unavailable

上述信息出现在使用 Weave CNI 的 Kubernetes 集群中。

我们不知道此错误的原因。如果您遇到此错误，请在上述 GitHub 问题中发表评论。

错误：`调用准入 Webhook 失败：服务器当前无法处理请求`

此问题在 4 个 GitHub 问题中被报告 (1369, 1425 3542, 4852)

Error from server (InternalError): error when creating "test-resources.yaml": Internal error occurred:
  failed calling admission webhook "issuers.admission.certmanager.k8s.io":
    the server is currently unable to handle the request

我们不知道此错误的原因。如果您能够重现此错误，请在上述 GitHub 问题中的任何一个中发表评论。

错误：`x509: 证书由未知颁发机构签名`

在 GitHub 问题中报告 (2602)

在安装或升级 cert-manager 时，如果使用的命名空间不是cert-manager

Error: UPGRADE FAILED: release core-l7 failed, and has been rolled back due to atomic being set:
  failed to create resource: conversion webhook for cert-manager.io/v1alpha3, Kind=ClusterIssuer failed:
    Post https://cert-manager-webhook.core-l7.svc:443/convert?timeout=30s:
      x509: certificate signed by unknown authority

在创建颁发者或任何其他 cert-manager 自定义资源时，可能会出现非常类似的错误信息。

Internal error occurred: failed calling webhook "webhook.cert-manager.io":
  Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s:
    x509: certificate signed by unknown authority`

使用cmctl install 和cmctl check api，您可能会看到以下错误信息。

2022/06/06 15:36:30 Not ready: the cert-manager webhook CA bundle is not injected yet
  (Internal error occurred: conversion webhook for cert-manager.io/v1alpha2, Kind=Certificate failed:
    Post "https://<company_name>-cert-manager-webhook.cert-manager.svc:443/convert?timeout=30s":
      x509: certificate signed by unknown authority)

如果您使用的是 0.14 及以下版本的 cert-manager 与 Helm，并且您在与cert-manager不同的命名空间中进行安装，则 CRD 清单文件中的命名空间名称cert-manager被硬编码。您可以在以下注释中看到硬编码的命名空间。

kubectl get crd issuers.cert-manager.io -oyaml | grep inject

您将看到以下内容。

cert-manager.io/inject-ca-from-secret: cert-manager/cert-manager-webhook-ca
#                                      ^^^^^^^^^^^^
#                                       hardcoded

**注意 1：**cert-manager Helm 图表中的此错误在 cert-manager 0.15 中得到修复。

**注意 2：**自 cert-manager 1.6 以来，此注释不再使用于 cert-manager CRD 上，因为不再需要转换。

解决方案是，如果您仍然使用的是 0.14 及以下版本的 cert-manager，则可以使用helm template渲染清单文件，然后编辑注释以使用正确的命名空间，最后使用kubectl apply安装 cert-manager。

如果您使用的是 1.6 及以下版本的 cert-manager，则问题可能是由于 cainjector 被卡住了，它试图将 cert-manager Webhook 创建并存储在 Secret 资源cert-manager-webhook-ca 中的自签名证书注入到 cert-manager CRD 的spec.caBundle 字段中。第一步是检查 cainjector 是否正常运行。

$ kubectl -n cert-manager get pods -l app.kubernetes.io/name=cainjector
NAME                                       READY   STATUS    RESTARTS       AGE
cert-manager-cainjector-5c55bb7cb4-6z4cf   1/1     Running   11 (31h ago)   28d

查看日志，您将能够判断领导者选举是否成功。领导者选举工作可能需要长达一分钟才能完成。

I0608 start.go:126] "starting" version="v1.8.0" revision="e466a521bc5455def8c224599c6edcd37e86410c"
I0608 leaderelection.go:248] attempting to acquire leader lease kube-system/cert-manager-cainjector-leader-election...
I0608 leaderelection.go:258] successfully acquired lease kube-system/cert-manager-cainjector-leader-election
I0608 controller.go:186] cert-manager/secret/customresourcedefinition/controller/controller-for-secret-customresourcedefinition "msg"="Starting Controller"
I0608 controller.go:186] cert-manager/certificate/customresourcedefinition/controller/controller-for-certificate-customresourcedefinition "msg"="Starting Controller"
I0608 controller.go:220] cert-manager/secret/customresourcedefinition/controller/controller-for-secret-customresourcedefinition "msg"="Starting workers"  "worker count"=1
I0608 controller.go:220] cert-manager/certificate/customresourcedefinition/controller/controller-for-certificate-customresourcedefinition "msg"="Starting workers"  "worker count"=1

成功输出包含类似以下的行。

I0608 sources.go:184] cert-manager/secret/customresourcedefinition/generic-inject-reconciler
  "msg"="Extracting CA from Secret resource" "resource_name"="issuers.cert-manager.io" "secret"="cert-manager/cert-manager-webhook-ca"
I0608 controller.go:178] cert-manager/secret/customresourcedefinition/generic-inject-reconciler
  "msg"="updated object" "resource_name"="issuers.cert-manager.io"

现在，查找任何指示 cert-manager Webhook 创建的 Secret 资源无法加载的消息。可能会显示以下两个错误信息。

E0608 sources.go:201] cert-manager/secret/customresourcedefinition/generic-inject-reconciler
  "msg"="unable to fetch associated secret" "error"="Secret \"cert-manager-webhook-caq\" not found"

以下信息表示由于缺少注释而跳过了给定的 CRD。您可以忽略这些信息。

I0608 controller.go:156] cert-manager/secret/customresourcedefinition/generic-inject-reconciler
  "msg"="failed to determine ca data source for injectable" "resource_name"="challenges.acme.cert-manager.io"

如果 cainjector 日志中没有问题，您将需要检查验证、变异和转换配置中的spec.caBundle 字段是否正确。Kubernetes API 服务器使用该字段的内容来信任 cert-manager Webhook。caBundle 包含 cert-manager Webhook 启动时创建的自签名 CA。

$ kubectl get validatingwebhookconfigurations cert-manager-webhook -ojson | jq '.webhooks[].clientConfig'
{
  "caBundle": "LS0tLS1...LS0tLS0K",
  "service": {
    "name": "cert-manager-webhook",
    "namespace": "cert-manager",
    "path": "/validate",
    "port": 443
  }
}

$ kubectl get mutatingwebhookconfigurations cert-manager-webhook -ojson | jq '.webhooks[].clientConfig'
{
  "caBundle": "LS0tLS1...RFLS0tLS0K",
  "service": {
    "name": "cert-manager-webhook",
    "namespace": "cert-manager",
    "path": "/validate",
    "port": 443
  }
}

让我们查看caBundle 的内容。

$ kubectl get mutatingwebhookconfigurations cert-manager-webhook -ojson \
  | jq '.webhooks[].clientConfig.caBundle' -r | base64 -d \
  | openssl x509 -noout -text -in -

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            ee:8f:4f:c8:55:7b:16:76:d8:6a:a2:e5:94:bc:7c:6b
        Signature Algorithm: ecdsa-with-SHA384
        Issuer: CN = cert-manager-webhook-ca
        Validity
            Not Before: May 10 16:13:37 2022 GMT
            Not After : May 10 16:13:37 2023 GMT
        Subject: CN = cert-manager-webhook-ca

让我们检查caBundle 的内容是否适用于连接到 Webhook。

$ kubectl -n cert-manager get secret cert-manager-webhook-ca -ojsonpath='{.data.ca\.crt}' \
  | base64 -d | openssl x509 -noout -text -in -

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            ee:8f:4f:c8:55:7b:16:76:d8:6a:a2:e5:94:bc:7c:6b
        Signature Algorithm: ecdsa-with-SHA384
        Issuer: CN = cert-manager-webhook-ca
        Validity
            Not Before: May 10 16:13:37 2022 GMT
            Not After : May 10 16:13:37 2023 GMT
        Subject: CN = cert-manager-webhook-ca

我们的最终测试是尝试使用此信任捆绑包连接到 Webhook。让我们将端口转发到 Webhook Pod。

kubectl -n cert-manager port-forward deploy/cert-manager-webhook 10250

在另一个 shell 会话中，使用以下命令发送一个/validate HTTP 请求。

curl -vsS --resolve cert-manager-webhook.cert-manager.svc:10250:127.0.0.1 \
    --service-name cert-manager-webhook-ca \
    --cacert <(kubectl get validatingwebhookconfigurations cert-manager-webhook -ojson | jq '.webhooks[].clientConfig.caBundle' -r | base64 -d) \
    https://cert-manager-webhook.cert-manager.svc:10250/validate 2>&1 -d@- <<'EOF' | sed '/^* /d; /bytes data]$/d; s/> //; s/< //'
{"kind":"AdmissionReview","apiVersion":"admission.k8s.io/v1","request":{"requestKind":{"group":"cert-manager.io","version":"v1","kind":"Certificate"},"requestResource":{"group":"cert-manager.io","version":"v1","resource":"certificates"},"name":"foo","namespace":"default","operation":"CREATE","object":{"apiVersion":"cert-manager.io/v1","kind":"Certificate","spec":{"dnsNames":["foo"],"issuerRef":{"group":"cert-manager.io","kind":"Issuer","name":"letsencrypt"},"secretName":"foo","usages":["digital signature"]}}}}
EOF

您应该看到一个成功的 HTTP 请求和响应。

POST /validate HTTP/1.1
Host: cert-manager-webhook.cert-manager.svc:10250
User-Agent: curl/7.83.0
Accept: */*
Content-Length: 1299
Content-Type: application/x-www-form-urlencoded

HTTP/1.1 200 OK
Date: Wed, 08 Jun 2022 16:20:45 GMT
Content-Length: 2029
Content-Type: text/plain; charset=utf-8

...

错误：`集群范围的资源“mutatingwebhookconfigurations/”已管理，并且拒绝访问`

此消息在 GitHub 问题3717 中被报告。

在 GKE Autopilot 上安装 cert-manager 时，您将看到以下信息。

Error: rendered manifests contain a resource that already exists. Unable to continue with install:
  could not get information about the resource:
    mutatingwebhookconfigurations.admissionregistration.k8s.io "cert-manager-webhook" is forbidden:
      User "XXXX" cannot get resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope:
        GKEAutopilot authz: cluster scoped resource "mutatingwebhookconfigurations/" is managed and access is denied

此错误信息将在使用 1.20 及以下版本的 Kubernetes 与 GKE Autopilot 时出现。这是由于GKE Autopilot 中对变异准入 Webhook 的限制。

截至 2021 年 10 月，“快速”Autopilot 发布通道已为 Kubernetes 主机推出 1.21 版本。通过 Helm 图表进行安装可能会导致错误信息，但据一些用户反映，cert-manager 正在运行。欢迎提供反馈和 PR。

错误：`命名空间“kube-system”已管理，并且拒绝请求的动词“create”`

在使用 Helm 在 GKE Autopilot 上安装 cert-manager 时，您将看到以下错误信息。

Not ready: the cert-manager webhook CA bundle is not injected yet

在出现此错误之后，您应该仍然看到三个 Pod 正常运行。

$ kubectl get pods -n cert-manager
NAME                                      READY   STATUS    RESTARTS   AGE
cert-manager-76578c9687-24kmr             1/1     Running   0          47m
cert-manager-cainjector-b7d47f746-4799n   1/1     Running   0          47m
cert-manager-webhook-7f788c5b6-mspnt      1/1     Running   0          47m

但是，查看任何一个日志，您将看到以下错误信息。

E0425 leaderelection.go:334] error initially creating leader election record:
  leases.coordination.k8s.io is forbidden: User "system:serviceaccount:cert-manager:cert-manager-webhook"
    cannot create resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system":
      GKEAutopilot authz: the namespace "kube-system" is managed and the request's verb "create" is denied

这是由于 GKE Autopilot 的限制。无法在kube-system 命名空间中创建资源，而 cert-manager 使用众所周知的kube-system 来管理领导者选举。为了绕过此限制，您可以告诉 Helm 为领导者选举使用不同的命名空间。

helm install cert-manager jetstack/cert-manager --version 1.8.0 \
  --namespace cert-manager --create-namespace \
  --set global.leaderElection.namespace=cert-manager

cert-manager Webhook Pod 的终极调试指南

错误：connect: connection refused

错误：i/o timeout（连接问题）

原因 1：GKE 私有集群

原因 2：使用自定义 CNI 的 EKS

原因 3：网络策略，Calico

原因 4：EKS 和安全组

其他原因

错误：x509: certificate is valid for xxx.internal, not cert-manager-webhook.cert-manager.svc（使用 Fargate Pod 的 EKS）

错误：service "cert-managercert-manager-webhook" not found

错误：no endpoints available for service "cert-manager-webhook" (OVHCloud)

错误：x509: certificate has expired or is not yet valid

错误：net/http: request canceled while waiting for connection

错误：context deadline exceeded

错误：net/http: TLS 握手超时

错误：HTTP 探测失败，状态码：500

错误：服务不可用

错误：调用准入 Webhook 失败：服务器当前无法处理请求

错误：x509: 证书由未知颁发机构签名

错误：集群范围的资源“mutatingwebhookconfigurations/”已管理，并且拒绝访问

错误：命名空间“kube-system”已管理，并且拒绝请求的动词“create”

错误：`connect: connection refused`

错误：`i/o timeout`（连接问题）

错误：`x509: certificate is valid for xxx.internal, not cert-manager-webhook.cert-manager.svc`（使用 Fargate Pod 的 EKS）

错误：`service "cert-managercert-manager-webhook" not found`

错误：`no endpoints available for service "cert-manager-webhook"` (OVHCloud)

错误：`x509: certificate has expired or is not yet valid`

错误：`net/http: request canceled while waiting for connection`

错误：`context deadline exceeded`

错误：`net/http: TLS 握手超时`

错误：`HTTP 探测失败，状态码：500`

错误：`服务不可用`

错误：`调用准入 Webhook 失败：服务器当前无法处理请求`

错误：`x509: 证书由未知颁发机构签名`

错误：`集群范围的资源“mutatingwebhookconfigurations/”已管理，并且拒绝访问`

错误：`命名空间“kube-system”已管理，并且拒绝请求的动词“create”`