Building a Production-Ready K3s Control Plane on External Etcd With Full TLS
This artifact documents the complete process I followed to design, build, secure, and troubleshoot a highly available external etcd cluster and integrate it with a K3s Kubernetes control plane, using:
- Mutual TLS authentication
- Static IP addressing
- Explicit PKI lifecycle management
- Etcd cluster health validation
- K3s server bootstrapping via external datastore
- Disaster recovery–aligned configuration
- Real troubleshooting of degraded raft nodes and certificate failures
This mirrors the real-world operational skills expected from DevOps, SRE, and Platform Engineering roles.
1. Architecture Summary
Nodes
| Node | Role | IP |
|---|---|---|
| node-01 | etcd-1 | 172.16.0.2 |
| node-02 | etcd-2 | 172.16.0.3 |
| node-03 | etcd-3 | 172.16.0.4 |
| node-04 | K3s server (control plane) | 172.16.0.5 |
Topology
+------------------+ +------------------+
| etcd-1 |<------>| etcd-2 |
| 172.16.0.2:2380 |<------>| 172.16.0.3:2380 |
+--------^---------+ +---------^--------+
| |
| |
v v
+------------------+
| etcd-3 |
| 172.16.0.4:2380 |
+--------^---------+
|
|
v
+------------------+
| K3s Server |
| 172.16.0.5 |
| External Etcd |
+------------------+
The control plane depends on external etcd via:
etcd://172.16.0.2:2379,172.16.0.3:2379,172.16.0.4:2379
2. Skills Demonstrated
✔ Distributed Systems
- Built a 3-node etcd cluster with full Raft consensus.
- Validated cluster health, election behavior, and endpoint availability.
✔ PKI + Mutual TLS Authentication
-
Generated:
- Cluster CA
- Per-node private keys
- Per-node signed certificates with correct SAN entries
-
Investigated and resolved TLS failures due to incorrect IP SANs.
✔ Linux Systems Engineering
- Managed systemd units for etcd and k3s.
- Corrected file ownership and secure permissions (
600,644). - Used journald logs for deep debugging.
✔ K3s Control Plane Bootstrapping
- Integrated external etcd into K3s using
config.yaml. - Corrected datastore URI formats (
etcd://). - Ensured k3s-server successfully registered itself as a Kubernetes node.
✔ Troubleshooting & Root Cause Analysis
- Diagnosed “tls: bad certificate” errors.
- Compared CSR SAN values against real host IPs.
- Re-issued and redeployed certificates.
- Validated fix by checking Raft health and Kubernetes readiness.
✔ Real DevOps Discipline
-
Iterated systematically:
- Observe failure
- Interpret logs
- Form hypothesis
- Validate
- Apply fix
- Re-run tests
Exactly how production outages are handled.
3. What I Built (Step-by-Step Summary)
3.1. Created PKI Infrastructure
Generated a CA, node CSRs, and signed certificates:
openssl genrsa -out ca.key 4096
openssl req -x509 -new -nodes -key ca.key -subj "/CN=etcd-ca" -days 3650 -out ca.crtEach etcd node received its own unique cert + key with SAN entries matching:
DNS:etcd-X
IP:172.16.0.X
3.2. Deployed TLS-Secured Etcd Cluster
Each node received:
/etc/etcd/tls/ca.crt
/etc/etcd/tls/etcd-N.crt
/etc/etcd/tls/etcd-N.key
Correct permissions enforced:
600 private keys
644 public certs
3.3. Brought up the cluster & validated health
etcdctl --endpoints="https://172.16.0.2:2379,https://172.16.0.3:2379" \
--cacert=/etc/etcd/tls/ca.crt \
--cert=/etc/etcd/tls/etcd-1.crt \
--key=/etc/etcd/tls/etcd-1.key \
endpoint health -w tableFinal expected output:
| https://172.16.0.2:2379 | true |
| https://172.16.0.3:2379 | true |
| https://172.16.0.4:2379 | true |
3.4. Integrated K3s with External Etcd
Created /etc/rancher/k3s/config.yaml:
datastore-endpoint: "etcd://172.16.0.2:2379,172.16.0.3:2379,172.16.0.4:2379"
etcd-cafile: "/etc/rancher/k3s/ca.crt"
etcd-certfile: "/etc/rancher/k3s/etcd-1.crt"
etcd-keyfile: "/etc/rancher/k3s/etcd-1.key"Installed K3s:
curl -sfL https://get.k3s.io | sudo sh -s - server3.5. Validated Kubernetes Control Plane
sudo k3s kubectl get nodesResult:
NAME STATUS ROLES AGE VERSION
node-04 Ready control-plane,master 57s v1.33.6+k3s1
4. Root Cause Analysis (Major Learning)
During the build, I intentionally kept logs, mistakes, and resolutions. One real production-grade issue occurred:
❌ etcd peers rejected connections: tls: bad certificate
Cause
A certificate SAN contained the wrong IP:
IP Address:176.16.0.3 # incorrect
Fix
- Regenerated CSR
- Reissued certificate
- Deployed correct cert bundle
- Restarted etcd
- Revalidated health
Lesson
**99% of etcd TLS failures are caused by incorrect SAN values. Always verify SANs before signing certificates.