Couldn’t start a Stolon PostgreSQL Kubernetes StatefulSet

I was troubleshooting a Stolon PostgreSQL deployment on a Kubernetes cluster recently – there’d been an outage on the cluster, and the stolon pods wouldn’t restart. Specifically, the keeper component would not start. Stolon uses sentinel nodes, proxy nodes and keeper nodes. The sentinels and proxies were starting, but the keepers were not.

kubectl get pods | grep keeper
postgres-keeper-0    0/1     CrashLoopBackOff     40          9s

The keeper-0 pod was cycling between the Init state and the CrashLoopBackOff state, as the container continually failed. The keeper pods were to be orchestrated by a StatefulSet, which requires each pod to pass a readiness check before scaling to additional pods.

Herein lies the problem. When I connect to one of the sentinel pods to look at the cluster status, it shows that keeper2 is actually the current master. The keeper0 pod could not start, as it required keeper2 to be running. Kubernetes won’t start keeper2, as keeper0 is failing a readiness check.

stolonctl status --cluster-name cluster-postgres --store-backend kubernetes --kube-resource-kind configmap
=== Active sentinels ===

ID              LEADER
715a29e8        false
a63f09cb        true
fe9bc74f        false

=== Active proxies ===

ID
4e97ff73
f066ea02

=== Keepers ===

UID     HEALTHY PG LISTENADDRESS        PG HEALTHY      PG WANTEDGENERATION     PG CURRENTGENERATION
keeper0 false   10.1.123.89:5432        false            2                       2
keeper1 false   10.1.123.93:5432        false            4                       4
keeper2 false   10.1.123.208:5432       false            5                       5

=== Cluster Info ===

Master: keeper2

===== Keepers/DB tree =====

keeper2 (master)
├─keeper0
└─keeper1

When you look at the StatefulSet configuration (kubectl get statefulset keeper -o yaml, you could see the definition of the readiness check.

        readinessProbe:
          exec:
            command:
            - sh
            - -c
            - exec pg_isready -h localhost -p 5432
          failureThreshold: 5
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1

The readiness probe requires a positive output from pg_isready. Unfortunately, pg_isready, well, wasn’t ready. Given the keeper0 instance wasn’t ready due its lack of communication with keeper2, Kubernetes was not able to instantiate the additional pods.

The logs from keeper0 (kubectl logs keeper0) was as you can imagine, showing the failure of connectivity to keeper2.

I ended up exporting the yaml for the StatefulSet (command above), removing the readinessProbe, and recreating the StatefulSet. With the readiness probe removed, Kubernetes started the additional pods. Given the keepers themselves were able to coordinate amongst themselves, this was able to resolve the issue.

~ Mike

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s