Couldn’t start a Stolon PostgreSQL Kubernetes StatefulSet

I was troubleshooting a Stolon PostgreSQL deployment on a Kubernetes cluster recently – there’d been an outage on the cluster, and the stolon pods wouldn’t restart. Specifically, the keeper component would not start. Stolon uses sentinel nodes, proxy nodes and keeper nodes. The sentinels and proxies were starting, but the keepers were not.

kubectl get pods | grep keeper
postgres-keeper-0    0/1     CrashLoopBackOff     40          9s

The keeper-0 pod was cycling between the Init state and the CrashLoopBackOff state, as the container continually failed. The keeper pods were to be orchestrated by a StatefulSet, which requires each pod to pass a readiness check before scaling to additional pods.

Herein lies the problem. When I connect to one of the sentinel pods to look at the cluster status, it shows that keeper2 is actually the current master. The keeper0 pod could not start, as it required keeper2 to be running. Kubernetes won’t start keeper2, as keeper0 is failing a readiness check.

stolonctl status --cluster-name cluster-postgres --store-backend kubernetes --kube-resource-kind configmap
=== Active sentinels ===

ID              LEADER
715a29e8        false
a63f09cb        true
fe9bc74f        false

=== Active proxies ===


=== Keepers ===

keeper0 false        false            2                       2
keeper1 false        false            4                       4
keeper2 false       false            5                       5

=== Cluster Info ===

Master: keeper2

===== Keepers/DB tree =====

keeper2 (master)

When you look at the StatefulSet configuration (kubectl get statefulset keeper -o yaml, you could see the definition of the readiness check.

            - sh
            - -c
            - exec pg_isready -h localhost -p 5432
          failureThreshold: 5
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1

The readiness probe requires a positive output from pg_isready. Unfortunately, pg_isready, well, wasn’t ready. Given the keeper0 instance wasn’t ready due its lack of communication with keeper2, Kubernetes was not able to instantiate the additional pods.

The logs from keeper0 (kubectl logs keeper0) was as you can imagine, showing the failure of connectivity to keeper2.

I ended up exporting the yaml for the StatefulSet (command above), removing the readinessProbe, and recreating the StatefulSet. With the readiness probe removed, Kubernetes started the additional pods. Given the keepers themselves were able to coordinate amongst themselves, this was able to resolve the issue.

~ Mike


2 thoughts on “Couldn’t start a Stolon PostgreSQL Kubernetes StatefulSet

  1. Thank you, that was helpful, In my case I did minimal change, that is, changed the statefulset “podManagementPolicy from OrderedReady to parallel temporary, killed the keeper-1 so that keeper-0 becomes master again, then put it back

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s