On September 6th, 2023, at 11:18:00 UTC, an incident began when neon-console-api-1, one of the executors, encountered an issue, causing it to halt processing new requests. This disruption significantly impacted our services, resulting in over 150/200k projects becoming stuck with a "context deadline exceeded" error.
Initially, we responded by promptly restarting neon-console-api-1, temporarily alleviating the issue. However, it resurfaced shortly thereafter, manifesting with DNS lookup failures, presenting a complex diagnostic challenge.
Following extensive debugging efforts, we identified a recent change in the Kubernetes CNI configuration that was obstructing DNS requests on specific nodes.
In our attempts to address this misconfiguration, we removed the problematic CNI change, inadvertently triggering a system-wide console shutdown, leaving it stuck in a restarting state. To restore normalcy, we were compelled to forcibly terminate pods.
Subsequent scrutiny of audit logs uncovered that one of our engineers had activated an automated script linked to the service cluster, resulting in the installation of CNI configurations.
(not as ‘root causes’ but as a collection of things that had to be true for this incident to take place)
150+ Projects stuck and visible on Console
PagerDuty Alert triggered for stuck Projects
Most projects where down with neon-console-api-1 Executor id
Restarted neon-console-api-1 and issue was resolved but there were other issues in logs related to DNS lookup time outed
Other services (vmagent, teleport, external-dns) impacted on service cluster with same dns lookup errors
The root cause of this incident on September 6th, 2023, was traced back to a change in the Kubernetes CNI configuration on our service cluster. This change inadvertently led to the blocking of DNS requests on specific nodes, causing disruptions in the DNS resolution process. The incident escalated when an attempt to rectify this misconfiguration by removing the problematic CNI change resulted in the entire console becoming unresponsive and stuck in a restarting state. The triggering event for the CNI configuration change was an automated script executed by one of our engineers, which was connected to the service cluster. This incident highlights the importance of stringent control and oversight over configuration updates in critical production environments to prevent future disruptions.
Preventative Actions for Future Incidents
Implement a more rigorous and automated configuration update process for our production service cluster.
This measure aims to minimize the risk of similar incidents occurring in the future by enhancing the oversight and control over configuration changes in our critical infrastructure
Initiative: split the control plane
Dedicated per-region control plane will help narrowing down the impact due to misconfiguration or infra issues to only one region.
Investigate: Migration Neon Services to new EKS cluster
RBAC for production Kubernetes
Role for developer user with restricted access on Kubernetes cluster