Postmortem
Summary
On September 6th, 2023, at 11:18:00 UTC, an incident began when neon-console-api-1, one of the executors, encountered an issue, causing it to halt processing new requests. This disruption significantly impacted our services, resulting in over 150/200k projects becoming stuck with a "context deadline exceeded" error.
Initially, we responded by promptly restarting neon-console-api-1, temporarily alleviating the issue. However, it resurfaced shortly thereafter, manifesting with DNS lookup failures, presenting a complex diagnostic challenge.
Following extensive debugging efforts, we identified a recent change in the Kubernetes CNI configuration that was obstructing DNS requests on specific nodes.
In our attempts to address this misconfiguration, we removed the problematic CNI change, inadvertently triggering a system-wide console shutdown, leaving it stuck in a restarting state. To restore normalcy, we were compelled to forcibly terminate pods.
Subsequent scrutiny of audit logs uncovered that one of our engineers had activated an automated script linked to the service cluster, resulting in the installation of CNI configurations.
📆 Incident Timeline
Time |
Event |
2023-09-06 |
|
11:19:00 UTC |
PagerDuty Alert triggered |
11:21:00 UTC |
Arthur Petukhovsky set the status: Status: Investigating |
11:25:00 UTC |
After inspection found that issue with neon-console-api-1 pod and 150+ projects stuck with executor id neon-console-api-1 with error context deadline exceeded |
11:30:00 UTC |
Status: Investigating → Fixing: Restarted neon-console-api-1 and all operation back to normal state - Alexey Kondratov |
11:31:00 UTC |
Status: Fixing → Monitoring: All operations are normal after neon-console-api restart and we are monitoring further. |
11:41:19 UTC |
Arthur Petukhovsky updated the summary: I see that problems started around 2023-09-06T11:05:56.198Z, and projects in different regions were affected. My active projects had no errors in this period, so I can assume that active projects were not affected, and this issue was only with executing new operations. There were around 200 stuck projects at peak, but they all were quickly resolved after neon-console-api-1 was restarted. |
11:49:00 UTC |
Anton Chaporgin: Have we been rate limited by dns? |
12:31:09 UTC |
Lassi Pölönen & Andrey Taranik: started investigating on AWS DNS quota limit and CoreDNS |
12:32:00 UTC |
Status: Monitoring → Fixing: Service cluster still has dns resolution issue and we are looking into it. |
|
Identifying the root cause of this DNS issue took an extended period |
13:16:03 UTC |
Update: We have identified and fixed issue which is happening due to internal dns lookup failure (coredns service). |
13:22:51 UTC |
Found manual configuration on Kubernetes CNI which blocked DNS requests internally and only affected few nodes |
13:29:38 UTC |
Status: Monitoring → Fixing: Misconfiguration Kubernetes CNI on the Service cluster caused disruptions to other service objects. We're working to resolve this issue, but it may take some time. |
13:55:00 UTC |
Console was stuck in restarting state for a 10+ minutes while fixing the issue |
13:57:52 UTC |
Update: Status: Fixing → Monitoring: After addressing the Kubernetes network configuration issue and restarting the impacted services, everything has returned to normal operation. |
14:15:28 UTC |
Update:Status: Monitoring → Documenting |
14:16:00 UTC |
Rahul Patil updated the summary: Neon Console API service affected due to internal DNS lookup issue |
The investigation, Root Cause & Analysis
Contributors
(not as ‘root causes’ but as a collection of things that had to be true for this incident to take place)
- 150+ Projects stuck and visible on Console
- PagerDuty Alert triggered for stuck Projects
- Most projects where down with neon-console-api-1 Executor id
- Restarted neon-console-api-1 and issue was resolved but there were other issues in logs related to DNS lookup time outed
- Other services (vmagent, teleport, external-dns) impacted on service cluster with same dns lookup errors
Root Cause Analysis
The root cause of this incident on September 6th, 2023, was traced back to a change in the Kubernetes CNI configuration on our service cluster. This change inadvertently led to the blocking of DNS requests on specific nodes, causing disruptions in the DNS resolution process. The incident escalated when an attempt to rectify this misconfiguration by removing the problematic CNI change resulted in the entire console becoming unresponsive and stuck in a restarting state. The triggering event for the CNI configuration change was an automated script executed by one of our engineers, which was connected to the service cluster. This incident highlights the importance of stringent control and oversight over configuration updates in critical production environments to prevent future disruptions.
Mitigators
Preventative Actions for Future Incidents
Implement a more rigorous and automated configuration update process for our production service cluster.
This measure aims to minimize the risk of similar incidents occurring in the future by enhancing the oversight and control over configuration changes in our critical infrastructure
Follow-ups
- Initiative: split the control plane
- Dedicated per-region control plane will help narrowing down the impact due to misconfiguration or infra issues to only one region.
- Investigate: Migration Neon Services to new EKS cluster
- RBAC for production Kubernetes
- Role for developer user with restricted access on Kubernetes cluster