Console and API requests

99.94 % uptime
Jul 2023100% uptime
Aug 2023100% uptime
Sep 202399.83% uptime

Database Operations

99.92 % uptime
Jul 2023100% uptime
Aug 2023100% uptime
Sep 202399.76% uptime

Database Connectivity

99.92 % uptime
Jul 2023100% uptime
Aug 2023100% uptime
Sep 202399.76% uptime

Notice history

Sep 2023

Important Announcement: Upcoming Migration of Our Status Page
Console is down
  • Resolved

    Postmortem

    Summary

    On September 6th, 2023, at 11:18:00 UTC, an incident began when neon-console-api-1, one of the executors, encountered an issue, causing it to halt processing new requests. This disruption significantly impacted our services, resulting in over 150/200k projects becoming stuck with a "context deadline exceeded" error.

    Initially, we responded by promptly restarting neon-console-api-1, temporarily alleviating the issue. However, it resurfaced shortly thereafter, manifesting with DNS lookup failures, presenting a complex diagnostic challenge.

    Following extensive debugging efforts, we identified a recent change in the Kubernetes CNI configuration that was obstructing DNS requests on specific nodes.

    In our attempts to address this misconfiguration, we removed the problematic CNI change, inadvertently triggering a system-wide console shutdown, leaving it stuck in a restarting state. To restore normalcy, we were compelled to forcibly terminate pods.

    Subsequent scrutiny of audit logs uncovered that one of our engineers had activated an automated script linked to the service cluster, resulting in the installation of CNI configurations.

    📆 Incident Timeline

    Time Event
    2023-09-06
    11:19:00 UTC PagerDuty Alert triggered
    11:21:00 UTC Arthur Petukhovsky set the status: Status: Investigating
    11:25:00 UTC After inspection found that issue with neon-console-api-1 pod and 150+ projects stuck with executor id neon-console-api-1 with error context deadline exceeded
    11:30:00 UTC Status: Investigating → Fixing: Restarted neon-console-api-1 and all operation back to normal state - Alexey Kondratov
    11:31:00 UTC Status: Fixing → Monitoring: All operations are normal after neon-console-api restart and we are monitoring further.
    11:41:19 UTC Arthur Petukhovsky updated the summary: I see that problems started around 2023-09-06T11:05:56.198Z, and projects in different regions were affected. My active projects had no errors in this period, so I can assume that active projects were not affected, and this issue was only with executing new operations. There were around 200 stuck projects at peak, but they all were quickly resolved after neon-console-api-1 was restarted.
    11:49:00 UTC Anton Chaporgin: Have we been rate limited by dns?
    12:31:09 UTC Lassi Pölönen & Andrey Taranik: started investigating on AWS DNS quota limit and CoreDNS
    12:32:00 UTC Status: Monitoring → Fixing: Service cluster still has dns resolution issue and we are looking into it.
    Identifying the root cause of this DNS issue took an extended period
    13:16:03 UTC Update: We have identified and fixed issue which is happening due to internal dns lookup failure (coredns service).
    13:22:51 UTC Found manual configuration on Kubernetes CNI which blocked DNS requests internally and only affected few nodes
    13:29:38 UTC Status: Monitoring → Fixing: Misconfiguration Kubernetes CNI on the Service cluster caused disruptions to other service objects. We're working to resolve this issue, but it may take some time.
    13:55:00 UTC Console was stuck in restarting state for a 10+ minutes while fixing the issue
    13:57:52 UTC Update: Status: Fixing → Monitoring: After addressing the Kubernetes network configuration issue and restarting the impacted services, everything has returned to normal operation.
    14:15:28 UTC Update:Status: Monitoring → Documenting
    14:16:00 UTC Rahul Patil updated the summary: Neon Console API service affected due to internal DNS lookup issue

    The investigation, Root Cause & Analysis

    Contributors

    (not as ‘root causes’ but as a collection of things that had to be true for this incident to take place)

    • 150+ Projects stuck and visible on Console
    • PagerDuty Alert triggered for stuck Projects
    • Most projects where down with neon-console-api-1 Executor id
    • Restarted neon-console-api-1 and issue was resolved but there were other issues in logs related to DNS lookup time outed
    • Other services (vmagent, teleport, external-dns) impacted on service cluster with same dns lookup errors

    Root Cause Analysis

    The root cause of this incident on September 6th, 2023, was traced back to a change in the Kubernetes CNI configuration on our service cluster. This change inadvertently led to the blocking of DNS requests on specific nodes, causing disruptions in the DNS resolution process. The incident escalated when an attempt to rectify this misconfiguration by removing the problematic CNI change resulted in the entire console becoming unresponsive and stuck in a restarting state. The triggering event for the CNI configuration change was an automated script executed by one of our engineers, which was connected to the service cluster. This incident highlights the importance of stringent control and oversight over configuration updates in critical production environments to prevent future disruptions.

    Mitigators

    Preventative Actions for Future Incidents

    Implement a more rigorous and automated configuration update process for our production service cluster.

    This measure aims to minimize the risk of similar incidents occurring in the future by enhancing the oversight and control over configuration changes in our critical infrastructure

    Follow-ups

    • Initiative: split the control plane
      • Dedicated per-region control plane will help narrowing down the impact due to misconfiguration or infra issues to only one region.
    • Investigate: Migration Neon Services to new EKS cluster
    • RBAC for production Kubernetes
      • Role for developer user with restricted access on Kubernetes cluster
  • Resolved

    This incident has been resolved. We'll be sharing a postmortem in the coming days, once we've gathered all the details.

  • Monitoring

    We implemented a fix and are currently monitoring the result.

  • Investigating

    We are currently investigating this incident.

Aug 2023

Jul 2023 to Sep 2023