Tech

What Took Google Hours to Fix a Glitch Its Engineers Identified in 10 Minutes

Published

1 week ago

June 16, 2025

admin

What Happened?

On June 12–17, 2025 (timeline adjusted for reporting), a widespread Google Cloud outage disrupted over 70 services globally, affecting core platforms like Gmail, Drive, Calendar, Meet—and even major third-party services like Cloudflare, OpenAI, and Shopify

Root Cause of the Outage

According to engineers, the fault stemmed from a null-pointer error in the Cloud Service Control system. Within 10️⃣ minutes of the incident unfolding, Google’s SRE team diagnosed the root cause

Why Did the Fix Take Hours?

Despite rapid identification, full resolution wasn’t immediate due to a series of complicating factors:

Network congestion hampered diagnostic tooling and fixes
System dependencies needed careful orchestration: changes couldn’t roll out piecemeal without risking further instability
Managers prioritized safe, staged recovery—eschewing risky shortcuts—to prevent further cascading failures .

Impact on Services & Users

Over 70 Google Cloud services were affected globally
Third-party platforms relying on Google—like Cloudflare, OpenAI, Shopify—also experienced disruptions
Users faced delays accessing emails, calendars, file storage, video calls, and API services.

Google’s Response & Apology

Google issued a public apology, acknowledging the severity of the outage and the inconvenience caused to millions of users and enterprise customers Internal postmortems are now underway to prevent repeats.

Key Pointers from SRE Best Practices

Guided by decades of Site Reliability Engineering, Google highlights essential learnings that are especially relevant here:

Appropriate mitigation risk levels: Don’t let remediation create larger issues
Thorough testing of recovery procedures: Big change tools must be tested before emergencies
Canary releases: Roll out changes gradually to detect issues before they become widespread .
Communication redundancy: Have independent channels when affected services are down .

These principles align with earlier incident investigations—such as the 2019 GCLB outage—reinforcing the core tenets of reliability engineering

Toplisted

Tech

What Took Google Hours to Fix a Glitch Its Engineers Identified in 10 Minutes

What Happened?

Root Cause of the Outage

Why Did the Fix Take Hours?

Impact on Services & Users

Google’s Response & Apology

Key Pointers from SRE Best Practices

Leave a Reply

Leave a Reply

Trending

What Happened?

Root Cause of the Outage

Why Did the Fix Take Hours?

Impact on Services & Users

Google’s Response & Apology

Key Pointers from SRE Best Practices

Leave a Reply Cancel reply

Leave a Reply

Trending

Leave a Reply