Tech
What Took Google Hours to Fix a Glitch Its Engineers Identified in 10 Minutes
What Happened?
On June 12–17, 2025 (timeline adjusted for reporting), a widespread Google Cloud outage disrupted over 70 services globally, affecting core platforms like Gmail, Drive, Calendar, Meet—and even major third-party services like Cloudflare, OpenAI, and Shopify
Root Cause of the Outage
According to engineers, the fault stemmed from a null-pointer error in the Cloud Service Control system. Within 10️⃣ minutes of the incident unfolding, Google’s SRE team diagnosed the root cause

Why Did the Fix Take Hours?
Despite rapid identification, full resolution wasn’t immediate due to a series of complicating factors:
- Network congestion hampered diagnostic tooling and fixes
- System dependencies needed careful orchestration: changes couldn’t roll out piecemeal without risking further instability
- Managers prioritized safe, staged recovery—eschewing risky shortcuts—to prevent further cascading failures .
Impact on Services & Users
- Over 70 Google Cloud services were affected globally
- Third-party platforms relying on Google—like Cloudflare, OpenAI, Shopify—also experienced disruptions
- Users faced delays accessing emails, calendars, file storage, video calls, and API services.
Google’s Response & Apology
Google issued a public apology, acknowledging the severity of the outage and the inconvenience caused to millions of users and enterprise customers Internal postmortems are now underway to prevent repeats.
Key Pointers from SRE Best Practices
Guided by decades of Site Reliability Engineering, Google highlights essential learnings that are especially relevant here:
- Appropriate mitigation risk levels: Don’t let remediation create larger issues
- Thorough testing of recovery procedures: Big change tools must be tested before emergencies
- Canary releases: Roll out changes gradually to detect issues before they become widespread .
- Communication redundancy: Have independent channels when affected services are down .
These principles align with earlier incident investigations—such as the 2019 GCLB outage—reinforcing the core tenets of reliability engineering