ADOC - Multiple Features Experiencing Downtime

Postmortem

Root Cause Analysis

Catalog Service Migration Failure During 4.2.0 Release

Date of Incident: March 27, 2025 09:29 UTC

Resolved: March 27, 2025 10:27 UTC

Duration: 58 minutes

Severity: P2

Affected Services/Functionality

  • Catalog Service
  • Data Reliability Features:

    • Policy Listing Page
    • Asset Discovery Page

Incident Summary

During the 4.2.0 release deployment, a database migration failure in the Catalog Service caused partial

system degradation. The issue stemmed from a schema validation mismatch between the database

layer and application ORM, specically around tenant_id eld length validation.

Impact:

  • Intermittent failures in Data Reliability features
  • Inconsistent UI rendering
  • Mixed-version service communication issues

Root Cause Analysis

Primary Cause: Schema Validation Mismatch

Database vs. ORM Inconsistency:

Layer Field Constraint
Database tenant_id VARCHAR(64)
Exposed ORM tenant_id @Column(length=34)

*The migration failed when processing a tenant_id with 37 characters

Technical Breakdown:

  1. Database Schema: Allows 64-character tenant_ids
  2. ORM Entity:

‌ @Table(name = "tenants")

‌ class Tenant : IntIdTable() {

‌ val tenant_id = varchar("tenant_id", 34) // Constraint

‌ }

  1. Migration attempted to process existing 35-character ID
  2. ORM validation rejected the value before DB interaction

The migration failed when encountering a tenant_id exceeding 34 characters.

Failure Chain

  1. Migration script aborted due to ORM validation
  2. Catalog Service pods failed initialization
  3. Kubernetes maintained previous version pods
  4. Version mismatch caused API inconsistencies

Detailed Incident Timeline

Investigation Phase

Time (UTC) Duration Action
09:29 Deployment failure detected
09:32 3 mins Team mobilisation
09:37 5 mins Root cause identified

Hotx Development

Time (UTC) Duration Action
09:40 Hotfix development started
09:45 5 mins Core x implemented
09:50 5 mins Validation logic added
09:52 2 mins 09:52 2m Unit tests completed

Build & Deployment

Time (UTC) Duration Action
09:55 CI pipeline triggered
09:58 3 mins Build artifacts ready
10:00 2 mins Staging deployment
10:10 10 mins Staging verication
10:15 5 mins Production rollout
10:25 10 mins Full deployment
10:27 2 mins Systems normal

Resolution

Corrective Actions:

  1. Deployed hotfix with:

    1. Updated ORM constraints (64 chars)
    2. Additional migration validation
  2. Verified all tenant_id values in production

  3. Validated cross-service compatibility

Preventive Measures

Action Item Owner
Schema validation pre-checks Data Eng

Key Learnings

Testing Gap: Need real-production-data migration testing

Posted Mar 29, 2025 - 14:45 UTC

Resolved

This incident has been resolved.
Posted Mar 27, 2025 - 10:27 UTC

Update

We are continuing to monitor for any further issues.
Posted Mar 27, 2025 - 09:47 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Mar 27, 2025 - 09:37 UTC

Identified

The fix is being implemented. ETA: 3:45 PM IST
Posted Mar 27, 2025 - 09:32 UTC

Update

We are continuing to investigate this issue.
Posted Mar 27, 2025 - 09:31 UTC

Investigating

We are currently investigating this issue.
Posted Mar 27, 2025 - 09:29 UTC
This incident affected: Data Reliability and Reporting.