Inbound messages delayed

Incident Report for MXGuardian

Postmortem

This postmortem addresses the issue of delayed inbound email messages, which was significantly impacting our ability to deliver timely email services. The root cause of the delay was identified as twofold: first, there were sharp peaks of email volume at the top of every hour that our auto-scaling infrastructure struggled to accommodate quickly; second, there was high memory pressure on our content filtering servers, leading them to become unresponsive and consequently fall out of use.

Timeline:

Incident Detection: The issue was first noticed on April 16 when our monitoring systems reported higher than usual latency in message processing.
Initial Response: Our operations team initially responded by manually scaling up resources, which provided temporary relief but did not fully mitigate the issue.
Root Cause Analysis: Further investigation revealed the specific challenges related to auto-scaling response times and memory pressure on content filtering servers.
Resolution Implementation: Changes were implemented progressively over the following days, with the final adjustments completed on April 18.

Contributing Factors:

Email Volume Peaks: Email traffic exhibited sharp increases at the start of each hour, likely due to automated send schedules, which outpaced the auto-scaling's response capability.
Memory Pressure: Content filtering servers were operating close to their memory capacity, which under the strain of high volume became insufficient, causing the servers to crash or become unresponsive.

Resolution and Recovery:

Server Upgrade: All content filtering servers were upgraded from c5a.xlarge to c5ad.xlarge instances. These instances offer high-bandwidth, low-latency instance store volumes. The instance store is now partitioned to provide additional swap space, thereby preventing server crashes from insufficient memory, and to use the remainder as a working directory for faster email processing.
Auto-Scaling Logic Update: We revised our auto-scaling logic to incorporate time-of-day awareness, enabling the system to scale up resources in anticipation of known traffic spikes.
Enhanced Monitoring: Additional monitoring metrics were established to alert us more promptly when servers are nearing capacity, allowing for quicker intervention.

Conclusion:

We believe the steps taken have resolved the immediate issue and have made our system more robust for handling similar challenges in the future. We sincerely apologize for any inconvenience caused by the delay in email delivery and appreciate your patience as we continue to grow.

Posted Apr 18, 2024 - 19:52 UTC

Resolved

Queue times are back to normal now. We will be following up with a postmortem that provides more details and how we will prevent this going forward. Thank you for your patience.

Posted Apr 16, 2024 - 15:07 UTC

Investigating

We are experiencing an issue that is causing inbound messages to be delayed up to 45 minutes in some cases. We are in the process of expanding our capacity to handle the additional load.

Posted Apr 16, 2024 - 13:31 UTC

This incident affected: Email Security Gateway (Inbound Queue).