Summary:
Both HTTPTrap brokers which support API metric submission hung while waiting on a zombie child to die, and were unable to restart themselves, or to continue to service traffic. This caused all incoming requests to get returned a 502 error from the load balancer.
One of the HTTPTrap brokers stopped serving traffic at 2019-08-29 15:21:38 UTC. At this time notifications were fired via Circonus and issued to our internal alerting service. At 2019-08-29 19:32:52 UTC the second HTTPTrap broker stopped serving traffic, and also generating alerts and pushing notifications via our internal alerting service.
During this window the on-call rotation was assigned to non-SRE staff. The associated staff did not follow internal notification processes, which resulted in an inability to receive the generated notifications.
At 2019-08-30 13:10:00 UTC the operations staff noticed the generated 502s and notified the on-call rotation of the problem. At 2019-08-30 13:17:35 UTC the first HTTPTrap broker was returned to service, at which point the service stopped producing 502s and started to process current and queued metric submission submissions.
Impact: