API-based metric submission down
Incident Report for Circonus
Postmortem

Summary:

Both HTTPTrap brokers which support API metric submission hung while waiting on a zombie child to die, and were unable to restart themselves, or to continue to service traffic. This caused all incoming requests to get returned a 502 error from the load balancer.

One of the HTTPTrap brokers stopped serving traffic at 2019-08-29 15:21:38 UTC. At this time notifications were fired via Circonus and issued to our internal alerting service. At 2019-08-29 19:32:52 UTC the second HTTPTrap broker stopped serving traffic, and also generating alerts and pushing notifications via our internal alerting service.

During this window the on-call rotation was assigned to non-SRE staff. The associated staff did not follow internal notification processes, which resulted in an inability to receive the generated notifications.

At 2019-08-30 13:10:00 UTC the operations staff noticed the generated 502s and notified the on-call rotation of the problem. At 2019-08-30 13:17:35 UTC the first HTTPTrap broker was returned to service, at which point the service stopped producing 502s and started to process current and queued metric submission submissions.

Impact:

  • Customers that queued their data were able to recover the queued data by sending it at this point.
  • Customers who were running their own client-side HTTPTrap brokers were unaffected.
  • Customers submitting without client-side queuing had their requests dropped for approximately 13 hours, and did not get their associated metrics into the system.
Posted 20 days ago. Aug 31, 2019 - 22:47 UTC

Resolved
This incident has been resolved.
Posted 21 days ago. Aug 30, 2019 - 13:19 UTC
Monitoring
Underlying trap brokers return to service. Customer data ingestion returns. Queued data begins resending.
Posted 21 days ago. Aug 30, 2019 - 13:18 UTC
Identified
Underlying service brokers unable to process metric submission.
Posted 21 days ago. Aug 30, 2019 - 13:10 UTC
Investigating
We are currently investigating this issue.
Posted 22 days ago. Aug 29, 2019 - 19:33 UTC
This incident affected: API (Metrics).