A Simple Bug

A simple bug bit me. Rookie mistake, all because I made some bad assumptions.

One feature I’m currently building for our online platform is the ability to send transactional emails to users.

We wrote a chapter in our book on email service. Knowing that reliable email delivery is hard, we decided to use a popular transactional email service to handle it.

During testing, we noticed that it took upwards of 20 minutes to deliver an email. The amount of delay varied.

Being an “expert” on this subject 😆, I immediately assumed that the problem was with our brand new domain. The IP reputation must have been poor. Our email itself must have been poorly drafted and attracted the wrath of various spam filters.

To debug the issue, I sent test emails to an email testing service for scoring. It was lower than expected, so I fixed it and got a 10/10. Victory! But, random delays remained.

The issue only happened in our staging environment. Calling the same email service from the local dev environment on my laptop and the delivery was instant.

So, there was something with our staging environment. It must be that the ISPs didn’t like our staging domain. I spent a few more hours tweaking and testing various settings. Nothing.

At some point, Alex Xu noticed that every time he requested a new email, one would arrive immediately, but it was old. It was as if something in the email sending chain was holding onto the most recently sent email and would only release it when a new request came in.

What the hell?

Then the 🤦 moment.

Here is the code we called to send an email:

sendEmail(email, subject…)

Here is the signature of the function:

async function sendEmail(email, subject,…)

Do you spot the problem?

I forgot the await keyword. Without it, the backend code calling the sendEmail function returned immediately. Adding await, the backend code waited until sendEmail finished before returning, and the email was delivered instantly on staging.

Why did we only see the delay on staging, but not when testing this locally?

Well, the environments are not exactly the same.

On staging, the backend code lives in a cloud function. Like most cloud function implementations, the code has access to CPU only for the duration of function execution.

In my case, without the await keyword, the cloud function returned too early and it starved sendEmail of CPU cycles to finish. But as soon as we made another request to the cloud function, sendEmail got CPU cycles to finish sending the previous email.

This could also explain why the max delay was about 20 minutes. That was probably how long it took this cloud function provider to shut down an idle cloud function due to inactivity. On shutdown, the idle cloud function flushed all background activities. That gave sendEmail CPU cycles to finish.

Looking back, I can say this is a fun little debugging session with some interesting lessons. What lessons do you pick up from this story?

Published: 2022-04-12

Back to overview