To Err is Human

Gareth Thomas
3 min readJun 25, 2020

I was going to write this as just a status post in LinkedIn but the more I thought about it, the more it annoyed me and the more detail I wanted to put into this rant, and it is a rant. In fact it’s a rant that has been a particular subject of annoyance to me throughout my career, although with the advent of public APIs and Cloud service integration, it’s become even more of a problem.

In a nutshell: Why does it seem to be so hard for some software engineers to write better error handling? I suspect in bigger companies such as Amazon this is partly the fault of product management but it’s also the fault of failing to understand how a customer will use your product and what might make using it especially hard, or not caring.

The perpetrator in this case is Amazon Web Services (AWS), which I will start off by saying I am huge fan of and have spent a lot of time using (back to when it launched with just S3). The service in question is API Gateway, which, for those not familiar provides a means to offer a public API that can integrate with a host of other AWS services.

In this scenario I was looking to write tests that would enable me to check that endpoints I was building to verify an architecture concept would work. The setup for this service is at a very high-level API Gateway>State Machine>Lambda, in addition this requires some custom permission roles in IAM to manage interaction between the components, and the reason I mention this is when it comes to debugging the error I am going to show you, it requires digging into all of these areas because the message does not describe the real problem.

Here is the message:

The eagle-eyed amongst you will have noticed that the issue is the object property containing the ARN for the state machine is misspelled, but on first glance I did what many people would do and assumed something was incorrect at one of the levels of the integration with a needed value not being passed or some incorrect setting. After a bit of digging I knew the problem wasn’t in the state machine as it wasn’t even being called (no log entries were visible) so it was in the API Gateway or so I thought. Anyway, the point being I wasted probably 30 minutes trying to find this.

What I don’t understand is why API gateway doesn’t appear to parse the body to see if the properties are even valid and report those errors as a first level response? What I should see is an error message something like “1 validation error detected: unknown property stateMarchineArn” and bingo, one minute later I would’ve had it working.

If you were to add up the wasted hours software engineers spend trying to integrate APIs with poor error handling, it’s got to be costing companies tens of millions every year. But perhaps the problem here is there is no obvious “hero” if the engineer integrating it finds it all smooth sailing as it won’t lead to more sales because either he/she won’t tell anyone or no-one will care. Which is a shame as well as a missed opportunity.

--

--