Debugging a serverless app written in AWS Transcribe and AWS Lambda with IOpipe


Written by:

Hey everyone this is Mike from IO pipe I
am our senior site reliability engineer that keeps the lights on for application
infrastructure in AWS today I’m gonna take you through a lambda
troubleshooting session that happened during a customer facing robot bingo
game this is what we ran during the keynote for the AWS San Francisco summit
some background on the game we decided to have some fun with the keynote and
developed an application over the course of 24 hours or so
that would leverage the AWS lambda and the new AWS transcribed services users
would go to HTTP colon forward slash four slash robot bingo click Play game
and would receive a randomly generated bingo board based on a compilation of
key tech words used in most AWS quino’s events we had a lambda that would send
the audio data to the transcribed service and take the transcription
results and put them in an s3 bucket another lambda which was the main
customer facing portion would then take those results from s3 populate a list of
words that matched our master list and the autofill and the players boards when
a player hit bingo they would then tweet out to our robot bingo Twitter bot with
bingo and it would trigger the lambda to start game to reset the user’s board and
begin a new transcription list so the day before the event while we were doing
the development work we decided to set the lambda to check the transcribe
service once a minute my ops brain from here wanted to set some constraints in
case the lambda did anything other than that so I set up some alerts inside my
I’ll pipe project to alert me if the transform function ran more than 15
times in five minutes or it had any errors this allows for some retries in
case of any timeouts but if the function was having issues it would let me know
so the AWS keynote kicked off around 9:30 a.m. and our system was running
normally as you can see I’m scrolling through the time frame everything looks
fine this is exactly what I wanted to see during its peak we had about 200
players during the first game we had our first winner right about 1019 a.m. in
the morning that person then tweeted out to the robot bingo robot with the words
bingo and from there that we noticed all the board’s transitions successfully
into game 2 during the second game very quickly got an alert that my robot
bingo lambda had run more than 15 times in 15 minutes I logged into my aisle
pipe dashboard and clicked into the project and this is what I saw if I
scroll down here you can see that definitely this bingo dev transform is
in an alarm state to begin the troubleshooting I decided to jump into a
lower level function view what we were looking at right here is a top-level
function view so if you had multiple functions you would see them all listed
here so as I clicked in I took a look at just the pertinent information I see it
ran 36 times in the last 15 minutes the region that it’s running in the alarm
status how long ago the alarm started and if we scroll down we start seeing
some more detailed information on what’s going on with this specific lambda as
you can see here it’s starting to take off in terms of what’s going on with the
amount of functions that are being run if we take a look over on the right-hand
side though you can see that the duration of the function was fairly flat
and then as we transition into that second game we started getting spikes
and you can see that the 99th percentile is starting to get pretty far up there
and as we scroll down we can see all of the individual invocations that did run
so one of the first things I wanted to check for was to see if there was any
legitimate error traces so for I start by error here I didn’t see any in the
specific time frame that we are in so I uncheck that and I go well let’s take a
look at the long running durations because we know that the 99th percentile
the outliers were going pretty far out there so if I sort by duration you can
see this 1019 1021 right around the time that the issue started that we had some
long-running functions so if we click into the functions we can see what’s
going on so now that I click through to the individual invocation view we take a
look at all the various information that’s provided on the left-hand side
you’re gonna get all the various a double us information which would be
things like the AR n what triggered this lamda the region you know various
information like that this middle column right here is very important this is the
trigger itself so automatically we collect all of the info from any trigger
that triggers a lambda so this is our event info plugin and this provides a
very good way to understand where the information at this
is processing or using came from and so in this case its s3 you can see was a
object created put any of the IDs the bucket names these types of things and
on the right hand side at the top you can see we have the underlying container
information so we see the memory usage was this a cold start in this case yes
it was CPU usage system resources and various things like that the next
plug-in that I’m looking at this is where I instrumented my code we take a
look at the i/o pipe tracing plug-in and you can see we have four different
functions inside this lambda they’re running at different timings but you can
see this gap at the beginning that’s that’s evidence of a cold start because
that’s just the system resources in the container spinning up to fire off this
lambda so that’s indication number one if we scroll down now we see this custom
metrics field the custom metrics field is very important to me because I do a
lot of different instrumentation a lot of different types of information
responses from various AWS services and in this case what the game is running
which is the current game ID and what items were retrieved by transcribed to
mark people’s bingo boards with right off the bat this should never be empty
there should always be something returned so that was a huge clue to what
was going on and just for comparison’s sake I pulled up a function that was
running correctly and this is after we actually fixed the issue but I just
wanted to show you a comparison of what would be returned through a custom
metrics field and it came in this case the current game was three and the the
various items that came back they’re gonna be marked off on people’s bingo
boards so back to one of the non-working functions in this case this is an area
of lambda that gets kind of hard to troubleshoot sometimes it’s that case
where the function ran successfully in terms of how AWS sees it and so it’s not
going to return an error but the code logic is wrong but still successful so
those are really the hardest ones to troubleshoot so in this case because
this custom metric field right here was empty I very quickly identified that it
was a code logic problem I engaged with our lead developer Corey gave him this
information and he was very quickly able to identify an issue inside the code
however in this time for right around 10 26 we had our first
player actually tweet out that they’re bored on the second game wasn’t filling
even though the transcribe words had been picked so I needed to reach out let
him know that we had an issue and that we were very much working on it right
now so Cory did get back to me very quickly and said he had found the issue
fixed it and was getting ready to deploy so just a little code snippet I’ll show
you exactly what we fixed in the code to resolve this issue so looking at the
code snippet this is actually a PR that we sent up to fix the issue you can see
on the left that we were not iterating the game to update with the new
transcribe so we’re able to very quickly add the iterator in there with the new
game ID submit this PR get it approved push it through the CI pipeline and push
it out to the players so we sent the PR up into the infrastructure right around
10:31 right after that you can see the game resumed as it should normally we
had all of the players refresh their board via a tweet and everything went
smoothly from there on out so the time from 1026 when we got the first customer
report to the time we deployed the new function was about 5 minutes but we had
already been working on it for a short timeframe before that so the median time
to resolution was about 12 minutes or so in the end it turned out to be a runaway
function that was going to get exponentially worse unless we had the
tools to report the issue and then the team to fix the issue in the end so you
can see from here we were working fine that function starts running away we
deploy the fix everything’s fine and dandy
after that it’s definitely one of the worst issues to troubleshoot without
having tools like i/o pipe just because it’s very hard to understand what’s
going on from a code logic perspective so instrumentation of the code is
extremely important especially to me running the infrastructure I need to
look inside the black box need to understand what the code is doing when
it’s doing it how it’s doing it and so using i/o pipe really helps me in those
situations I definitely loved dogfooding our own product and I run it in our
production environment that runs 200 million invocations a month so I love it
hopefully this helps anybody else out who’s trying to troubleshoot lambdas if
you have any questions or any comments or anything like that please feel free
to reach out to me on Twitter or you can email me directly here at i/o pipe
definitely willing to help if you want to try out i/o pipe just visit the
website we have a free trial where you can get all these functions and all the
advanced plugins so everybody have a great day and thanks for stopping by and
watching my video

Leave a Reply

Your email address will not be published. Required fields are marked *