AWS Lambda Silent Crash – A Platform Failure, Not an Application Bug [pdf]

(lyons-den.com)

115 points nonfamous | 3 comments | 15 Jul 25 01:22 UTC | HN request time: 0s | source

Show context

ceejayoz ◴[15 Jul 25 02:42 UTC] No.44567373[source]▶

I saw this come up on /r/aws a few days back.

This response seemed illuminating:

https://www.reddit.com/r/aws/comments/1lxfblk/comment/n2qww9...

> Looking at (this section)[https://will-o.co/gf4St6hrhY.png], it seems like you're trying to queue up an asyncronous task and then return a response. But when a Lambda handler returns a response, that's the end of execution. You can't return an HTTP response and then do more work after that in the same execution; it's just not a capability of the platform. This is documented behavior: "Your function runs until the handler returns a response, exits, or times out". After you return the object with the message, execution will immediately stop even if other tasks had been queued up.

replies(2): >>44567415 #>>44567675 #

mcflubbins ◴[15 Jul 25 02:49 UTC] No.44567415[source]▶

>>44567373 #

If this is the case (it might very well be) I do at least find it odd that no one on AWS' side was able to explain this to them.

replies(6): >>44567446 #>>44567457 #>>44567464 #>>44567509 #>>44567669 #>>44569284 #

1. eddythompson80 ◴[15 Jul 25 03:47 UTC] No.44567669[source]▶

>>44567415 #

They did (in a typical non-fault admitting way). They didn't escalate to Lambda engineering team, then said that this is a code issue, and that they should move to EC2 or Fargate, which is the polite way of saying "you can't do that on lambda. its your issue. no refunds. try fargate."

OP seems to be fixated on wanting MicroVM logs from AWS to help them correlate their "crash", but there likely no logs support can share with them. The microVM is suspended in a way you can't really replicate or test locally. "Just don't assume you can do background processing". Also to be clear, AWS used to allow the microVM to run for a bit after the response is completed to make sure anything small like that has been done.

It's an nondeterministic part of the platform. You usually don't run into it until some library start misbehaving for one reason or another. To be clear, it does break applications and it's a common support topic. The main support response is to "move to EC2 or fargate" if it doesn't work for you. Trying to debug and diagnose the lambda code with the customer is out of support scope.

replies(1): >>44569159 #

2. jiggawatts ◴[15 Jul 25 08:33 UTC] No.44569159[source]▶

>>44567669 (TP) #

I had a similar issue with Microsoft's own Application Insights dropping logs on the floor when used inside Azure Functions (equiv. of Lambda).

Same technical cause, it uses background tasks to upload logs. If the Function exits, the logs aren't sent.

This has been fixed since, but for a while generated a lot of repeated support tickets and blog articles.

replies(1): >>44570796 #

3. eddythompson80 ◴[15 Jul 25 13:14 UTC] No.44570796[source]▶

>>44569159 #

I know for a fact that’s not true. You must have misunderstood the issue.

This is one of the main technical differences between Azure Functions and Google Cloud Run vs Lambda. Azure and GCP offer serverless as a billing model rather than an execution model precisely to avoid this issue. (Among many other benefits*)

Both in Azure and GCP you listen and handle SIGTERM (on Linux at least, Azure has a Windows offering and you use a different Windows thing there that I’m forgetting) and you can control and handle shutdown. There is no “suspend”. “Suspending nodejs” is not a thing. This is a super AWS Lambda specific behavior that is not replicable outside AWS (not easily at least)

The main thing I do is review cloud issues mid size companies run into. Most of them were startups that transitioned into a mid size company and now need to get off the “Spend Spend Spend” mindset and rain in cloud costs and services. The first thing we almost always have to untangle is their Lambdas/SF and it’s always the worst thing to ever untangle or split apart because you will forever find your code behavior differently outside of AWS. Maybe if you have the most excellent engineers working with the most excellent processes and most excellent code. But in reality all Lambda code takes complete dependency on the fact that lambda will kill your process after a timeout. Lambda will run only 1 request through your process at any given time. Lambda will “suspend” and “resume” your process. 99% of lambdas I helped move out of AWS have had a year+ trail of bugs where those services couldn’t run for any length of time before corrupting their state. They rely on the process restarting every few request to clear up a ton of security-impacting cross contamination.

* I might be biased, but I much much prefer GCR or AZF to Lambda in terms of running a service. Lambda shines if you resell it. Reselling GCR or AZF as a feature in your application is not straight forward. Reselling a lambda in your application is very very easy

↑