How do you improve serverless mean time to resolution? The difference between logging and tracing.

Early in my career there were a lot of discussions about logging. What should we log, how should we format it, and how will it affect the system? When it comes down to it, logging really covers two system domains. First is that it's extremely useful in creating datasets you can use later in analytics, tying custom context to information which is very useful in big data analysis. This is still a useful practice and can really help you understand your customers. The second domain is troubleshooting, which just got a lot harder with all your components broken out into micro-services which work together to produce results.

What is the difference between logging and tracing?

When first getting into serverless, logging is critical as your server has just become a bit of a black box, and understanding what happened in a serverless system gets easier when there are more logs. The good news here is that the execution of your code is being managed by AWS, Azure or other parties. In AWS, this means that your code, which is actually hosted in a container, only gets a single call to it at a time. This means that typical resource constraints are minimal around memory, cpu and logging. So if logging is critical for serverless, where does tracing come into play?

To answer this question, we'll look at the impact of tracing using Epsagon, a tracing system that easily integrates into both serverless and container based systems.

Tracing handles tracking where a single chain of processing and data traverses through your serverless system. Let's say that your team has an integration with Stripe. You have a lambda function which handles the actual integration call, with data that is supplied by a transaction put into S3. The integration is failing, and you know the data put into S3 is bad data. How do you find where the invalid data originated? Tracing helps pull all the logs and chain of events together from your lambda functions and your containers, so that you can investigate a single integration failure across a multitude of integrations and services easily, cutting down the time to find the root cause of an issue and resolve it.

In that example we're looking at data that is bad, but there is an array of other use-cases with tracing, but the main ones you'll immediately benefit from are: debugging failures, system health monitoring, notifications/alerts, monitoring service/resource usage. These areas help your team stay focused on the capabilities that are important to you, while providing you with continual monitoring of your system to increasing overall confidence the system is performing as expected.

Debugging Failures

Debugging failures gets a lot easier when all something connects all of your service logs for a single transaction. This is one of the big differences between log management and trace management. With log management, the goals is to make your logs searchable for specific phrases and keywords, which then you can use to search through the logs to find what you're after. Tracing ties the logs for a chain of events together. It focuses on following data and processing chains rather than keywords in your log files. While the primary purpose is to simplify time to resolution for issues, that's not to say you can't programmatically tag traces with identifying information, such as customer id, origin of data, etc.

The second benefit of debugging traces is that they can accurately represent the timeline of events, durations, etc, across services more accurately. An example of where this was beneficial was when I was working with a team that was indexing data into AWS Elasticsearch. They were dealing with timeouts of lambda functions randomly, and with what appeared to be without cause. Looking at the timeline what we discovered was that it was taking two seconds or more for the function to start and that the errors always happened on cold-starts (or when AWS has to provision new resources because you've exceeded the ones you already warmed up). After looking at the timing and cold start information, we could see that this lambda function was being called multiple times quickly from the same process, which was causing AWS to have to scale up the services supporting the lambda function. This ultimately caused delays. The solution to this was to have the process call the lambda function with multiple values to process as a batch. This approach allowed fewer resources to be used, and eliminate the timeout issues which were run into. Having timing information allowed us to quickly see the anomaly and move to the next stage of the investigation.

Finally, sequence diagrams can be very useful when working through the order of interactions across services. Having these perspectives on issues allows teams to reduce the time to resolution and increase system stability while spending less time to do so.

System Health Monitoring

When your system is hundreds of services running independently, how do you know that every part of the system working as expected? System Health Monitoring is the answer to this question. System Health Monitoring covers more than just are functions up and running, its monitoring errors, invocations approaching their thresholds, advice on how to improve the performance and cost based on usage. This helps increase confidence around your system's outlook.

Running multiple system myself, it surprised me as I moved each one onto Epsagon how many systems I presumed to be working were failing regularly and I wouldn't have known this other than system heath monitoring reports, as these failures were under the radar. Getting the reports moved me from firefighting random issues to diagnosing where the failures were occurring, allowing more of my time to focus on new capabilities.

Notifications/alerts

One of the big changes in systems that do tracing, getting your system to tell you when something goes wrong gets a lot easier. Since tracing systems are tracking success and failures of APIs and processing, they often integrate the ability to integrate services like Slack, Microsoft Teams, Email and even systems like Pager Duty. These integrations allow your team to respond to failures even before a customer reports it, with the processes you already have for handling these types of issues. The result is that your tracing service becomes your 24 hour monitoring team member in the support of your system.

Monitoring service/resource usage

When working on a hybrid architecture for a financial company, we ran into an issue where we started getting timeouts on a system which was hosted on-prem. The system typically responded within 30 milliseconds, but we ran into a situation where the responses were taking over 6 seconds which was the timeout for our lambda function which was making the call to the system. Using trace information, we could see that the service over time would become less responsive until the point where most of our calls would go unanswered. We quickly eliminated all other possibilities for the failure and engage with the company that created the third party system with data to back us up. Having information for all the third-party systems which your solution interacts with is critical when working through integration issues, especially if those issues happen rarely. By leveraging tracing information, the team gets this information immediately without having to put in extra effort to change their code for tracking on a case-by-case basis.

You also get the benefit of having detailed metics on function duration and memory utilization. This is critical information when your function times out or runs out of memory to understand when the behavior first started appearing and how dramatic a departure from normal usage it is. This helps your team tune the performance of the lambda function itself, and can even help the team identify when it would be better to transform the lambda function into a Step Function due to long response times.

Conclusion

The goal when moving to serverless is to eliminate focus areas so your team can focus on business objectives. Maintaining a micro-service system, ensuring visibility availability/health and shortening your time to resolution is imperative for your team to continue to focus on new capabilities rather than continually being pulled back to focus on previous work. Tracing, therefore, is imperative to your serverless team's success. Epsagon and others offers a wide variety of tools in this area which will increase your speed to deliver your solution and decrease the effort of configuring and maintaining a home-grown solution.