This post is the first in the series “Twelve Lessons Learned with Performance Testing Server Side”. This serie was adapted of the post “12 Lições Aprendidas em Testes de Performance Server Side” posted in portuguese in my blog “The Bug Bang Theory“, originally published on January 31, 2013. The original post was developed mostly based on my experience after several months working close to some great performance engineers when I was a Consultant at ThoughtWorks. At that time I was leading the performance testing for the biggest magazine company in south america. The projects were developed mostly in Ruby on Rails based on micro-services with high availability and high requirements for response times and simultaneous users.
As main performance testing engineer in the account, it was my responsibility to develop the tests, collect the data, diagnose main problems and its dependencies, write the reports periodically to the client, follow and highlight the improvements and make sure that the performance tests were part of the continuous delivery pipeline. In order to do that, we choose the Apache JMeter, a tool that I was in use with, that is open source, easy to use and well known and very well tested in the open source world. Another feature that was very welcome on JMeter was its cli (command line interface) that made it possible to develop script in order to make it automatable with minimum effort.
Apart of JMeter, another tool that was very useful to evaluate the problems and to identify issues faster was the NewRelic, a monitoring tool in real time, that opened our eyes to the internal behaviour and bottlenecks while the JMeter was loading and stressing the apis and collecting information about part of the external behaviours.
Below you will find some observations and lessons learned during those months in the awesome world of performance testing:
Lesson 1 – The Average Response Time is the Fool’s gold of Performance Testing
It is quite common to see performance acceptance criteria based exclusively on average response time. I have seen many professionals in forums and blogs and even in some materials such as training and certification booklet, referring to the average response time as a metric that defines if your performance is acceptable or not, but it is not true at all.
The average response time is the sum of all response times divided by the number of samples, in this case by the number of requests. For that reason, it is commonly taken as the response time that a visitor will get when he or she visits a page under a predefined load. The average response time must be seen as a indicator among many other indicators much more important them the average response time, so it never, ever should be used as the primary indicator and specially not as the only one to evaluate the performance of a website, page or service.
To exemplify how the average response time can cause more harm than good if not taken with many other indicators, we can pretend that we just run a performance test and we got the follow response times:
5, 11, 5, 1, 5, 2, 1, 5 e 1.
The data shown previously can be graphically represented as the follow line chat:
To make it easy to understand, we are working with a very small number of samples.
Now let’s pretend that our product owner or whoever take the business/technical decisions on the performance subject, said that usually, for this kind of system, the user give up on the loading of a page or a service after four and a half seconds without a complete response. In this case, it is clear that we want the system to have a response time under four and a half seconds.
If we take the average response time as the ultimate indicator, when evaluating the previous data, we will have something like that:
(5+11+5+5+5+2+1+1+1) / 9 = 4 seconds
In this case, using only the average response time, we could say that based in our test data, in the scenario where we have the given load, the users are happy, because the average response time is the border of the acceptance criteria. But if we look from another perspective, we will realize that from our nine samples, only four are below the point where the users get frustrate and abandon the page. The data is the same, the way you look at it is different (we will see this perspective in detail in the next lesson ahead in this post).
Do not take me wrong, the average response time has a lot to say, but if not taken carefully, it can be very dangerous and guide your towards false positives. It is unquestionable that it can highlight slower services and pages and it is not a problem to use it to get a quick perception of the response times during a first evaluation, but it is one of the poorest ways to interpret your test data and should not be used without other metrics.
Lesson 2 – Consider the Response Time Percentiles to more Accurate Acceptance Criteria
One of the most interesting ways to evaluate your performance data is to add a new dimension to it with percentile as a criteria. When you use meta data, as described in the lesson one with the average response time, you are vulnerable to misinterpretation. If instead you start using the entire data to extract information from other perspectives, you can bring more value and accuracy to your test results.
Unlike average response time, response time percentiles can be defined to follow business goals when used to measure the percentage of happy users. Let’s keep the previous scenario, but instead of use only the information that the product owner gave to us, we are now going to ask what is the minimum acceptable percent of users that must be below the magic number of four and a half seconds. Let’s pretend that the product owner said that at least 95% or the visitors should have a good experience and other 5% would be acceptable to have a not so good response time in the launch.
This new way to read the same data, can be graphically represented in the follow percentile chart, using the axis X as percent of requests and the axis Y as response time:
The previous chart highlight that only 44% of the requests (four out nine) were under the frustration point of four and a half seconds. This kind of view make it very easy to see whether a number of transactions are good or bad for a given requirement, just by crossing the chart. It is very powerful and it is not meta-data, it is based on all samples and given the extra quantitative dimension of percent of transactions, it make this view very detailed and business oriented.
Of course you can have as many thresholds as you feel that are necessary. In real life performance testing for services, you could start your thresholds at 99% and have several others using diferent precision such as 99.9% and 99.999%.
Lesson 3 – Try to see your system/services as a pipeline
Specially when talking about micro-services based applications, usually a bottleneck hides others that come after it. In complex systems we have several services working together, usually sharing data among them and consuming each others api. It is very common that when one of the services is compromised from a performance point of view, many others get affected, and it is not only bad because it will be slow, but because could hide the real performance of many services.
When running performance tests in a page for example, and a single micro-service takes to long to respond it can reduce the amount of requests and transactions of all the services that it consumes. It would guid us to have a false positive of the performance of those micro-service.
If you think of a pipeline where you have a narrow pipe distributing water to many other pipes, the volume of water that will reach the pipes after the first one is only the volume that the first pipe supports. So if you have a bottleneck in the first pipe, it does not matter the volume you are using to test the whole pipeline, because it will never do real stress in any pipe but in the first one. The same comes to micro-services. If you are testing the performance of the system based only in one point (like a page) you will face a big problem: you will be vulnerable to false positives.
To exemplify, pretend that this monolith wordpress blog is a very complex system based entirely on micro-services as represented by the picture below:
Now, pretend that we are testing the response time for the home page and it is extremely slow. fortunately we have some kick-ass tool that collects information about each one of our micro-services and make it very easy to diagnoses our performance issues. After collecting the data, we got the following pizza chart with the distribution of the slowness:
Given the previous example, it is quite easy to analise it superficially and say that the only compromised api is the text api, because all the other apis are consuming only 15% of the response time. With that in mind, performance engineers, developers and managers bury themselves under false performance evidences and take decisions such as focus exclusively on fixing the text api, because all the others are doing fine. The result is very likely to be frustrating and unexpected, since after solving a bottleneck, you can get new results with even worse global response times, since with the new amount of requests going to another api could highlight even worse performance issues.
As seen previously, when you do a big bang performance test in a micro-service based application and you have a result like this, the only thing that you can say for sure is that you have a problem in the one api that is consuming this absurd response time. You cannot have any speculation in the others apis, since all the response times can change drastically after fixing the first bottleneck. And if you keep doing that, you can delay even more the solution to the whole page.
Fortunately there is a obvious solution for this. Just like we had this problem with big bang software testing approaches decades ago and we found good solutions by testing small pieces before doing the big functional testing, we can do the same thing with performance testing when approaching micro-service based applications, and that is described in the next lesson.
Lesson 4 – Divide and Conquer
As seen in the previous example, in many cases and specially in micro-services based applications, it is not a good practice to look to the whole, we just discussed some of the reasons. Knowing that a single micro-service can perform requests to many other micro-services, and those can also perform requests to databases, other micro-services and sometimes even to themselves it is quite clear that we have to think about a distributed performance testing strategy, from the beginning of the project.
So in this case you can write different performance tests for each one of the services that compose the main application. Each micro-service must have it’s own performance and load acceptance criteria, based specially on its criticality for the business and demand. If it is a very critical service that comes before most of the other ones, we need to make sure that it is responding under a good response time, in order to not slow down the whole application. At the same time, if it is not that important, but it is frequently requested, you may consider to have a good response time, since the sum of the the many many response times for this request could also cause a lot of damage from a performance perspective. At this point I want to make it clear that we are not talking about solutions here, but about testing and diagnosis, so I will not get deeper on things like elasticity, caching, timeouts and so on.
Of course I know we do not live in a world full of unicorns where everything is perfect and you will only have brand new green field projects, with all world’s time to follow each one of the micro-services. In some cases you will be “forced” to work in the end of the project, and find out the performance issues of a multi-apis system just like a old monolith. Luckily not everything is lost in this scenario, and you have a way to by pass the fix and find a new bottleneck scenario that I described before. There are many tools today that provide you a easy way to find out those services that you depend on. As commented in the beginning of this article, New Relic is one of them.
With the help of that kind of tool you can have a X-Ray of your application for a first diagnosis, find the different micro-services and how to interact with them. You also get how often each service is being called, the average, higher and lower response times for each service and other useful informations. This information is more than enough to start a strategy to divide and stress each api apart of the others and get real performance numbers.
The report below was taken from the New Relic page and it shows how useful this kind of tool can be when it comes to diagnostic, specially in this scenarios where the approach is to start with a entire application instead of each micro-service.
I hope I have helped with a bit of theory regarding performance and it is just the first post, so if you liked it, follow me for the next two posts of this series and for a lot of JMeter content.