In 2015, we decided to extinguish the Quality Assurance (QA) function of our Locaweb product development team. We had 12 QAs, some with a developer profile and others with a SysAdmin profile. In proposing the extinction of the QA role, some of the QAs became devs, while others took on the role of system administrators. The reasons that led us to extinguish Locaweb’s QA function were:
When I joined the Conta Azul, they had also just extinguished the role of QA, and the ex-QAs became business analysts, mainly helping product managers.
I saw other companies also discussing the need for QAs, and in some cases, a heated debate emerges around this topic. However, having or not having QAs should not be the focus of the discussion. Having or not having this function is the solution to a problem, usually referred to as “how can we improve the quality of our product?”, And that problem should be the center of the discussion.
A simple Google search for software quality will yield tons of definitions typically related to meeting functional and non-functional requirements. When the software does not meet a functional or non-functional requirement, it has a defect, a bug. Therefore, to improve the quality of a software product, we need to work on two things:
A good way to track this is to have a weekly measurement of your bug inventory and new bug creation per week and discuss this every week with the team. We did that at Gympass. We define at the beginning of each quarter what the target is for the bug inventory and the average new bugs per week.
The image shows the evolution of our bug inventory during the 2nd quarter of 2019. We started the quarter with 215 bugs in our stock and we aimed for a target of less than 166 at the end of the quarter, a reduction of almost 23%. We ended the quarter with an inventory of 136 bugs, a reduction of 36%. We did this by focusing not only on resolving bugs in our inventory, but also on controlling the number of new bugs per week.
In the first quarter of 2019, we had an average of 26.2 bugs created per week. During the second quarter, we reduced this average to 17.4 new bugs per week, to a total of 226 new bugs during the quarter. This is a 33% reduction in the number of new bugs per week.
That looks like a pretty good improvement, right? But there is a lot of room for improvement there. Let me explain the math of bug management:
If we were able to reduce our bug stock from 215 to 136, it means that we have resolved at least 79 bugs. However, we created 226 new bugs (17.4 new bugs per week x 13 weeks) during the quarter. So we solved 79 + 226 = 305 bugs during the quarter, a lot of bug fixing work. If we had generated 90 new bugs during the quarter, an average of 6.9 new bugs per week, instead of the 226 new bugs, we could have zeroed out on the bug inventory.
An additional aspect of the bug resolution to be measured is the SLA (Service Level Agreement) resolution, that is, how many days the team took to resolve a bug from the day the bug was first identified. For this, we classify the bugs by their severity, which is the impact it has on users and the business. The most serious bugs are those that we need to resolve on the same day; high severity errors, in 7 days and average severity, in 14 days. The following chart shows how we were at Gympass in the fourth quarter of 2019.
This is not the ideal way of viewing this info because it shows only an image of the moment, and not an evolution. To understand the evolution of any metric, you need to see how it did at different points in time.
As soon as I joined Lopes, I started bringing this topic up for discussion with the teams. One of the things we noticed is that 50% of deployed items were bug fixes. I was informed that “these bugs were caught before going into production, which is a good thing”. In fact, it is a good thing that these bugs did not reach the production environment and appeared to our users. However, they reached pre-production and needed to be corrected. Wouldn’t it be better if these errors didn’t even exist, not even in pre-production?
The OKRs we defined to help us with the quality theme were 3 additional KRs in order to Increase the cadence of deploys in production that I mentioned in the previous chapter:
And we add the following OKR:
At Conta Azul, we doubled the product development team in a period of 8 months between November 2017 and July 2018. This growth was aimed at increasing the team’s productive capacity.
In addition, we divided the quantity of deliveries by the total number of people on the team to assess whether we were managing to increase our productivity individually.
However, with the increase of people on the team, it ended up increasing the amount of bugs. So much so that the team that had already had 40% of its deliveries as a bug fix ended up having to increase this proportion to 60%. That is, despite having increased individual and total productivity, this increase in productivity was not being felt by the user, as it ended up being used for bug correction.
To control this problem, we increased our focus on fixing these bugs within the SLAs, which were:
See that the quality has worsened and the customer suffered from it. But, after some time, we managed to return to the defined SLA levels. We looked at this metric weekly and, whenever we discussed this metric, we agreed that the best way to comply with the SLA was not to create bugs!
In addition to bug control, there are several other aspects that impact the quality of the digital product that we deliver to users. Performance, scalability, operability, monitorability are some examples of non-functional requirements.
When I joined Gympass, on my second Monday the system went down for users around 7 pm. I started asking people on the team what was going on and the answer was that Mondays are peak days in terms of gym visits and that the system could not handle the volume at all. As there was no monitoring, we were not alerted that the volume was higher than usual and we were unable to prepare properly. Two months later, when Rodrigo Rodrigues joined Gympass as CTO, he dubbed the event “Black Mondays”. To address the problem, we started to monitor and implement an infrastructure that could handle Monday peaks. And we set OKRs for uptime, successful HTTP requests and backend response time.
Any user prefers to use a good quality product that behaves as expected. This is a sine qua non condition to provide a good user experience.
In addition to the user experience, there is another important aspect to consider when we talk about quality and bugs. Whenever someone needs to work on resolving a bug that was found in a digital product, that person needs to stop working on whatever they are currently working on in order to resolve the bug. This is an interruption in the workflow. If that person were able to deliver the software without that bug, they could continue to work on new things without interruption, which would make them more productive.
I had the opportunity to participate in an MIT course on how to create high-speed organizations. The course was taught by Professor Steven J. Spear, author of the book The High-Velocity Edge: How Market Leaders Leverage Operational Excellence to Beat the Competition. This is one of those very dense courses, full of content, but which can be summarized in one paragraph:
High-speed organizations are able to learn very quickly, especially with their failures, and to absorb that learning as an integral part of the organization’s knowledge.
A high-speed organization works by following the 4 steps ahead:
The classic example is Toyota, with lean manufacturing and the concept of stopping production whenever there are failures, correcting them and using them as a learning opportunity so that they no longer happen. This ability to learn from failure is what gives Toyota the ability to stay ahead of its competitors for so long.
Another good example is Alcoa, which had a work incident rate of 2% per year, which was considered normal. Alcoa has more than 40,000 employees, so 2% of work incidents per year means that 800 employees per year have some type of work incident. This is a very impressive and worrying number.
To combat this problem, they implemented a zero error tolerance policy. Before implementing this policy, mistakes were seen as part of the job. Employees are now encouraged to report operational errors within 24 hours, propose solutions within 48 hours and report the solution found to their colleagues to ensure that knowledge spreads throughout the organization.
This caused the risk of incidents to drop from 2% to 0.07% per year! This reduction in the incident rate meant that fewer than 30 employees per year had any work incident problems after the zero error tolerance policy was implemented and Alcoa achieved an increase in productivity and quality similar to that of Toyota.
An important factor in the Toyota and Alcoa examples is that recognizing and learning from failures must be part of the company’s culture. This is somewhat more common in the culture of technology companies, but not so common in traditional companies. During the MIT course I shared a table with a Brazilian executive from Grupo Globo, the major TV company in Brazil, a Spanish executive from AMC Networks International (Walking Dead, Breaking Bad and Mad Men), a German project manager living in Azerbaijan who works for Swire Pacific Offshore (oil and gas industry) and an MIT postdoctoral student in solar energy from Saudi Arabia.
All of my table mates were from more traditional industries. I was the only one at an internet company. The executives at Globo and AMC were there because they saw Netflix with its streaming video on demand and YouTube with its huge catalog of user-generated videos as major threats, stealing their audience very quickly and they wanted to understand how they could defend themselves.
Although the theme is somewhat obvious for internet companies, especially with the culture of technology startups that value fail fast. That’s what makes Netflix and YouTube a threat to traditional media companies like Grupo Globo and AMC Networks. However, even though this is part of the culture of internet companies, sitting and discussing it with people from more traditional companies was a great opportunity to reflect on the relationship between failure, failure recognition, learning and high speed:
Even though we hear a lot about the culture of internet companies to fail fast, talking about failing fast diverts our focus from what is really important, learning fast. We must put our energy into learning, not into failure itself. It is the learning process that makes people and companies evolve. And it is the ability of an organization to learn fast, especially with its failures, that will allow it to move at really high speeds.
In the next chapter we’ll see a little more about metrics.
Do you work with digital products? Do you want to know more about how to manage a digital product to increase its chances of success, solve its user’s problems and achieve the company objectives? Check out my Digital Product Management bundle with my 3 books where I share what I learned during my almost 30 years of experience in creating and managing digital products: