The joy of fixing bugs

Bugs slow our companies down. How can we use them as a stepping stone for engineering excellence?

The joy of fixing bugs
The first documented bug - Naval Surface Warfare Center - public domain

I hate bugs. Every time a new one appears I feel dread. I felt this way as a developer, and I still feel this way as a CTO but ten times more strongly. As a developer, I was annoyed that I had to waste a lot of my time fixing bugs instead of writing cool code. As a CTO, I felt the pain of buggy software even more. Customers complain because they can’t do what they want. Managers complain to me because customers complain. My team complains because they have to fix bugs. And other teams complain because we don’t have time to address their tech requests. I spend my time trying to appease everyone. Behind every bug hides one or more frustrated users. Each bug degrades our relationship with them and this makes our life miserable.

Over the past couple of years, I have focused on breaking this infernal cycle. I first tried a gentle approach. I tried to show my teams the importance of fixing bugs. I provided them with a problem-solving template I had used in my previous job. This template was supposed to guide them to fix their bugs. This hands-off approach did not work. I saw other tech leaders take a much more brutal approach by imposing strict bug reduction targets without providing much support. Neither approach works. Just like I did, most developers don’t like to fix bugs. It’s dirty, it’s annoying, and a waste of our talent. Plus, I came to realize that most developers don’t know how to fix bugs well, let alone avoid them. So I took a Lean viewpoint and changed my approach radically. I decided to turn this hatred into something positive. I decided to try and make bug fixing a joyful activity!

Bug fixing can become a cool team activity where we do what we do best: code, simplify, and structure. With thorough fixing, boring bugs disappear quickly. The new ones that come up are increasingly interesting and teach us something new. Teams learn fresh things every day, and they can spend more time on features and innovations.

Aiming for radical quality

A few years ago I worked for a web and data agency. I became the team leader on an ongoing project to replace a web application for solar panel monitoring. We did everything by the book: we worked as a scrum team, had a continuous delivery pipeline, and collaborated closely with the customer. After a few months, our lives got miserable. Development started to slow down, epics were systematically delivered late, and stakeholders kept asking why. The team felt demoralized and powerless.

I set out to discover what was at the root of this thorny problem. We knew that our data was imperfect. Every week, the Product Owner showed stakeholders that we had 80% correct data, which was insufficient. After a closer look, I estimated that the reality was closer to 10%. The development team had rotated a few times over the previous months and I was a recent joiner. Even with the previous developers' help, we had deep knowledge gaps about the existing features and data structures. Everyone assumed that we were on track and that fixing quality would come with the next epics. But in fact, we were slow because of these knowledge gaps. At the next team ceremony, I shared my analysis of the situation and how our measurements of quality were all wrong. I insisted that quality was really bad: we had to stop further development and stabilize quality before going any further. The stakeholders stayed silent for what felt like minutes, the team held their breath. Fortunately, they gave us their go to focus on improving quality over the following months.

This marks the beginning of one of the highlights of my career.

First, we visualized all quality problems. We created a dashboard that showed the total number of known problems, their categories, and their locations in the data pipeline. The top of the dashboard had broader metrics, shown as simple graphs. It went into more detail as one scrolled down. We gradually added more monitoring and logging at different stages of the data flow. We analyzed the root causes of known bugs daily and presented the results to stakeholders to regain their trust.

Below are examples of KPIs of our dashboard.

The total count of problems at one point in time 
Visualization of where errors occur in the Lambda functions of the pipeline, and what are the most common
Visualization of Lambda functions and their error rate

In addition, I committed to responding within one minute when a member of my team had a problem or a doubt about how to go about something. Twice a day, I discussed with the whole team our findings. Sometimes we found quick wins, other times we had to think harder about how to solve trickier issues. Often, we realized that the problems reflected our knowledge gaps about the software and solar panel equipment.

To develop our collective knowledge, we made the product architecture and data structures visible to the whole team on a physical wall. We highlighted the areas where we encountered problems. This way, we were able to spot the most problematic areas and everyone agreed on what was most critical. People could then propose solutions, creating motivation.

Examples of fun bug solving

One issue was “InverterNotFound”, which had occurred 30 times. An inverter is a piece of equipment that controls electric tension. This exception meant that an inverter sent us data, but its serial number did not match a piece of known equipment. We dug deeper into specific cases. We realized that some inverters' serial numbers did not correspond to real equipment. Rather, some of them were logical groupings that had to be mapped differently. We fixed the ones we had and managed to find a thorough listing of all these special inverters. We made it so new equipment would be added quickly to our list. We documented the existence of these logical groupings in our architectural view. We also set up an alert for any new occurrence of the “InverterNotFound” problem.

Some errors led to surprising discoveries, such as the “IniFileNotFound”. An INI file in our system described the schema for data sent by an inverter. Different makes or even different versions of the equipment used different schemas. This exception meant that we did not have the INI file to deserialize inverter data. Again, we looked at specific cases. Indeed, some inverters had INI files and a significant number did not. Code existed to import INI files, but we realized that it could never be executed. Indeed, previous developers had started to work on it but never finished it. They had decided to take a shortcut and copy the INI files manually into the object storage! So we implemented the import and set up a new alert so we could be notified of new occurrences of “IniFileNotFound”. We also added a dead code detector in our Continuous Integration pipeline.

For both problems, we did not just fix the bugs, but we also took the time to change our coding methods. Bug after bug, the team was ecstatic and having fun. Fixing bugs this way created stronger bonds within our team. After three months, we reached close to 100% quality on existing features and the team was quite proud of the achievement. We were now able to resume work on new epics. Last but not least, the whole team was going at the speed of light! We had never been able to work so fast. Everyone on the team finally was more at ease with the software and it felt great.

Bugs reflect our knowledge gaps

This is a familiar story in the development world. We feel good piling up new features until it becomes a burden. Working now as a consultant or as a CTO, I've seen this scenario play out several times, and every time it becomes disheartening for the team. In a survey by Rollbar, 38% of developers spend at least 25% of their time fixing bugs from users or QA, and 26% spend more than half their time. This is leaving out bugs that are caught before – in dev or by the continuous integration pipeline. So how can we use all these bugs as an opportunity to develop excellence in engineering?

I have become convinced of one thing. The quality of a software product is the reflection of the engineering team's knowledge about the product and its environment. Hence, we need to think of bug fixing as a path to improving our own knowledge, not as a chore. I developed my own framework which I may share in future articles.

What about you, is fixing bugs a joyful activity in your company or a chore?