software testing services
 Software testing training, consulting, and outsourcing from the experts: Rex Black Consulting Services (RBCS)
CALL US TODAY
(866) 438-4830
ISTQB certification testingISTQB certification testing ISTQB certification testing
PMI

Archive for August, 2012

Defect Metrics

I had another interesting question from a reader that I’m getting to in the blog today:

I was ready some of your excellent webinars, specially about the defect metrics, So my question is: Is there any benchmark data about defects severity, and also about reopen defects.?

I would really appreciate if you can guide me where I Can find this data.

Note: the purpose of this data is to measure and compare our results against the benchmark.

Tomas Gotes

Thanks for the question, Tomas.  In terms of defect severity, unfortunately there are no standard, common definitions for severity in the industry yet.  So, any metrics that showed aggregated data from multiple organizations would probably be rather questionable.  That said, you might check out Capers Jones’ excellent book, The Economics of Software Quality, which, as I recall from reviewing it, had some interesting analysis here.

In terms of defect report re-open rates, our target during assessments is 5% or less.  Some amount of defect report re-open seems inevitable, since test environments and data are generally more representative of production than are the developer’s environments.  However, there is a significant risk of schedule delay, as well as significant inefficiency (due to an additional layer of rework), when the defect re-open rate gets too high.

Measuring the Value of Unit Tests

I received an interesting, detailed query from a reader of my new e-book on testing metrics:

Dear Rex,

My name is Marcus Milanez and I’m a software developer who lives in São Paulo, Brasil. I recently bought a copy of your “Testing Metrics” ebook and found it really valuable for improving my understanding on testing processes, as well as hearing more words on the importance that metrics play on my daily tasks. Thanks for sharing all your studies and results, I appreciate it.

However, I still have some questions that I still couldn’t get any reasonable answers – maybe they’ve been presented to me already, but my limited comprehension is failing to absorb them. In order to exercise these questions correctly, I would like to add some general contexts, just to make sure that I’m using my words properly.

Forgetting a little bit about all the nitty-gritty details that are observed during the creation or maintenance of a software system, in general, in a regular software development process we have:

1) Customers require features that must be properly implemented

2) Developers along with specification teams, stakeholders and other parties, define the scope of that feature and write a couple of functional/acceptance tests that validate the analysis of the customer request

3) In a planned spring, developers implement the feature and get feedback from specs team, stakeholders and customers, to make sure that everything is fine.

4) Quality team verifies the quality of any version delivered by developers. Bugs are filled and latter fixed by developers in a sprint.

I clearly understand that the list above is incomplete, could be better and is definitely passive of change, but I’m just trying to give an overall context. Now, by reading your book, I could clearly see all the values that metrics and proper testing strategies add to a software project, and I’m glad I was presented to things like DDE and all those sample charts – that information is really gold. While reading the book though, I tried hard to think and use the same logic that you presented in your DDE formula, for somehow measuring the value of unit tests created by developers during development time. The thing is that I couldn’t get to any conclusion because looks like the value added by unit tests can’t be measured in the same way that you presented in your book. My reasons for that are:

1) Looks like unit tests are not making verifications in the same sense that exploratory or automated tests are – rather, we regularly use them for getting something built according to a set of premises, and later on, these set of premises can be verified.

2) Developers don’t usually count, or indicate, bugs found during development time. I don’t even think that this would make any sense.

3) They are not testing a use case, they are testing smaller pieces of it. Although these smaller pieces can directly affect a use case, looks like these observations are quite different from those observed by an automated functional/acceptance test.

So, my question is: can we measure the value/cost benefit added by unit tests, using metrics that indicate that they are definitely effective in a software project, or the value of unit tests are not related at all with the number of bugs filled by a validation team for example? Can the value added (or not) by unit tests be measured, qualified and improved? Is it possible to observe unit tests, alone, with a cost perspective?

I exercised this question on my own, and so far the following seem to be valid for me:

Looks like unit tests main intention is not to find bugs. Unit tests seem to be closer to a development tool (like an IDE is) than a tool for finding problems.

Since unit tests are part of the development effort, it is difficult to say how many hours have been spent during codification of unit tests and codification of production code.

Although a project with high a code coverage by unit tests tend to have less bugs filled, they are not a guarantee of a use case bug free code.Still thinking about number of bugs, I believe that unit tests tend to prevent run time errors more than use case errors.

Apparently, the real value of unit test comes in terms of maintenance, ability of adding or removing features without fear, and mainly as a means of communication with other developers working on the same code base.

If these first items were valid, it is likely that unit tests will certainly prevent the creation of a given number of bugs during validation phase. However, in order to assert that, we would need to collect all the metrics related to a given project that don’t have unit tests (like DDE, cost per bug fix, time per bug fix and others), properly implement unit tests for this project, and then get the very same metrics again for comparison. Is this rationale correct? If so, can we confidently say that unit tests, alone, were responsible for getting these numbers changed? Are there any study published that analyses these points?

Thank you so much for your attention and sorry for this long email. The thing is that I found your book extremely useful and I definitely want to improve my understanding of many aspects of my profession.

yours,

Marcus Milanez

Thanks for your kind words about my book.  I’m glad it’s proving interesting and useful to you, and that it provoked such a good and reflective set of questions.

Before I get into a detailed response, let me set up the background for what I’m about to write:

  • First, I agree with your brief, general summary of the software process as it applies in Agile development, and I’ll refine it further in a moment.  I would add, at this point, that in many organizations there is an additional test level that occurs, sometimes called “system integration test,” where the software produced in the Agile sprints is tested with other software which interoperates or cohabitates with it.  This level can occur once after the last sprint, but it’s better if it happens iteratively as each sprint concludes.
  • Second, since you mention sprints and since your description of the process seems very Agile-oriented, I assume that we are talking about unit tests in the context of Agile development.
  • Third, I assume that, when you talk about unit tests, since you say that developing the unit tests is “part of development,” you are talking about unit tests that are used for test-driven development (whether formally or informally).  That is, the unit tests are developed at the same time the code is being developed, and that coding, developing tests, and executing the tests proceeds in a tight loop until the unit is complete. 
  • Fourth, I assume, since you talk about unit tests allowing us to make changes without fear of regression, that you are talking about automated unit tests, ideally incorporated in a continuous integration framework such as Hudson.

With those assumptions in mind, here are my thoughts on the many good comments and points that you raise.  Let me start by reviewing the development process:

  1. Test-driven development is a great idea.  Indeed, I did that myself years ago when I was a programmer.  However, it’s a misnomer to call that process “test driven.”  The “tests” used in this process are actually more like executable low-level design specifications.  I know you are already aware of this, because of your comment about these tests “not testing use cases.”  What can be somewhat confusing about this is that the work product produced by test-driven development includes a set of automated unit tests that subsequently can be used for something properly referred to as testing, specifically regression testing. Since test-driven development, when it’s happening, is not really testing, the objective of the developer as he or she creates the code and the unit tests is not really to find defects.  (Finding defects is one of the typical objectives of most test levels, including unit testing.)  The main objective is to build confidence, as the coding proceeds, that the code being built works properly. So, if we want to measure the level of confidence that we should have, that is something better measured using coverage metrics.  In this case, the most relevant coverage metrics would be code coverage and, in some cases, data flow coverage. 
  2. Once the unit is completed, typically there is a missing step in most organizations.  Ideally the developer would apply structured test design techniques to augment the unit tests in order to make them true unit tests, and to find any defects that remain in the completed unit.  If developers were to do this systematically, the quality of code delivered for testing would dramatically include.  Capers Jones’ studies show that, typically, unit testing has a defect removal effectiveness of only 25%, but best practices in unit testing boost that to 50%. 
  3. In fact, there is another missing step in many organizations at this point.  The best practice would be for the programmer to take the code, the unit tests, and the unit test results to a code review.  A study by Motorola showed that, to maximize effectiveness, this review should include at least two people other than the author, so pair programming or simply having a lead developer review the code are not sufficient.  The unit tests might be augmented based on this review, and, if so, re-run.
  4. At this point in the process, the unit tests created in step 1 (and ideally steps 2 and 3) of the process are incorporated into a continuous integration framework, as mentioned earlier, and serve as ongoing guards against regression of the unit.  If the unit is refactored, the tests may have to be updated, as you mentioned.

So, let’s get back to metrics and measuring the value of unit testing, starting with not logging defects found by developers.  This is a mistake.  I agree that we don’t need to log failures that occur during step 1 above, because–as I said–we’re not really testing, we are developing the software.  However, once we get into steps 2, 3, and 4, I believe those defects should be logged.  Often when I say this, people have a horrified reaction and say, “Oh, no no no, we can’t have programmers interupting themselves and breaking their flow to log defects in some cumbersome bug tracking tool.”

To which I reply:  “I didn’t say they had to use a bug tracking tool.  I said they should log the defects.”  Here’s the distinction. I agree that there’s no need for a tracking tool here, because the bug is going to be immediately removed.  Its lifecycle will start and end at the same day, and so all the state-based stuff that the tracking tool does is unnecessary.  But we do need to log information about the defect, especially classification information, phase of introduction, etc., and we should use the same classifications as are used during formal testing.  This information can be captured in a simple spreadsheet. That way, we can do analysis of the defects found during unit testing (step 2), code reviews (step 3), and unit regression testing (step 4).

What can this analysis do for us?   A number of things.  First, it can help developers and the broader organization learn how to write better code.  Patterns in types of defects found during steps 2, 3, and 4 can show that training and mentoring is needed to reduce bad coding practices or habits that some developers might have. 

Second, to address your main question, it can allow us to directly measure the value of unit tests.  To do so, we need to do one more thing, which is to estimate the effort associated with unit testing and unit test defect removal as well as estimating the effort associated with other levels of testing (e.g., acceptance testing, system integration testing, etc.), removal of defects found in those levels of testing, and the effort associated with failures in production.  This is actually easier than you might think, and sufficiently accurate numbers can be obtained by surveying the people involved. 

The differences are often quite dramatic.  One client found that a defect found and removed in unit testing took three person-hours of effort, while a defect found and removed in system testing took 18 hours, a defect found and removed in system integration testing took 37 hours, and a defect found and removed in production took 64 hours.  So, for this client, each defect found in unit testing saved at least 15 hours of effort, and possibly as much as 61 hours of effort!

Notice that this approach allows us to avoid the “double blind study” approach that you mentioned earlier.  We don’t have to find two almost-identical projects, where the only difference is whether unit testing happened or not.  I expect that would prove impossible, so it’s good that we don’t need to do it.

Now, as you can see, we can measure part of the value delivered by unit testing, in terms of avoided downstream costs of failure.  (You might want to read more about cost of quality, which is the technique used above; check out my article here.)  However, I also agree with your earlier comments about unit testing having other objectives, such as easing the maintenance of code, and reducing the risk of breaking something when we do, as well as helping developers communicate about the code.  (That last benefit is especially applicable if you follow step 3 mentioned above.)  It is possible to develop metrics that allow you to measure these benefits as well, but, if the main objective of measuring the value of unit testing is to convince managers to continue to invest in it, then the “effort saved” metric mentioned above should be sufficient.

Some Questions, and Answers, on Advanced Software Testing Topics

I received some questions from a reader and webinar attendee, which I’ll answer below:

Hi Rex,

I am Jason, and few days ago was in the webinar for Reviews. As promised, I have collated the lists of questions to ask based on the Study Preparation Guide for CTAL-TM that I purchased from RBCS recently. Please do not hesitate to contact me should you require further information from me. Below are the questions and my comments:-

 71. An organization follows a requirements-based test strategy for most of its projects. Which of the following is the best example of modifying the test approach for a project based on an understanding of risks?

A. Past performance issues lead to an increased effort on performance testing

B. Test estimation is based on the number of pages in the requirement specification.

C. Test execution is outsourced to a testing company based on a low-cost bid.

D. Unit test effort is limited to ensure early commencement of system test execution.

I did not really understand this question very well. How does performance issues in the past are related to risk?

Jason, the reason that A is the right answer is because we are using past defect information (in this case, performance defects) to assess the likelihood of particular types of problems. 

84. You are managing a test effort that uses entirely reactive techniques, including a list of past bugs found in the field, a checklist of typical bugs for products using this technology, and exploratory testing based on tester experience. No written tests are developed prior to test execution, other than the list of bugs. Consider following statements

i Immune to the pesticide paradox

ii Repeatable for regression and confirmation testing

iii Useful in preventing bugs during system design

iv Cheap to maintain

v Makes no assumptions about skill tester skills

Which of the following is true about this particular testing strategy?

A. I and III are benefits of this strategy.

B. II and V are benefits of this strategy.

C. I and V are benefits of this strategy.

D. I and IV are benefits of this strategy.

Argument: I should not be part of the answer. “Exploratory testing based on tester experience” allows different test cases combination to be tested. If that is the case, why is that immune to the pesticide paradox? To my understanding, immune to the pesticide paradox refers to doesn’t have any effect to the pesticide paradox.

Since it is based on list of past bugs found in the field and checklist, I would consider II as tester can go through the regression based on these checklists.

The Foundation and Advanced syllabi are clear on the fact that detailed test cases–which do not exist here–are useful if repeating tests precisely for regression and confirmation testing purposes is needed.  So, II cannot be a benefit. 

The Foundation syllabus, in the section about general testing principles, says that the only way to overcome the pesticide paradox is to run different tests rather than repeating the exact same tests. Because all tests would be subtly (or even dramatically) different if repeated, especially if testers with different experience are used for subsequent executions.

125. You manage a test team for a bank. Your test team uses two primary test strategies, checklist-based and dynamic. As its checklist, your team has a list of main areas in which the test team, in-house users, or bank customers have reported defects on past releases. For the dynamic testing, it employs members of the test team with experience in the bank branches and back-office to do exploratory testing.

Based on this information alone, which of the following is an improvement that you would expect from a STEP assessment?

A. Get involved earlier in the lifecycle

B. Analyse requirements specifications

C. Run only scripted tests.

D. Improve the office environment.

The answer given in the guide is B. Analyse requirements specification. Could you please elaborate the reason to this answer?

Yes.  The reason is that, as mentioned in the Advanced syllabus, STEP is based on an assumption that requirements analysis and requirements-based testing will occur.

Looking forward hearing from you soon. Thanks.

I hope this is helpful.



 
`