software testing services
Rex Black Consulting Services | software testing experts providing consulting, outsourcing, and software training
CALL US TODAY
(866) 438-4830
ISTQB certification testingISTQB certification testing ISTQB certification testing
PMI

Archive for the ‘software test metrics’ Category

Effective, Efficient, and Elegant Software Testing

Reader Patricia Osorio asked another interesting question:

Hi Rex

In chapter 8 [of the Advanced syllabus], I was reading about Standards and Test process improvements. There you talk about Efficiency and Effectiveness. I have one definition about these terms, but what do I have in mind about these terms in order to be prepared for the advanced exam?

Thank you

Regards

Patricia Osorio Aristizabal

To answer, I’ll quote my explanation of the difference from Chapter 2 of Beautiful Testing:

Each stakeholder has a set of objectives and expectations related to testing.  They want these carried out effectively, efficiently, and elegantly.  What does that mean?

Effectiveness means satisfying these objectives and expectations.  Unfortunately, the objectives and expectations are not always clearly defined or articulated. So, to achieve effectiveness, testers must work with the stakeholder groups to determine their objectives and expectations.  We often see a wide range of objectives and expectations held by stakeholders for testers.  Sometimes stakeholders have unrealistic objectives and expectations.  You must know what people expect from you, and resolve unrealistic expectations, to achieve beautiful testing.

Efficiency means satisfying objectives and expectations in a way that maximizes the value received for the resources invested.  Different stakeholders have different views on invested resources, which might not include money.  For example, a business executive will often consider a corporate jet an efficient way to travel, because it maximizes her productive time and convenience.  A vacationing family will often choose out-of-the-way airports and circuitous routings, because it maximizes the money available to spend on the vacation itself.  You must find a way to maximize value—as defined by your stakeholders—within your resource constraints to achieve beautiful testing.

Elegance means achieving effectiveness and efficiency in a graceful, well-executed fashion.  You and your work should impress the stakeholders as fitting well with the overall project.  You should never appear surprised—or worse yet dumbfounded—by circumstances that stakeholders consider foreseeable.  Elegant testers exhibit what Ernest Hemingway called “grace under pressure,” and there’s certainly plenty of pressure involved in testing. You and your work should resonate as professional, experienced, and competent.  To achieve beautiful testing, you cannot simply create a superficial appearance of elegance—that is a con man’s job—but rather you prove yourself elegant over time in results, behavior, and demeanor.

In an upcoming newsletter, I’ll include this chapter as the featured article.

Measuring Software Test Processes (and Software Testers)

On the heels of the webinar last week, listeners have had lots of comments (all good) and some questions.  Here’s an interesting set of questions from listener Stephen Ho. I’ve interspersed my answers in his e-mail, with “RB:” in front to make it easier to follow:

Rex,

Thanks for such Webinar. This webinar did not talk about how to organize and build up a good testing Metrics.

RB:  There was a general discussion early on about how to go from an objective to a specific metric and specific targets for that metric.  Perhaps you were missing the part about implementing the metrics with specific tools? 

However, it provided some interesting points to measure the successful of a testing project, such as: BFE. Just for more realistic, how can we know that a testing metrics is good?

RB: One attribute of a good metric is that it is traceable back to some specific objective. That objective should relate to a process (e.g., finding defects is an objective for the test process), to a project (e.g., reaching 100% completion of all schedule tests is an objective for many projects), or to a product (e.g., reaching 100% coverage of requirements with passing tests is an objective for some products).  Another important attribute is that the metric supports smart decision-making and, if necessary, guides corrective action.  Yet another important attribute is that the metric have a realistic target.

At here, I have another topic for your interest.

“What is effective & efficiency testing?”

RB:  We have to be more specific than this.  What are the objectives for testing, as you mean it here?  Once you have defined those objectives, you can then discuss effectively and efficiently meeting them.  For example, if you define finding defects as an objective, then you can use the DDP metric (discussed in the presentation) as a metric of effectiveness.  Cost of quality (which is discussed in various articles in the RBCS web site, such as this one) can serve as a metric of efficiency.

-how to narrow it down to know that our existing testing job is in effective & efficiency ways.

RB:  You might want to read the chapter I wrote (Chapter 2) on this topic in the book Beautiful Software.  That’s a book worth reading, anyway, because there a number of other good chapters in it.

-What are the right way to measure the performance of a QA?

RB:  I assume you’re talking about an individual tester here. If we can define specific objectives for the tester, then we can use the same method to define metrics.  Keep in mind the rule about objectives needing to be SMART.

-How can we know that a QA is in competence level?

RB: Check out my book, Managing the Testing Process, 3e, for a discussion about how to use skills inventories to manage the skills of your test team.

-How to increase the productivity of a QA?

RB: This question is too general, I’m afraid.  Productive at what?  I suggest that you define specific objectives for the test team, and then measure the current efficiency with which those objectives are achieved.  At that point, you can make realistic (and measurable) goals for improvement of productivity.

You may have existing webinar or article regarding this topic. If yes, I am thirsty to study your material. Would you direct me how to access to this information. I would definitely provide my feedback to you.

RB:  Follow this link for another article on metrics that you might find useful.

Thanks,

Stephen

RB: You’re welcome.

Quantifying Testing Effectiveness with the Defect Detection Percentage

After the test metrics webinar held yesterday–link to recorded webinar coming soon in the Digital Library–we had an attendee ask a good question by e-mail (mailto:info@rbcs-us.com).  Linda Li wrote to ask,

Hello Rex,

 I just attended your free webinar about test metrics, you mentioned :

DDP=Bugs Detected/Bugs Present.

 So I want to know how can I get ‘Bugs Present’?  what’s included in Bugs Present?  Thank you very much.

You delivered a great presentation, that is really help me much.

Thanks for the kind words about the presentation, Linda. I do hope it provides useful ideas. 

As this metric, which is variously called defect detection percentage (DDP) or defect detection effectiveness (DDE), it is mathematically defined as Linda mention:

DDP = bugs found/bugs present

When we’re talking about testing at the end of the software development or maintenance process, we can say that:

bugs present = bugs found by testing + bugs subsequently found in production

So, to calculate DDP for testing, use this formula:

DDP = bugs found by testing/(bugs found by testing + bugs subsequently found in production)

In this equation, test bugs are the unique, true bugs found by the test team. This number excludes duplicates, non-problems, test and tester errors, and other spurious bug reports, but includes any bugs found but not fixed due to deferral or other management prioritization decisions. Production bugs are the unique, true bugs found by users or customers after release that were reported to technical support and for which a fix was released; in other words, bugs that represented real quality problems. Again, this number excludes duplicates, non-problems, customer, user, and configuration errors, and other spurious field problem reports, and excludes any bugs found but not fixed due to deferral or other management prioritization decisions. In this case, excluding duplicates means that production bugs do not include bugs found by users or customers that were previously detected by testers, developers, or other prerelease activities but were deferred or otherwise deprioritized, because that would be double-counting the bug. In other words, there is only one bug, no matter how many times it is found and reported.

To calculate this metric, you need to have a bug tracking system for all the bugs found in testing (and for those using Agile methods, yes, I do mean bugs found during the sprints even if fixed during the sprints).  You also need a way to track bugs found in production. Most help-desk or technical-support organizations have such data, so it’s usually just a matter of figuring out how to sort and collate the information from the two (often) distinct databases. You also have to decide on a time window. That depends on how long it takes for your customers or users to find 80 percent or so of the bugs they will find over the entire post-release life cycle of the system. For consumer electronics, for example, the rate of customer encounters with new bugs (unrelated to new releases, patches, and so forth) in a release tends to fall off very close to zero after the first three to six months. Therefore, if you perform the calculation at three months, adjust upward by some historical factor—say 10 to 20 percent—you should have a fairly accurate estimate of production bugs, and furthermore one for which you can, based on your historical metrics, predict the statistical accuracy if need be.

Note that this is a measure of the test processes’ effectiveness as a bug-finding filter. Finding bugs and giving the project team an opportunity to fix them before release is typically one of the major objectives for a test team, as I mentioned in my presentation.

Advanced Software Testing: Bug Isolation

Reader Gianni Pucciani has a good question about a question in the Advanced Software Testing: Volume 2 book:

I have another doubt for a question in Advanced Software Testing Vol. 2. It is about the first question in Chapter 7, Incident Management. The book says that the correct answer is C “Insufficient Isolation”. What does it mean? I had chosen B “Inadequate classification information”, because all the rest was not making sense to me. For B, I could justify it saying that more information could be added to the incident report, e.g the error message displayed by the application.

Here is the question from the book:

Assume you are a test manager working on a project to create a programmable thermostat for home use to control central heating, ventilation, and air conditioning (HVAC) systems. In addition to the normal HVAC control functions, the thermostat also has the ability to download data to a browser-based application that runs on PCs for further analysis.

During quality risk analysis, you identify compatibility problems between the browser-based application and the different PC configurations that can host that application as a quality risk item with a high level of likelihood.

Your test team is currently executing compatibility tests. Consider the following excerpt from the failure description of a compatibility bug report:

1. Connect the thermostat to a Windows Vista PC.

2. Start the thermostat analysis application on the PC. Application starts normally and recognizes connected thermostat.

3. Attempt to download the data from the thermostat.

4. Data does not download.

5. Attempt to download the data three times. Data will not download.

Based on this information alone, which of the following is a problem that exists with this bug report?

A. Lack of structured testing

B. Inadequate classification information

C. Insufficient isolation

D. Poorly documented steps to reproduce

The reason that the answer is “C” is because we don’t see any evidence of the tester trying some different scenarios to see if the data downloads properly.  The testing is clearly well-structured and carefully thought out, and the steps to reproduce are well-described.  The classifications are not given, so we have no way of saying, based on this information alone, whether those classifications are correct.

Calculating Defect Detection Effectiveness

Reader Gianni Pucciani has another good question in his review of the Advanced Software Testing: Volume 2 book.  He asks:

My doubt is on question 4 of chapter 10 (People Skills and Team Composition).

The correct answer is B, and I had chosen B by excluding all the others which were for sure wrong.

However, my question is: how do you know that your team found 90% of defects by the time you need to give bonuses?

You know for sure the number of defects found prior to release, but how do you know the total number of defects if not after an agreed period (1 year?) of production use?

How would you implement this approach in a real life situation?

Here’s the question from the book:

You are a test manager in charge of system testing on a project to update a cruise-control module for a new model of a car. The goal of the cruise-control software update is to make the car more fuel efficient. Assume that management has granted you the time, people, and resources required for your test effort, based on your estimate. Which of the following is an example of a motivational technique for testers that will work properly and is based on the concept of adequate rewards as discussed in the Advanced syllabus?

A. Bonuses for the test team based on improving fuel efficiency by 20% or more

B. Bonuses for the test team based on detecting 90% of defects prior to release

C. Bonuses for individual testers based on finding the largest number of defects

D. Criticism of individual testers at team meetings when someone makes a mistake 

Gianni is of course right, the answer is B.  He is also right that there is some lag time after release required to calculate the defect detection effectiveness.  Defect detection effectiveness is calculated as

DDE = (defects detected)/(defects present).

In the case of the final stage of testing, you can calculate this as

DDE = (defects detected in testing)/(defects detected in testing + defects detected in production).

The bottom side of that equation (the denominator) is a reasonably good approximation for “defects present” is you wait long enough. 

So, how long is “long enough”?  Most of our clients find that they can determine the typical period of time in which 90% of the defects will be reported on a given release, usually through analysis of the field failure information.  In some organizations, this is as short as 30 days, though 90 days seems a more typical number.

Software Testing Evidence

I recently received an interesting e-mail from a colleague:

To Whom It May Concern-

Do you have any articles on the value of collecting/capturing detailed test evidence (e.g., screenshots attached to test cases)?

In my opinion, for mature systems with experienced, veteran testers, the need for an abundance of test evidence in the form of screenshots attached to test runs in QC is overkill and unecessary that adds more time to release cycles. The justification for this is awlays “For Audit” as opposed to “Improves Quality”. I looked in several articles on this fantastic site, and couldn’t find anything pertaining to test evidence. Do you have any articles that provide evidence that an abundance of test evidence improves quality (even if it’s just a correlation and not necessarily causation)?

Thanks

Erik Tuininga

We have clients that do need to retain such detailed software testing evidence; e.g., clients working in safety critical systems (such as medical systems) who must satisfy outside regulators that all necessary tests have been run and have passed.  For them, retaining such evidence is a best practice, as not doing so can result in otherwise-worthy systems being barred from the market due to the lack of adequate paperwork. 

As someone who relies on such systems to work–indeed, as we all do–I appreciate these regulations and would not want to see software held to a lesser standard. However, Erik makes a very valid point in terms of the trade-off.  As time is spent on these audit-trail activities, that is time not spent doing other tasks that would perhaps result in a higher level of quality.  Of course, these audit-trail activities are designed to ensure that all critical quality risks are addressed.  So, the key question is how should organizations balance the risk of failing to test certain critical quality attributes against the reduction in breadth of quality risk coverage?

I’d be interested in hearing from other readers of this blog on their thoughts.  Erik, if you have further comments on this matter, I’m sure the readers of this blog would benefit from those ideas, as this is clearly an important area to consider.  I certainly agree it’s an interesting topic for an article, and this blog discussion may well inspire me to collaborate with you and other respondents to write one.

Metrics and Bonuses

One of the topics I find very interesting and useful for our clients is the proper use of metrics.  We do a lot of metrics-related engagements, and in fact just this morning I’ll be talking with a client about some US$ 100 million in defect-related waste that we’ve found in their software development process.  I’ve written a lot on the topic, including in my books and in various articles

Regular blog reader Gianni Pucciani asks an interesting metrics question in an e-mail:

The question is: how can you give a bonus to your test team, to motivate it, based on 90% of bugs found before the release to production date? How can you know that you found 90% of the bugs at the time you release the software?

Gianni is referring to a metric called defect detection effectiveness or defect detection percentage.  This is a metric I’ve discussed quite a bit in my books, especially Managing the Testing Process.

Defect detection effectiveness is a very useful metric for measuring the effectiveness of a test process at defect detection.  Most testing processes have defect detection as a primary objective, and we certainly should have effectiveness and efficiency metrics for objectives.

That said, it is a retrospective metric that can only be calculated some time after a release, if you intend to calculate it on a release-by-release basis.  (Some of our clients calculate it on an annual basis, aggregating all their projects together, which also works.)  It’s typical to wait 90 days after a release to calculate defect detection effectiveness, though you really should verify what time period is required to have say 80% or more of the field-reported defects.

I could go on for days about this metric, but, since it’s a blog and since Gianni asked a specific question, I’ll address the other point he brought up, which is the use of this metric for bonuses.  Defect detection effectiveness is a process metric, which is not the same as a metric of individual or collective performance.  Many things are required to enable good defect detection effectiveness, including good testers, and many things can reduce defect detection effectiveness, some of which are beyond the control of testers.  I’d encourage a web search on the string “Deming red bead black bead experiment” for a discussion on the risks of rewarding or punishing based on metrics that might not be entirely in the individuals’ control.

In addition, while defect detection is typically a primary objective of testing, it’s not the only objective, and defect detection effectiveness is only an effectivness metric.  It doesn’t measure the efficiency or the elegance with which the test team detects defects.  A test process should have a fully articulated set of objectives, with effectiveness, efficiency, and (ideally) elegance metrics for each objective, rather than a single unidimensional metric by which it is measured.

For further information on defect detection effectiveness, I’d refer people to my book Managing the Testing Process, 3e.  My colleague Capers Jones also contributed an article to our web site on a couple related defect metrics that readers might find interesting.

Cost of Poor Software Quality: $242,000,000

The Financial Times today featured an article on how a software bug–abysmally handled–in a financial application cost the company US$ 242,000,000:

http://www.ft.com/cms/s/0/5e1ba340-2feb-11e0-a7c6-00144feabdc0.html#axzz1D2IDiwLs

Because I don’t know how long that link will live, here’s the summary.

Axa Rosenberg Group had some quantitative analysis software that it used to service its clients accounts.  Axa Rosenberg Group manages money for other people, and the software is an internal application, albeit one they touted as a key differentiator, apparently–and indeed it did turn out to be, though not in a happy way.

The software had a bug that disabled a key risk-management component of the software, which was released to production in 2007.  Apparently management found out about the bug in November 2009.  However, rather than fix the problem, they tried to cover up the reasons for the poor performance of their funds.

Over one third of their customers were affected by the bug.

A wee bit of analysis from yours truly:  I have clients in the financial world, and I know how hard it can be to test these kinds of applications.  When a calculation is wrong, it can be wrong in a way that is beyond the ability of a human tester to detect.  However, Axa Rosenberg Group’s handling of the bug after they found out about it is truly a textbook illustration of how not to handle a software quality problem.

Risk Based Software Testing: Better Ways to Report Results

 As readers of this blog will know, I’ve spent a lot of time in this blog (and in articles, books, and consulting work) on the topic of risk based testing.  I recently spent some time in Japan, working with various teams in Sony to implement better risk based testing.  Part of that time was spent working with Atsushi Nagata, who is helping teams in Sony put risk based testing  into action.  Together, he and I have written an article describing some ground-breaking work that they’ve done on risk based test results reporting.  That article was just published in ST&QA magazine.  You can read it (and comment on it) by clicking here.

Why Is Software Quality So Unmeasured?

As regular readers of this blog know, from time to time I like to throw a topic out for discussion and see what comes of it.  It’s that time.

I have said (more than once) that most companies manage their holiday parties more rigorously and quantitatively than they manage the quality of their software.  That’s not just a throwaway snarky line: It’s a fact.  You can go the treasurer, CFO, head accountant, whatever the moneybags person is called in any organization worth calling an organization and ask that person, “How much money did you spend on holiday parties last year?”  You’ll get an answer.  You can ask people at that same organization, “What benefits did you receive from those parties?”  You’ll get an answer (albeit probably not as quantitative).

Now, try this experiment: Ask Mr. or Ms. Moneybags, “How much money did your organization spend on software testing and software quality last year?”  While they might be able to answer the first half of that question (about testing), most organizations couldn’t answer the second half, even though a technique for getting a good approximation of the costs of software quality has been around a long time.  (For example, check out the free tool here and also the article here.)  You can ask them about the benefits received from testing and quality and get the same lack of solid answers most of the time. 

Given that people have been accounting for stuff for thousands of years (e.g., I saw a 4,000 year old receipt for donkeys and other sundry items, written in Sumerian cuneiform, in a museum in Japan last month), how to explain this lack of fiscal measurement, given the widely-acknowledged importance of software in the modern economy? 

Moving beyond cost, while some of our clients do have reasonably good metrics for product quality (what some call “quality in use metrics”), many companies do not.  Some companies that do have such metrics don’t tie those metrics back to what is getting tested.  We’ve seen situations where companies knew they had interoperability problems in their data centers and yet, when we asked people who was responsible for interoperability testing, the accountability trail went around in circles and ended up nowhere.  Same story for performance and reliability. 

So, why does this happen?  The cynic in me wants to say that this problem comes down to a lack of legal liability for quality and quality problems.  In other words, until organizations are held to the same legal standards for software quality as they would be for other products (e.g., food, cars, etc.), we will see this immature approach to managing and measuring quality.  But is it really that simple?  What do you think?  Comments, please.



 
`