Some of you who follow me on Twitter (@RBCS, by the way) may have seen the ongoing debate between myself on one side and what seems to be the entirety of the “context driven school of testing” on the other side. I’ve been saying that testing metrics, while imperfect, are useful when used properly. The other side seems to be saying…well, I’m not clear exactly what they are saying, and I’ll let them say it themselves on their own blogs. Suffice it to say they don’t like test metrics.
If you have missed the Twittersphere brouhaha and you want to get more details on what I think about metrics, you can listen to my recorded webinar on the topic. You can ask the other folks about what their objections are.
Once you do, I’d be interested in knowing your thoughts. Are test metrics useful to you? What problems have you had? What benefits have you received? What different situations have you used metrics in (e.g., Agile vs. waterfall, browser-based apps, level of risk, etc.), and how did that context affect the metrics? Let me know…
I received an interesting tweet from Shoshanah Gil, which then lead to the following e-mail:
I wanted to respond to your Twitter reply regarding my first attempt at using PRAM (Twitter post Feb 18, 2013 6:58am) in more detail.
The real success of this testing efforts will of course be measured once the software is out in the field…but this is the “small” success: I was assigned a project to coordinate testing involving 25 integrating applications. Testing needed to be coordinated between unit and production testers, who would be testing at the same point in the development pipeline. (The company is exploring new testing methodologies, so utilizing testing resources in different ways is part of this.) I needed to find a common ground from which to develop a test plan for this project. I had just seen your presentation on PRAM and brainstorming, so I thought focusing the risks would bring testers together. From the software specs, I compiled a list of quality risk categories. I put these on the board, and as I passed out Post-it notes, I asked the newly-formed team of unit/production testers from each application to identify the test types they would use to mitigate the risks. One of the interesting outcomes was that we found that one area of risk did not have as many Post-it notes. We then needed to decide whether this was a risk area that did not need to be a priority, or whether it needed to be tested but not by all apps. As a result of this exercise, testers collaborated on their test plans for their specific application, while simultaneously evaluating the effect this project had on the product as a whole.
How does this experience differ from how you might have coordinated the test planning for this type of project? Any suggestions for the future?
Thank you for your interest,
First, glad to hear that you’re having success with the PRAM technique.
One thing that struck me about your e-mail is that you only mentioned testers as participants in your risk analysis process. Testers do have an excellent perspective on product quality risk, but their perspective, like all stakeholders, is incomplete.
As you continue to refine your use of the technique, be sure to include other business and technical stakeholders. You can find more details on the how’s and why’s of stakeholder involvement on the RBCS Digital Library.
I had an interesting question from a reader, Stas Milev:
I hope you are well. I wanted to ask you a question about test estimation. I am sure you have been asked many of these before but the one I have is not really about the estimation techniques themselves (such as usage of historical data, dev effort, etc)
There is one area of test estimates which is always arguable, hard to estimate and finally explain to sponsors no matter how well you are prepared. This is a test analysis and design task which is vague by definition. If we quickly decompose it into smaller pieces we would end up with the following simplified list of activities:
1. Analyse the test basis (if exist).
2. Get ambiguities, inconsistencies and gaps in the test basis resolved.
3. Apply the test techniques to create the test cases.
While you can more or less quantify 1 and 3 in terms of the effort (let’s assume we at least have something to work with in terms of test basis), the issue is obviously with 2 where we are dependant on many people (Business Analysts, system Analysts, dev team, end-users, etc).
There are two obvious options we can choose from:
Option 1: Assume there will be no gaps, issues or they will be resolved immediately and all our questions will get answers with no delays and thus, simply estimate 1 and 3. Of course, we will make the assumptions documented to highlight the risk if they arise. The problem with this one is that we know straightaway we will overbudget and we will have to come back to business sponsors and ask for more money. Nobody likes doing this, especially if we start asking for an additional amount every single time our inadequate estimates deviate with the reality. Moreover, on some of the project the budget is fixed straight away once it has been confirmed.
Option 2: Get this slippage time or time spent on requirements clarification somehow estimated based on previous experience. The issue here is that this is an ‘unexplained’ effort to an extent it can’t be justified by a statement: “but we know there will be issues or something will not be ready”. Pretty valid scenario in this case would be: “Hey, we have just two requirements here. Why the hell it takes two weeks and not two days to create the tests for these two?”
To me getting the right questions raised, asked and answered is a part of test analysis and this activity is extremely important as it prevents defects. To a certain extent this is a very informal static testing or a QA activity which needs to be build into the process but nobody is willing to pay for it explicitly. From the other hand, the ethics does not allow you to simply ignore problems and test that a buggy software is buggy. In the latter case, I normally still try to squeeze in the static test and get decision makers to accept the risk that problems with the requirements may arise very late.
I wanted to hear for your recommendation on test analysis and design effort estimates and test effort negotiation with business sponsors and project managers. It would be also great to hear your comments on both options or perhaps option 3 if it exists.
ISTQB Certified Advanced Test Manager (CTAL)
A good question. What I would suggest is that the estimation for activities 1, 2, and 3 should be based on historical data. So, if you know that you have some average number of test cases be identified quality risk, per specified requirement, per supported configuration, etc., you should be to estimate activities 1 and 3 based on the average number of hours effort associated per test case. For activity 2, once again, if you have historical data on the average number of defects typically found per test basis document page, you should be able to estimate the number of defects you’ll find. If you know the average time from discovery to resolution of such defects, and the average amount of effort for each such defect, you can then estimate the delay and effort.
The metrics gathered about test basis defects could be used not only for estimation, but also for process improvement.
I had another interesting question from a reader that I’m getting to in the blog today:
I was ready some of your excellent webinars, specially about the defect metrics, So my question is: Is there any benchmark data about defects severity, and also about reopen defects.?
I would really appreciate if you can guide me where I Can find this data.
Note: the purpose of this data is to measure and compare our results against the benchmark.
Thanks for the question, Tomas. In terms of defect severity, unfortunately there are no standard, common definitions for severity in the industry yet. So, any metrics that showed aggregated data from multiple organizations would probably be rather questionable. That said, you might check out Capers Jones’ excellent book, The Economics of Software Quality, which, as I recall from reviewing it, had some interesting analysis here.
In terms of defect report re-open rates, our target during assessments is 5% or less. Some amount of defect report re-open seems inevitable, since test environments and data are generally more representative of production than are the developer’s environments. However, there is a significant risk of schedule delay, as well as significant inefficiency (due to an additional layer of rework), when the defect re-open rate gets too high.
I received an interesting, detailed query from a reader of my new e-book on testing metrics:
My name is Marcus Milanez and I’m a software developer who lives in São Paulo, Brasil. I recently bought a copy of your “Testing Metrics” ebook and found it really valuable for improving my understanding on testing processes, as well as hearing more words on the importance that metrics play on my daily tasks. Thanks for sharing all your studies and results, I appreciate it.
However, I still have some questions that I still couldn’t get any reasonable answers – maybe they’ve been presented to me already, but my limited comprehension is failing to absorb them. In order to exercise these questions correctly, I would like to add some general contexts, just to make sure that I’m using my words properly.
Forgetting a little bit about all the nitty-gritty details that are observed during the creation or maintenance of a software system, in general, in a regular software development process we have:
1) Customers require features that must be properly implemented
2) Developers along with specification teams, stakeholders and other parties, define the scope of that feature and write a couple of functional/acceptance tests that validate the analysis of the customer request
3) In a planned spring, developers implement the feature and get feedback from specs team, stakeholders and customers, to make sure that everything is fine.
4) Quality team verifies the quality of any version delivered by developers. Bugs are filled and latter fixed by developers in a sprint.
I clearly understand that the list above is incomplete, could be better and is definitely passive of change, but I’m just trying to give an overall context. Now, by reading your book, I could clearly see all the values that metrics and proper testing strategies add to a software project, and I’m glad I was presented to things like DDE and all those sample charts – that information is really gold. While reading the book though, I tried hard to think and use the same logic that you presented in your DDE formula, for somehow measuring the value of unit tests created by developers during development time. The thing is that I couldn’t get to any conclusion because looks like the value added by unit tests can’t be measured in the same way that you presented in your book. My reasons for that are:
1) Looks like unit tests are not making verifications in the same sense that exploratory or automated tests are – rather, we regularly use them for getting something built according to a set of premises, and later on, these set of premises can be verified.
2) Developers don’t usually count, or indicate, bugs found during development time. I don’t even think that this would make any sense.
3) They are not testing a use case, they are testing smaller pieces of it. Although these smaller pieces can directly affect a use case, looks like these observations are quite different from those observed by an automated functional/acceptance test.
So, my question is: can we measure the value/cost benefit added by unit tests, using metrics that indicate that they are definitely effective in a software project, or the value of unit tests are not related at all with the number of bugs filled by a validation team for example? Can the value added (or not) by unit tests be measured, qualified and improved? Is it possible to observe unit tests, alone, with a cost perspective?
I exercised this question on my own, and so far the following seem to be valid for me:
Looks like unit tests main intention is not to find bugs. Unit tests seem to be closer to a development tool (like an IDE is) than a tool for finding problems.
Since unit tests are part of the development effort, it is difficult to say how many hours have been spent during codification of unit tests and codification of production code.
Although a project with high a code coverage by unit tests tend to have less bugs filled, they are not a guarantee of a use case bug free code.Still thinking about number of bugs, I believe that unit tests tend to prevent run time errors more than use case errors.
Apparently, the real value of unit test comes in terms of maintenance, ability of adding or removing features without fear, and mainly as a means of communication with other developers working on the same code base.
If these first items were valid, it is likely that unit tests will certainly prevent the creation of a given number of bugs during validation phase. However, in order to assert that, we would need to collect all the metrics related to a given project that don’t have unit tests (like DDE, cost per bug fix, time per bug fix and others), properly implement unit tests for this project, and then get the very same metrics again for comparison. Is this rationale correct? If so, can we confidently say that unit tests, alone, were responsible for getting these numbers changed? Are there any study published that analyses these points?
Thank you so much for your attention and sorry for this long email. The thing is that I found your book extremely useful and I definitely want to improve my understanding of many aspects of my profession.
Thanks for your kind words about my book. I’m glad it’s proving interesting and useful to you, and that it provoked such a good and reflective set of questions.
Before I get into a detailed response, let me set up the background for what I’m about to write:
- First, I agree with your brief, general summary of the software process as it applies in Agile development, and I’ll refine it further in a moment. I would add, at this point, that in many organizations there is an additional test level that occurs, sometimes called “system integration test,” where the software produced in the Agile sprints is tested with other software which interoperates or cohabitates with it. This level can occur once after the last sprint, but it’s better if it happens iteratively as each sprint concludes.
- Second, since you mention sprints and since your description of the process seems very Agile-oriented, I assume that we are talking about unit tests in the context of Agile development.
- Third, I assume that, when you talk about unit tests, since you say that developing the unit tests is “part of development,” you are talking about unit tests that are used for test-driven development (whether formally or informally). That is, the unit tests are developed at the same time the code is being developed, and that coding, developing tests, and executing the tests proceeds in a tight loop until the unit is complete.
- Fourth, I assume, since you talk about unit tests allowing us to make changes without fear of regression, that you are talking about automated unit tests, ideally incorporated in a continuous integration framework such as Hudson.
With those assumptions in mind, here are my thoughts on the many good comments and points that you raise. Let me start by reviewing the development process:
- Test-driven development is a great idea. Indeed, I did that myself years ago when I was a programmer. However, it’s a misnomer to call that process “test driven.” The “tests” used in this process are actually more like executable low-level design specifications. I know you are already aware of this, because of your comment about these tests “not testing use cases.” What can be somewhat confusing about this is that the work product produced by test-driven development includes a set of automated unit tests that subsequently can be used for something properly referred to as testing, specifically regression testing. Since test-driven development, when it’s happening, is not really testing, the objective of the developer as he or she creates the code and the unit tests is not really to find defects. (Finding defects is one of the typical objectives of most test levels, including unit testing.) The main objective is to build confidence, as the coding proceeds, that the code being built works properly. So, if we want to measure the level of confidence that we should have, that is something better measured using coverage metrics. In this case, the most relevant coverage metrics would be code coverage and, in some cases, data flow coverage.
- Once the unit is completed, typically there is a missing step in most organizations. Ideally the developer would apply structured test design techniques to augment the unit tests in order to make them true unit tests, and to find any defects that remain in the completed unit. If developers were to do this systematically, the quality of code delivered for testing would dramatically include. Capers Jones’ studies show that, typically, unit testing has a defect removal effectiveness of only 25%, but best practices in unit testing boost that to 50%.
- In fact, there is another missing step in many organizations at this point. The best practice would be for the programmer to take the code, the unit tests, and the unit test results to a code review. A study by Motorola showed that, to maximize effectiveness, this review should include at least two people other than the author, so pair programming or simply having a lead developer review the code are not sufficient. The unit tests might be augmented based on this review, and, if so, re-run.
- At this point in the process, the unit tests created in step 1 (and ideally steps 2 and 3) of the process are incorporated into a continuous integration framework, as mentioned earlier, and serve as ongoing guards against regression of the unit. If the unit is refactored, the tests may have to be updated, as you mentioned.
So, let’s get back to metrics and measuring the value of unit testing, starting with not logging defects found by developers. This is a mistake. I agree that we don’t need to log failures that occur during step 1 above, because–as I said–we’re not really testing, we are developing the software. However, once we get into steps 2, 3, and 4, I believe those defects should be logged. Often when I say this, people have a horrified reaction and say, “Oh, no no no, we can’t have programmers interupting themselves and breaking their flow to log defects in some cumbersome bug tracking tool.”
To which I reply: “I didn’t say they had to use a bug tracking tool. I said they should log the defects.” Here’s the distinction. I agree that there’s no need for a tracking tool here, because the bug is going to be immediately removed. Its lifecycle will start and end at the same day, and so all the state-based stuff that the tracking tool does is unnecessary. But we do need to log information about the defect, especially classification information, phase of introduction, etc., and we should use the same classifications as are used during formal testing. This information can be captured in a simple spreadsheet. That way, we can do analysis of the defects found during unit testing (step 2), code reviews (step 3), and unit regression testing (step 4).
What can this analysis do for us? A number of things. First, it can help developers and the broader organization learn how to write better code. Patterns in types of defects found during steps 2, 3, and 4 can show that training and mentoring is needed to reduce bad coding practices or habits that some developers might have.
Second, to address your main question, it can allow us to directly measure the value of unit tests. To do so, we need to do one more thing, which is to estimate the effort associated with unit testing and unit test defect removal as well as estimating the effort associated with other levels of testing (e.g., acceptance testing, system integration testing, etc.), removal of defects found in those levels of testing, and the effort associated with failures in production. This is actually easier than you might think, and sufficiently accurate numbers can be obtained by surveying the people involved.
The differences are often quite dramatic. One client found that a defect found and removed in unit testing took three person-hours of effort, while a defect found and removed in system testing took 18 hours, a defect found and removed in system integration testing took 37 hours, and a defect found and removed in production took 64 hours. So, for this client, each defect found in unit testing saved at least 15 hours of effort, and possibly as much as 61 hours of effort!
Notice that this approach allows us to avoid the “double blind study” approach that you mentioned earlier. We don’t have to find two almost-identical projects, where the only difference is whether unit testing happened or not. I expect that would prove impossible, so it’s good that we don’t need to do it.
Now, as you can see, we can measure part of the value delivered by unit testing, in terms of avoided downstream costs of failure. (You might want to read more about cost of quality, which is the technique used above; check out my article here.) However, I also agree with your earlier comments about unit testing having other objectives, such as easing the maintenance of code, and reducing the risk of breaking something when we do, as well as helping developers communicate about the code. (That last benefit is especially applicable if you follow step 3 mentioned above.) It is possible to develop metrics that allow you to measure these benefits as well, but, if the main objective of measuring the value of unit testing is to convince managers to continue to invest in it, then the “effort saved” metric mentioned above should be sufficient.
I received some questions from a reader and webinar attendee, which I’ll answer below:
I am Jason, and few days ago was in the webinar for Reviews. As promised, I have collated the lists of questions to ask based on the Study Preparation Guide for CTAL-TM that I purchased from RBCS recently. Please do not hesitate to contact me should you require further information from me. Below are the questions and my comments:-
71. An organization follows a requirements-based test strategy for most of its projects. Which of the following is the best example of modifying the test approach for a project based on an understanding of risks?
A. Past performance issues lead to an increased effort on performance testing
B. Test estimation is based on the number of pages in the requirement specification.
C. Test execution is outsourced to a testing company based on a low-cost bid.
D. Unit test effort is limited to ensure early commencement of system test execution.
I did not really understand this question very well. How does performance issues in the past are related to risk?
Jason, the reason that A is the right answer is because we are using past defect information (in this case, performance defects) to assess the likelihood of particular types of problems.
84. You are managing a test effort that uses entirely reactive techniques, including a list of past bugs found in the field, a checklist of typical bugs for products using this technology, and exploratory testing based on tester experience. No written tests are developed prior to test execution, other than the list of bugs. Consider following statements
i Immune to the pesticide paradox
ii Repeatable for regression and confirmation testing
iii Useful in preventing bugs during system design
iv Cheap to maintain
v Makes no assumptions about skill tester skills
Which of the following is true about this particular testing strategy?
A. I and III are benefits of this strategy.
B. II and V are benefits of this strategy.
C. I and V are benefits of this strategy.
D. I and IV are benefits of this strategy.
Argument: I should not be part of the answer. “Exploratory testing based on tester experience” allows different test cases combination to be tested. If that is the case, why is that immune to the pesticide paradox? To my understanding, immune to the pesticide paradox refers to doesn’t have any effect to the pesticide paradox.
Since it is based on list of past bugs found in the field and checklist, I would consider II as tester can go through the regression based on these checklists.
The Foundation and Advanced syllabi are clear on the fact that detailed test cases–which do not exist here–are useful if repeating tests precisely for regression and confirmation testing purposes is needed. So, II cannot be a benefit.
The Foundation syllabus, in the section about general testing principles, says that the only way to overcome the pesticide paradox is to run different tests rather than repeating the exact same tests. Because all tests would be subtly (or even dramatically) different if repeated, especially if testers with different experience are used for subsequent executions.
125. You manage a test team for a bank. Your test team uses two primary test strategies, checklist-based and dynamic. As its checklist, your team has a list of main areas in which the test team, in-house users, or bank customers have reported defects on past releases. For the dynamic testing, it employs members of the test team with experience in the bank branches and back-office to do exploratory testing.
Based on this information alone, which of the following is an improvement that you would expect from a STEP assessment?
A. Get involved earlier in the lifecycle
B. Analyse requirements specifications
C. Run only scripted tests.
D. Improve the office environment.
The answer given in the guide is B. Analyse requirements specification. Could you please elaborate the reason to this answer?
Yes. The reason is that, as mentioned in the Advanced syllabus, STEP is based on an assumption that requirements analysis and requirements-based testing will occur.
Looking forward hearing from you soon. Thanks.
I hope this is helpful.
Reader Steven Moore wrote to ask the following:
I have been asked to document the benefits of early engagement of QA in the SDLC and am looking for any qualitative or quantitative information, articles, papers, or industry experts I can reference. Any advice or guidance you could provide would be greatly appreciated. Thanks.
Steven, I’d recommend that you pick up a copy of Capers Jones’ recent book, The Economics of Software Quality. This is an excellent resource for what you’re trying to accomplish. Anecdotally, I can mention that we did a study for a client where we found that the average cost of removing a defect in reviews was $37, while the cost to remove a defect in system testing was $3,700. Due to those relative costs, and the number of defects involved that escaped to system testing, this client was losing between $100,000,000 and $250,000,000, on a $1,000,000,000 annual IT budget.
My friend and colleague from Colombia, Patricia Osorio, asks the following question:
Could you please help me with the following question? Is it right to say the order of reviews according its formality – from most formal to less formal is: inspection, walkthrough, technical review and informal review? When I had read foundation level syllabus to present the exam (2005) it was very clear. Now, when I reread the foundation level syllabus in its version 2010, it seems to me that It has changed to inspection, technical review, walkthrough and informal review.
Patricia Osorio Aristizabal
Patricia, the latest version of the Foundation syllabus is 2011, but I think the text is much the same as the 2010 version. I have checked some of the previous versions of the syllabus, and can’t find any explicit mention of the spectrum of formality.
I believe this idea of the spectrum of formality (informal, technical review, walkthrough, inspection) comes from IEEE 1028. In that standard, metrics and review-based process improvement are not specified for the technical review, and they are for walkthroughs and inspections. So, this makes the technical review less formal. The inspection is more formal than a walkthrough due to the separation of moderator and author.
This leads to an interesting question: Does this spectrum of formality actually matter in the real world? In my experience, it really doesn’t. Here’s why I say that. Most companies don’t do reviews, or at least don’t do them anywhere near as often and as thoroughly as they should. So, when I’m talking to a client about doing reviews, I don’t get into the issue of what level of formality. Instead, my focus is on motivating them to start doing more reviews and doing them better. I can’t remember working with any clients where the main problem they had with reviews was that they weren’t using the right level of formality.
The other reason this doesn’t matter much is because of the naming issue. People use the terms “review,” “technical review,” “inspection,” “JAD session,” “walkthrough,” and more, and whenever I hear those terms I ask people to tell me, specifically, what such an event is, who is involved, what the process is, and–if people mention more than one type of review–what the differences are. I very rarely get clear answers to those questions, which tells me that any particular session where people sit down to discuss a work product could have any one of a dozen or so different names attached to it.
Personally I don’t see this as a problem to worry about. I’m more worried about whether my clients are doing reviews, regularly and with good benefit, than with what they call it.
A reader, Stas Milev, sent me a question which may be interesting to readers of this blog:
First, let me thank you for the huge effort you put in writing the books and educating people on various testing topics – the materials are just excellent. I have been a test manager for 7 years already and still find a lot of useful things.
Thanks very much, I’m glad they are useful to you.
I hope you don’t mind if I ask you a question related to weighted failure metric you mentioned in your Advanced Test Manager book. You did not focus to much on this in your book and just mentioned it measures technical risk and the likelihood of finding problems. As such, can you please expand a bit more on how to analyse this metric, the value and meaning of it?
The weighted failure can be calculated both on a per-test basis and a per-test suite basis. (A test suite is a logical collection of test cases, such as a functional test suite, a performance test suite, etc.) The weighted failure counts the number of bugs found (either by test or across all the tests in the test suite), but each bug report is weighted based on the priority and severity of the bug. In other words, a test suite that finds a moderate number of high priority, high severity bugs will probably score higher than a test suite that finds a large number of low priority, low severity bugs.
Probably the best way to learn more about this metric is to download and experiment with an Excel test tracking spreadsheetwith weighted failure included. Feel free to work with this one a bit, and I think the concept will make more sense.
ISTQB Certified Advanced Test Manager
You’re welcome. I hope this is useful.
I received a questions from a reader of the blog:
I have a query with respect to the levels and types of testing & how they are carried out. Can Development team execute all the System Integration test cases as a part of their Unit test cases? Later the testing team will be re-executing the same test cases once again and will be left behind with no defects to be reported.
In general, Krishna, this would be neither possible nor desirable. It’s not possible because properly designed system integration testing will focus on interoperability and other emergent behaviors (e.g., performance, security, reliability) that do not manifest themselves or are indeed not even testable at a unit level. It’s not desirable because different levels of testing should focus on covering different things.
Basically, the test levels should function as a sequence of filters. If you wanted to filter water, you wouldn’t use five identical filters to do that, but rather would use a sequence of different filters, each designed to catch particular types of impurities. In the case of testing, each test level is designed to cover certain aspects of the system, to mitigate certain types of quality risks, and to catch certain types of bugs.