Archive for the ‘test coverage’ Category
I received an interesting, detailed query from a reader of my new e-book on testing metrics:
My name is Marcus Milanez and I’m a software developer who lives in São Paulo, Brasil. I recently bought a copy of your “Testing Metrics” ebook and found it really valuable for improving my understanding on testing processes, as well as hearing more words on the importance that metrics play on my daily tasks. Thanks for sharing all your studies and results, I appreciate it.
However, I still have some questions that I still couldn’t get any reasonable answers – maybe they’ve been presented to me already, but my limited comprehension is failing to absorb them. In order to exercise these questions correctly, I would like to add some general contexts, just to make sure that I’m using my words properly.
Forgetting a little bit about all the nitty-gritty details that are observed during the creation or maintenance of a software system, in general, in a regular software development process we have:
1) Customers require features that must be properly implemented
2) Developers along with specification teams, stakeholders and other parties, define the scope of that feature and write a couple of functional/acceptance tests that validate the analysis of the customer request
3) In a planned spring, developers implement the feature and get feedback from specs team, stakeholders and customers, to make sure that everything is fine.
4) Quality team verifies the quality of any version delivered by developers. Bugs are filled and latter fixed by developers in a sprint.
I clearly understand that the list above is incomplete, could be better and is definitely passive of change, but I’m just trying to give an overall context. Now, by reading your book, I could clearly see all the values that metrics and proper testing strategies add to a software project, and I’m glad I was presented to things like DDE and all those sample charts – that information is really gold. While reading the book though, I tried hard to think and use the same logic that you presented in your DDE formula, for somehow measuring the value of unit tests created by developers during development time. The thing is that I couldn’t get to any conclusion because looks like the value added by unit tests can’t be measured in the same way that you presented in your book. My reasons for that are:
1) Looks like unit tests are not making verifications in the same sense that exploratory or automated tests are – rather, we regularly use them for getting something built according to a set of premises, and later on, these set of premises can be verified.
2) Developers don’t usually count, or indicate, bugs found during development time. I don’t even think that this would make any sense.
3) They are not testing a use case, they are testing smaller pieces of it. Although these smaller pieces can directly affect a use case, looks like these observations are quite different from those observed by an automated functional/acceptance test.
So, my question is: can we measure the value/cost benefit added by unit tests, using metrics that indicate that they are definitely effective in a software project, or the value of unit tests are not related at all with the number of bugs filled by a validation team for example? Can the value added (or not) by unit tests be measured, qualified and improved? Is it possible to observe unit tests, alone, with a cost perspective?
I exercised this question on my own, and so far the following seem to be valid for me:
Looks like unit tests main intention is not to find bugs. Unit tests seem to be closer to a development tool (like an IDE is) than a tool for finding problems.
Since unit tests are part of the development effort, it is difficult to say how many hours have been spent during codification of unit tests and codification of production code.
Although a project with high a code coverage by unit tests tend to have less bugs filled, they are not a guarantee of a use case bug free code.Still thinking about number of bugs, I believe that unit tests tend to prevent run time errors more than use case errors.
Apparently, the real value of unit test comes in terms of maintenance, ability of adding or removing features without fear, and mainly as a means of communication with other developers working on the same code base.
If these first items were valid, it is likely that unit tests will certainly prevent the creation of a given number of bugs during validation phase. However, in order to assert that, we would need to collect all the metrics related to a given project that don’t have unit tests (like DDE, cost per bug fix, time per bug fix and others), properly implement unit tests for this project, and then get the very same metrics again for comparison. Is this rationale correct? If so, can we confidently say that unit tests, alone, were responsible for getting these numbers changed? Are there any study published that analyses these points?
Thank you so much for your attention and sorry for this long email. The thing is that I found your book extremely useful and I definitely want to improve my understanding of many aspects of my profession.
Thanks for your kind words about my book. I’m glad it’s proving interesting and useful to you, and that it provoked such a good and reflective set of questions.
Before I get into a detailed response, let me set up the background for what I’m about to write:
- First, I agree with your brief, general summary of the software process as it applies in Agile development, and I’ll refine it further in a moment. I would add, at this point, that in many organizations there is an additional test level that occurs, sometimes called “system integration test,” where the software produced in the Agile sprints is tested with other software which interoperates or cohabitates with it. This level can occur once after the last sprint, but it’s better if it happens iteratively as each sprint concludes.
- Second, since you mention sprints and since your description of the process seems very Agile-oriented, I assume that we are talking about unit tests in the context of Agile development.
- Third, I assume that, when you talk about unit tests, since you say that developing the unit tests is “part of development,” you are talking about unit tests that are used for test-driven development (whether formally or informally). That is, the unit tests are developed at the same time the code is being developed, and that coding, developing tests, and executing the tests proceeds in a tight loop until the unit is complete.
- Fourth, I assume, since you talk about unit tests allowing us to make changes without fear of regression, that you are talking about automated unit tests, ideally incorporated in a continuous integration framework such as Hudson.
With those assumptions in mind, here are my thoughts on the many good comments and points that you raise. Let me start by reviewing the development process:
- Test-driven development is a great idea. Indeed, I did that myself years ago when I was a programmer. However, it’s a misnomer to call that process “test driven.” The “tests” used in this process are actually more like executable low-level design specifications. I know you are already aware of this, because of your comment about these tests “not testing use cases.” What can be somewhat confusing about this is that the work product produced by test-driven development includes a set of automated unit tests that subsequently can be used for something properly referred to as testing, specifically regression testing. Since test-driven development, when it’s happening, is not really testing, the objective of the developer as he or she creates the code and the unit tests is not really to find defects. (Finding defects is one of the typical objectives of most test levels, including unit testing.) The main objective is to build confidence, as the coding proceeds, that the code being built works properly. So, if we want to measure the level of confidence that we should have, that is something better measured using coverage metrics. In this case, the most relevant coverage metrics would be code coverage and, in some cases, data flow coverage.
- Once the unit is completed, typically there is a missing step in most organizations. Ideally the developer would apply structured test design techniques to augment the unit tests in order to make them true unit tests, and to find any defects that remain in the completed unit. If developers were to do this systematically, the quality of code delivered for testing would dramatically include. Capers Jones’ studies show that, typically, unit testing has a defect removal effectiveness of only 25%, but best practices in unit testing boost that to 50%.
- In fact, there is another missing step in many organizations at this point. The best practice would be for the programmer to take the code, the unit tests, and the unit test results to a code review. A study by Motorola showed that, to maximize effectiveness, this review should include at least two people other than the author, so pair programming or simply having a lead developer review the code are not sufficient. The unit tests might be augmented based on this review, and, if so, re-run.
- At this point in the process, the unit tests created in step 1 (and ideally steps 2 and 3) of the process are incorporated into a continuous integration framework, as mentioned earlier, and serve as ongoing guards against regression of the unit. If the unit is refactored, the tests may have to be updated, as you mentioned.
So, let’s get back to metrics and measuring the value of unit testing, starting with not logging defects found by developers. This is a mistake. I agree that we don’t need to log failures that occur during step 1 above, because–as I said–we’re not really testing, we are developing the software. However, once we get into steps 2, 3, and 4, I believe those defects should be logged. Often when I say this, people have a horrified reaction and say, “Oh, no no no, we can’t have programmers interupting themselves and breaking their flow to log defects in some cumbersome bug tracking tool.”
To which I reply: “I didn’t say they had to use a bug tracking tool. I said they should log the defects.” Here’s the distinction. I agree that there’s no need for a tracking tool here, because the bug is going to be immediately removed. Its lifecycle will start and end at the same day, and so all the state-based stuff that the tracking tool does is unnecessary. But we do need to log information about the defect, especially classification information, phase of introduction, etc., and we should use the same classifications as are used during formal testing. This information can be captured in a simple spreadsheet. That way, we can do analysis of the defects found during unit testing (step 2), code reviews (step 3), and unit regression testing (step 4).
What can this analysis do for us? A number of things. First, it can help developers and the broader organization learn how to write better code. Patterns in types of defects found during steps 2, 3, and 4 can show that training and mentoring is needed to reduce bad coding practices or habits that some developers might have.
Second, to address your main question, it can allow us to directly measure the value of unit tests. To do so, we need to do one more thing, which is to estimate the effort associated with unit testing and unit test defect removal as well as estimating the effort associated with other levels of testing (e.g., acceptance testing, system integration testing, etc.), removal of defects found in those levels of testing, and the effort associated with failures in production. This is actually easier than you might think, and sufficiently accurate numbers can be obtained by surveying the people involved.
The differences are often quite dramatic. One client found that a defect found and removed in unit testing took three person-hours of effort, while a defect found and removed in system testing took 18 hours, a defect found and removed in system integration testing took 37 hours, and a defect found and removed in production took 64 hours. So, for this client, each defect found in unit testing saved at least 15 hours of effort, and possibly as much as 61 hours of effort!
Notice that this approach allows us to avoid the “double blind study” approach that you mentioned earlier. We don’t have to find two almost-identical projects, where the only difference is whether unit testing happened or not. I expect that would prove impossible, so it’s good that we don’t need to do it.
Now, as you can see, we can measure part of the value delivered by unit testing, in terms of avoided downstream costs of failure. (You might want to read more about cost of quality, which is the technique used above; check out my article here.) However, I also agree with your earlier comments about unit testing having other objectives, such as easing the maintenance of code, and reducing the risk of breaking something when we do, as well as helping developers communicate about the code. (That last benefit is especially applicable if you follow step 3 mentioned above.) It is possible to develop metrics that allow you to measure these benefits as well, but, if the main objective of measuring the value of unit testing is to convince managers to continue to invest in it, then the “effort saved” metric mentioned above should be sufficient.
I received a questions from a reader of the blog:
I have a query with respect to the levels and types of testing & how they are carried out. Can Development team execute all the System Integration test cases as a part of their Unit test cases? Later the testing team will be re-executing the same test cases once again and will be left behind with no defects to be reported.
In general, Krishna, this would be neither possible nor desirable. It’s not possible because properly designed system integration testing will focus on interoperability and other emergent behaviors (e.g., performance, security, reliability) that do not manifest themselves or are indeed not even testable at a unit level. It’s not desirable because different levels of testing should focus on covering different things.
Basically, the test levels should function as a sequence of filters. If you wanted to filter water, you wouldn’t use five identical filters to do that, but rather would use a sequence of different filters, each designed to catch particular types of impurities. In the case of testing, each test level is designed to cover certain aspects of the system, to mitigate certain types of quality risks, and to catch certain types of bugs.
This week, I’m pleased to feature a bonus bug of the week, courtesy of long-time reader John Singleton. He wrote to tell us:
I was posting some comments in response to your blog post, entitled “Bad Agile Implementation Causes Software Test Challenges”. For whatever reason, the Submit button was entirely non-responsive in Firefox. Thankfully, switching over to IE resolved it.
Not griping, just thought you’d want to know. Good, thoughtful response you gave.
That’s weird. It’s usually the other way around with WordPress (the software we use for our blog): Firefox works when IE doesn’t. I guess we have a bonus bug of the week.
Glad you liked the post. I’m looking forward to your comments.
So, while this isn’t a very serious bug, it is part of a very serious pattern: Failure of major players in the web world to test browser compatibility. See my previous posts http://bit.ly/mdBIRY and http://bit.ly/mPkD9z. Is this a lack of understanding of basic test design techniques such as equivalence partitioning and (for particularly risky situations) pairwise testing? Or just a lack of caring?
I have to say that this pattern is not universal. Working with a major e-commerce client this week, I was pleased to see that they include testing of most of the major browsers, covering the equivalence partitions.
What do you think about this pattern? Are you in the web world, and, if so, do you test browser compatibility? If not, why?
I recently had an experience with a couple important lessons for software quality and software testing. I ordered a pair of identical products from an e-commerce retailer. I’ll not distract the discussion by mentioning the specific products, or the problems with them, as it would entail a long, technical digression that detracts rather than enforces my point.
Anyway, the order fulfillment process went reasonably well. The retailer did an acceptable job of keeping me posted on the status of the order, which apparently involved working directly with the vendor making the products.
The two products finally arrived, and I immediately tested them out. I was not impressed with the quality of the products. Both exhibited an annoying failure each time they were used, and about 2% of the time they exhibited a failure that made the items not fit for purpose under certain circumstances. However, the products hadn’t cost much and I could imagine using them in non-critical situations, so I decided not to bother with returning them.
A week later, an e-mail popped into my inbox, from the retailer, asking me to review the product. Okay, it’s worth five minutes warning other potential customers about these problems. I write a review, click to submit it, and immediately see an error message. I go back, copy the review, and try it with another browser. Exact same problem.
So, now I’m a bit irritated. I send an e-mail to the retailer, advising them of the problem with their site and with the product. I get back a terse response from them, via their online support web page, saying, “It sounds to me like you got a defective [product]. What I would suggest is contacting the manufacturer of the item and most likely they will replace it for you.”
I then asked the retailer whether they were able to post my review on my behalf, as their system didn’t work. Here’s another terse response: “I don’t have a way to do that on my end, you can attempt that again.” I did try again–and this is now three days after the initial failure of their site–and it still doesn’t work.
Okay, I concluded, enough nonsense. I’ve now flipped the bozo bit on the retailer. A textbook case of how not to handle customer complaints about your website and the products you sell.
Okay, let’s compare this to best-of-breed processes for handling customer quality complaints. When I recently had a problem with a Blue Snowball microphone, the vendor worked with me to repair the microphone. When their repairs failed, Blue then sent me a new one for free, and apologized for my inconvenience.
As another example–not to blow our own horn too loudly–we (RBCS) have a policy that, whenever a customer has a problem using our website, we give them a discount code, even when the problem is not our fault (e.g., the recent failures due to a stealth software update by Pair, our site host). In fact, when people attending our free webinars have had problems with attending the webinar, we have given them discount codes.
So, what lessons can we learn as IT people, software professionals, and software testers specifically?
1. If you provide a means for customers to provide online feedback on your product or service, make sure it works! Testing should cover that feedback mechanism, and compatibility testing should be included in that, due to browser diversity.
2. If a customer reports a problem via an online feedback mechanism, make sure that the workflow includes tracking that problem to resolution, and test that workflow. Just telling the customer, “Tough crackers, take it up with the vendor,” is actually worse than not having an online feedback mechanism at all.
All in all, for me this experience has gone from the “that’s too bad, but not big deal” category to “I’ll never do business with that retailer again.” I’m sure that’s not the customer experience the retailer had in mind when they were designing their website, and, had they bothered to test it sufficiently, it probably wouldn’t have been the experience I had.
As long-time listeners–or even brand new listeners, for that matter–of the RBCS webinars know, we use Citrix’s GoToWebinar service for our free monthly webinars. Now, I’ve been fairly satisfied with GoToWebinar. I’ve used one or two of the competing services, and been less happy with those. Of course, webinar listeners (and readers of this blog) might remember I chided Citrix back in May for the ungraceful way the system handles audio drop-outs by the presenter.
So, during the June webinar, webinar attendee Keith Stobie reported an inability to see the presentation using Internet Explorer 9. He said that Chrome (not sure which version) worked just fine. I reported the problem to Citrix on Wednesday of last week. Five days later, I receive the following reply, quoted in its entirety (minus the links provided at the end):
Thank you for contacting Citrix Online Global Customer Support,
Dear Rex Black,
IE 9 has not been tested with any of our products as of yet. we will try to help fix any issues the best we can, but cannot guarantee anything. Hopefully we should get this done as soon as possible.
If you have any additional questions or need further clarification regarding this matter, please feel free to reply directly to this email. For any other product inquiries or technical assistance, please visit us at our Support Centers listed at the bottom of this email. Our Support Centers include Self Help files and our Global Customer Support Contact Information.
Richard Carrel | Global Customer Support
So, I appreciate the reply, though I have to say that five days isn’t quick turnaround for a customer complaint about a browser-based service that’s incompatible with a major vendor’s browser.
More surprising to me is the admission that Citrix hadn’t tested IE9. I don’t keep up with the browser wars, so I’m not sure what share of the browsing action IE9 has, but I’m pretty sure that Microsoft’s IE family of browsers remains at least one of the 800 pound gorillas in the room.
Putting myself in the position of the Director of Quality or VP of Testing or whatever the head-testing-honcho’s title is at Citrix, I understand that there are constraints on compatibility testing. I wouldn’t bother to test four-year-old versions of Opera, for example. But come on, not testing IE9? If I were in charge of testing for any SaaS provider, compatibility would be one of my top quality risks, and testing browser/OS/malware configuration combinations would receive a fair amount of time, money, and attention. Of course, functionality, reliability, performance, and security would also be high on the list of risk categories, too.
Here’s some free consulting advice to my fellow test professionals who work at Citrix: Spend a little time getting ramped up on how to do quality risk analysis and risk based testing. You can find lots of free resources on our web site, especially in the articles and the Digital Library. You’ll notice that compatibility is one of the quality risk categories included in our free quality risk checklist. If you need more help, let me know, as we can provide a one-week risk based testing bootstrapping service that will get you headed in the right direction.
Morale of the story: If you are in charge of testing at any SaaS vendor, and you’re not testing for compatibility, it’s only a matter of time before someone writes a blog post like this one about your product and the degree to which you aren’t testing it.
Here’s a question about equivalence partitioning.
My name is Nhat Khai Le and Iâ€™m taking ISTQB Advanced Technical Test Analyst online course (provided by RBCS).
I faced this sample question:
You are testing an accounting software package. You have a field which asks for the month of a transaction.
You know that months are treated differently in this software depending on the month of the quarter (first month of the quarter is treated differently than the second which is treated differently than the third.)
How many test cases, minimum, would you need to test this field to get equivalence coverage?
I chose answer B, i.e 3, but the answer is C, i.e 5.
The reason the answer is C is because there are three months in each quarter, each of which is different, plus the two months immediately preceding and following the quarter. It’s important to remember that equivalence partitioning includes partitions outside the “normal” range when using it for testing.
To help visualize the answer, see the illustration below:
Equivalence Partitioning for the Months in the Quarter
A quick follow-up related to my earlier post on evidence. As some readers may know, avionics software that controls flight on airplanes (e.g., cockpit software) is subject to a test coverage standard, FAA DO-178B. That standard applies lower standards of test coverage to software that is not safety critical.
So far, so good.
Here’s an example of why such standards are useful. During my flight from the US to China today, I managed to crash the entertainment software running at my seat not once by three times. I did this by pausing, rewinding, and resuming play when the flight attendants were taking my dinner orders (i.e., not by unusual actions). I was ultimately able to get it working again, thanks to a series of hard reboots by a flight attendant. One of my fellow passengers wasn’t so lucky, as his system never recovered.
Okay, that’s just entertainment, and anyone who travels regularly knows they should bring a book or plan to winnow down their sleep deprivation balance on long flights.
However, what if the flight control software were as easy to crash? Who would want to hear a cockpit announcement along the lines of the following: “Our entire flight control system just crashed. This enormous airliner is now essentially an unpowered and uncontrolled glider. We’ll reboot the system until we get it working again, or until we have an uncontrolled encounter with terrain”?
Personally, I want people testing the more safety-critical aspects of avionics software to adhere to higher standards of coverage, and to be able to provide evidence of the same.
Like most people, I don’t always read those pesky agreements that come with software these days, but I made an exception for the Tune-Up package I’m installing to try to revive my tired old Windows XP system. I came across this curious contradiction in the warranty section of the agreement:
The Software and your documentation are free of defects if they can be used in accordance with the description of the Software and its functionalities that was provided by TuneUp at the point in time that you received the Software and documentation. Further qualities of the Software are not agreed.
Since no Software is free of defects, we urgently recommend you to back up your data regularly.
Okay, guys, what is it? Is the software free of defects or not? If it is free of defects, perhaps you could enlighten us all on how you did that?
As I mentioned earlier in this blog, we are adopting a unique feature here. Readers can submit questions about my books to me to answer in this blog. I will answer at most one a week–as I have a lot of other work going on, which I hope everyone can understand–but I will get to the questions eventually. Here’s the first question, from Gianni Pucciani of CERN.
I finished reading the book Advanced Software Testing Vol.2 for the preparation of the ISTQB AL-TM. First of all thanks a lot, I found the book excellent, with lots of good tips that one could not know without adequate experience, and very well explained. Now I am reviewing all the chapters and their Q/A. I am planning to send you an email at the end of each chapter in case I have doubts, in order to clarify some of the questions.
For Chapter 1 I have only one doubt, on question #2 [which I've inserted here].
Assume you are a test manager working on a project to create a programmable thermostat for home use to control central heating, ventilation, and air conditioning (HVAC) systems. This project is following a sequential lifecycle model, specifically the V-model. Currently, the system architects have released a first draft design specification, based on the approved requirements specification released previously. Which of the following are appropriate test tasks to execute at this time?
A. Design tests from the requirements specification.
B. Analyze design-related risks.
C. Execute unit test cases.
D. Write the test summary report.
E. Design tests from the design specification .
The solution is A, B, E, but I don’t agree on A. It asks to identify the tests that are appropriate to execute at this time (release of the first draft design, requirements specification was already released). A (design tests from the requirements specification) is wrong in my opinion because this should have already been done as soon as the requirements specification was available. So, I don’t think A is appropriate, it can be done “now,” but it should have been done before. I would agree with including A if the questions was “identify the tests that can be done at this time”. The Chapter stresses the importance of testing activities aligned with the development process. Executing A at that time for me is an example of sub-optimal alignment. What do you think?
CERN IT Dept.
Gianni, you are correct that the design of tests based on the requirements should have started earlier, which is indeed a key theme of the chapter. However, that set of test tasks might not have been completed yet. In addition, the design of tests from design specifications often involves referring to the requirements specification as well (e.g., as a test oracle). Therefore, it is appropriate that the test tasks described in option A take place at this time.
I hope that helps?
Recently, one of our licensed instructors asked me about a question in our Advanced Test Analyst course, related to two very useful test design techniques, the decision table and the related cause-effect graph. The question is as follows:
An on-line shoe-selling e-commerce Web site stocks the following options for men’s loafers:
Tassel: Tassel (T) or non-tassel (~T)
Color: Black (B), cordovan (C), or white (W)
Size: all full and half sizes from 8 to 14 (S=n)
The store is overstocked with tasseled loafers of all sizes and colors, along with white loafers in all sizes, and cordovan loafers in sizes 13, 13 ½, and 14. As a result, they are offering a 10% discount (10%) and free shipping (FS) on these items. Design a full decision table that shows all combinations of conditions, then collapse that table by using don’t care (“-“) notation where one or two conditions cannot influence the action. Which of the following statements is true about these two tables?
A. The full table has 8 rules; the collapsed table has 5.
B. The full table has 12 rules; the collapsed table has 7.
C. The full table has 12 rules; the collapsed table has 5.
D. Both tables have 12 rules, as no combinations can collapse
The instructor wrote, “The answer is C – however I was wondering if you explain the logic to as to why?”
Okay, so here’s the trick. The full table has twelve rules (columns) because you have one condition with three possible values (color) and two conditions with two possible values (size >= 13 and tassel), so 3×2x2=12. Because half of the columns have tassel == true, then six columns collapse to one, leaving seven columns. The four remaining columns that collapse to leave two columns each (or five columns total) have to do with color being black (which is not on sale no matter size) and color being white (which is on sale no matter size).
So, you can completely test the combinations of conditions for the business logic behind the discount with just twelve tests, and, if you are pressed for time, just five tests will give you pretty good risk mitigation.