Software reliability

Seems like the process of bring Java under an open source license has raised the question of Java platform quality even higher. In particular the question of reliability has been discussed more and more widely among my peers over last few months. Thus, I decided to share a couple of thoughts on the topic. Hopefully, you’ll like what you’ll about to see.

What reliability means for us.

According to IEEE definition, reliability is “The ability of a system or component to perform its required functions under stated conditions for a specified period of time.” [1]

First of all, I’d like to emphasize words “required functions”, “stated conditions”, and “specified period”. Also I want to add “repetitively”. I believe later on it will be clear, why I’ve focused on those.

The majority of software reliability studies are paying a good deal of attention to the amount of time a system or component can perform without a failure. Most of it is dealing with different kinds of fault/time distributions, estimations of failure intensity, failure likelihood probabilities, failure intervals, and such. I beg your pardon for rather long citation: “…There is a lot of lore about system testing, but it all boils down to guesswork. That is, it is
guesswork unless you can structure the problem and perform the testing so that you can apply mathematical statistics. If you can do this, you can say some- thing like “No, we cannot be absolutely certain that the software will never fail, but relative to a theoretically sound and experimentally validated statistical model, we have done sufficient testing to say with 5-percent confidence that the probability of 1,OOO CPU hours of failure-free operation in a probabilistically defined environment is at least 0.995.” When you do this, you are applying software-reliability measurement.” [2]

With no attempt to underestimate or undermine such studies ([3]), and being in the agreement with absolute necessity of statistical modeling and verification of the software testing, I want to talk about a wider approach. It isn’t perhaps a brand new one, but might be slightly different from what you’ve seeing so far.

Different takes on quality.

I love to talk about quality, mostly because this is very vast topic and one can sell some nonsense 🙂

As I see this, there are two main quality approaches. I’ll call them hardware and software types. The main differences between those are coming from the production cycle specifics of devices and applications. Namely these are:

  • hardware production has much higher costs because of complex factory processes, complicated and costly equipment involved, et cetera. Thus, you’d better be careful with how a device’s components are designed, produced, assembled together, and tested. It might cost a fortune to make changes in a silicon chip, a motherboard design, or a car once it’s out.
  • Eventually, hardware development is addressed with more “respect” and precise planning because of high up front investment.

  • software, on the other hand, usually has more flexible life span, the targets are sometimes easily moved along the development process, requirement are changed, design documents might be somehow informal, specs changes might not be well tracked down to the real application defects, quality process gets stuck behind, and on, and on… At the end of the day a software application reaches its customers and they start finding bugs in it. Then an escalation is being arisen. And the product’s sustaining team has to spend time to mirror customer’s setup, repeat all the steps to reproduce a defect, etc. And consider yourself lucky if all of this can be done just in one interaction. However, if defect report wasn’t detailed enough or the setup was a way too sophisticated you might spend months to nail down a particular problem. We all saw this many times, right?

Our take on the problem represents a mix between those two above as follows. I wanted to bring the best parts of the hardware reliability and bring it over to the software one wherever possible. Here’s what I see as necessary steps:

  1. Design and architectural reviews (many teams are doing this already)
    1. Tracking correlations between architecture decisions, changes and discovered defects
  2. Mean-Time-To-Failure (MTTF) testing. A quality department can run some preferably standardized applications for a prolonged period of time to demonstrate the stability of the software platform. However a better approach would be to run scenario based MTTF tests.
  3. Employing statistical analysis of quality trends
  4. Enforcing static analysis valuations on the periodic basis
    1. Scenario based MTTF testing. Normally, one can gather a few (may
      be a hundred or so) typical usage scenarios for a software
      application. The number likely to be much higher for a software
      platform like Java. These scenarios might be simulated or
      replicated with a test harness of choice and a specific set of
      existing or newly developed tests. Of course, you might not be
      able to simulate any of these real-life scenarios with 100%
      accuracy, but it’s not always necessary. These scenarios then
      should be executed repeatedly and their pass/fail rate has to be
      tracked over time.
    2. Scenarios completeness. Using a list of features, utilized during a scenario execution, and static analysis results one can tell which parts of a software application will be touched during a particular scenario’s run. Using code coverage methods you can find out which parts of the scenario’s functionality are covered or not. With something similar to BSP http://weblogs.java.net/blog/cos/archive/2005/12/java_quality_me_6.html you can leverage efforts of the improvements, but this is another story and it’s been covered already.
  5. Quality trends monitoring. The proceedings of #5.1 should be included here.

When I’m communicating these steps to my peers and colleagues I’m hearing a number of concerns. Typically, these are:

  • how #2 is connected to the reliability
  • #4 seems to be an over stretch
  • how you can be sure that #5a is the same as running heavy weight applications to verity your platform stability/reliability

Hopefully, I’ll be able to answer these or other questions, you might send to me as your comments.

  1. Why design and architectural review? Long story short, you can keep bad solutions away from your system. Proven practices usually guarantee lesser amount of last minutes changes at the development stage. Thus, the testing burden will be lower, as well as amount of regressions, customer escalation, etc. What about 1.1 above? I don’t know – it just sounds cool, I guess 🙂
  2. Everybody seems to be doing this, so why don’t we..? Seriously, this one of the reliability’s aspects you want to count, because it backed up by well developed theory and years of practice, and this one is meaningful quantitative metric.
  3. Not sure why? Just read some of those books, will ya? 😉
  4. Static analysis is capable of finding types of defects, which aren’t likely to be discovered in the runtime. That happens because for complex systems you can’t guarantee a coverage of Cartesian product of input and output states sets. However, some of the nasty bugs are tend to be hiding right in those dusty corners, which you or one of your customers only might hit once in a while. Thus, if you are running a designated static analyzers cleanly on every build of yours, you at least can demonstrate, that it doesn’t leak memory or running out of file handles. Consider that reliable also means trustworthy.
  5. One might say, that you can track memory leaks with runtime monitoring. True. But how you’ll going to find and fix them now?

    1. Is giving you the determinism of testing repetitiveness which is likely to be missed with BigApps approach, discussed later.
    2. is complimentary to this one.
  6. You want to know if your development/quality processes are convergent, right?

And finally I’d like to mention several common reliability approaches. Also I’m going to explain why I see these as misconceptions.

  1. Stress testing
  2. This one is most often being messed up with reliability concept. The reason for this is perhaps clear, because one might expect from a reliable system to work in a wide variety of conditions and perform its functions well.

    You can hear a word on the streets, that ‘…Microsoft Windows is unreliable.” Hell, yes. It sure isn’t if you’ll try to debug a huge C++ project, process some statistical data, get a bunch of spam emails, and install 20+ security fixes from their update center at the same time. It will likely to crash and destroy some of your files, or it might hung nicely. Or you’ll suffer some critical performance degradations. I can’t tell for sure, ’cause I’m not one of those lucky Windows users. And I’m not trying to make fun of Windows – people are doing so with their computers on a daily basis much better than I can even dream of 🙂 My point is to tell, that the scenario above is a bit extreme and well beyond an average Windows’ users capabilities or, perhaps, desire.

    However, normally your Visual C++ project debug session will go smoothly in probably 95% of cases (however, I once was doing some C# project, which crashed my development machine to BSoD on every load. But an after crash attempt to load it again was always a success. Weirdo…). Did you ever count how many times your email client worked well when you were sending your emails? Perhaps not, but I’m sure almost everyone has a story to tell about so badly corrupted was address book last time the Outlook crashed, right?

    Correctly data series processing and features usage information gathering can relatively easy demonstrate that the Outlook is reliable application. It has, say, 93.5% failure free behavior over every 10 hours of execution. But it hard to guarantee that this application will survive under some monstrous load conditions.

  3. BigApps testing
  4. The concept of BigApps testing consists of running some bulky commercial applications to derive MTTF for. usually, a software platform. Well, I see three fold problem here (I’m sure there more
    of these, but I’ll let you to deduce them on your own:)

    1. Any BigApp run is as good as the typical utilization (or usage scenario) of features of your platform by that application.
    2. The correctness the exercised application itself might be questionable
    3. Results you’ll see at the completion of a run should be considered accountable to that particular application. If you were running a PeopleSoft system for a week and have demonstrated a MTTF=140 hours that is great… For PeopleSoft marketing and PR team, but not that useful for your development organization. That gives them a little of handy info. Although, if a crash will occur the engineering team can discover some really bad problem in the code and fix it. Which is the rare case of non-zero sum game!

It might be cool marketing or sales tool to use on customers, but it might be not as great for engineers.

I hope this article helped to scratch some surface of this problem. Please let me know what do you think, pinpoint the lacks of logic, or just yell at me if you think I’m wrong. Let’s communicate about this. May be we’ll work something out that we all can use later for the applications and products we develop for life or for an enjoyment.

Cheers,
Cos


[1] IEEE 982.2-1988 standard. Has been withdrawn in 2002
[2] John D. Musa. A. Frank Ackerman.”Quantifying Software Validation: When to Stop Testing?”
[3] http://portal.acm.org/citation.cfm?id=22980&dl=#


The original has been posted at http://weblogs.java.net: Software reliability