Deductive proofs of program correctness are prohibitively expensive in the majority of cases. To what extent can test results be used as Bayesian evidence in confirmation of the hypothesis “this program is correct”?
This report is about test results, how it can be used measure correctness of a program, or to what extent it can be used to support the hypothesis “this program is correct”. In this case, a program is basically a piece of computer software, and “correct” in this sense means that the program meets its initial requirements.
Computer programs are often tested and tagged “correct”. Testing is one of the most important and crucial part of program design. It is a way of measuring the correctness of a program whether it is complete, and it is in compliance with its specified requirements. Nowadays, programs have become very complicated and so has testing methodologies, it has become so vast that measuring correctness of a system is virtually very difficult to achieve. Extensive testing of such complex programs has become very expensive that is not wisely affordable. Since absolute correctness is virtually impossible to achieve, programs are being tested to a certain extent and are considered “fit” or “correct”.
The issue now is to what extent are the test results being measured and said to confirm the correctness of a program. There are different approaches used to determine the correctness of a program. These will be elaborated fully in the main content of this report. Also, how expensive are these “tests”? And why are they not affordable?
To be able to understand the scenario, there is need to evaluate some basic terms used like system test results, deductive proofs, program correctness, and prohibitively expensive.
What is meant by Deductive Proof?
Deductive proofs – This is where there are one or more indisputable facts that a certain thing is true. A strong proof where the facts used to support the truth of characteristics are true, then the characteristics most also be true. For example, if a person owns one million British pounds then he is a millionaire. If john owns ten million pounds, then from a deductive proof we can assume that john is a millionaire. If it is true that john has more than a million pounds, then it is true that john is a millionaire .
What is Correctness?
Program correctness – According to John Daintith, program correctness proof is a mathematical demonstration that the semantics of a program are consistent with some specifications of that program. And also state two prerequisites to the provision of such proof; there must be a formal specification for the program, and there must be some formal definition of the semantics of the program (John Daintith, 2004).
Software Testing Approach
Testing is a fundamental part in software development. It is largely deployed in every phase in the software development cycle. Usually, more than half of the development time is spent on testing. Testing is usually performed for the following motivations;
Testing is actually a vital part in the software development circle, and these are some of the major motivation for testing a program;
- To improve quality – computers and software are used in critical applications such as medical equipments, which a result of a bug will cause severe consequences. Quality means to its conformance the initial design requirements or specification
- For Verification & Validation – In this case, testing serves as a metrics to ensure the correctness of the program in certain conditions
- For reliability estimation – testing is done to ensure reliability of a program. A program may conform to the initial requirements but may fail in real-time due to pressure or other external factors.
- In the Software Development Life cycle System Testing is the first level where the System is tested as a whole
- The System is tested to verify if it meets the functional and technical requirements
- The application/System is tested in an environment that closely resembles the production environment where the application will be finally deployed
- The System Testing enables us to test, verify and validate both the Business requirements as well as the Application Architecture
This describes how and what components of the program are tested to ensure its conformance with its initial requirements. Below are some of the major testing methodologies in brief;
- Correctness Testing – this testing approach is done to test the behaviour of the program, to ensure that the output is the right and not wrong.
- Black-box Testing – a kind of testing approach where test data are derived from the specified functional requirements with regard to final program structure.
- White-box Testing – a kind of approach in which testing is done based on the program structure and flow regardless of its initial requirements
- Performance Testing – most programs do not have specification for performance. These are specifications that are implicit to the program. Performance bugs could cause a program to degrade since it’s not made to last forever. Resource use is considered in this kind of approach
- Reliability Testing – this approach is about exercising a program so that failures are discovered and removed before they are deployed.
- Security Testing – a program may run according to requirements, but may be vulnerable to intruders and unauthorised access for exploits. This is extremely dangerous and poses a threat to the integrity of the program.
Testing is very costly, automation can be a very good way to cut down testing costs. Software testing tools and techniques sometimes lack applicability and scalability. The reason is basic. In order to automate the process, we have to generate test cases to test the target software against the program specification to decide their correctness. There is still need for human intervention in the process as there has not been any system that has achieved 100% automation.
Financial Implication of testing
Prohibitively expensive – “prohibitively expensive” in this sent means that is very expensive and not affordable to conduct extensive test to get a deductive proof that a program is correct.
Testing is important for software quality, but it can also be prohibitively expensive. Testing consumes up to 50% of development costs. It can locate defects but never demonstrate correctness. We have seen that testing is endless; it has financial implications as well. The better the testing, the lesser the risk incurred by the program. Annually, billions are lost because of improper testing; this is also due to the fact that a line has to be drawn to stop testing at some point. If the stopping point is miscalculated, the more risk is incurred. Below is a report made by NIST on billions of dollars that could be saved each year if improvements were made to the testing process.
Realities of system testing
- Not all problems will be found no matter how thorough or systematic the testing.
- Testing resources (staff, time, tools, and labs) are limited.
- Specifications are frequently unclear/ambiguous and changing (and not necessarily complete and up-to-date).
- Systems are almost always too large to permit test cases to be selected based on code characteristics.
- Exhaustive testing is not possible.
- Testing is creative and difficult.
- A major objective of testing is failure prevention.
- Testing must be planned.
- Testing should be done by people who are independent of the developers.
Prioritizing Test Cases
Once a test suite has been selected, it is often desirable to prioritize test cases based on some criterion. That way, since the time available for testing is limited and therefore all tests can’t be run, at least the “most important” ones can be. 
Bases for Test Prioritization
- Most frequently executed inputs.
- Most critical functions.
- Most critical individual inputs.
- (Additional) statement or branch coverage.
- (Additional) Function coverage.
- Fault-exposing potential.
When to Stop Testing?
Testing is potentially endless. We cannot stop testing until all the defects are unearthed and removed which is simply impossible. At a point testing has to be stopped and software is released. Example, Microsoft releases Windows 7 at some point, it becomes vulnerable at some point and users are advised to update with the necessary patches. The question is when to stop these tests.
Realistically, testing is a trade-off between budget, time and quality. It is driven by profit models on return on investment and risk based. The distrustful and unfortunately most often used approach is to stop testing whenever some or any of the allocated resources such as time, budget, or test cases are exhausted. The optimistic stopping rule is to stop testing when either reliability meets the requirement, or the benefit from continuing testing cannot justify the testing cost (Yang, 1995). This will usually require the use of reliability models to evaluate and predict reliability of the software under test. Each evaluation requires repeated running of the following cycle: failure data gathering, modelling, and prediction. This method does not fit well for ultra-dependable systems, however, because the real field failure data will take too long to accumulate. (Pan, 1999)
Test results are said to be fit to validate a program when they prove that the program conforms to initial requirements. Also risk factor has to be considered, risk has to be prioritised and tested accordingly. For example, a banking program should have its risks sorted like in the description below,
(Source: Elaine Weyuker, Software Testing Basics – AT&T Labs – Research)
What is Inference?
Inference is the process of updating probabilities of outcomes based upon the relationships in the model and the evidence known about the situation at hand.
Bayesian evidence is a method used to calculate the probability of a hypothesis whether it is true. When a Bayesian model is actually used, the end user applies evidence about recent events or observations. This information is applied to the model by “instantiating” or “clamping” a variable to a state that is consistent with the observation. Then the mathematical mechanics are performed to update the probabilities of all the other variables that are connected to the variable representing the new evidence.
After inference, the updated probabilities reflect the new levels of belief in (or probabilities of) all possible outcomes coded in the model. These beliefs are mediated by the original assessment of belief performed by the author of the model.
The beliefs originally encoded in the model are known as prior probabilities, because they are entered before any evidence is known about the situation. The beliefs computed after evidence is entered are known as posterior probabilities, because they reflect the levels of belief computed in light of the new evidence. (Microsoft Research, 2010)
There are numerous real-world situations about which an analyst might wish to hypothesize and investigate, but it would be impractical to encode all of them explicitly in a support system for analysts. Instead, our approach is to represent fragments of situations and provide a mechanism for combining them into a wide variety of more complete ones (Michael et al, 2007). The combination occurs dynamically as evidence about a situation becomes available or as an analyst revises or enters new hypotheses. A fragment is represented as a Bayesian network with nodes for hypotheses, events, and evidence, and links for relating them. Our ability to combine the fragments into more complete situation models is dependent on having a consistent terminology in which the fragments are described. The focus of our work has been on
- Defining and representing the terminology, including terms of a domain and terms for evidence in that domain,
- Capturing new fragments from a variety of sources, and
- Incorporating the terminology and BN fragments into an integrated end-to-end tool.
(Michael et al, 2007)
So, having looked at the processes involved in obtaining test results, how inferences are drawn to conclude on validity of a program and to what extent should programs be tested, I could easily draw some of these conclusions.
Testing is quite expensive, automated testing is a good way to cut down cost and time. Also, risk factor should also be thoroughly put in side by side when making a conclusion the correctness of a program. 100% testing is impossible, program complexity has been the root of the problem, and at some point software testing has to be stopped and shipped out. This is basically decided by time and budget of the project. Software is termed “correct” if the reliability estimate of the program conforms to the initial requirements.
In my opinion, the point where a line should drawn to say this “program is correct” is if the reliability estimate of the software product meets requirement. This should be the minimum amount of testing that should be conducted. There are different kind of programs and different levels of testing. Critical programs like medical, airplane, etc should have a more priority level of testing than word processing software.
- Randall Hudson. (2009). Inductive/Deductive Proofs. Available: http://teachers.net/lessons/posts/2724.html. Last accessed 1st Dec
- John Daintith. (2004). Program Correctness Proof. Available: http://www.encyclopedia.com/doc/1O11-programcorrectnessproof.html. Last accessed 1st Dec 2010.
- Microsoft. (2010). Basics of Bayesian Inference and Belief Networks. Available: http://research.microsoft.com/en-us/um/redmond/groups/adapt/msbnx/msbnx/basics_of_bayesian_inference.htm. Last accessed 1st Dec 2010.
- Elaine Weyuker . (2010). Software Testing Basics-AT&T Labs – Research . Available: www.cs.rice.edu/FORTE02/weyuker1.ppt. Last accessed 3rd Dec 2010.
- Yang, M.C.K.; Chao, A. Reliability-estimation and stopping-rules for software testing, based on repeated appearances of bugs; IEEE Transactions on Reliability, vol.44, no.2, p. 315-21, 1995
- Michael N. Huhns and Marco G. Valtorta. (2007). Ontological Support for Bayesian Evidence Management. Available: http://www.cse.sc.edu/~mgv/papers/oic-07.pdf. Last accessed 3rd