formal_pbt.tex | searchcode

/Thesis/formal_pbt.tex

http://github.com/JacquesCarette/GenCheck · LaTeX · 254 lines · 200 code · 52 blank · 2 comment · 0 complexity · 8e064be3bd271bef7a66ac6a55c4d2ec MD5 · raw file

%\section{Formally define property-based testing systems}

% for convenience
\newcommand{\hf}{\ensuremath{\mathtt{f}}\xspace}
\newcommand{\Bool}{\ensuremath{\mathtt{Bool}}}

Testing a Haskell function \hf, using properties, consists of four, largely
independent, steps:
\begin{enumerate}
\item construct a property $p$ that \hf should have,
\item generate test cases (valid inputs for \hf),
\item evaluate the property over (a selection of) test cases,
\item report the results.
\end{enumerate}

A \emph{test system} is a framework to perform these steps.  As each
step admits a number of useful variations, such a framework is only 
useful if it provides a variety of capabilities.  To better establish the
requirements for a \pbt, we first refine the definition and scope of each
step.

\subsection{Properties}

Recall that in section~\ref{pbt}, we defined an \emph{implementation} $M$ as a
computable \emph{interpretation} of a specification $S$, where the specification
defines a set of symbols and their semantics as a set of axioms.
The symbols are simply the names of the functions exported from a module.
The purpose of a \pbt is to supply evidence supporting or refuting
the correctness of this implementation.

A \emph{property}  $ p: \alpha \ra \Bool$ is the implementation (using $M$) of
a predicate $P : A \ra \boolean$ in the specification.  Note
that, as it is an implementation, it is necessarily computable; without
loss of generality, we can restrict properties to be univariate.
It is important to note that there is a concretization step happening here,
not just between predicates and properties, but also between data values:
$\dom{P} \subseteq A$ and $\alpha$.  As is usual in computer science, 
we will assume that $A$ and $\alpha$ are isomorphic, and will not explicitly
worry about this isomorphism.  We can thus talk about $\dom{p}$ as well,
which is a ``subset'' of the values of type $\alpha$.

We will, from now on, talk interchangeably between predicates and properties.
Furthermore, we will make some use of set notation over types; in the context
of PBT, this actually causes no harm since all data we represent will
eventually be concrete, and thus has a Set model.

We say that a property $p$ holds if, over all values in the domain ($\dom{p}$),
the property evaluates to $\True$.  Note that on $\alpha \setminus \dom{p}$, it
is not required to hold, nor even be defined.  If $p$ is partial, we will call
it a \emph{conditional} property.  For these, a \pbt must be careful to not
attempt to evaluate the property outside its domain.

If we have a (total) function $f : \alpha \ra \Bool$ such that 
$f a = \True \Leftrightarrow a \in \dom{p}$, we call $f$ a
\emph{characteristic function} for $p$.  Such functions are very
useful to deal with partial properties.

The collection of properties implementing the predicates of the specification
will be called the testable specification,
or just the specification where it is clear through context that
the reference is to the properties and not the abstract predicates.

\jacques{I deleted all the ``formal'' definitions, are they were completely
equivalent to what is already above.  The above is sufficiently formal.}

\subsection{Test Cases and Test Suites}

For this subsection, fix a predicate $p : \alpha \ra \Bool$.

Recall that a \emph{test case} contains a single value (the \emph{datum}) of
type $\alpha$, and that a \emph{test suite} is a collection of test
cases.

Each test case may include additional information (\emph{meta-datum})
that can be used to improve the evaluation of the test cases or reporting of the results.
A datum will not necessarily be in the property's domain,
in which case the test case can be characterized as \emph{invalid}.
The meta-datum can be arbitrary, but their type must be uniform over a test suite.

Given a test case $\tau : \beta$, we assume that we have a projection function $\datum$
to extract the data.

\begin{df}[Datum Function]
$$\datum(\tau) : \beta \ra \alpha$$
\end{df}

\begin{df}[Valid Test Case]
A test case $\tau$ is \emph{valid} for a 
property $p$ if the datum is guaranteed to be in its domain:

$$ \datum(\tau) \in \dom{p} $$
\end{df}

\subsection{Evaluation, Selection}

We first define what happens when evaluating a single test case.
Second, we define some items relating to test suites.
Then we deal with actually running a test suite.

\subsubsection{Verdicts and Results}

The evaluation of a property at a datum does not merely produce a boolean 
value.  Other data, such as the execution time or system resources used,
can also be tracked as part of the \emph{result} of running a test.
Of course, if the property evaluates to $\True$, we say that the 
property holds (is a successful test evaluation), while for $\False$,
a failure.  But this is not the only method by which a test can ``fail'':
it is also possible that the test case was not valid, or that the 
property evaluation took too long.  Lastly, the framework may decide to
not evaluate every test case in a test suite, and this can be reported as well.

In other words, rather than just pass/fail, we define a finer-grained notion
of a \emph{verdict}, which encapsulates the outcome of a test case evaluation.

\begin{df}[Verdict]
A \emph{verdict} describes the result of a test evaluation.  We use the label
set $\verdictset = \{ \success, \fail, \nonterm, \invalid, \noteval \}$
as verdicts, and these should be intererpreted as:

\begin{description}
\item[$\success$] the property holds over a valid test case,
\item[$\fail$] the property does not hold over a valid test case,
\item[$\nonterm$] the property evaluation did not terminate in the allotted time,
\item[$\invalid$] the test case was not in the property's domain, so was not evaluated,
\item[$\noteval$] the test case was not evaluated and its validity is unknown.
\end{description}
\end{df}

Note that the verdict of $\nonterm$ can also be used to indicate that an
exception was raised during the evaluation of the test case,
assuming the evaluation function is capable of trapping it.

In general, different evaluation functions used.  They generally differ
in what kinds of verdicts they can return.  \emph{Unbounded} evaluation
assumes that $\nonterm$ is impossible; \emph{unconditional} evaluation
can be used when $\invalid$ is impossible.  

A \emph{result} consists of a test case, the verdict of the evaluation of the
property for that case, and any additional information about the evaluation of
the property at that value.  It must be possible to extract the verdict from
a result.

\begin{df}[Verdict Function]
$$\verdict : \gamma \ra \verdictset$$
\noindent where $\gamma$ is a result type.
\end{df}

\subsubsection{Test suites}

A test suite $T$ is a collection of test cases.  $T$ can be 
considered to be a set of test cases, although we will rarely 
implement it that way.  For concreteness, we will use $t$ to
denote a \emph{container}, $\beta$ a type of test cases, 
and thus $t \beta$ will be the type of a test suite.  $t$
will generally be assumed to be a \texttt{Functor} as well
as \texttt{Traversable}.

We can lift definitions previously made on test cases to test suites.
For example, a test suite is \emph{valid}, if all test cases it contains
are valid.

A test suite may be partitioned (and labelled) -- to 
assist in prioritizing the evaluation of test cases, and reporting the results.
The labels could be based on the test data, meta-data or 
other evaluation information.

\begin{df}[Labelled Partition]
A labelled partition, with label set $\labels$, is an ordered sequence of test suites
$$ \Pi :: \nat \ra (\labels, t \beta)$$
\end{df}
\noindent
 
The verdict over a test suite is determined by interpreting the collective
verdicts of each result; partial verdicts can also be determined for any part
of the result set.

\subsubsection{Execution of a suite}

While a test suite is any collection of tests, this does not mean that each
of these will actually be run.  We can use different methods of choosing 
when to stop testing, for example:
\begin{itemize}
\item terminate testing when a fixed number of errors are found (\QC, \SC),
\item terminate after a fixed period of time, or
\item complete all tests.
\end{itemize}

We will naturally say that an execution of a test suite 
is \emph{complete} if all cases were evaluated (even if some resulted
in $\invalid$ verdicts); \emph{partial} if some verdicts were $\noteval$;
and \emph{time limited} if some were $\nonterm$. These can be combined.

\subsubsection{Summary Verdicts}

The verdict of a test suite is based on the combined verdicts of each individual result.
We do this by
\begin{itemize}
\item putting an order structure on verdicts:
$$ \fail > \nonterm > \success > \invalid > \noteval$$
\item using the Foldable structure of the container to fold
\item the induced join-semilattice structure (aka join, $\wedge$)
fold the join ($\wedge$) 
\end{itemize}
\noindent over the verdicts.

\subsection{Reporting}

\jacques{you were doing the $4$ steps up to now, so logically you should 
be talking about reporting at this point.}

A report presents the results of a test suite, in a way useful to the
(human) tester.
Reports provide the verdict of a test and some level of detail about the results,
organizing them in different ways depending on the goal of the report.
Test reports should highlight any failed (or non-terminating) test cases.

Report components should use the generic interface into 
the results to get the verdicts and test case values.
Optional evaluation information should be provided by 
\emph{specializing} the results data structure, the report component,
and either the test case generator (for test metadata) or
the execution component (for test evaluation information).

\section{Test Systems}

\jacques{Made this into a section, as it is not about the $4$ steps anymore}

A test system provides additional functionality over the $4$ steps described above.
In particular, it can

\begin{enumerate}
\item obtain actual test cases from a generator,
\item schedule the evaluation of test cases,
\item perform evaluation of test cases (with potential early termination),
\item organizes test results,
\item report verdict and optionally a summary and/or details of the results.
\end{enumerate}

\noindent
These components should adhere to a common interface
but tcan be specialized to exchange additional information.

Scheduling refers to providing the order in which test cases in a test suite are evaluated.
Scheduling may be implicit,
such as when applying |map| with an evaluation function over the test suite container.
For example,
\QC, \SC and the other packages reviewed in section \ref{pbt_soft}
generate and evaluate test cases sequentially,
terminating on the first error (by default).
On distributed systems, scheduling would include 
the allocation of evaluations to different nodes
and collecting the results into the common result set.

Test \emph{results} should be considered distinct from \emph{reports}, as discussed above.