PageRenderTime 28ms CodeModel.GetById 13ms app.highlight 6ms RepoModel.GetById 1ms app.codeStats 0ms

/tags/wad-0-2-1/SWIG/Tools/WAD/Papers/usenix2001.tex

#
LaTeX | 1347 lines | 1073 code | 208 blank | 66 comment | 0 complexity | d1bffb819b928936709585ec80b9bb01 MD5 | raw file
Possible License(s): LGPL-2.1, Cube, GPL-3.0, 0BSD, GPL-2.0

Large files files are truncated, but you can click here to view the full file

   1%template for producing IEEE-format articles using LaTeX.
   2%written by Matthew Ward, CS Department, Worcester Polytechnic Institute.
   3%use at your own risk.  Complaints to /dev/null.
   4%make two column with no page numbering, default is 10 point
   5%\documentstyle{article}
   6\documentstyle[twocolumn,times]{article}
   7\pagestyle{empty}
   8
   9%set dimensions of columns, gap between columns, and space between paragraphs
  10%\setlength{\textheight}{8.75in}
  11\setlength{\textheight}{9.0in}
  12\setlength{\columnsep}{0.25in}
  13\setlength{\textwidth}{6.45in}
  14\setlength{\footheight}{0.0in}
  15\setlength{\topmargin}{0.0in}
  16\setlength{\headheight}{0.0in}
  17\setlength{\headsep}{0.0in}
  18\setlength{\oddsidemargin}{0in}
  19%\setlength{\oddsidemargin}{-.065in}
  20%\setlength{\oddsidemargin}{-.17in}
  21%\setlength{\parindent}{0pc}
  22
  23%I copied stuff out of art10.sty and modified them to conform to IEEE format
  24
  25\makeatletter
  26%as Latex considers descenders in its calculation of interline spacing,
  27%to get 12 point spacing for normalsize text, must set it to 10 points
  28\def\@normalsize{\@setsize\normalsize{12pt}\xpt\@xpt
  29\abovedisplayskip 10pt plus2pt minus5pt\belowdisplayskip \abovedisplayskip
  30\abovedisplayshortskip \z@ plus3pt\belowdisplayshortskip 6pt plus3pt
  31minus3pt\let\@listi\@listI} 
  32
  33%need an 11 pt font size for subsection and abstract headings
  34\def\subsize{\@setsize\subsize{12pt}\xipt\@xipt}
  35
  36%make section titles bold and 12 point, 2 blank lines before, 1 after
  37\def\section{\@startsection {section}{1}{\z@}{24pt plus 2pt minus 2pt}
  38{12pt plus 2pt minus 2pt}{\large\bf}}
  39
  40%make subsection titles bold and 11 point, 1 blank line before, 1 after
  41\def\subsection{\@startsection {subsection}{2}{\z@}{12pt plus 2pt minus 2pt}
  42{12pt plus 2pt minus 2pt}{\subsize\bf}}
  43\makeatother
  44
  45\newcommand{\ignore}[1]{}
  46%\renewcommand{\thesubsection}{\arabic{subsection}.}
  47
  48\begin{document}
  49
  50%don't want date printed
  51\date{}
  52
  53%make title bold and 14 pt font (Latex default is non-bold, 16 pt)
  54\title{\Large \bf   An Embedded Error Recovery and Debugging Mechanism for Scripting Language Extensions}
  55
  56%for single author (just remove % characters)
  57\author{{David M.\ Beazley} \\
  58{\em Department of Computer Science} \\
  59{\em University of Chicago }\\
  60{\em Chicago, Illinois 60637 }\\
  61{\em beazley@cs.uchicago.edu }}
  62
  63%  My Department \\
  64%  My Institute \\
  65%  My City, ST, zip}
  66 
  67%for two authors (this is what is printed)
  68%\author{\begin{tabular}[t]{c@{\extracolsep{8em}}c}
  69%  Roscoe Giles	                        & Pablo Tamayo \\
  70% \\
  71%  Department of Electrical, Computer,   & Thinking Machines Corp. \\
  72%  and Systems Engineering               & Cambridge, MA~~02142.  \\
  73%  and                                   & \\
  74%  Center for Computational Science      & \\
  75%  Boston University, Boston, MA~~02215. & 
  76%\end{tabular}}
  77
  78\maketitle
  79
  80%I don't know why I have to reset thispagesyle, but otherwise get page numbers
  81\thispagestyle{empty}
  82
  83
  84\subsection*{Abstract}
  85{\em
  86In recent years, scripting languages such as Perl, Python, and Tcl
  87have become popular development tools for the creation of
  88sophisticated application software.  One of the most useful features
  89of these languages is their ability to easily interact with compiled
  90languages such as C and C++.  Although this mixed language approach
  91has many benefits, one of the greatest drawbacks is the complexity of
  92debugging that results from using interpreted and compiled code in the
  93same application.  In part, this is due to the fact that scripting
  94language interpreters are unable to recover from catastrophic errors
  95in compiled extension code. Moreover, traditional C/C++ debuggers
  96do not provide a satisfactory degree of integration with interpreted
  97languages.  This paper describes an experimental system in which fatal
  98extension errors such as segmentation faults, bus errors, and failed
  99assertions are handled as scripting language exceptions.  This system,
 100which has been implemented as a general purpose shared library,
 101requires no modifications to the target scripting language, introduces
 102no performance penalty, and simplifies the debugging of mixed
 103interpreted-compiled application software.
 104}
 105
 106\section{Introduction}
 107
 108Slightly more than ten years have passed since John Ousterhout
 109introduced the Tcl scripting language at the 1990 USENIX technical
 110conference \cite{ousterhout}.  Since then, scripting languages have
 111been gaining in popularity as evidenced by the wide-spread use of
 112systems such as Tcl, Perl, Python, Guile, PHP, and Ruby
 113\cite{ousterhout,perl,python,guile,php,ruby}.
 114
 115In part, the success of modern scripting languages is due to their
 116ability to be easily integrated with software written in compiled
 117languages such as C, C++, and Fortran.  In addition, a wide variety of wrapper
 118generation tools can be used
 119to automatically produce bindings between existing code and a
 120variety of scripting language environments
 121\cite{swig,sip,pyfort,f2py,advperl,heidrich,vtk,gwrap,wrappy}.  As a result, a large number of
 122programmers are now using scripting languages to control
 123complex C/C++ programs or as a tool for re-engineering legacy
 124software.  This approach is attractive because it allows programmers
 125to benefit from the flexibility and rapid development of
 126scripting while retaining the best features of compiled code such as high
 127performance \cite{ouster1}.
 128
 129A critical aspect of scripting-compiled code integration is the way in
 130which it departs from traditional C/C++ development and shell
 131scripting.  Rather than building stand-alone applications that run as
 132separate processes, extension programming encourages a style of
 133programming in which components are tightly integrated within 
 134an interpreter that is responsible for high-level control.
 135Because of this, scripted software tends to rely heavily
 136upon shared libraries, dynamic loading, scripts, and
 137third-party extensions. In this sense, one might argue that the
 138benefits of scripting are achieved at the expense of creating a
 139more complicated development environment.
 140
 141A consequence of this complexity is an increased degree of difficulty
 142associated with debugging programs that utilize multiple languages,
 143dynamically loadable modules, and a sophisticated runtime environment.
 144To address this problem, this paper describes an experimental system
 145known as WAD (Wrapped Application Debugger) in which an embedded error
 146reporting and debugging mechanism is added to common scripting
 147languages.  This system converts catastrophic signals such as
 148segmentation faults and failed assertions to exceptions that can be
 149handled by the scripting language interpreter.  In doing so, it
 150provides more seamless integration between error handling in
 151scripting language interpreters and compiled extensions. 
 152
 153\section{The Debugging Problem}
 154
 155Normally, a programming error in a scripted application 
 156results in an exception that describes the problem and the context in
 157which it occurred.  For example, an error in a Python script might
 158produce a traceback similar to the following:
 159
 160\begin{verbatim}
 161% python foo.py
 162Traceback (innermost last):
 163  File "foo.py", line 11, in ?
 164    foo()
 165  File "foo.py", line 8, in foo
 166    bar()
 167  File "foo.py", line 5, in bar
 168    spam()
 169  File "foo.py", line 2, in spam
 170    doh()
 171NameError: doh
 172\end{verbatim}
 173
 174In this case, a programmer might be able to apply a fix simply based
 175on information in the traceback.  Alternatively, if the problem is
 176more complicated, a script-level debugger can be used to provide more
 177information.  In contrast, a failure in compiled extension code might
 178produce the following result:
 179
 180\begin{verbatim}
 181% python foo.py
 182Segmentation Fault (core dumped)
 183\end{verbatim}
 184
 185In this case, the user has no idea of what has happened other than it
 186appears to be ``very bad.''  Furthermore, script-level debuggers are
 187unable to identify the problem since they also crash when the error
 188occurs (they run in the same process as the interpreter).  This means
 189that the only way for a user to narrow the source of the problem
 190within a script is through trial-and-error techniques such as
 191inserting print statements, commenting out sections of scripts, or
 192having a deep intuition of the underlying implementation. Obviously,
 193none of these techniques are particularly elegant.
 194
 195An alternative approach is to run the application under the control of
 196a traditional debugger such as gdb \cite{gdb}.  Although this provides
 197some information about the error, the debugger mostly provides
 198detailed information about the internal implementation of the
 199scripting language interpreter instead of the script-level code that
 200was running at the time of the error.  Needless to say, this information 
 201isn't very useful to most programmers.
 202A related problem is that
 203the structure of a scripted application tends to be much more complex
 204than a traditional stand-alone program.  As a result, a user may not
 205have a good sense of how to actually attach an external debugger to their
 206script.  In addition, execution may occur within a
 207complex run-time environment involving events, threads, and network
 208connections.  Because of this, it can be difficult for the user to reproduce
 209and identify certain types of catastrophic errors if they depend on
 210timing or unusual event sequences. Finally, this approach
 211requires a programmer to have a C development environment installed on
 212their machine.  Unfortunately, this may not hold in practice.
 213This is because scripting languages are often used to provide programmability to
 214applications where end-users write scripts, but do not write low-level C code.
 215
 216Even if a traditional debugger such as gdb were modified to provide
 217better integration with scripting languages, it is not clear that this
 218would be the most natural solution to the problem.  For one, 
 219having to run a separate debugging process to debug
 220extension code is unnatural when no such requirement exists for
 221scripts.  Moreover, even if such a debugger existed, an
 222inexperienced user may not have the expertise or inclination to use
 223it.  Finally, obscure fatal errors may occur long after an application
 224has been deployed.  Unless the debugger is distributed along with the
 225application in some manner, it will be extraordinary difficult to
 226obtain useful diagnostics when such errors occur.
 227
 228\begin{figure*}[t]
 229{\small
 230\begin{verbatim}
 231% python foo.py
 232Traceback (most recent call last):
 233  File "<stdin>", line 1, in ?
 234  File "foo.py", line 16, in ?
 235    foo()
 236  File "foo.py", line 13, in foo
 237    bar()
 238  File "foo.py", line 10, in bar
 239    spam()
 240  File "foo.py", line 7, in spam
 241    doh.doh(a,b,c)
 242
 243SegFault: [ C stack trace ]
 244
 245#2 0x00027774 in call_builtin(func=0x1c74f0,arg=0x1a1ccc,kw=0x0) in 'ceval.c',line 2650
 246#1 0xff083544 in _wrap_doh(self=0x0,args=0x1a1ccc) in 'foo_wrap.c',line 745
 247#0 0xfe7e0568 in doh(a=3,b=4,c=0x0) in 'foo.c',line 28
 248
 249/u0/beazley/Projects/WAD/Python/foo.c, line 28
 250
 251    int doh(int a, int b, int *c) {
 252 =>   *c = a + b;
 253      return *c;
 254    }
 255\end{verbatim}
 256}
 257\caption{Cross language traceback generated by WAD for a segmentation fault in a Python extension}
 258\end{figure*}
 259
 260The current state of the art in extension debugging is to simply add
 261as much error checking as possible to extension modules. This is never
 262a bad thing to do, but in practice it's usually not enough to
 263eliminate every possible problem.  For one, scripting languages are
 264sometimes used to control hundreds of thousands to millions of lines
 265of compiled code.  In this case, it is improbable that a programmer will
 266foresee every conceivable error.  In addition, scripting languages are
 267often used to put new user interfaces on legacy software. In this
 268case, scripting may introduce new modes of execution that cause a
 269formerly ``bug-free'' application to fail in an unexpected manner.
 270Finally, certain types of errors such as floating-point exceptions can
 271be particularly difficult to eliminate because they might be generated
 272algorithmically (e.g., as the result of instability in a numerical
 273method). Therefore, even if a programmer has worked hard to eliminate
 274crashes, there is usually a small probability that an application may
 275fail under unusual circumstances.
 276
 277\section{Embedded Error Reporting}
 278
 279Rather than modifying an existing debugger to support scripting
 280languages, an alternative approach is to add a more powerful error
 281handling and reporting mechanism to the scripting language
 282interpreter.  We have implemented this approach in the form of an
 283experimental system known as WAD.  WAD is packaged as dynamically
 284loadable shared library that can either be loaded as a scripting
 285language extension module or linked to existing extension modules as a
 286library.  The core of the system is generic and requires no
 287modifications to the scripting interpreter or existing extension
 288modules.  Furthermore, the system does not introduce a performance
 289penalty as it does not rely upon program instrumentation or tracing.
 290
 291WAD works by converting fatal signals such as SIGSEGV,
 292SIGBUS, SIGFPE, and SIGABRT into scripting language exceptions that contain
 293debugging information collected from the call-stack of compiled
 294extension code.  By handling errors in this manner, the scripting
 295language interpreter is able to produce a cross-language stack trace that
 296contains information from both the script code and extension code as
 297shown for Python and Tcl/Tk in Figures 1 and 2.  In this case, the user
 298is given a very clear idea of what has happened without having
 299to launch a separate debugger. 
 300
 301The advantage to this approach is that it provides more seamless
 302integration between error handling in scripts and error handling in
 303extensions.  In addition, it eliminates the most common debugging step
 304that a developer is likely to perform in the event of a fatal
 305error--running a separate debugger on a core file and typing 'where'
 306to get a stack trace.  Finally, this allows end-users to provide
 307extension writers with useful debugging information since they can
 308supply a stack trace as opposed to a vague complaint that the program
 309``crashed.''
 310
 311\begin{figure*}[t]
 312\begin{picture}(400,250)(0,0)
 313\put(50,-110){\special{psfile = tcl.ps hscale = 60 vscale = 60}}
 314\end{picture}
 315\caption{Dialog box with WAD generated traceback information for a failed assertion in a Tcl/Tk extension}
 316\end{figure*}
 317
 318\section{Scripting Language Internals}
 319
 320In order to provide embedded error recovery, it is critical to understand how
 321scripting language interpreters interface with extension code.  Despite the wide variety
 322of scripting languages, essentially every implementation uses a similar
 323technique for accessing foreign code.  
 324
 325Virtually all scripting languages provide an extension mechanism in the form of a foreign function
 326interface in which compiled procedures can be called from the scripting language
 327interpreter. This is accomplished by writing a collection of wrapper functions that conform
 328to a specified calling convention. The primary purpose of the wrappers are to
 329marshal arguments and return values between the two languages and to handle errors.
 330For example, in Tcl, every wrapper
 331function must conform to the following prototype:
 332
 333\begin{verbatim}
 334int 
 335wrap_foo(ClientData clientData,
 336         Tcl_Interp *interp,
 337         int objc,
 338         Tcl_Obj *CONST objv[])
 339{
 340    /* Convert arguments */
 341    ...
 342    /* Call a function */
 343
 344    result = foo(args);
 345    /* Set result */
 346    ...
 347    if (success) {
 348        return TCL_OK;
 349    } else {
 350        return TCL_ERROR;
 351    }
 352}
 353\end{verbatim}
 354
 355Another common extension mechanism is an object/type interface that allows programmers to create new
 356kinds of fundamental types or attach special properties to objects in
 357the interpreter.  For example, both Tcl and Python provide an API for creating new 
 358``built-in'' objects that behave like numbers, strings, lists, etc.  
 359In most cases, this involves setting up tables of function
 360pointers that define various properties of an object.  For example, if
 361you wanted to add complex numbers to an interpreter, you might fill in a special
 362data structure with pointers to methods that implement various numerical operations like this:
 363
 364\begin{verbatim}
 365NumberMethods ComplexMethods {
 366    complex_add,
 367    complex_sub,
 368    complex_mul,
 369    complex_div,
 370    ...
 371};\end{verbatim}
 372
 373\noindent
 374Once registered with the interpreter, the methods in this structure
 375would be invoked by various interpreter operators such as $+$,
 376$-$, $*$, and $/$.
 377
 378Most interpreters handle errors as a two-step process in which
 379detailed error information is first registered with the interpreter
 380and then a special error code is returned. For example, in Tcl, errors
 381are handled by setting error information in the interpreter and
 382returning a value of TCL\_ERROR.  Similarly in Python, errors are
 383handled by calling a special function to raise an exception and returning NULL.  In both cases,
 384this triggers the interpreter's error handler---possibly resulting in
 385a stack trace of the running script.  In some cases, an interpreter
 386might handle errors using a form of the C {\tt longjmp} function. 
 387For example, Perl provides a special function {\tt die} that jumps back
 388to the interpreter with a fatal error \cite{advperl}.
 389
 390The precise implementation details of these mechanisms aren't so
 391important for our discussion.  The critical point is that scripting
 392languages always access extension code though a well-defined interface
 393that precisely defines how arguments are to be passed, values are to be
 394returned, and errors are to be handled.
 395
 396\section{Scripting Languages and Signals}
 397
 398Under normal circumstances, errors in extension code are handled
 399through the error-handling API provided by the scripting language
 400interpreter.  For example, if an invalid function parameter is passed,
 401a program can simply set an error message and return to the
 402interpreter.  Similarly, automatic wrapper generators such as SWIG can produce
 403code to convert C++ exceptions and other C-related error handling
 404schemes to scripting language errors \cite{swigexcept}. On the other
 405hand, segmentation faults, failed assertions, and similar problems
 406produce signals that cause the interpreter to abort execution.
 407
 408Most scripting languages provide limited support for Unix signal
 409handling \cite{stevens}.  However, this support is not sufficiently advanced to
 410recover from fatal signals produced by extension code.
 411Unlike signals generated for asynchronous events such as I/O,
 412execution can {\em not} be resumed at the point of a fatal signal.
 413Therefore, even if such a signal could be caught and handled by a script,
 414there isn't much that it can do except to print a diagnostic
 415message and abort before the signal handler returns.  In addition,
 416some interpreters block signal delivery while executing
 417extension code--opting to handle signals at a time when it is more convenient.
 418In this case, a signal such as SIGSEGV would simply cause the whole application
 419to freeze since there is no way for execution to continue to a point where
 420the signal could be delivered.  Thus, scripting languages tend to 
 421either ignore the problem or label it as a ``limitation.''
 422
 423\section{Overview of WAD}
 424
 425WAD installs a signal handler for SIGSEGV, SIGBUS, SIGABRT, SIGILL,
 426and SIGFPE using the {\tt sigaction} function
 427\cite{stevens}. Furthermore, it uses a special option (SA\_SIGINFO) of
 428signal handling that passes process context information to the signal
 429handler when a signal occurs. Since none of these signals are normally used in the
 430implementation of the scripting interpreter or by user scripts,
 431this does not usually override any previous signal handling.
 432Afterwards, when one of these signals occurs, a two-phase recovery
 433process executes. First, information is collected about the execution
 434context including a full stack-trace, symbol table entries, and
 435debugging information.  Then, the current stream of execution is
 436aborted and an error is returned to the interpreter.  This process is
 437illustrated in Figure~3.
 438
 439The collection of context and debugging information involves the
 440following steps:
 441
 442\begin{itemize}
 443\item The program counter and stack pointer are obtained from 
 444context information passed to the signal handler.
 445
 446\item The virtual memory map of the process is obtained from /proc
 447and used to associate virtual memory addresses with executable files,
 448shared libraries, and dynamically loaded extension modules \cite{proc}.
 449
 450\item The call stack is unwound to collect traceback information.
 451At each step of the stack traceback, symbol table and debugging
 452information is gathered and stored in a generic data structure for later use
 453in the recovery process.  This data is obtained by memory-mapping
 454the object files associated with the process and extracting
 455symbol table and debugging information. 
 456\end{itemize}
 457
 458Once debugging information has been collected, the signal handler
 459enters an error-recovery phase that
 460attempts to raise a scripting exception and return to a suitable location in the 
 461interpreter.  To do this, the following steps are performed:
 462
 463\begin{itemize}
 464
 465\item The stack trace is examined to see if there are any locations in the interpreter
 466to which control can be returned. 
 467
 468\item If a suitable return location is found, the CPU context is modified in
 469a manner that makes the signal handler return to the interpreter
 470with an error.  This return process is assisted by a small
 471trampoline function (partially written in assembly language) that arranges a proper
 472return to the interpreter after the signal handler returns.
 473\end{itemize}
 474
 475\noindent
 476Of the two phases, the first is the most straightforward to implement
 477because it involves standard Unix API functions and common file formats such
 478as ELF and stabs \cite{elf,stabs}.   On the other hand, the recovery phase in
 479which control is returned to the interpreter is of greater interest.  Therefore, 
 480it is now described in greater detail.
 481
 482\begin{figure*}[t]
 483\begin{picture}(480,340)(5,60)
 484
 485\put(50,330){\framebox(200,70){}}
 486\put(60,388){\small \tt >>> {\bf foo()}}
 487\put(60,376){\small \tt Traceback (most recent call last):}
 488\put(70,364){\small \tt   File "<stdin>", line 1, in ?}
 489\put(60,352){\small \tt SegFault: [ C stack trace ]}
 490\put(60,340){\small \tt ...}
 491
 492\put(55,392){\line(-1,0){25}}
 493\put(30,392){\line(0,-1){80}}
 494\put(30,312){\line(1,0){95}}
 495\put(125,312){\vector(0,-1){10}}
 496\put(175,302){\line(0,1){10}}
 497\put(175,312){\line(1,0){95}}
 498\put(270,312){\line(0,1){65}}
 499\put(270,377){\vector(-1,0){30}}
 500
 501\put(50,285){\framebox(200,15)[c]{[Python internals]}}
 502\put(125,285){\vector(0,-1){10}}
 503\put(175,275){\vector(0,1){10}}
 504\put(50,260){\framebox(200,15)[c]{call\_builtin()}}
 505\put(125,260){\vector(0,-1){10}}
 506%\put(175,250){\vector(0,1){10}}
 507\put(50,235){\framebox(200,15)[c]{wrap\_foo()}}
 508\put(125,235){\vector(0,-1){10}}
 509\put(50,210){\framebox(200,15)[c]{foo()}}
 510\put(125,210){\vector(0,-1){10}}
 511\put(50,185){\framebox(200,15)[c]{doh()}}
 512\put(125,185){\vector(0,-1){20}}
 513\put(110,148){SIGSEGV}
 514\put(160,152){\vector(1,0){100}}
 515\put(260,70){\framebox(200,100){}}
 516\put(310,155){WAD signal handler}
 517\put(265,140){1. Unwind C stack}
 518\put(265,125){2. Gather symbols and debugging info}
 519\put(265,110){3. Find safe return location}
 520\put(265,95){4. Raise Python exception}
 521\put(265,80){5. Modify CPU context and return}
 522
 523\put(260,185){\framebox(200,15)[c]{return assist}}
 524\put(365,174){Return from signal}
 525\put(360,170){\vector(0,1){15}}
 526\put(360,200){\line(0,1){65}}
 527
 528%\put(360,70){\line(0,-1){10}}
 529%\put(360,60){\line(1,0){110}}
 530%\put(470,60){\line(0,1){130}}
 531%\put(470,190){\vector(-1,0){10}}
 532
 533\put(360,265){\vector(-1,0){105}}
 534\put(255,250){NULL}
 535\put(255,270){Return to interpreter}
 536
 537\end{picture}
 538
 539\caption{Control Flow of the Error Recovery Mechanism for Python}
 540\label{wad}
 541\end{figure*}
 542
 543\section{Returning to the Interpreter}
 544
 545To return to the interpreter, WAD maintains a table of symbolic names
 546that correspond to locations within the interpreter
 547responsible for invoking wrapper functions and object/type methods.
 548For example, Table 1 shows a partial list of return locations used in
 549the Python implementation.  When an error occurs, the call stack is
 550scanned for the first occurrence of any symbol in this table.  If a
 551match is found, control is returned to that location by emulating the
 552return of a wrapper function with the error code from the table. If no
 553match is found, the error handler simply prints a stack trace to
 554standard output and aborts.
 555
 556When a symbolic match is found, WAD invokes a special user-defined
 557handler function that is written for a specific scripting language.
 558The primary role of this handler is to take debugging information
 559gathered from the call stack and generate an appropriate scripting
 560language error.  One peculiar problem of this step is that the
 561generation of an error may require the use of parameters passed to a
 562wrapper function.  For example, in the Tcl wrapper shown earlier, one
 563of the arguments was an object of type ``{\tt Tcl\_Interp *}''.  This
 564object contains information specific to the state of the interpreter
 565(and multiple interpreter objects may exist in a single application).
 566Unfortunately, no reference to the interpreter object is available in the
 567signal handler nor is a reference to interpreter guaranteed to exist in
 568the context of a function that generated the error.
 569
 570To work around this problem, WAD implements a feature
 571known as argument stealing.  When examining the call-stack, the signal
 572handler has full access to all function arguments and local variables of each function
 573on the stack.
 574Therefore, if the handler knows that an error was generated while
 575calling a wrapper function (as determined by looking at the symbol names),
 576it can grab the interpreter object from the stack frame of the wrapper and
 577use it to set an appropriate error code before returning to the interpreter.
 578Currently, this is managed by allowing the signal handler to steal
 579arguments from the caller using positional information.
 580For example, to grab the {\tt Tcl\_Interp *} object from a Tcl wrapper function,
 581code similar to the following is written:
 582
 583\begin{verbatim}
 584Tcl_Interp *interp;
 585int         err;
 586
 587interp = (Tcl_Interp *)
 588  wad_steal_outarg(
 589           stack,                
 590           "TclExecuteByteCode",
 591           1,
 592           &err
 593  );
 594  ...
 595if (!err) {
 596  Tcl_SetResult(interp,errtype,...);
 597  Tcl_AddErrorInfo(interp,errdetails);
 598}
 599\end{verbatim}
 600
 601In this case, the Tcl interpreter argument passed to a wrapper function 
 602is stolen and used to generate an error.  Also, the name {\tt TclExecuteByteCode}
 603refers to the calling function, not the wrapper function itself.
 604At this time, argument stealing is only applicable to simple types
 605such as integers and pointers.  However, this appears to be adequate for generating
 606scripting language errors.
 607
 608
 609\begin{table}[t]
 610\begin{center}
 611\begin{tabular}{ll}
 612Python symbol                 &   Error return value \\ \hline
 613call\_builtin                 &   NULL \\
 614PyObject\_Print               & -1 \\
 615PyObject\_CallFunction        & NULL \\
 616PyObject\_CallMethod          & NULL \\
 617PyObject\_CallObject          & NULL \\
 618PyObject\_Cmp                 & -1 \\
 619PyObject\_DelAttrString       & -1 \\
 620PyObject\_DelItem             & -1 \\
 621PyObject\_GetAttrString       & NULL \\
 622\end{tabular}
 623\end{center}
 624
 625\label{returnpoints}
 626\caption{A partial list of symbolic return locations in the Python interpreter}
 627\end{table}
 628
 629\section{Register Management}
 630
 631A final issue concerning the return mechanism has to do with the
 632behavior of the non-local return to the interpreter.  Roughly
 633speaking, this emulates the C {\tt longjmp}
 634library call.  However, this is done without the use of a matching
 635{\tt setjmp} in the interpreter.  
 636
 637The primary problem with aborting execution and returning to the
 638interpreter in this manner is that most compilers use a register
 639management technique known as callee-save \cite{prag}.  In this case,
 640it is the responsibility of the called function to save the state of
 641the registers and to restore them before returning to the caller. By
 642making a non-local jump, registers may be left in an inconsistent
 643state due to the fact that they are not restored to their original
 644values.  The {\tt longjmp} function in the C library avoids this
 645problem by relying upon {\tt setjmp} to save the registers.  Unfortunately,
 646WAD does not have this luxury.   As a result, a return from the signal
 647handler may produce a corrupted set of registers at the point of return
 648in the interpreter.
 649
 650The severity of this problem depends greatly on the architecture and
 651compiler.  For example, on the SPARC, register windows effectively
 652solve the callee-save problem \cite{sparc}.  In this case, each stack
 653frame has its own register window and the windows are flushed to the
 654stack whenever a signal occurs.  Therefore, the recovery mechanism can
 655simply examine the stack and arrange to restore the registers to their
 656proper values when control is returned.  Furthermore, certain
 657conventions of the SPARC ABI resolve several related issues. For
 658example, floating point registers are caller-saved and the contents of
 659the SPARC global registers are not guaranteed to be preserved across
 660procedure calls (in fact, they are not even saved by {\tt setjmp}).
 661
 662On other platforms, the problem of register management becomes 
 663more interesting.  In this case, a heuristic approach that examines
 664the machine code for each function on the call stack can be used to
 665determine where the registers might have been saved.  This approach is
 666used by gdb and other debuggers when they allow users to inspect
 667register values within arbitrary stack frames \cite{gdb}.  Even though
 668this sounds complicated to implement, the algorithm is greatly
 669simplified by the fact that compilers typically generate code to store
 670the callee-save registers immediately upon the entry to each function.
 671In addition, this code is highly regular and easy to examine.  For
 672instance, on i386-Linux, the callee-save registers can be restored by
 673simply examining the first few bytes of the machine code for each
 674function on the call stack to figure out where values have been saved.
 675The following code shows a typical sequence of machine instructions
 676used to store callee-save registers on i386-Linux:
 677
 678\begin{verbatim}
 679foo:
 68055       pushl %ebp
 68189 e5    mov  %esp, %ebp
 68283 a0    subl  $0xa0,%esp 
 68356       pushl %esi   
 68457       pushl %edi
 685...
 686\end{verbatim}
 687
 688%
 689% Include an example
 690%
 691
 692% more interesting.  One approach is to simply ignore the problem
 693% altogether and return to the interpreter with the registers in an
 694% essentially random state.  Surprisingly, this approach actually seems to work (although a considerable degree of
 695% caution might be in order).
 696% This is because the return of an error code tends to trigger
 697% a cascade of procedure returns within the implementation of the interpreter.
 698% As a result, the values of the registers are simply discarded and
 699% overwritten with restored values as the interpreter unwinds itself and prepares to handle an
 700% exception.  A better solution to this problem is to modify the recovery mechanism to discover and
 701% restore saved registers from the stack.  Unfortunately, there is
 702% no standardized way to know exactly where the registers might have been saved.
 703% Therefore, a heuristic scheme that examines the machine code for each procedure would
 704% have to be used to try and identify stack locations. This approach is used by gdb
 705% and other debuggers when they allow users to inspect register values
 706% within arbitrary stack frames \cite{gdb}.  However, this technique has 
 707% not yet been implemented in WAD due to its obvious implementation difficulty and the
 708% fact that the WAD prototype has primarily been developed for the SPARC.
 709
 710As a fall-back, WAD could be configured to return control to a location
 711previously specified with {\tt setjmp}.  Unfortunately, this either
 712requires modifications to the interpreter or its extension modules.
 713Although this kind of instrumentation could be facilitated by automatic
 714wrapper code generators, it is not a preferred solution and is not discussed further.
 715
 716\section{Initialization}
 717
 718To simplify the debugging of extension modules, it
 719is desirable to make the use of WAD as transparent as possible.
 720Currently, there are two ways in which the system is used.  First, WAD
 721may be explicitly loaded as a scripting language extension module.
 722For instance, in Python, a user can include the statement {\tt import
 723libwadpy} in a script to load the debugger.  Alternatively, WAD can be
 724enabled by linking it to an extension module as a shared
 725library.  For instance:
 726
 727\begin{verbatim}
 728% ld -shared $(OBJS) -lwadpy
 729\end{verbatim}
 730
 731In this latter case, WAD initializes itself whenever the extension module is
 732loaded.  The same shared library is used for both situations by making
 733sure two types of initialization techniques are used.  First, an empty
 734initialization function is written to make WAD appear like a proper
 735scripting language extension module (although it adds no functions to
 736the interpreter).  Second, the real initialization of the system is
 737placed into the initialization section of the WAD shared library
 738object file (the ``init'' section of ELF files).  This code always executes
 739when a library is loaded by the dynamic loader is commonly used to
 740properly initialize C++ objects.  Therefore, a fairly portable way
 741to force code into the initialization section is to encapsulate the
 742initialization in a C++ statically constructed object like this:
 743
 744\begin{verbatim}
 745class InitWad {
 746   public:
 747      InitWad() { wad_init(); }
 748};
 749/* This forces InitWad() to execute
 750   on loading. */
 751static InitWad init;
 752\end{verbatim}
 753
 754The nice part about this technique is that it allows WAD to be enabled
 755simply by linking or loading; no special initialization code needs to
 756be added to an extension module to make it work.  In addition, due to
 757the way in which the loader resolves and initializes libraries, the
 758initialization of WAD is guaranteed to execute before any of the code
 759in the extension module to which it has been linked.  The primary
 760downside to this approach is that the WAD shared object file can not be
 761linked directly to an interpreter.   This is because WAD sometimes needs to call the
 762interpreter to properly initialize its exception handling mechanism (for instance, in Python,
 763four new types of exceptions are added to the interpreter).  Clearly this type of initialization
 764is impossible if WAD is linked directly to an interpreter as 
 765its initialization process would execute before before the main program of the
 766interpreter started.  However, 
 767if you wanted to permanently add WAD to an interpreter, the problem is easily
 768corrected by first removing the C++ initializer from WAD and then replacing it with an explicit
 769initialization call someplace within the interpreter's startup function.
 770
 771\section{Exception Objects}
 772
 773Before WAD returns control to the interpreter, it collects all of the
 774stack-trace and debugging information it was able to obtain into a
 775special exception object. This object represents the state of the call
 776stack and includes things like symbolic names for each stack frame,
 777the names, types, and values of function parameters and stack
 778variables, as well as a complete copy of data on the stack. This
 779information is represented in a generic manner that hides
 780platform specific details related to the CPU, object file formats,
 781debugging tables, and so forth.
 782
 783Minimally, the exception data is used to print a stack trace as shown
 784in Figure 1.  However, if the interpreter is successfully able to
 785regain control, the contents of the exception object can be
 786freely examined after an error has occurred.  For example, a Python
 787script could catch a segmentation fault and print debugging information
 788like this:
 789
 790\begin{verbatim}
 791try:
 792   # Some buggy code
 793   ...
 794except SegFault,e:
 795   print 'Whoa!'
 796   # Get WAD exception object
 797   t = e.args[0]
 798   # Print location info
 799   print t.__FILE__
 800   print t.__LINE__
 801   print t.__NAME__
 802   print t.__SOURCE__
 803   ...
 804\end{verbatim}
 805
 806Inspection of the exception object also makes it possible to write post mortem
 807script debuggers that merge the call stacks of the two languages and
 808provide cross language diagnostics.  Figure 4 shows an
 809example of a simple mixed language debugging session using the WAD
 810post-mortem debugger (wpm) after an extension error has occurred in a
 811Python program.  In the figure, the user is first presented with a
 812multi-language stack trace.  The information in this trace is obtained
 813both from the WAD exception object and from the Python traceback
 814generated when the exception was raised. Next, we see the user walking
 815up the call stack using the 'u' command of the debugger.  As this
 816proceeds, there is a seamless transition from C to Python where the
 817trace crosses between the two languages.  An optional feature of the
 818debugger (not shown) allows the debugger to walk up the entire C
 819call-stack (in this case, the trace shows information about the
 820implementation of the Python interpreter).  More advanced features of
 821the debugger allow the user to query values of function
 822parameters, local variables, and stack frames (although some of this
 823information may not be obtainable due to compiler optimizations and the
 824difficulties of accurately recovering register values).
 825
 826\begin{figure*}[t]
 827{\small
 828\begin{verbatim}
 829[ Error occurred ]
 830>>> from wpm import *
 831*** WAD Debugger ***
 832#5   [ Python ] in self.widget._report_exception() in ...
 833#4   [ Python ] in Button(self,text="Die", command=lambda x=self: ...
 834#3   [ Python ] in death_by_segmentation() in death.py, line 22
 835#2   [ Python ] in debug.seg_crash() in death.py, line 5
 836#1   0xfeee2780 in _wrap_seg_crash(self=0x0,args=0x18f114) in 'pydebug.c', line 512
 837#0   0xfeee1320 in seg_crash() in 'debug.c', line 20
 838
 839      int *a = 0;
 840 =>   *a = 3;
 841      return 1;
 842
 843>>> u
 844#1   0xfeee2780 in _wrap_seg_crash(self=0x0,args=0x18f114) in 'pydebug.c', line 512
 845        
 846        if(!PyArg_ParseTuple(args,":seg_crash")) return NULL;
 847 =>     result = (int )seg_crash();
 848        resultobj = PyInt_FromLong((long)result);
 849
 850>>> u
 851#2   [ Python ] in debug.seg_crash() in death.py, line 5
 852    
 853    def death_by_segmentation():
 854 =>     debug.seg_crash()
 855    
 856>>> u
 857#3   [ Python ] in death_by_segmentation() in death.py, line 22
 858
 859        if ty == 1:
 860 =>         death_by_segmentation()
 861        elif ty == 2:
 862>>> \end{verbatim}
 863}
 864\caption{Cross-language debugging session in Python where a user is walking a mixed language call stack.}
 865\end{figure*}
 866
 867\section{Implementation Details}
 868
 869Currently, WAD is implemented in ANSI C and small amount of assembly
 870code to assist in the return to the interpreter.  The current
 871implementation supports Python and Tcl extensions on SPARC Solaris and
 872i386-Linux.  Each scripting language is currently supported by a
 873separate shared library such as {\tt libwadpy.so} and {\tt
 874libwadtcl.so}.  In addition, a language neutral library {\tt
 875libwad.so} can be linked against non-scripted applications (in which case
 876a stack trace is simply printed to standard error when a problem occurs). 
 877The entire implementation contains approximately 2000
 878semicolons. Most of this code pertains to the gathering of debugging
 879information from object files.  Only a small part of the code is
 880specific to a particular scripting language (170 semicolons for Python
 881and 50 semicolons for Tcl).
 882
 883Although there are libraries such as the GNU Binary File Descriptor
 884(BFD) library that can assist with the manipulation of object files,
 885these are not used in the implementation \cite{bfd}.  These
 886libraries tend to be quite large and are oriented more towards
 887stand-alone tools such as debuggers, linkers, and loaders.  In addition,
 888the behavior of these libraries with respect to memory management
 889would need to be carefully studied before they could be safely used in
 890an embedded environment. Finally, given the small size of the prototype 
 891implementation, it didn't seem necessary to rely upon such a 
 892heavyweight solution.
 893
 894A surprising feature of the implementation is that a significant
 895amount of the code is language independent.  This is achieved by
 896placing all of the process introspection, data collection, and
 897platform specific code within a centralized core.  To provide a
 898specific scripting language interface, a developer only needs to
 899supply two things; a table containing symbolic function names where
 900control can be returned (Table 1), and a handler function in the form
 901of a callback.  As input, this handler receives an exception object as
 902described in an earlier section.  From this, the handler can
 903raise a scripting language exception in whatever manner is most
 904appropriate.
 905
 906Significant portions of the core are also relatively straightforward
 907to port between different Unix systems.  For instance, code to read
 908ELF object files and stabs debugging data is essentially identical for
 909Linux and Solaris.  In addition, the high-level control logic is
 910unchanged between platforms.  Platform specific differences primarily
 911arise in the obvious places such as the examination of CPU
 912registers, manipulation of the process context in the signal handler,
 913reading virtual memory maps from /proc, and so forth.  Additional
 914changes would also need to be made on systems with different object
 915file formats such as COFF and DWARF2.  To extent that it is possible,
 916these differences could be hidden by abstraction mechanisms (although
 917the initial implementation of WAD is weak in this regard and would
 918benefit from techniques used in more advanced debuggers such as gdb).
 919Despite these porting issues, the primary requirement for WAD is a fully
 920functional implementation of SVR4 signal handling that allows for
 921modifications of the process context.
 922
 923Due to the heavy dependence on Unix signal handling, process
 924introspection, and object file formats, it is unlikely that WAD could
 925be easily ported to non-Unix systems such as Windows.  However, it may
 926be possible to provide a similar capability using advanced features of
 927Windows structured exception handling \cite{seh}.  For instance, structured
 928exception handlers can be used to catch hardware faults, they can
 929receive process context information, and they can arrange to take
 930corrective action much like the signal implementation described here.  
 931
 932\section{Modification of Interpreters?}
 933
 934A logical question to ask about the implementation of WAD is whether
 935or not it would make sense to modify existing interpreters to assist
 936in the recovery process. For instance, instrumenting Python or Tcl with setjmp
 937functions might simplify the implementation since it would eliminate
 938issues related to register restoration and finding a suitable return
 939location.
 940
 941Although it may be possible to make these changes, there are 
 942several drawbacks to this approach.  First, the number of required modifications may be
 943quite large.  For instance, there are well over 50 entry points to
 944extension code within the implementation of Python.  Second, an
 945extension module may perform callbacks and evaluation of script code.
 946This means that the call stack would cross back and forth
 947between languages and that these modifications would have to be made
 948in a way that allows arbitrary nesting of extension calls.  Finally,
 949instrumenting the code in this manner may introduce a performance
 950impact--a clearly undesirable side effect considering the infrequent
 951occurrence of fatal extension errors.
 952
 953\section{Discussion}
 954
 955The primary goal of embedded error recovery is to provide an
 956alternative approach for debugging scripting language extensions.
 957Although this approach has many benefits, there are a number
 958drawbacks and issues that must be discussed.
 959
 960First, like the C {\tt longjmp} function, the error recovery mechanism
 961does not cleanly unwind the call stack.  For C++, this means that
 962objects allocated on stack will not be finalized (destructors will not
 963be invoked) and that memory allocated on the heap may be
 964leaked. Similarly, this could result in open files, sockets, and other
 965system resources. In a multi-threaded environment,
 966deadlock may occur if a procedure holds a lock when an error occurs.
 967
 968In certain cases, the use of signals in WAD may interact adversely with scripting
 969language signal handling. Since scripting languages ordinarily do not catch signals such as
 970SIGSEGV, SIGBUS, and SIGABRT, the use of WAD is unlikely to conflict
 971with any existing signal handling. However, most scripting languages would not 
 972prevent a user from disabling the WAD error recovery mechanism by 
 973simply specifying a new handler for one or more of these signals.  In addition, the use of 
 974certain extensions such as the Perl sigtrap module would completely 
 975disable WAD \cite{perl}.
 976
 977A more difficult signal handling problem arises when thread libraries
 978are used. These libraries tend to override default signal handling
 979behavior in a way that defines how signals are delivered to each
 980thread \cite{thread}.  In general, asynchronous signals can be
 981delivered to any thread within a process.  However, this does not
 982appear to be a problem for WAD since hardware exceptions are delivered
 983to a signal handler that runs within the same thread in which the
 984error occurred.  Unfortunately, even in this case, personal experience has
 985shown that certain implementations of user thread libraries (particularly on older versions
 986of Linux) do not reliably pass
 987signal context information nor do they universally support advanced
 988signal operations such as {\tt sigaltstack}.  Because of this, WAD may
 989be incompatible with a crippled implementation of user threads on
 990these platforms.  
 991
 992A even more subtle problem with threads is that the recovery process
 993itself is not thread-safe (i.e., it is not possible to concurrently
 994handle fatal errors occurring in different threads).  For most
 995scripting language extensions, this limitation does not apply due to
 996strict run-time restrictions that interpreters currently place on
 997thread support.  For instance, even though Python supports threaded
 998programs, it places a global mutex-lock around the interpreter that
 999makes it impossible for more than one thread to concurrently execute
1000within the interpreter at once. A consequence of this restriction is
1001that extension functions are not interruptible by thread-switching
1002unless they explicitly release the interpreter lock.  Currently, the
1003behavior of WAD is undefined if extension code releases the lock and
1004proceeds to generate a fault.  In this case, the recovery process may
1005either cause an exception to be raised in an entirely different
1006thread or cause execution to violate the interpreter's mutual exclusion
1007constraint on the interpreter.
1008
1009In certain cases, errors may result in an unrecoverable crash.  For
1010example, if an application overwrites the heap, it may destroy
1011critical data structures within the interpreter.  Similarly,
1012destruction of the call stack (via buffer overflow) makes it
1013impossible for the recovery mechanism to create a stack-trace and
1014return to the interpreter.    More subtle memory management problems
1015such as double-freeing of heap allocated memory can also cause a system
1016to fail in a manner that bears little resemblance to actual source
1017of the problem.    Given that WAD lives in the same process as the
1018faulting application and that such errors may occur, a common
1019question to ask is to what extent does WAD complicate debugging when it
1020doesn't work.
1021
1022To handle potential problems in the implementation of WAD itself,
1023great care is taken to avoid the use of library functions and
1024functions that rely on heap allocation (malloc, free, etc.).  For
1025instance, to provide dynamic memory allocation, WAD implements its own
1026memory allocator using mmap.  In addition, signals are disabled
1027immediately upon entry to the WAD signal handler.  Should a fatal
1028error occur inside WAD, the application will dump core and exit.  Since
1029the resulting core file contains the stack trace of both WAD and the
1030faulting application, a traditional C debugger can be used to identify
1031the problem as before.  The only difference is that a few additional
1032stack frames will appear on the traceback.
1033
1034An application may also fail after the WAD signal handler has completed
1035execution if memory or stack frames within the interpreter have been
1036corrupted in a way that prevents proper exception handling. In this case, the
1037application may fail in a manner that does not represent the original
1038programming error. It might also cause the WAD signal handler to be
1039immediately reinvoked with a different process state--causing it to
1040report information about a different type of failure.  To address
1041these kinds of problems, WAD creates a tracefile {\tt
1042wadtrace} in the current working directory that contains information
1043about each error that it has handled.  If no recovery was possible, a
1044programmer can look at this file to obtain all of the stack traces
1045that were generated.
1046
1047If an application is experiencing a very serious problem, WAD
1048does not prevent a standard debugger from being attached to the
1049process.  This is because the debugger overrides the current signal
1050handling so that it can catch fatal errors.  As a result, even if WAD
1051is loaded, fatal signals are simply redirected to the attached
1052debugger.  Such an approach also allows for more complex debugging
1053tasks such as single-step execution, breakpoints, and
1054watchpoints--none of which are easily added to WAD itself.
1055
1056%
1057% Add comments about what WAD does in this case?
1058%
1059
1060Finally, there are a number of issues that pertain
1061to the interaction of the recovery mechanism with the interpreter.
1062For instance, the recovery scheme is unable to return to procedures
1063that might invoke wrapper functions with conflicting return codes.
1064This problem manifests itself when the interpreter's virtual
1065machine is built around a large {\tt switch} statement from which different
1066types of wrapper functions are called.  For example, in Python, certain
1067internal procedures call a mix of functions where both NULL and -1 are
1068returned to indicate errors (depending on the function).  In this case, there
1069is no way to specify a proper erro…

Large files files are truncated, but you can click here to view the full file