My Schedule

Getting this post out to make my schedule look like a plan (“Achieve Ultimate Blog Success In One Easy Step”) rather than an accident of laziness. I don’t have a list of topics yet, and even if I did then I wouldn’t be sharing it, but I will be doing posts on a regular basis.

My aim is to do a detailed Python Tools blog every two weeks, and a meta or personal blog every other week (such as this week).

This may not be as often as many other bloggers, though I think that I’m going in-depth enough with the Python Tools to get away with it. Every two weeks gives me a good chance to research, write and review before publishing, and my (presumed) lack of readership means probably nobody is expecting frequent updates anyway.

Now, with that out of the way, it’s time to go and choose a bug to write about for next week…

Smart Indentation for Python

The text editor in Visual Studio provides a number of options related to indentation. Apart from the standard tabs/spaces and “how many” options, you can choose between three behaviours: “None”, “Block” and “Smart.”

The 'Tabs' options dialog in Visual Studio 2012

To my knowledge, the “None” mode is very rarely used. When enabled, pressing Enter will move the caret (the proper name for the text cursor) to the first column of the following line. While this is the normal behaviour in a word processor, it does not suit programming quite as well.

“Block” mode is more useful. In this mode, pressing Enter will move the caret to the following line and insert as many spaces/tabs as appeared at the start of the previous line. This ensures that consecutive lines of text all start in the same column by default, which is commonly used as a hint to the programmer that they are part of the same block.

However, the most common (and default) mode is “Smart.” Unlike the other two modes, which are built into the editor, smart indentation is provided by a language service (which are also responsible for providing syntax highlighting and completions). Because they are targeted to a specific language, they can help the programmer by automatically adding and removing extra indentation in ways that make sense.

For example, the smart indentation for C++ will automatically add an indent after an open brace, which begins a new block, and remove an indent for a close brace, which ends the block. Similarly, an indent is added after “if” or “while” statements, as well as the others that support implicit blocks, and removed after the one statement that may follow. In most cases, a programmer can simply continue typing without ever having to manually indent or unindent their code.

This feature has existed since very early in Python Tools for Visual Studio‘s life, but the implementation has changed significantly over time. In this post we will look at the algorithms used in changesets 41aa3fe86341 and 4db951c455d5, as well as the general approach to providing smart indentation from a language service.

ISmartIndentProvider

Since Visual Studio 2010, language services have provided smart indenting by exporting an MEF component implementing ISmartIndentProvider. This interface includes one method, CreateSmartIndent, which returns an object implementing ISmartIndent for a provided text view. ISmartIndent also only includes one method (ignoring Dispose), GetDesiredIndentation, which returns the number of spaces to indent by. VS will convert this to tabs (or a mix of tabs and spaces) depending on the user’s settings.

The entire implementation of these interfaces in PTVS looks like this:

[Export(typeof(ISmartIndentProvider))]
[ContentType(PythonCoreConstants.ContentType)]
public sealed class SmartIndentProvider : ISmartIndentProvider {
    private sealed class Indent : ISmartIndent {
        private readonly ITextView _textView;
 
        public Indent(ITextView view) {
            _textView = view;
        }
 
        public int? GetDesiredIndentation(ITextSnapshotLine line) {
            if (PythonToolsPackage.Instance.LangPrefs.IndentMode == vsIndentStyle.vsIndentStyleSmart) {
                return AutoIndent.GetLineIndentation(line, _textView);
            } else {
                return null;
            }
        }
 
        public void Dispose() { }
    }
 
    public ISmartIndent CreateSmartIndent(ITextView textView) {
        return new Indent(textView);
    }
}

The AutoIndent class referenced in GetDesiredIndentation contains the algorithm for calculating how many spaces are required. Two algorithms for this are described in the following sections, the first using reverse detection and the second using forward detection.

Reverse Indent Detection

This algorithm was used in PTVS up to changeset 41aa3fe86341, which was shortly before version 1.0 was released. It was designed to be efficient by minimising the amount of code scanned, but ultimately got so many corner cases wrong that it was easier to replace it with the newer algorithm. The full source file is AutoIndent.cs.

At its simplest, indent detection in Python is based entirely on the preceding line. The normal case is to copy the indentation from that line. However, if it ends with a colon then we should add one level, since that is how Python starts blocks. Also, we can safely remove one level if the previous line is a return, raise, break or continue statement, since no more of that block will be executed. (As a bonus, we also do this after a pass statement, since its main use is to indicate an empty block.) The complications come when the preceding textual line is not the preceding line of code.

Take the following example:

1
2
if a == 1 and (b == 2 or
               c == 3):

How many spaces should we add for line 3? According to the above algorithm, we’d add 15 plus the user’s indent size setting (for the colon on line 2). This is clearly not correct, since the if statement has 0 leading spaces, but it is the result when applying the simpler algorithm.

Finding the start of an expression is actually such a common task in a language service that PTVS has a ReverseExpressionParser class. It tokenises the source code as normal, but rather than parsing from the start it parses backwards from an arbitrary point. Since the parser state (things like the number of open brackets) is unknown, the class is most useful for identifying the span of a single expression (hence the name).

For smart indentation, the parser is used twice: once to find the start of the expression assuming we’re not inside a grouping (the zero on line 102) and once assuming we are inside a grouping (the one on line 103).The span provided on line 94 is the location of the last token before the caret, which for smart indenting should be an end of line character (that the parser automatically skips).

90
91
92
93
94
95
96
97
98
99
100
101
102
103
// use the expression parser to figure out if we're in a grouping...
var revParser = new ReverseExpressionParser(
        line.Snapshot,
        line.Snapshot.TextBuffer,
        line.Snapshot.CreateTrackingSpan(
            tokens[tokenIndex].Span.Span,
            SpanTrackingMode.EdgePositive
        )
    );
 
int paramIndex;
SnapshotPoint? sigStart;
var exprRangeNoImplicitOpen = revParser.GetExpressionRange(0, out paramIndex, out sigStart, false);
var exprRangeImplicitOpen = revParser.GetExpressionRange(1, out paramIndex, out sigStart, false);

The values of exprRangeNoImplicitOpen and exprRangeImplicitOpen are best described by example. Consider the following code:

1
2
3
4
5
6
7
def f(a, b, c):
    if a == b and b != c:
        return (a +
                b * c
                + 123)
    while a < b:
        a = a + c

When parsing starts at the end of line 4, exprRangeNoImplicitOpen will reference the span containing b * c, since that is a complete expression (remembering that the parser does not know it is inside parentheses). exprRangeImplicitOpen is initialised with one open grouping, so it will reference (a + b * c. However, if we start parsing at the end of line 7, exprRangeNoImplicitOpen will reference a + c while exprRangeImplicitOpen will be null, since an assignment within a group would be an error.

Using the two expressions, we can create a new set of indentation rules:

  • If exprRangeImplicitOpen was found, exprRangeNoImplicitOpen was not (or is different to exprRangeImplicitOpen), and the expression starts with an open grouping ((, [ or {), we are inside a set of brackets.
    • In this case, we match the indentation of the brackets + 1, as on line 2 of the earlier example.
  • Otherwise, if exprRangeNoImplicitOpen was found and it is preceded by a return or raise, break or continue statement, OR if the last token is one of those keywords, the previous line must be one of those statements.
    • In this case, we copy the indentation and reduce it by one level for the following line.
  • Otherwise, if both ranges were found, we have a valid expression on one line and one that spans multiple lines.
    • This occurred in the example shown above.
    • In this case, we find the lowest indentation on any line of the multi-line expression and use that.
  • Otherwise, if the last non-newline character is a colon, we are at the first line of a new block.
    • In this case, we copy the indentation and increase it by one level.

These rules are implemented on lines 105 through 143 of AutoIndent.cs. However, with this approach there are many cases that need special handling. Most of the above ‘rules’ are the result of these being discovered. For example, issue 157 goes through a lot of these edge cases, and while most of them were resolved, it remained a less-than-robust algorithm. The alternative approach, described below, was added to handle most of these issues directly rather than as workarounds.

Forward Indent Detection

This algorithm replaced the reverse algorithm for PTVS 1.0 and has been used since with very minor modifications. It potentially sacrifices some performance in order to obtain more consistent results, as well as being able to support a slightly wider range of interesting rules. The discussion here is based on the implementation as of changeset 4db951c455d5; full source at AutoIndent.cs.

For this algorithm, the reverse expression parser remains but is used slightly differently. Its definition was changed slightly to allow external code to enumerate tokens from it (by adding an IEnumerable<ClassificationSpan> implementation) and its IsStmtKeyword() method was made public. This allows AutoIndent.GetLineIndentation() to perform its own parsing:

109
110
111
112
113
114
115
116
117
118
119
120
121
122
var tokenStack = new System.Collections.Generic.Stack<ClassificationSpan>();
tokenStack.Push(null);  // end with an implicit newline
bool endAtNextNull = false;
 
foreach (var token in revParser) {
    tokenStack.Push(token);
    if (token == null && endAtNextNull) {
        break;
    } else if (token != null &&
       token.ClassificationType == revParser.Classifier.Provider.Keyword &&
       ReverseExpressionParser.IsStmtKeyword(token.Span.GetText())) {
        endAtNextNull = true;
    }
}

The result of this code is a list of tokens leading up to the current location and guaranteed to start from outside any grouping. Depending on the structure of the preceding code, this may result in quite a large list of tokens; only a statement that can never appear in an expression will stop the reverse parse. This specifically excludes if, else, for and yield, which can all appear within expressions, and so all tokens up to the start of the method or class may be included. This is unfortunate, but also required to make guarantees about the parser state without parsing the entire file from the beginning (which is the only other known state).

Since the parser state is known at the first token, we can parse forward and track the indentation level. The algorithm now looks like this as we visit each token in order (by popping them off of tokenStack):

  • At a new line, set the indentation level to however many spaces appear at the start of the line.
  • At an open bracket, set the indentation level to the column of the bracket plus one and remember the previous level.
    • This ensures that if we reach the end of the list while inside the group, our current level lines up with the group and not the start of the line.
  • At a close bracket, restore the previous indentation level.
    • This ensures that whatever indentation occurs within a group, we will use the original indentation for the line following.
  • At a line continuation character (a backslash at the end of a line), skip ahead until the end of the entire line of code.
  • If the token is a statement to unindent after (return and friends), set a flag to unindent.
    • This flag is preserved, restored and reset with the indentation level.
  • If the token is a colon character and we are not currently inside a group, set a flag to add an indent.
    • And, if the following token is not an end-of-line token, also set the unindent flag.

After all tokens have been scanned, we will have the required indentation level and two flags indicating whether to add or remove an indent level. These flags are separate because they may both be set (for example, after a single-line if statement such as if a == b: return False). If they don’t cancel each other out, then an indent should be added or removed to the calculated level to find where the next line should appear:

167
168
169
indentation = current.Indentation +
    (current.ShouldIndentAfter ? tabSize : 0) -
    (current.ShouldDedentAfter ? tabSize : 0);

The implementation of this algorithm uses a LineInfo structure and a stack to manage preserving and restoring state:

44
45
46
47
48
49
50
private struct LineInfo {
    public static readonly LineInfo Empty = new LineInfo();
    public bool NeedsUpdate;
    public int Indentation;
    public bool ShouldIndentAfter;
    public bool ShouldDedentAfter;
}

And the structure of the parsing loop looks like this (edited for length):

while (tokenStack.Count > 0) {
    var token = tokenStack.Pop();
    if (token == null) {
        current.NeedsUpdate = true;
    } else if (token.IsOpenGrouping()) {
        indentStack.Push(current);
        ...
    } else if (token.IsCloseGrouping()) {
        current = indentStack.Pop();
        ...
    } else if (ReverseExpressionParser.IsExplicitLineJoin(token)) {
        ...
    } else if (current.NeedsUpdate == true) {
        current.Indentation = GetIndentation(line.GetText(), tabSize)
        ...
    }
 
    if (ShouldDedentAfterKeyword(token)) {
        current.ShouldDedentAfter = true;
    }
 
    if (token != null && token.Span.GetText() == ":" && indentStack.Count == 0) {
        current.ShouldIndentAfter = true;
        current.ShouldDedentAfter = (tokenStack.Count != 0 && tokenStack.Peek() != null);
    }
}

A significant advantage of this algorithm over the reverse indent detection is its obviousness. It is much easier to follow the code for the parsing loop than to attempt to interpret the behaviour and interactions inherent in the reverse algorithm. Further, modifications can be more easily added because of the clear structure. For example, the current behaviour indents the contents of a grouping to the level of the opening token, but some developers prefer to only add one indent level and no more. With the reverse algorithm, finding the section of code requiring a change is difficult, but the forward algorithm has an obvious code path at the start of groups.

Summary

Smart indentation allows Python Tools for Visual Studio to assist the developer by automatically indenting code to the level it is usually required at. Since Python uses white-space to define scopes, much like C-style languages use braces, this makes writing Python code simpler and can reduce “inconsistent-whitespace” errors. Language services provide this feature by exporting (through MEF) an implementation of ISmartIndentProvider. This post looked at two algorithms for determining the required indentation based on the preceding code, the latter of which shipped with PTVS 1.0.

My Keyboard

Once I started work, one of the first things I asked for was a new keyboard. (Something about typing a million-word thesis makes you fussy about your keys.) Slightly surprisingly, despite the keyboard I asked for not being on the list of approved hardware, I got it.

Possibly this is because it is, in the manufacturer’s words, “bad ass.”

Yes, I have a Das Keyboard. And not just any old Das, but the Ultimate – the one with no labels.

Despite getting the “silent” model, it’s still quite a noisy keyboard, though the noise is plastic-on-plastic rather than the distinctive “click” of a mechanical keyboard. There is no difference in feel between the silent and the clicky, but there’s a huge difference from non-mechanical keyboards. Typing this on my laptop keyboard feels hard and stilted. Every key stops short, and after a while it begins to rattle up into my finger joints. Not pleasant.

But that part is not much of a surprise, since I previously owned a Das Keyboard and knew how good the mechanism was. However, my old one had labels, and so going without was a new experience. So far there have been two benefits.

The first is that “bad ass” is not just the manufacturer’s words, but the reaction of basically everyone who sees it. The box had been opened before I got hold of it, because the person it was delivered to was showing it off to people (which may also explain the late delivery…). I’ve also had a number of people who come into my office simply stop and stare as I type. The “wow” factor really exists.

On the downside, not having labels requires really getting to know the keyboard layout. In my case I already knew the layout, because I’ve been using the two-handed Dvorak layout for the last year. Without turning this into a rave about Dvorak, I think it’s great. However, one thing I’ve noticed myself doing is thinking in Qwerty and typing in Dvorak.

For example, Ctrl+Shift+B is used in Visual Studio to build the current solution. On a Qwerty keyboard, you would press the key labelled “N” to produce a “B”, and hence the keys labelled “Ctrl”, “Shift” and “N” to build. I found myself thinking of it as “control, shift, N” rather than “B”, which became problematic when my fingers started going to the actual key for “N” (labelled “L” on a Qwerty keyboard) instead of the key for “B”.

In short, I was still referring to the labels on my keyboards, despite them being incorrect. The keys for “A” and “M” are in the same place, and using them even after a year of Dvorak still felt more comfortable than other keys. I guess it’s just harder to trust yourself when there’s a sign telling you that you’re wrong.

Switching to a keyboard without labels has revealed just how often I would look at the keys. Cutting that out completely has significantly increased my confidence and reduced my typing errors. That alone, even without the “bad ass” effect, is worth it (blah-blah-YMMV-blah). Considering the amount of typing I do for work, it’s been a good investment, and if they hadn’t bought one then I would have bought my own and not regretted it at all.

Now, Das, about that sponsorship deal…

User-unhandled Exceptions

This feature is one that I added to Python Tools for Visual Studio as an intern in 2011. Those familiar with debugging in Visual Studio will know that if your code throws an exception that is not handled, the debugger breaks at the statement that caused the error:

This feature is known as a user-uncaught exception and is incredibly useful in debugging, since it provides an opportunity to inspect local state before the stack unwinds. It is optional but defaults to on for most exceptions:

The “thrown” column breaks before the system traverses the stack looking for a handler (the “first chance”), while the “user-unhandled” column breaks after traversal if no code chooses to handle it, but before the stack actually begins to unwind (the “second chance”).

In Python, an unhandled exception will unwind the stack immediately, executing finally blocks and otherwise destroying state as it goes, printing a basic trace at the end. In other words, the first traversal step does not exist in Python and there is no second chance. Before adding this feature, PTVS only supported breaking on the first chance:

The aim of this feature was to mimic other languages by simulating the first traversal of exception handlers without modifying the program state. Step one of this task was to enable the UI, step two was to find a way to detect active exception handlers, and step three was to extend the debugger to support the feature.

To follow along at home, the changeset is baff92317760, which I will quote from (and link to) where relevant. (The code has changed again since this commit, but these are the relevant ones for this discussion.) I also recorded a short video around the time this feature was added to demonstrate how it works.

Enabling the UI

Enabling the check boxes for user-unhandled exceptions is simply a case of changing the registration. Each debugging engine can list the exceptions it supports under AD7Metrics\Exception\{engine_guid}\{exception_name}. This allows the user to select whether or not to break on these exceptions, while the debugging engine is still fully responsible for handling the break itself.

Within these keys is a State value that combines values from the
EXCEPTION_STATE enumeration. The values we used are EXCEPTION_STOP_USER_UNCAUGHT and EXCEPTION_JUST_MY_CODE_SUPPORTED (0x4020):

The state value for Python exception ArithmeticError is set to 0x4020

In PTVS, the exceptions are registered using the ProvideDebugException attribute on PythonToolsPackage.cs, which includes a state parameter for setting this value. By modifying ProvideDebugExceptionAttribute.cs we can simply change the default for all of our exceptions:

    _state = (int)(enum_EXCEPTION_STATE.EXCEPTION_JUST_MY_CODE_SUPPORTED | 
                   enum_EXCEPTION_STATE.EXCEPTION_STOP_USER_UNCAUGHT);

After adding this line to the attribute constructor and rebuilding, the dialog has the checkboxes enabled:

However, simply making the option available does not provide any functionality, since it is entirely managed by Visual Studio. The options selected by the user are exposed to the debugger through the IDebugEngine2.SetException method, implemented in AD7Engine.cs. The following changes were made to support both forms of exception handling:

         int IDebugEngine2.SetException(EXCEPTION_INFO[] pException) {
+            bool sendUpdate = false;
             for (int i = 0; i < pException.Length; i++) {
                 if (pException[i].guidType == DebugEngineGuid) {
+                    sendUpdate = true;
                     if (pException[i].bstrExceptionName == "Python Exceptions") {
-                        _defaultBreakOnException = true;
+                        _defaultBreakOnExceptionMode =
+                            (int)(pException[i].dwState & (enum_EXCEPTION_STATE.EXCEPTION_STOP_FIRST_CHANCE | enum_EXCEPTION_STATE.EXCEPTION_STOP_USER_UNCAUGHT));
                     } else {
-                        _breakOnException.Add(pException[i].bstrExceptionName);
+                        _breakOnException[pException[i].bstrExceptionName] = 
+                            (int)(pException[i].dwState & (enum_EXCEPTION_STATE.EXCEPTION_STOP_FIRST_CHANCE | enum_EXCEPTION_STATE.EXCEPTION_STOP_USER_UNCAUGHT));
                     }
                 }
             }
 
-            _process.SetExceptionInfo(_defaultBreakOnException, _breakOnException);
+            if (sendUpdate) {
+                _process.SetExceptionInfo(_defaultBreakOnExceptionMode, _breakOnException);
+            }
             return VSConstants.S_OK;
         }

One trick here is that this method is only called where a setting is different from its default, which means that the debug engine has to be aware of all the default values. For example, we made AttributeError not break by default, since it is often handled in a way we cannot detect, but now SetExceptionInfo only receives a notification for when handling is enabled, and so the debug engine has to be aware that unless it is told otherwise, it should not break on AttributeError. The changes required to make it break when the exception is thrown are discussed later.

Detecting Exception Handlers

As described above (briefly), Python uses a different approach to thrown exceptions than other languages. The main difference is that exception handlers are checked as the stack unwinds, so by the time Python code can be sure that an exception is not handled, the stack is gone, all finally blocks have been run and only the traceback remains. The traceback contains source files and line numbers, but no information about local variables or values. Ideally, the debugger would detect that an exception in unhandled without unwinding and break in before the call stack is destroyed.

Since we are restricted to static analysis, a fallback position is necessary. The implementation assumes that all exceptions are unhandled unless proven otherwise, which makes us more likely to break when an exception occurs. The alternative would result in more stack traces and less breaking. Since the developer is already debugging their code, it is fair to assume that a false positive (breaking when a handler exists) is preferable to a false negative (not breaking when there is no handler).

In terms of implementation, the GetHandledExceptionRanges method in PythonProcess.cs does the main detection work:

309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
internal IList<Tuple<int, int, IList<string>>> GetHandledExceptionRanges(string filename) {
    PythonAst ast;
    TryHandlerWalker walker = new TryHandlerWalker();    var statements = new List<Tuple<int, int, IList<string>>>();
 
    try {
        using (var source = new FileStream(filename, FileMode.Open, FileAccess.Read, FileShare.Read)) {
            ast = Parser.CreateParser(source, LanguageVersion).ParseFile();
            ast.Walk(walker);
        }
    } catch (Exception ex) {
        Debug.WriteLine("Exception in GetHandledExceptionRanges:");
        Debug.WriteLine(string.Format("Filename: {0}", filename));
        Debug.WriteLine(ex);
        return statements;
    }
 
    foreach (var statement in walker.Statements) {
        int start = statement.GetStart(ast).Line;        int end = statement.Body.GetEnd(ast).Line + 1;        var expressions = new List<string>();
 
        if (statement.Handlers == null) {
            expressions.Add("*");        } else {
            foreach (var handler in statement.Handlers) {
                Expression expr = handler.Test;
                TupleExpression tuple;
                if (expr == null) {
                    expressions.Clear();
                    expressions.Add("*");                    break;
                } else if ((tuple = handler.Test as TupleExpression) != null) {
                    foreach (var e in tuple.Items) {
                        var text = ToDottedNameString(e, ast);
                        if (text != null) {
                            expressions.Add(text);                        }
                    }
                } else {
                    var text = ToDottedNameString(expr, ast);
                    if (text != null) {
                        expressions.Add(text);                    }
                }
            }
        }
 
        if (expressions.Count > 0) {
            statements.Add(new Tuple<int, int, IList<string>>(start, end, expressions));        }
    }
 
 
    return statements;
}

Given a filename, we parse the file and produce a syntax tree. With TryHandlerWalker we simply collect all the try nodes, each of which also contains the list of caught expressions and, importantly, the line numbers of the try statement and the first except statement. This information is sent to the debugger in response to a REQH command (discussed below).

Later, the debugger will parse expressions from the except statements to compare against the thrown exception. Line numbers are sufficient to determine whether code is in a try block (since try and except are only valid at the start of a line) but anything can be specified as the except expression. For example, the following is working (perhaps not valid, and certainly not easy to read) code:

try:
    do_something()
except get_exception_types()[1]:    print("handled")

When we are simulating a stack unwind, we cannot call the function to find out what it returns, firstly because we are not in the correct call frame and secondly because of potential side effects. (The closest thing to a sensible use of a call as an except expression is logging, which would mean that if we called it we would produce spurious messages.)

To handle as many safe cases as possible, the debugger will only look up names and modules. As soon as anything more complicated appears in an expression list, we assume that it does not handle the current exception and keep looking for another handler. Because we have access to the call frames, we are able to look up names in the correct scope. For example, in the following code we will determine that the exception has a handler, even though its name (zde) is defined in a different call frame than where the exception is thrown:

def test_1(x):
    return 100 / x 
def test_2(x):
    zde = ZeroDivisionError    try:
        return test_1(x)
    except zde:        return 0

An issubclass call is used to determine whether the exception will be caught by each possible handler.1 When it returns true (or a plain except: is encountered), execution is continued and the debugger is not notified of the exception. If we reach a call frame that does not have code available (meaning we can’t get the handlers) or that belongs to the debugger itself, we assume the exception is unhandled and break.

1Of course, this means that arbitrary code could still be executed in the form of a __subclasscheck__ method. This method should not have side effects anyway, so all we’d really achieve by calling it is to expose a bug earlier.

Extending the Debugger

In order for the debugger to support the updated exception handling mechanism, we have to add the new REQH (REQuest Handlers) and SEHI (Set Exception Handler Info) commands. Commands are used for stateless communicate between the Python process being debugged and the Visual Studio instance doing the debugging. REQH will be sent from the debuggee to request the list of exception handlers in a particular source file and SEHI is sent with the response. PythonProcess.cs contains the handling for REQH events:

266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
switch (CommandtoString(cmd_buffer)) {
    case "EXCP": HandleException(socket); break;
    case "BRKH": HandleBreakPointHit(socket); break;
    case "NEWT": HandleThreadCreate(socket); break;
    case "EXTT": HandleThreadExit(socket); break;
    case "MODL": HandleModuleLoad(socket); break;
    case "STPD": HandleStepDone(socket); break;
    case "EXIT": HandleProcessExit(socket); return;
    case "BRKS": HandleBreakPointSet(socket); break;
    case "BRKF": HandleBreakPointFailed(socket); break;
    case "LOAD": HandleProcessLoad(socket); break;
    case "THRF": HandleThreadFrameList(socket); break;
    case "EXCR": HandleExecutionResult(socket); break;
    case "EXCE": HandleExecutionException(socket); break;
    case "ASBR": HandleAsyncBreak(socket); break;
    case "SETL": HandleSetLineResult(socket); break;
    case "CHLD": HandleEnumChildren(socket); break;
    case "OUTP": HandleDebuggerOutput(socket); break;
    case "REQH": HandleRequestHandlers(socket); break;    case "DETC": _process_Exited(this, EventArgs.Empty); break;
}
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
private void HandleRequestHandlers(Socket socket) {
    string filename = socket.ReadString(); 
    Debug.WriteLine("Exception handlers requested for: " + filename);
    var statements = GetHandledExceptionRanges(filename); 
    _socket.Send(SetExceptionHandlerInfoCommandBytes);    SendString(_socket, filename);
 
    _socket.Send(BitConverter.GetBytes(statements.Count));
 
    foreach (var t in statements) {
        _socket.Send(BitConverter.GetBytes(t.Item1));
        _socket.Send(BitConverter.GetBytes(t.Item2));
 
        foreach (var expr in t.Item3) {
            SendString(_socket, expr);
        }
        SendString(_socket, "-");
    }
}

The GetHandledExceptionRanges method was shown above; HandleRequestHandlers calls this method and transmits the line numbers and exception types to the debuggee. The implementation on the debuggee side is in visualstudio_py_debugger.py. The handling for the SEHI command looks like this:

667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
self.command_table = {
    cmd('exit') : self.command_exit,
    cmd('stpi') : self.command_step_into,
    cmd('stpo') : self.command_step_out,
    cmd('stpv') : self.command_step_over,
    cmd('brkp') : self.command_set_breakpoint,
    cmd('brkc') : self.command_set_breakpoint_condition,
    cmd('brkr') : self.command_remove_breakpoint,
    cmd('brka') : self.command_break_all,
    cmd('resa') : self.command_resume_all,
    cmd('rest') : self.command_resume_thread,
    cmd('exec') : self.command_execute_code,
    cmd('chld') : self.command_enum_children,
    cmd('setl') : self.command_set_lineno,
    cmd('detc') : self.command_detach,
    cmd('clst') : self.command_clear_stepping,
    cmd('sexi') : self.command_set_exception_info,
    cmd('sehi') : self.command_set_exception_handler_info,}
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
def command_set_exception_handler_info(self):
    try:
        filename = read_string(self.conn) 
        statement_count = read_int(self.conn)
        handlers = []
        for _ in xrange(statement_count):
            line_start, line_end = read_int(self.conn), read_int(self.conn) 
            expressions = set()
            text = read_string(self.conn).strip()
            while text != '-':
                expressions.add(text)
                text = read_string(self.conn)
 
            if not expressions:
                expressions = set('*')
 
            handlers.append((line_start, line_end, expressions)) 
        BREAK_ON.handler_cache[filename] = handlers    finally:
        BREAK_ON.handler_lock.release()

Notice that the filename is sent both ways. Because the protocol is stateless, the debuggee cannot automatically associate the response with its request. In practice, it will block until it receives the response it expects, but it would require no protocol modifications to support preemptively sending SEHI commands. To avoid parsing source files repeatedly, the debuggee caches the handler lists, which are stored as tuples containing the same information found by GetHandledExceptionRanges.

Breaking on an exception is relatively straightforward. The handler for when an exception is thrown already had a ShouldBreak test, which was changed to check the mode for the exception as, if necessary, whether any handlers exist.

358
359
360
def handle_exception(self, frame, arg):
    if frame.f_code.co_filename != __file__ and BREAK_ON.ShouldBreak(self, *arg):        self.block(lambda: report_exception(frame, arg, self.id))
-    def ShouldBreak(self, name):
-        return self.break_always or name in self.break_on
 
+    def ShouldBreak(self, thread, ex_type, ex_value, trace):
+        name = ex_type.__module__ + '.' + ex_type.__name__
+        mode = self.break_on.get(name, self.default_mode)
+        return (bool(mode & BREAK_MODE_ALWAYS) or+                (bool(mode & BREAK_MODE_UNHANDLED) and not self.IsHandled(thread, ex_type, ex_value, trace)))

The IsHandled method performs the call-stack traversal described earlier, returning False at the first frame that has no code available or True at the first frame that has a handler matching the exception type:

    # Edited for length
    def IsHandled(self, thread, ex_type, ex_value, trace):
        ... 
        cur_frame = trace.tb_frame
 
        while should_send_frame(cur_frame) and cur_frame.f_code.co_filename is not None:
            handlers = self.handler_cache.get(cur_frame.f_code.co_filename)
 
            if handlers is None:
                # get handlers, or assume unhandled and return False
                ...
 
            line = cur_frame.f_lineno
            for line_start, line_end, expressions in handlers:
                if line_start <= line < line_end:
                    if '*' in expressions:
                        return True
 
                    for text in expressions:
                        res = lookup_local(cur_frame, text)
                        if res is not None and issubclass(ex_type, res):
                            return True
 
            cur_frame = cur_frame.f_back
 
        return False

The search starts from the first frame in the traceback, which is the frame where the exception was caused, and goes backwards through f_back (that is, towards the caller) searching each file’s handler list. If no handlers are cached, they are requested from the debugger, and the linear search and use of line ranges ensures the correct behaviour for nested try blocks.

Summary

This feature added the ability for Python Tools for Visual Studio to detect and break on unhandled exceptions without unwinding the stack. This allows developers to inspect the state of their program and not just the stack trace. Handler detection is based on simple source code analysis, performed when an exception has been thrown, and assumes that if it can’t guarantee that the exception has a handler then it should break. It was first released in PTVS 1.0, August 2011.

Poor start

The poor start is only with respect to blogging. Everything else has gone well. I’ve settled into life in the US and have been at work for two months, in which time I’ve done a lot of things I can’t talk about (yet) and some that I can.

I’m on the Python Tools team at Microsoft and our main project is Python Tools for Visual Studio (PTVS). We’re also the team responsible for supporting Python on Windows Azure and the PyKinect library for the Kinect SDK. Admittedly, between ramping up and other (unannounced) work, I’ve only contributed a few features and fixes.

I’ll probably write some blogs on those features. As I’m a developer and not a Program Manager, I promise that any such blogs will be filled with code and not marketing. While I could probably get a Microsoft-hosted blog (and will quite likely be asked to contribute to one), I’ve already committed to keeping my current site alive, so I’ll mainly post here and cross-post anything I do elsewhere.

Finishing on a more personal note, moving into a new apartment is hard when you don’t have any furniture, nor enough money to buy new furniture. Wait, let me rephrase. Moving is incredibly easy without furniture; living for the next month or two is not so comfortable. The lesson: purchase furniture early and put it into storage. Or find a place that has a really comfortable floor.

Post-WCCI 2012

So WCCI 2012 has come and gone, so now it’s up to the proceedings to make an actual change in the world. Hopefully some of the attendees found some new perspective or idea to influence their future work, though the whole experience is pretty full-on, so I thought I’d use this post as an excuse to write down my two biggest “lessons learnt.” (Whether I apply or use these is the future is a different matter…)

Gene Expression

EAs typically use one of a few common representations for the individuals they are evolving, such as integer or real-valued vectors. (There are others that aren’t relevant here.) In general, these vectors are of a fixed length and can change either (semi-)randomly or through exchanges with other individuals. Gene Expression is a concept where a vector has more elements than required, but a secondary Boolean vector indicates whether each “gene” (element) is “expressed” (active or ignored).

For example, the vector “12345” with expression vector “10110” implies the solution should be (based on) “134”. However, because the expression vector can change independently, the un-expressed values are retained. Dan Ashlock‘s presentations showed that even allowing a small amount of slack space (adding 4 elements to a 24 element vector) had a significant impact.

What’s potentially more interesting is the application of this concept on GPUs. Because of their architecture, variable-length vectors are particularly inefficient to use. Slack space up to the maximum possible length is a possibility, but then you need a way of indicating which elements should be ignored. My own intent was to include a length attribute, but an alternative would be to include an expression vector.

Real-world Problems

All the attendees at Zbigniew Michalewicz‘s lecture will know what I’m referring to here, since it had quite an impact. One of the main messages was that a lot of research is in “silos,” trying to find the fastest/best solution to a very narrow problem, and does not consider that (a) fast-enough is good-enough and (b) there are real-world applications that are completely ignored. Other presenters made similar points, perhaps better identifying the underlying reasons (lack of industry-academia interaction, time/effort/money required to create better than proof-of-concept software).

Probably the best outcome from this presentation was the defensive reactions it provoked: the number of people who said “hang on, look at what I’ve been working on” seemed to be higher than the number of written publications and certainly reached a wider audience. Normally I’m not a fan of purely “awareness raising” presentations, but in this case the awareness of actual work was raised, even if it was not work done by the presenter.

It will be interesting with my new job being in the “industry” that seems so mythical to academia. I’m quite happy about the move, mainly because I’d prefer my main output to be production-quality software rather than settling for prototypes and publications. Presumably I’ll get to go to tech conferences at some point, rather than academic ones, but for me WCCI 2012 was made great by having a pretty significant social group there. Even though I’m leaving academia, at least I’ve done it and it wasn’t all bad.

WCCI 2012

In what will be one of my last appearances as a Ph.D. student, I will be presenting a paper at WCCI next week in Brisbane, Australia. The paper is “Automatic Implementation of Evolutionary Algorithms on GPUs using ESDL” and, contrary to what the WCCI program says, I will be presenting on the Monday at 16:10 as part of CIGPU.

The work in this paper is an application of my thesis work, rather than being central to it. ESDL provides a structural decomposition of an algorithm that allows a software implementation to reconstruct the algorithm through operator composition. So, by implementing a set of operators for a GPU, rather than a CPU, a wide range of EAs can be easily constructed by any developer, whether they have taken the time to learn how to program for a GPU or not.

My presentation will be to an audience with a specific interest in implementing for GPU (at least, that’s what I’m assuming), which means I’ll be able to concentrate more on how ESDL’s robustness and portability can help make CIGPU accessible to more users without compromising on flexibility.

With regards to the performance results I’ll talk about, it’s hardly a spoiler to point out that there is no silver bullet here. Composing GPU-based operators produces faster code than operators written in a scripting language, comparable to CPU-based native operators and not as fast as a monolithic GPU-based algorithm. However, once a suitable set of operators are available, compiling ESDL with GPU-based operators produces code that is almost as fast as writing it from scratch, but without the (very significant) effort required.

If you are attending WCCI this year, feel free to come and have a chat to me either before or after I present (try and avoid doing it during my presentation…). I’m happy to talk about EAs, Python, C#, C++ or C++ AMP, all of which I’ve been using a lot of recently. See you there.

Starting fresh…

As I write this, I’m fully aware that this blog is hidden at a super secret URL (/blog) that nobody will find. I would have put off writing anything for a bit longer, but WordPress displays a really obnoxious (and poorly formatted) message when you don’t have any posts. I’ll probably write something else before this becomes the main entry page of the site.

So far, I’ve turned my previous site, which was entirely focused on my research work, into static pages. They need updating anyway, but I’ll get to that eventually. I kept the same theme, mostly because I love the colourful background.

But “starting fresh” isn’t entirely about the new site – I’m also about to generally start fresh in life. In a bit over a month’s time, I will have submitted my Ph.D., moved to a new country (USA), started a new job as a software engineer at Microsoft and restarted blogging (check!).

Every time I have this conversation, the job is what most interests people. I’ll talk about it in more detail later, but not this time.

As for the move, I’ve lived in Australia my entire life with the only exception being three months in the US last year for an internship. Moving for a more significant amount of time is a little intimidating, but having been over already makes it easier. Also, the support I’m getting from Microsoft is very helpful.

I’ll go into more detail on my plans for this blog in a later post, but as with previous efforts (under pseudonyms) it will be a mix of code, general development and possibly education. I’ve never been one for talking about my personal life (ref: my mostly blank Facebook page) but I may throw the occasional non-code post in (like this one).

That’s all for now. I don’t have a blogging schedule yet, but I will and I’ll be sticking to it, which probably means it will only be weekly. First, I need to figure out what to write about…