Monthly Archives: December 2012

Significant indentation comments/strings

I recently changed my comment syntax to allow multi-line comments through indentation, which you can see an example of in the below code.  If a line comment is followed by indented lines, then the comment continues until the indentation level returns to the initial level.


This approach supports nesting, doesn’t require an explicit terminal symbol, and is also useful for string literal parsing.

There were two hard parts to implement it:

  • I previously translated leading tabs into indent/outdent tokens after the lexer turned the input string into tokens.  Since comments and string literals are parsed by the lexer, I had to move that translation to happen before lexing.  However, I cannot strip white-space until after lexing, so these new indentation-aware components of the lexer needed to explicitly handle superfluous white-space.  The indentation-aware components of the parser don’t need to do so, since it occurs after white-space is stripped.
  • It was tricky to make syntax highlighting in Sublime Text handle the new syntax.  The problem is that it just uses a set of regexes to classify subsequences of the buffer.  However, it does support nested regexes, so I ended up writing code to generate the syntax highlighting definition file, and use that to nest the indentation-dependent classifications within an explicitly enumerated regex for each indentation level.

Objects and Implicit This

Something my language does differently from most object-oriented languages is that there’s no implicit this value for member functions.  This may seem strange at first, but it solves a few problems.

A common problem with an implicit this value is that the function must be called explicitly as a member function to receive the this value.  For example, in JavaScript, calling object.function() passes object into function as this.  One case where this goes wrong is if you want to use that function as a delegate: control.onClick = object.function.  Now, when control.onClick() is called, it will pass control to function as this instead of object.  This is sometimes useful, but it’s unintuitive behavior for a function originally declared as part of object.  JavaScript provides a way to override the this pointer passed to the function, but the fact that this may not be the object the function was declared in is a caveat most JavaScript programmers will have to deal with.

The next level of this problem is that when you have nested object declarations, this only allows you to access the innermost object’s members.  That sounds a little contrived, but it’s a real problem.  For example, if you have an object for your module’s public members, and within it you have another object literal, that nested object’s member definitions cannot access the module’s public members.

This is actually a little better in JavaScript than in languages like C++, since each object has its own map from member names to values.  Since a function value in JavaScript includes an implicit lexical closure, the member function for an object can have implicit information about its lexical environment.  That normally doesn’t include this, but it allows CoffeeScript to generate JavaScript that binds a function value to a specific value of this, so it will be the same regardless of how the function is called.  However, that solution still requires programmers to understand how JavaScript defines this, so they can decide which functions need the special this-binding function syntax, and it doesn’t solve the problems with nesting objects.  For more info, see CoffeeScript’s function binding.

The way I solved this in my language is to replace the caller-provided this with capturing the object’s members as part of a lexical closure.  Object members may be accessed within the scope of the object definition, and it will resolve to the inner-most object that defines the name.  Internally, there isn’t a single this, and member names need to be resolved against a stack of objects that were in the lexical scope.

To make things a little bit clearer for everybody (humans and machines), member names must start with a . symbol, both when declared and when accessed.  Objects may also include definitions that don’t have a . prefix, but they are the equivalent of private members: they cannot be accessed outside the object definition.  When an object is used as a prototype, its public and private members are copied into the derived object, and are bound to the new object where they previously accessed the prototype’s public members.


Member name resolution in my language

Member name resolution in my language

There are still some remaining problems:

  • There aren’t aliases for the objects that are being used to resolve member names.  This is rarely needed, but it is occasionally useful.  The simplest solution would be to define this(or .) to be the innermost object context, but that doesn’t allow e.g. memberFunction to access the module object it is being used from; it can’t simply refer to myModule, as myModule may be used as a prototype for another object, and if that is how memberFunction was accessed, then that prototyped object is what you want.
  • I’m torn on using a symbolic <- for object prototyping.  Its meaning is similar to class inheritance, where the left operand is a prototype object for the right operand.  An operator is more flexible than making the prototype part of the object literal syntax, and <- makes the non-commutative nature of the operator clear, but the meaning isn’t as obvious to people who don’t know the language as using a keyword like extends would be.  But what’s an appropriate keyword that keeps the prototype as the left operand?
  • The way I implement this when running in the JavaScript VM, objects include an implicit function that is called when it is used as a prototype.  The function recreates the prototype’s member definitions with the new object bound in place of the prototype object.  This function implicitly includes a closure for the lexical scope the object was created in, and if the prototype object is itself created from a prototype, a call to that prototype’s prototyping function.  The problem is that a VM may keep the whole lexical scope around, instead of just the variables accessed by the functions; the JavaScript VM I am using (V8) is implemented that way.  As a result, any definitions accessible from an object’s scope are kept around as long as the object is, since they may be accessed when the object is used as a prototype.  The compiler has the information it needs to avoid this when it is unnecessary, but I don’t like to rely on that kind of optimization for intuitive performance/memory characteristics.

Parser Combinators and Error Reporting

For the last month, I’ve been working on a programming language. It’s the first language I’ve made, so I’ve been learning and having fun. It still has a long way to go, but a week ago it became self-hosted. I deprecated the bootstrap compiler (written in CoffeeScript), and haven’t looked back!

The language is similar to CoffeeScript in spirit, but is growing in a different direction. It’s dynamically typed, translated to JavaScript, has string subexpressions (what CoffeeScript calls string interpolation), and has significant white-space.

However, it also has no mutable variables, doesn’t use parentheses or commas for function calls, and has a different object model. I’m planning to incrementally add static typing: first by adding type constraints to function parameters that are verified at run-time, and then by verifying the constraints at compile time. The language already generates code with extra run-time checks to catch errors earlier than they would occur in CoffeeScript/JavaScript.

Snippet of my language

The where term seen in this snippet is a simple feature, but provides a useful tool for describing your program in a top-down way.  The for is similar to CoffeeScript; in this case it means return an array where each element is the result of applying literalTokenParser … to the symbol strings in the provided array.  [: is a way to declare a literal array with an indented list of elements separated by newlines, with a decrease in indentation terminating the array rather than an explicit ].  There are several terms like that in the language, inspired by my experience with CoffeeScript.  In CoffeeScript, you often end up with redundant closing brackets and braces where the indentation provides the same information.

But I’ll talk more about my language later, this post is mostly about parser combinator error handling!

Parser combinators

The code snippet shows a very simple example of parser combinators: litSeq is a primitive parser that matches a string in the input, literalTokenParser turns a successful match of that parser into a literal token value, and alt matches any individual symbol parsers.  describeParser just provides a friendly description of the combined parser for debugging and error messages.

litSeq, literalTokenParser, alt and describeParser are all higher-order functions: they return a function that is parameterized by the input stream to be parsed.  The type of symbolsParser is a function that maps an input stream cursor to either some output value and a new stream cursor, or failure.

Implicit in the alt combinator is the ability to backtrack.  One of the alternatives may partially match the input stream, but ultimately fail and fall back to trying other alternatives at the earlier location in the stream.  This is a powerful property that allows general composition of parsers: alt doesn’t need to know which characters each alternative will match to work.  It’s also a dangerous feature, as a grammar with too much backtracking can be slow to parse and hard to read.

But the compositional nature of alt makes it possible to declare parsers like symbolsParser that may later be combined into more complicated parsers without any knowledge of what characters it matches.  This kind of modularity allows reducing a program to its pieces, and reading and reasoning about each piece individually.  Without it, complicated software projects eventually collapse under the weight of the knowledge needed to understand any given part of the project.

This modularity is why I prefer parser combinators to the usual shift-reduce parsers and the accompanying domain-specific languages that are often needed to practically develop them.

Some aspects of traditional parsing are still relevant, though.  To accommodate parsing without backtracking, shift-reduce parsers are usually split into two passes: the first pass, lexing, turns a sequence of characters into a sequence of tokens.  The second pass turns the sequence of tokens into a syntax tree.  The first pass can turn character sequences like “for” into a single token, so that the second parsing pass only needs to look at the next token to tell that it’s a for statement, instead of a variable name starting with “f” or any number of other alternatives.

With backtracking parser combinators, you can express grammars that need to read many characters to determine which alternative is being parsed, but it is still often simpler to do it in two passes.  The lexer can eliminate comments and insignificant white-space from the resulting token stream, which reduces complexity in the parser more than it adds complexity in the lexer.

In my language, the significant white-space is also translated in a separate pass from parsing and lexing.  Tab characters at the start of lines are converted into indent and outdent tokens, which allows the parsing to be context-free.  When the parser is trying to match a reduction in indent level, it doesn’t need to keep track of the current indent level; it just looks for an outdent token.

The lexer is also complicated by support for string subexpressions: the subexpressions within a string are recursively lexed, and so the raw output of the lexer is a tree of tokens that is flattened before being passed to the parser.

This is why string subexpressions complicate lexing

Error reporting

In my first implementation of the language in CoffeeScript, I used a special parser combinator to turn parse failures into fatal errors.  For example, you know that in your grammar that the for keyword must be followed by an identifier, so you can say that the identifier parser is required.  If that sequence parser matches the for keyword, but the following required identifier parser fails, then it is a fatal error.

However, that approach breaks the composability of the parsers.  To add that required qualifier to the identifier parser, you need to know that no other parser in the whole grammar can match the for keyword followed by a non-identifier.  It is true in this case, but it subtly effects the meaning of any parser using the for subparser.  If you wanted to add an alternative use of for to the grammar, you would have to change that required identifier parser.

When I rewrote the parser in my language, I implemented the parser combinators differently.  Whenever a parser fails, it also returns information about what it expected to parse.  The parser combinators propagate that failure information outward, but may be absorbed if an alternative parser succeeds.  Only when the failure is returned from the outermost parser does it become an error.  At that point, the failure has information about all the alternatives at the point of failure, and can be turned into a useful error message.  The rules for propagating the failures are fairly obvious, but took some trial and error for me to discover them:

  • I discard all alternative failures that matches less input than another failure, under the assumption that the parser that matched the most input is what the user intended.
  • The describeParser combinator is used to provide a friendly description of what an inner parser will match.  When the inner parser fails, if it didn’t match any input, it overrides the inner parser’s description of its expectations.  However, if the inner parser fails after matching some input, then it forms an expectation tree where the description is used as context for the inner parser’s failed expectations.
  • Because some parsers may have alternatives that succeed with a partial match of the input, they need to also return information about failed alternatives that consumed more input than the successful match.  For example, parsing an addition expression from “a+” should successfully match “a”, and return that along with a failed alternative that matched “a+” but then failed to match a term.
  • When a sequence parser fails, it needs to look at the failed alternatives for successfully matched subparsers at the beginning of the sequence, in addition to the subparser failure that caused the sequence to fail.  For example, if the expression parser in the previous point was wrapped in a sequence that expected an expression followed by the end of the input, and was applied to the same input string “a+”, then it would match “a” before failing to match “+” with the end of the input.  In that case, the failed alternative from the expression parser made it farther than the failed end of input parser, and so that is the failed expectation to return from the sequence.

Once the failure becomes fatal, it will have information about all the possible expectations at the furthest unmatched point in the input, organized in a tree with the nodes providing context for the failed expectations.  To describe this to the user, I use the description for each leaf, and the innermost node that contains all the leaves as context.

Automatically generated parse error for a statement in my language’s REPL