Author Archives: andrew

NYC transit map

I couldn’t find a map that showed where all of NYC passenger trains went, so I threw one together. It includes the subway (including the under construction 7 line extension and 2nd ave line), the LIRR, Metro North, PATH, and NJ Transit, though only in a very localized area around New York City.

NYCTransit

For comparison, here’s the map with only the existing subway routes:

NYCSubway

BrickGame

WorldMapI’m way behind on blogging about this, but in April 2014 I spent a month making a plugin for Unreal Engine 4 for rendering and interacting with Minecraft-style voxel worlds, along with a simple game called BrickGame to demonstrate it.

I released the source code for the plugin and game on Github. You can of course see how it all works by looking at the code, but there are some tradeoffs I made that deserve some explanation.

Regions and components

In BrickGame, the world is divided into regions, and regions into smaller render and collision components which correspond to UE4 PrimitiveComponents. Regions are generated, and render/collision components created in some radius around the player. The benefit to the different granularity of procedural brick generation and render components is that a region can span the whole Z axis of the world, which allows the procedural generation to only sample 2D noise functions once for each XY, but still divide the world into multiple render components on the Z axis for finer grained visibility culling and dynamic geometry updates.

Collision components are still smaller than the render components, and only created in a much smaller radius around the player than you can see. The small size is to avoid hitches when creating new collision components, and the small radius is to avoid limits in the total number of PhysX box elements. Since I wrote this code, UE4 added support for PhysX triangle mesh collision created at runtime, which would avoid the small radius problem, but would probably negatively affect performance and memory.

Geometry

The render components create a vertex buffer and an index buffer on the CPU. I don’t do anything to merge adjacent coplanar faces. I do use two tricks to minimize the amount of vertex data:

  • UVs are derived from the world-space position using a single, planar projection that is 45 degrees off the axes that the face normals are on.
  • I render faces with different normals in separate draw calls, and use a separate vertex stream with a stride of 0 to provide a single tangent basis for all those faces.

Each vertex consumes 4 bytes, with an 8-bit/component XYZ in the render component’s coordinate system, and an 8-bit ambient occlusion factor. A separate draw call is made for each face direction in a render component that may be visible with the vertex tangent basis bound to a zero-stride vertex stream.

This is tuned to minimize the cost of recreating the vertex and index buffers when a render component is modified or first becomes visible. As the player moves around and modifies bricks, render components must be created and updated. To maintain a smooth framerate, the amount of time it will spend doing so each frame is limited. Reducing the amount of time to create the vertex and index data allows either more components to be updated each frame, or the use of larger components that improve rendering efficiency.

The component sizes are configurable, but the settings I used for BrickGame use large components to minimize per-component CPU and GPU overhead, and enable large view distances. BrickGame uses 32x32x128 regions, 32x32x32 render components, and 16x16x32 collision components.

Global Illumination and Emissive Bricks

EmissiveBricks
BrickGame uses UE4’s Light Propagation Volumes for both Global Illumination and emissive bricks. It’s more expensive than it needs to be for a voxel world, but it was easy and mostly just works. However, I ran into problems with the reflective shadow map rendering time spiking when the sun reached the horizon, and disabling the sun’s directional light disables LPV, which causes the emissive bricks to stop working. My workaround was to add a moon that is always above the horizon, and switch the directional light to it once the sun approaches the horizon. This keeps the directional light at a high enough angle that performance is good, and keeps the emissive bricks working.

Adding a moon required modifying UE4’s starter sky blueprint and material. I added the moon to the material, and added a directional light component to the blueprint that it uses for either the sun or the moon, depending on how high the sun is above the horizon. The moon simply circles high enough above the horizon to avoid the performance problems with the reflective shadow maps.

Clouds

While I was modifying the sky material, I took the opportunity to add volumetric clouds. The clouds are ray-traced by discrete steps through a parametric noise function. It uses only four samples to determine opacity, and four to determine lighting. This is helped greatly by jittering the samples in cooperation with UE4’s temporal AA. Without the jittering, discrete cloud layers are very obvious, and with the jittering it looks smooth. I think the result is good, but by the time I was done I regretted the distraction from the brick-rendering part of the project.

Engine changes

The BrickGame code will work with completely stock UE4 code, but there are some engine changes necessary for it to work well:

Ambient Occlusion

WithAndWithoutAO

When I initially wrote BrickGame, I decided to compute the ambient occlusion on the CPU, pass it to the material through a vertex attribute, and output it from the material where “baked” ambient occlusion would be output. The goal was to avoid the need to modify the engine (which is so far required if you want to change any shaders). However, I eventually realized that I wanted to make this baked ambient occlusion only affect ambient lighting, not bounced lighting from LPVs, which I implemented with a one line shader change in the engine. Ironically, later versions of UE4 removed support for material-driven ambient occlusion, and so to make my approach work at all requires (small) changes to the engine code. Once it’s possible to add shaders without modifying the engine, it would be nice to come up with a GPU-based ambient occlusion solution.

Given that it eventually became necessary to modify the engine to make the CPU-based ambient occlusion work, and the amount of effort required to make the LPV-based emissive bricks work, it would have been better to just do GPU-based lighting.

Backface culling

The second engine change I made was later in the project, when I was focused on performance. I discovered that doing some large-scale culling of backfaces on the CPU reduces the GPU cost substantially. To do that within the UE4 renderer’s fast path for static meshes required some changes to the engine. Epic has since then added similar functionality to the engine, but it’s not exactly what I need. I’m hoping to eventually submit a pull request to Epic with a change that merges the functionality of our similar static mesh culling changes.

Merging

After spending April intensely focused on BrickGame, I have been focused on my programming language and my non-programming projects. I’ve updated BrickGame and my engine changes to new versions of UE4 a few times, but each time it has been incredibly painful to integrate the engine changes. This is mostly an issue with my tools (and likely my lack of understanding of them): I could have done these merges in 15 minutes using P4+Araxis, but using SourceTree/Git to merge simply doesn’t work in ways that I haven’t yet figured out an explanation for. To summarize, I have a fork of Epic’s UE4 Github repository, and a branch within that fork which contains my changes on top of version N of UE4. When I try to UE4 merge version N+1 into that branch:

  • Git reports merge conflicts in thousands of files I’ve never touched
  • Using SourceTree to manually “accept theirs” on the conflicts doesn’t work if any of the conflicts are new files from Epic’s repo (yes, that’s apparently a conflict). So I have to manually accept thousands of files in small chunks
  • Hunks being merged from Epic’s repo that are simple deletions aren’t merged (e.g. deleting a file, or deleting some lines of code without any nearby added lines)

This should be a very simple merge, so there has GOT to be something I’m doing wrong. If anybody has any ideas, please let me know.

SYMBL compilation pipeline

The compiler pipeline

The compiler pipeline

Despite my recent fascination with space manufacturing, I’m still focusing on my programming language, which is tentatively called SYMBL. In November and December, I wrote a bootstrap compiler in CoffeeScript, and replaced the bootstrap compiler with a self-hosted compiler. Since then, I’ve been working toward adding static bounds checking.

A lot of my effort has gone into transforming the high-level language syntax into a minimal core language. After parsing, there is a distiller stage that reduces a syntax tree with 20 different types of non-literal nodes to a tree with 10 different types of non-literal nodes. The distiller turns most of the complex language features into the simplified core language:

  • Operators are translated into applications of intrinsic functions.
  • Strings with subexpressions are translated to applications of the intrinsic string concatenation function.
  • Comprehensions are translated to applications of a map function.
  • Objects are translated to records (simple objects that just define a mapping from name to value).
  • Loops are translated to tail-recursive functions.

The knitter follows the distiller, and turns the core syntax tree into a graph by replacing references to local definitions with direct references to the subtrees denoting the definition’s value, producing cycles for recursive definitions. The resulting graph has only 9 different types of non-literal nodes.

The result is a term graph which has very simple language-independent semantics. Most of the language’s semantics are defined by how the distiller and knitter translate its syntax into the semantics of the term graph, combined with the simple evaluation rules for the term graph.

The distiller and knitter were initially a single stage, but I found that splitting them into two stages produced simpler code: two 250 line modules can be much simpler than a 500 line module! And while this change didn’t reduce the amount of code in the stages themselves, the tests were simplified dramatically.

The term graph for the simplest module in the compiler

The term graph for the simplest module in the compiler

That distiller/knitter splitting was inspired by the success of splitting code generation into two stages.  It previously generated JavaScript directly from the Term Graph.  The translation into JavaScript is simple, but it needs to turn tail-recursive functions into imperative loops (JavaScript runtimes don’t have tail call optimization), and accommodate the distinction between expressions and statements that is typical in imperative languages like JavaScript (e.g. if is not an expression in JavaScript).

Instead of doing that in the same stage as the JavaScript generation, I defined an Abstract Sequential Machine that maps to a subset of JavaScript, and created a new Instructor stage that translates the Term Graph into the ASM.  Because the ASM has the same structure as the JavaScript syntax tree, the generating JavaScript from the ASM is very simple.

The distiller/knitter split also matches this pattern of splitting a stage into a homomorphic (structure-preserving) part and a non-homomorphic part.  The distiller is a homomorphic map from the syntax tree to the core syntax tree.  I’m using homomorphic loosely here: the distiller is not truly homomorphic because repeat and array size shortcuts have a core syntax translation that is dependent on their context.  But most of the syntax nodes are translated by the distiller independent of context!  This context-independence leads to simple code, and much simpler tests.

There’s a point where simpler stages aren’t worth the additional complexity of more stages, but I think these two additional stages were a good trade-off.

Space Colonization

The High Frontier

Lately I’ve been reading about the prospects for space colonization. My interest was sparked by playing Kerbal Space Program[1], and stoked by reading Gerard K. O’Neill’s book The High Frontier[2]. The High Frontier was written in the 1970s, and is showing its age, but describes a compelling vision for human colonization of space.

The High Frontier got one big thing wrong: the space shuttle program ended up being 10-50x more expensive per launch than expected. His proposed roadmap involves launching from Earth’s surface the components for a large mass driver[3] to be placed on the moon, which is impractically expensive without lower launch costs. From a modern perspective, there are more appealing routes to bootstrapping manufacturing in space.

Origin of elements

Stellar nucleosynthesis

Stellar nucleosynthesis

Taking a step back, all of the heavy elements that are present on Earth came from stars. Smaller stars (like ours, Sol) fuse hydrogen into helium, with a small amount of carbon, nitrogen, and oxygen produced. A larger star starts with the same reactions, but has enough gravity to ignite further fusion reactions through higher pressure.[4] The heaviest element this produces is iron, which is the tipping point where fusion consumes energy rather than releasing it.[5] Because fusion of heavier elements doesn’t release energy to resist gravitational compression, a star that is massive enough to fuse iron into heavier elements will eventually collapse as a supernova.[6]

Earth, and our solar system, is formed from the debris of dead stars. But virtually all of the heavy elements Earth is formed from have ended up in the core of the planet; the heavy elements we find in the crust primarily come from asteroid impacts.

Asteroids

Earth is constantly bombarded by small asteroids; too small to see from the surface of Earth until they are heated by hitting the Earth’s atmosphere. Regularly, but on an astronomic timescale, a medium-sized asteroid will hit earth. The most famous evidence of this is the Chicxulub crater in Mexico.[7] That crater was made by a 10km asteroid that released energy equivalent to 4.7 trillion times the atomic bomb dropped on Nagasaki, and likely caused permanent climate change that led to the extinction of the dinosaurs.

Even a much smaller asteroid would destroy human society. But if we discover an asteroid on a collision course with Earth early enough, we can use a small amount of energy while it is far from Earth to avoid impact. We should not wait to discover such an asteroid to develop that capability.

Near-Earth asteroids by time

Near-Earth asteroids by time

Though the bulk of the asteroids in the solar system are in an orbit between Mars and Jupiter, some have an eccentric orbit that brings them in close to Earth’s orbit once for each revolution they make. These are called near-earth-asteroids; nearly all of them have been discovered since 1980, with more than 90% discovered since 2000.[8]

There are many different types of asteroids, but they come from the same dead stars as Earth, and as a whole, they contain the same elements as Earth. Most of them are small, but there are some exceptions. There are more than 800 near-earth asteroids with a diameter greater than 1km, with the largest having a diameter of 34km.[9]

Some of those 1km asteroids take less energy to reach than the moon, and have huge quantities of all the elements we need to bootstrap manufacturing in space: iron, silicon, carbon, nitrogen, hydrogen, oxygen. And those elements aren’t buried deep below a planetary crust at the bottom of a deep gravity well: a human on the surface of a 1km asteroid could reach escape velocity by jumping.

Manufacturing

Manufacturing a solar power array from asteroid materials

Manufacturing a solar power array from asteroid materials

But instead of sending a human to tip-toe around an asteroid, we should send mining robots. Launching 1kg to low-earth orbit costs ~$10K[10], which is a double-edged sword: it means launching even a simple mining robot will be expensive; but it means that each kg that can be extracted and transported from the asteroid to low-earth-orbit is worth $10K.

Today, spacecraft must be launched from the surface with all of their fuel; if they were refueled in orbit, it would allow heavier, faster, or longer missions to be launched with our existing rockets. This would eliminate one huge barrier to manned missions to Mars and other planets, where an energy minimizing route would expose a human to a dangerous amount of cosmic radiation. Fuel can be extracted from asteroids by using solar power to heat rocks containing water ice, and electrolysis to separate the water into hydrogen and oxygen.

Mining fuel is a good first step to exploiting space resources, but asteroids can provide the raw material to manufacture anything we can make on Earth’s surface. Mining elements for transport to Earth’s surface wouldn’t initially compete with today’s mining industry, but the supply of raw minerals in asteroids dwarfs what is available on Earth. According to The High Frontier[2]:

Even if we were to excavate the entire land area of Earth to a depth of a half-mile, and to honeycomb the terrain to remove a tenth of all its total volume, we would obtain only 1 percent of the materials contained in just the three largest asteroids.
(Kindle location 984)

We’ve reached a point where providing an American standard of living to everybody on Earth would despoil the Earth, and yet many Americans are still materially poor. It’s necessary to limit consumption of Earth’s resources to a sustainable level, but that should not be seen as the limit for human society. In 50 years, I hope to see a decreasing number of humans on Earth, and a rapidly increasing number in space. We only have one planet that supports life, and we shouldn’t spoil it by living here.

Habitats

One of O’Neill’s engineering contributions was to show that it is possible, without exotic materials, to build large amounts of livable area in space. He proposed a 5 mile diameter steel cylinder, spinning at a rate that would produce centrifugal force equivalent to Earth gravity.[11] It would take a huge amount of energy to launch the material for such a structure from the surface of Earth, but there are enough raw materials in the asteroids to build many times the surface area of Earth in O’Neill cylinders.

Interior of an O'Neill cylinder

Interior of an O’Neill cylinder

Even if you can build such a structure, there are a lot of engineering challenges to make it livable. You have access to as much solar power as you want, but you need to radiate the resulting heat. You would need to manage the carbon cycle, transferring CO2 from the breathable air to carefully controlled food growth capsules. All waste would be broken down into its constituent elements and recycled. But by solving those problems, you can make a habitat that is perfectly tuned for human life; perfect weather, long days, and fewer natural disasters.

Another advantage to life in artificial gravity is easy access to zero-g and space. Transportation between habitats could be at high speeds without losing energy to air resistance, lift, or rolling friction. Manufacturing would benefit from using space’s vacuum to produce ultra-pure materials, and the lower energy needed to move heavy objects in zero-g.

Mars and Luna

These advantages give orbital habitats a clear advantage over trying to colonize the surface of Mars, or other large bodies in the solar system. Space-based manufacturing will make it easier to get to Mars, but the energy required to escape a planetary surface makes it unappealing to settle at the bottom of the Mars gravity well after going to great effort to climb out of Earth’s. The surface of Mars may look familiar to us, but that is deceptive: the engineering and energy costs to colonizing the surface of Mars would be at least as high as building orbital habitats.

Lunar base with mass driver

Lunar base with mass driver

In contrast to Mars, the moon has some advantages over even asteroids as a location for extracting resources; primarily that it is always close to Earth, and has easily accessible water in the perpetual shadowed craters near the poles.[12] However, it takes more energy to escape the surface of the moon, and lunar soil doesn’t contain the same diversity of elements that asteroids do. It’s only attractive as a stepping stone to the asteroids, but any manufacturing techniques developed for use in lunar gravity would likely be useless in zero-g and on asteroids. O’Neill’s plan for economical removal of material from the moon relied on building a huge mass driver[3], something that would take more than a decade of manufacturing build-up to create from lunar materials, or launching an impractical amount of material from Earth. In comparison, returning material from an asteroid requires potentially waiting several years for a good transfer opportunity, but then takes less overall energy than anything returned from the surface of the moon.

The seed

Self-assembling machine

Self-assembling machine

There are a lot of details to work out, but the work done so far leaves little doubt that the problems can be solved. The first step is to build an automated manufacturing system that can use asteroid materials to self-replicate from a “seed” launched from Earth[13].

The cost to do this is primarily in engineering. Such a seed would likely need several generations of building progressively more specialized manufacturing systems before it could achieve “closure” by developing the capability to self-replicate. But it is not necessary to design all generations before deploying the seed; only to be confident that the seed is capable of closure in a reasonable amount of time. As the seed builds each new generation of capability, engineers on Earth can continue to work on the design of the following generations.

Once the seed is capable of self-replication, exponential growth could make more resources available in space than on Earth within several decades, and sustain that growth for a thousand years. With vastly more resources in space, would humans still prefer to live on hot, crowded Earth?

References

[1]Kerbal Space Program
[2]The High Frontier on Amazon
[3]Mass driver on Wikipedia
[4]Stellar nucleosynthesis on Wikipedia
[5]Iron peak on Wikipedia
[6]Supernova nucleosynthesis on Wikipedia
[7]Chicxulub crater on Wikipedia
[8]Near-Earth object on Wikipedia
[9]1036 Ganymed on Wikipedia
[10]Comparison of orbital launch systems on Wikipedia
[11]O’Neill cylinder on Wikipedia
[12]Lunar water on Wikipedia
[13]Advanced Automation for Space Missions NASA study

Links

Software

  • Instapaper – I’ve been using Instapaper to save longer articles to read later; it’s great when you come across something you want to read, but don’t have time to immediately read it. Beyond simply saving the link for later, it also has an iOS app that downloads the article for reading offline (e.g. while flying). The archive and “liked” lists are also handy for digging up past articles you’ve read; most of these links come from my instapaper likes.
  • Kerbal Space Program – A sandbox game where you build rockets and fly them around the solar system. It’s simple enough to play without being a rocket scientist, but challenging enough to make successful flights rewarding.
  • SourceTree – A great, free Git/Mercurial GUI. This has been around for a while as a Mac app, but they only recently released a Windows version.

Money

Technical

  • Combinatory Logic – In contrast to lambda calculus, combinatory logic is an equivalent program representation that doesn’t have explicit parameters or substitution. Instead, combinators are used to implicitly form functions. See also Iota and Jot, which use combinators to interpret any string of binary digits as a program, a simplification of an important component of Godel’s incompleteness theorems.
  • Lambda Diagrams – A combinatory logic inspired visual representation for lambda calculus expressions.
  • Are scalars “just” degenerate matrices? – Some thoughts on the relationship between dimensionless scalars, units of measure, and linear algebra.
  • Architecture of Open Source Application – The Glasgow Haskell Compiler – The AOSA book is good in general, but I found the GHC chapter particularly interesting.
  • Digital Physics – Irrational numbers aren’t computable, but modern physics relies on them. If you assume that anything physical is computable, that implies that either our theory of computability is too narrow, or our theory of physics too “irrational”. Digital physics is a general direction of research that tries to find computable models for physics.

Politics

Society

Galapagos

At the beginning of March, I went on a cruise in the Galapagos. It was an interesting place to visit, and I took advantage of the unique wildlife to take a few photos. Well, more than a few photos; I took 5700 photos.  Here are my favorite 30:

Galapagos Gallery

Significant indentation comments/strings

I recently changed my comment syntax to allow multi-line comments through indentation, which you can see an example of in the below code.  If a line comment is followed by indented lines, then the comment continues until the indentation level returns to the initial level.

CommentSyntax

This approach supports nesting, doesn’t require an explicit terminal symbol, and is also useful for string literal parsing.

There were two hard parts to implement it:

  • I previously translated leading tabs into indent/outdent tokens after the lexer turned the input string into tokens.  Since comments and string literals are parsed by the lexer, I had to move that translation to happen before lexing.  However, I cannot strip white-space until after lexing, so these new indentation-aware components of the lexer needed to explicitly handle superfluous white-space.  The indentation-aware components of the parser don’t need to do so, since it occurs after white-space is stripped.
  • It was tricky to make syntax highlighting in Sublime Text handle the new syntax.  The problem is that it just uses a set of regexes to classify subsequences of the buffer.  However, it does support nested regexes, so I ended up writing code to generate the syntax highlighting definition file, and use that to nest the indentation-dependent classifications within an explicitly enumerated regex for each indentation level.

Objects and Implicit This

Something my language does differently from most object-oriented languages is that there’s no implicit this value for member functions.  This may seem strange at first, but it solves a few problems.

A common problem with an implicit this value is that the function must be called explicitly as a member function to receive the this value.  For example, in JavaScript, calling object.function() passes object into function as this.  One case where this goes wrong is if you want to use that function as a delegate: control.onClick = object.function.  Now, when control.onClick() is called, it will pass control to function as this instead of object.  This is sometimes useful, but it’s unintuitive behavior for a function originally declared as part of object.  JavaScript provides a way to override the this pointer passed to the function, but the fact that this may not be the object the function was declared in is a caveat most JavaScript programmers will have to deal with.

The next level of this problem is that when you have nested object declarations, this only allows you to access the innermost object’s members.  That sounds a little contrived, but it’s a real problem.  For example, if you have an object for your module’s public members, and within it you have another object literal, that nested object’s member definitions cannot access the module’s public members.

This is actually a little better in JavaScript than in languages like C++, since each object has its own map from member names to values.  Since a function value in JavaScript includes an implicit lexical closure, the member function for an object can have implicit information about its lexical environment.  That normally doesn’t include this, but it allows CoffeeScript to generate JavaScript that binds a function value to a specific value of this, so it will be the same regardless of how the function is called.  However, that solution still requires programmers to understand how JavaScript defines this, so they can decide which functions need the special this-binding function syntax, and it doesn’t solve the problems with nesting objects.  For more info, see CoffeeScript’s function binding.

The way I solved this in my language is to replace the caller-provided this with capturing the object’s members as part of a lexical closure.  Object members may be accessed within the scope of the object definition, and it will resolve to the inner-most object that defines the name.  Internally, there isn’t a single this, and member names need to be resolved against a stack of objects that were in the lexical scope.

To make things a little bit clearer for everybody (humans and machines), member names must start with a . symbol, both when declared and when accessed.  Objects may also include definitions that don’t have a . prefix, but they are the equivalent of private members: they cannot be accessed outside the object definition.  When an object is used as a prototype, its public and private members are copied into the derived object, and are bound to the new object where they previously accessed the prototype’s public members.

 

Member name resolution in my language

Member name resolution in my language

There are still some remaining problems:

  • There aren’t aliases for the objects that are being used to resolve member names.  This is rarely needed, but it is occasionally useful.  The simplest solution would be to define this(or .) to be the innermost object context, but that doesn’t allow e.g. memberFunction to access the module object it is being used from; it can’t simply refer to myModule, as myModule may be used as a prototype for another object, and if that is how memberFunction was accessed, then that prototyped object is what you want.
  • I’m torn on using a symbolic <- for object prototyping.  Its meaning is similar to class inheritance, where the left operand is a prototype object for the right operand.  An operator is more flexible than making the prototype part of the object literal syntax, and <- makes the non-commutative nature of the operator clear, but the meaning isn’t as obvious to people who don’t know the language as using a keyword like extends would be.  But what’s an appropriate keyword that keeps the prototype as the left operand?
  • The way I implement this when running in the JavaScript VM, objects include an implicit function that is called when it is used as a prototype.  The function recreates the prototype’s member definitions with the new object bound in place of the prototype object.  This function implicitly includes a closure for the lexical scope the object was created in, and if the prototype object is itself created from a prototype, a call to that prototype’s prototyping function.  The problem is that a VM may keep the whole lexical scope around, instead of just the variables accessed by the functions; the JavaScript VM I am using (V8) is implemented that way.  As a result, any definitions accessible from an object’s scope are kept around as long as the object is, since they may be accessed when the object is used as a prototype.  The compiler has the information it needs to avoid this when it is unnecessary, but I don’t like to rely on that kind of optimization for intuitive performance/memory characteristics.

Parser Combinators and Error Reporting

For the last month, I’ve been working on a programming language. It’s the first language I’ve made, so I’ve been learning and having fun. It still has a long way to go, but a week ago it became self-hosted. I deprecated the bootstrap compiler (written in CoffeeScript), and haven’t looked back!

The language is similar to CoffeeScript in spirit, but is growing in a different direction. It’s dynamically typed, translated to JavaScript, has string subexpressions (what CoffeeScript calls string interpolation), and has significant white-space.

However, it also has no mutable variables, doesn’t use parentheses or commas for function calls, and has a different object model. I’m planning to incrementally add static typing: first by adding type constraints to function parameters that are verified at run-time, and then by verifying the constraints at compile time. The language already generates code with extra run-time checks to catch errors earlier than they would occur in CoffeeScript/JavaScript.

Snippet of my language

The where term seen in this snippet is a simple feature, but provides a useful tool for describing your program in a top-down way.  The for is similar to CoffeeScript; in this case it means return an array where each element is the result of applying literalTokenParser … to the symbol strings in the provided array.  [: is a way to declare a literal array with an indented list of elements separated by newlines, with a decrease in indentation terminating the array rather than an explicit ].  There are several terms like that in the language, inspired by my experience with CoffeeScript.  In CoffeeScript, you often end up with redundant closing brackets and braces where the indentation provides the same information.

But I’ll talk more about my language later, this post is mostly about parser combinator error handling!

Parser combinators

The code snippet shows a very simple example of parser combinators: litSeq is a primitive parser that matches a string in the input, literalTokenParser turns a successful match of that parser into a literal token value, and alt matches any individual symbol parsers.  describeParser just provides a friendly description of the combined parser for debugging and error messages.

litSeq, literalTokenParser, alt and describeParser are all higher-order functions: they return a function that is parameterized by the input stream to be parsed.  The type of symbolsParser is a function that maps an input stream cursor to either some output value and a new stream cursor, or failure.

Implicit in the alt combinator is the ability to backtrack.  One of the alternatives may partially match the input stream, but ultimately fail and fall back to trying other alternatives at the earlier location in the stream.  This is a powerful property that allows general composition of parsers: alt doesn’t need to know which characters each alternative will match to work.  It’s also a dangerous feature, as a grammar with too much backtracking can be slow to parse and hard to read.

But the compositional nature of alt makes it possible to declare parsers like symbolsParser that may later be combined into more complicated parsers without any knowledge of what characters it matches.  This kind of modularity allows reducing a program to its pieces, and reading and reasoning about each piece individually.  Without it, complicated software projects eventually collapse under the weight of the knowledge needed to understand any given part of the project.

This modularity is why I prefer parser combinators to the usual shift-reduce parsers and the accompanying domain-specific languages that are often needed to practically develop them.

Some aspects of traditional parsing are still relevant, though.  To accommodate parsing without backtracking, shift-reduce parsers are usually split into two passes: the first pass, lexing, turns a sequence of characters into a sequence of tokens.  The second pass turns the sequence of tokens into a syntax tree.  The first pass can turn character sequences like “for” into a single token, so that the second parsing pass only needs to look at the next token to tell that it’s a for statement, instead of a variable name starting with “f” or any number of other alternatives.

With backtracking parser combinators, you can express grammars that need to read many characters to determine which alternative is being parsed, but it is still often simpler to do it in two passes.  The lexer can eliminate comments and insignificant white-space from the resulting token stream, which reduces complexity in the parser more than it adds complexity in the lexer.

In my language, the significant white-space is also translated in a separate pass from parsing and lexing.  Tab characters at the start of lines are converted into indent and outdent tokens, which allows the parsing to be context-free.  When the parser is trying to match a reduction in indent level, it doesn’t need to keep track of the current indent level; it just looks for an outdent token.

The lexer is also complicated by support for string subexpressions: the subexpressions within a string are recursively lexed, and so the raw output of the lexer is a tree of tokens that is flattened before being passed to the parser.

This is why string subexpressions complicate lexing

Error reporting

In my first implementation of the language in CoffeeScript, I used a special parser combinator to turn parse failures into fatal errors.  For example, you know that in your grammar that the for keyword must be followed by an identifier, so you can say that the identifier parser is required.  If that sequence parser matches the for keyword, but the following required identifier parser fails, then it is a fatal error.

However, that approach breaks the composability of the parsers.  To add that required qualifier to the identifier parser, you need to know that no other parser in the whole grammar can match the for keyword followed by a non-identifier.  It is true in this case, but it subtly effects the meaning of any parser using the for subparser.  If you wanted to add an alternative use of for to the grammar, you would have to change that required identifier parser.

When I rewrote the parser in my language, I implemented the parser combinators differently.  Whenever a parser fails, it also returns information about what it expected to parse.  The parser combinators propagate that failure information outward, but may be absorbed if an alternative parser succeeds.  Only when the failure is returned from the outermost parser does it become an error.  At that point, the failure has information about all the alternatives at the point of failure, and can be turned into a useful error message.  The rules for propagating the failures are fairly obvious, but took some trial and error for me to discover them:

  • I discard all alternative failures that matches less input than another failure, under the assumption that the parser that matched the most input is what the user intended.
  • The describeParser combinator is used to provide a friendly description of what an inner parser will match.  When the inner parser fails, if it didn’t match any input, it overrides the inner parser’s description of its expectations.  However, if the inner parser fails after matching some input, then it forms an expectation tree where the description is used as context for the inner parser’s failed expectations.
  • Because some parsers may have alternatives that succeed with a partial match of the input, they need to also return information about failed alternatives that consumed more input than the successful match.  For example, parsing an addition expression from “a+” should successfully match “a”, and return that along with a failed alternative that matched “a+” but then failed to match a term.
  • When a sequence parser fails, it needs to look at the failed alternatives for successfully matched subparsers at the beginning of the sequence, in addition to the subparser failure that caused the sequence to fail.  For example, if the expression parser in the previous point was wrapped in a sequence that expected an expression followed by the end of the input, and was applied to the same input string “a+”, then it would match “a” before failing to match “+” with the end of the input.  In that case, the failed alternative from the expression parser made it farther than the failed end of input parser, and so that is the failed expectation to return from the sequence.

Once the failure becomes fatal, it will have information about all the possible expectations at the furthest unmatched point in the input, organized in a tree with the nodes providing context for the failed expectations.  To describe this to the user, I use the description for each leaf, and the innermost node that contains all the leaves as context.

Automatically generated parse error for a statement in my language’s REPL