Ensō Grammars

I have not had much time to write about Ensō recently, because I’ve been too busy building Ensō. Bruno Oliveira visited last week and we worked out some interesting new ideas for transforming circular structures. There are cases where the input is circular, the output is circular, and also when the operation itself is circular (needing a fixpoint). Bruno is working in a pure functional setting, but we are doing analogous things in the Ruby implementation of Ensō. Tijs van der Storm is still here until the beginning of July. Alex Loh, a PhD student in my group, has also joined the team. Things are going well, although my implementation of the diagram editor is going a little slower than I had hoped. But no problems yet.

Last time I talked about schemas and said I would talk about parsing this time. It turns out that parsing is still an interesting topic, even though its been worked on extensively. Here is the simple schema, for points, which I mentioned in the last post:

class Point
x: int
y: int

A grammar to recognize points in standard notation is easy to define:

start P
P ::= "(" int "," int ")"

As is typical, an Ensō grammar is a list of rules with a start symbol. The right hand side of a rule contains literal strings “(“, “,” and “)” and generic tokens, e.g. int. The Ensō tokenizer has predefined syntax for strings, identifiers, and numbers, but allows the grammar to define arbitrary operator/punctuation tokens. This grammar can recognize strings that have the form (-12, 42). To be useful a parser must create objects, not just recognize syntax. The following grammar creates points:

start P
P ::= [Point] "(" x:int "," y:int ")"

The expression [Point] means that the rule creates a point. The notation x:int means that an integer is recognized, then assigned to the x field of the current object being created. The end result is that the rule creates a point and assigns its x and y fields.

This example illustrates that an Ensō grammar can create objects directly, rather than creating abstract syntax tree (AST) as in most grammar formalisms. In other words, the result of parsing is a point, not syntax that later creates points. Ensō grammars only create syntax trees when they are parsing a schema that represents syntax of a language. Points aren’t syntax, so there is no need to create syntax trees when parsing them.

The second important fact about Ensō grammars is that they are bi-directional. The grammar can also be used to render a point into a text string. Under this interpretation of the grammar, the expression [Point] is a condition or predicate that matches objects of type Point. The expression x:int means to access the x field of the current object and render it as an integer. The result is a string of the form (5, -3).

The bidirectional nature of grammars is a result of the weaving of two different languages in the grammar notation: the syntactic structure, represented by literals “(“, “,”, “)” and int tokens, and the instantiation structure represented by class names [Point] and fields x: and y:. A parser matches the syntactic structure and creates the object instantiation structure. A renderer matches the object instantiation structure and generates the syntactic structure.

In my previous post I presented several schemas and introduced a notation for describing schemas. We can now formally specify this notation. For convenience, here is a version of the final schema that we introduced, the Schema Schema.

class Schema
  types: Type*
class Type
  name: string
class Primitive < Type
class Class < Type
  super: Class?
  fields: Field*
class Field
  name: string
  type: Type
  many: bool
  optional: bool
primitive string
primitive bool

The Schema Grammar defines the language in which this schema is written. It looks like this:

start Schema
Schema ::= [Schema] types:Type*
Type ::= Primitive | Class
Primitive ::= [Primitive] "primitive" name:str
Class ::= [Class] "class" name:str Super? fields:Field*
Super ::= "<" super:^Class
Field ::= [Field] name:str ":" ^Type (optional:"?" | many:"*")

The rules correspond fairly closely to the structure of the schema itself, so it is convenient to use the same names where appropriate. It can also be confusing. Just keep in mind that the names on the left of ::= are the names of nonterminals, and they could be renamed to be any names at all. There is a special case for boolean fields: the notation optional:”?” sets the optional field to true if the symbol “?” is present. There are two places where schema class names show up in the grammar. One is an object instance expressions, e.g. [Field]. The other is an object reference expressions, e.g. ^Type, described below. Here is an equivalent grammar, where the nonterminals have all be renamed.

start S
S ::= [Schema] types:T*
T ::= P | C
P ::= [Primitive] "primitive" name:str
C ::= [Class] "class" name:str U? fields:F*
U ::= "<" super:^Class
F ::= [Field] name:str ":" ^Type (optional:"?" | many:"*")

Again, parsing a text file with this grammar does not create abstract syntax, it creates an actual schema.

The reference syntax ^Class allows the parser to automatically create circular references in the resulting structure. The expression ^Class means that a name for a class is parsed, then during object construction the parser replaces the name with a pointer to the Class with that name. During rendering, the opposite happens: the Class is rendered by printing its name. This brings up the question of how the grammar knows what field to use as the name of a class. It doesn’t just use the name field. Instead the real Schema Schema has an extra attribute on fields that specifies which field acts as the key for identifying objects. In the class Class, the field named name is marked as the key field. (I’m sorry for that last sentence. It is typical Ensō-speak, brought on by using the words “name” and “class” as categories and instances in the same sentence).

Notice that not every rule has a constructor. For example, the Super rule (aka U) just reads in part of a class object, so it does not create any objects on its own. Of course, the Super rule could also be inlined into the Class rule:

Class ::= [Class] "class" name:str ("<" super:^Class)? fields:Field*

In fact, everything could be inlined in this grammar, because it is not recursive. But then it would be a little more difficult to read:

[Schema] types:
    ( [Primitive] "primitive" name:str
    | [Class] "class" name:str ("<" super:^Class)? fields:
        ( [Field] name:str ":" ^Type (optional:"?" | many:"*" )*
    )*

This grammar is just a regular expression. Ensō grammars include regular expressions as a special case.

The parser is a grammar interpreter, not a grammar compiler. The underlying technology is called Generalized LL (GLL) parsing. It can parse arbitrary context-free grammars with left recursion. The overall parse must be unambiguous, but there is no need for the syntax to be locally ambiguous based on fixed lookahead. I hope that Tijs will write a blog post explaining how the parser actually works, at some point. The basic algorithm is not very interesting, but the way it works with the rest of Ensō has some interesting points.

After my last post, some people got the impression that Ensō is a data description language. With this post, I’ve extended that impression slightly to be data description and formatting. We are creating lots of different modeling languages in Ensō, so one way to understand Ensō is that it is an environment for creating and integrating multiple domain-specific languages. Ensō is a full programming model, so there has to be some code somewhere. Unlike most approaches to model-driven development, we don’t generate any code in Ensō. Instead we directly interpret the models to create useful behaviors. This is more direct, because its the behavior we want, not the intermediate code.

For the foundational models, like schemas and grammars, we have hard-coded Ruby interpreters. The Ruby code looks pretty good, but its not exactly ordinary OO Ruby code, because the data it manipulates is Ensō data. As we move into interpreters for high-level models, we are finding a need for higher level programming patterns that manipulate circular structures. This code may be functional in nature, or more imperative. Writing functional style is nice, but the circularity of Ensō data means that ordinary functional programs, based on induction, don’t work correctly. We have developed some nice techniques for transformation of cyclic structures. For imperative code we are exploring ideas of Jonathan Edwards to decouple imperative operations and sequencing of operations. Some of these ideas are still in development, but they are coming along nicely. Next time I’ll write about the Grammar Grammar, which defines the syntax of grammars.

Get plugin http://www.fastemailsender.com