Ensō Data

The foundation of the Ensō programming model is the representation and manipulation of data. We are borrowing ideas from database and information modeling, and turning that into a data structuring mechanism that makes sense in a programming language. When you think of data structures as mini databases some interesting things can happen. Most importantly, it encourages thinking of data holistically rather than as individual objects or individual data types. It also enables us to rethink the idea of “types” in programming languages.

Programming languages can be divided into two camps: those with static type checking and those with dynamic type checking. The static ones allow programmers to describe the structure of their data and then create and manipulate values that have the specified structure. Since these type systems are checked statically, they traditionally only describe the structure of values, not semantic constraints. It is also possible to define abstract data types, which wrap a primitive data structure with operations that enforce semantic constraints. More recently there has been extensive work on adding more semantic constraints into static types, although there is significant tension due to the need for static verification of the types. Existing dynamically typed languages have a built-in set of predefined data structures, which the programmer uses to create larger structures. In many cases these higher-level structures are never described explicitly. Dynamically-typed object-oriented languages allow programmers to wrap dynamic data with methods to enforce semantic constraints. Objects implement the constraints locally and operationally. There typically is no declarative global view of the semantic constraints that the programmer intends on the data.

Ensō adopt the descriptive and declarative view typical of static type systems, but uses dynamic checking so that arbitrary behavioral constraints can be included in the descriptions. Because we are exploring new territory, we avoid using the word “type” in talking about Ensō data. We use the term “schema” instead, borrowed from the database literature.

This leads to the first principle of Ensō, which is that all data is described by schemas. A schema specifies the properties and constraints on a kind of data. For example, if we want to describe 2D points, we might say “A point has an x and a y component, which are integers.” This is corresponds to the following schema:

class Point
  x: int
  y: int

This defines a class named Point with two fields, named x and y. We use the generic term field to describe a named component of a structure. In the next post I’ll talk about how to write a grammar for points, but for now we can assume that points can be written in the normal way, e.g. (3, 6). Also, given a point P, we can access the fields in the normal way, as P.x and P.y. We will also talk about constraints later. For now I want to talk about some more complex structures and then circularity of descriptions. Continuing with larger structures, we can define rectangles and polygons:

class Rectangle
  top_left: Point
  bottom_right: Point
class Polygon
  points: Point*

The * means that a polygon has a list of (zero or more) points. This example also shows that the type of a field can either by a primitive type, int in the case of x and y, or a complex type Point the case of top_left and points. Note that some of the typical data structuring ideas, like List<Point>, which are usually explicit in a programming language, are implicit in Ensō. We will also talk about how to create and manipulate rectangles and polygons later. There are many things to talk about, and I’m sorry I can’t do it all at once. For now I have a specific agenda, as you will see in a minute.

The schema above is called the Point Schema or Geometry Schema. This is because it is a schema and it is about points, or geometry. We can create many kinds of schemas in Ensō, to represent many kinds of data. For example, a Genealogy Schema could describe the structure of genealogical data, a Mail Schema could describe the emails in an email program, a Diagram Schema could describe the structure of visual diagrams, and a File System Schema could describe the structure of files and directories in a file system. I will talk about all these examples, and more, some other time. Buts its not rocket science. If you know Entity Relationship (ER) modeling, or Class Diagrams in UML, then this process of creating schemas will be familiar. What might be a little strange is writing these schemas down and calling them a “programming language”. I mean, I haven’t even told you how to write “Hello World!” yet. But trust me, these schemas are programs, in the sense that they can be used to create/manipulate data. (Oh, and I never will tell you how to write Hello World… it’s not interesting). For now I want to focus on something that might be more interesting, which is circular descriptions.

These descriptions are also data. From this and principle #1, it immediately follows that schemas are described by a schema. This means we need to write down a schema the describes the structures we’ve be writing above. In programming terms, what we need is a type of types. This meta-schema needs to say something like “a class has a name and a list of fields, which specify their name and type”.

class Class
  name: string
  fields: Field*
class Field
  name: string
  type: ???

This is a good start at a schema that describes schemas. We had a little problem writing it down, because the type of a field can either be a class or a primitive type (Point or int in the examples above). The structure we need is familiar: a generic category called Type that has two specific realizations Primitive and Class. This is a class hierarchy. One thing that Primitive and Class have in common is that they both have names, so we can put that field into the base class. Another problem with the schema is that it has two classes, which work together to describe classes. Thus it would be good to be able to organize collections of classes. Lets introduce this into our schema:

class Schema
  types: Type*
class Type
  name: string
class Primitive < Type
class Class < Type
  fields: Field*
class Field
  name: string
  type: Type

The notation < Type means that Primitive and Class are subclasses of Type. This is now a very nice description, and it is capable of describing the Geometry Schema given above. Using YAML notation, the explanation of the Point Schema in terms of the meta-schema would look something like this:

class Point
 x: int
 y: int
Schema:
- Class:
    name: "Point"
    fields:
    - Field:
        name: "x"
        type: *int
    - Field:
        name: "x"
        type: *int
- Primitive &int
   name: "int"

On the left is the point schema, and on the right is an encoding of the Point Schema where its schema structure is made explicit. In our use of YAML, the top level is the class name, and then the fields are listed below. The fields values can either be primitive data, or more complex fields. The label &int identifies the primitive type named “int” which is referred to via *int for the type of the x and y fields. (Technically the label should come first, but this is pseudo-YAML, not real YAML). The left is quite nice, but the explanation on the right is not a pretty thing to look at. But it is useful to understand the mapping between them. As another simple example, a point would look like this:

(3, 7)
Point
  x: 3
  y: 7

The above example shows that our proposed Schema Schema is able to describe the Point schema. Note how the complexity goes up rapidly as we move from points to descriptions of points to descriptions of descriptions of points.

If you think about the corollary above, that schemas are described by a schema, it should be clear that there is potential for infinite regress. If everything is described by a schema (principle #1), and schemas are described by schemas (corollary #1) then the only way to avoid regress is if some schemas describe themselves. This is what the Schema Schema must do. It is a schema that describes the structures of schemas, including itself. As it stands, there are two features that are used in the proposed Schema Schema that it is not able to explain: subtyping and collection fields. These are represented by the < and * symbols above. Its easy to explain collection fields, by simply adding a boolean flag to each field that specifies whether the field is single-valued or multi-valued. A super class can also be listed on a class definition:

  class Schema
    types: Type*
  class Type
    name: string
  class Primitive < Type
  class Class < Type
    super: Class
    fields: Field*
  class Field
    name: string
    type: Type
    single: bool
  primitive string
  primitive bool

Ooops! We have a problem: not every class has a superclass. Thus we need to allow the superclass field to be optional. If we do this, then we have to have an explanation for fields whose value is optional as well. We will use the symbol ? to indicate optional values, and then add an optional attribute to the description of fields:

  class Schema
    types: Type*
  class Type
    name: string
  class Primitive < Type
  class Class < Type
    super: Class?
    fields: Field*
  class Field
    name: string
    type: Type
    single: bool
    optional: bool
  primitive string
  primitive bool

For completeness, we also added descriptions for the primitive types used in the schema.

Now this is a schema that describes itself. It is a slightly simplified version of the Schema Schema in Ensō. The explanation of how the Schema Schema describes itself is ugly, but it is interesting to see how the structural description contains data that is analogous to the structure:

Schema:
  types:
  - Class:
      name: "Schema"
      fields:
        - Field:
             name: "types"
             type: *Type
             single: false
             optional: false
  - Class &Type
      name: "Type"
      fields:
        - Field:
            name: "name"
            type: *string
            single: true
            optional: false
  - Class
      name: "Primitive"
      super: *Type
  - Class &Class
      name: "Class"
      super: *Type
      fields:
        - Field:
            name: "super"
            type: *Class
            single: true
            optional: true
        - Field:
            name: "fields"
            type: *Field
            single: false
            optional: false
  - Class &Field
      name: "Field"
      fields:
        - Field:
            name: "name"
            type: *string
            single: true
            optional: false
        - Field:
            name: "type"
            type: *Type
            single: true
            optional: false
        - Field:
            name: "single"
            type: *bool
            single: true
            optional: false
  - Primitive &string
      name: "string"
  - Primitive &bool
      name: "bool"

Explanations are often more complex than the simple facts. The original scheme is highly circular, and this explanation makes the circularity explicit. In particular, the definition of the class Class is labeled &Class, and this is used in the type of the super field that is inside the class Class. Remember that Ensō means circle :-)

This kind of meta-circular description is well known in the literature. It is used in Smalltalk and the the Meta Object Facility (MOF) of UML. It has also been studied in programming languages. Some languages have explored the idea of allowing types to have types, and the possibility of a type of types. There might be theoretical issues with having a type of types, but we are having too much fun programming with Ensō to worry about it. Also, in Ensō we merely say that the schema describes itself. We don’t think of schemas as “sets of values” so there is no issue of a set that contains itself.

The Schema Schema given above is not exactly the one we use in Ensō right now, but it is  close. I’ll talk about some other features over the next few weeks. We are implementing Ensō in Ruby, but last week I played with encoding the Schema Schema in Haskell. I’m not sure what we could do with it yet, but it is fun to look at.

Comments (3)

  1. Existing dynamically typed languages have a built-in set of predefined data structures, which the programmer uses to create larger structures. In many cases these higher-level structures are never described explicitly. Dynamically-typed object-oriented languages allow programmers to wrap dynamic data with methods to enforce semantic constraints. Objects implement the constraints locally and operationally. There typically is no declarative global view of the semantic constraints that the programmer intends on the data.

    I don’t like your ensuing Point and Rectangle examples. They don’t really demonstrate this section of the blog entry well. It is a shame, since your description here is very clear and an accurate protrayal of how I see JavaScript developers use prototypes. For example, in Ext.js, the Data Grid control does not statically know the type or even a “class” for a data source it uses to materialize its view. It doesn’t know if it is JSON or XML or CSV or one of those crazy Javascript in-memory database abstractions, whatever. Instead, the programmer plugs-in an iterator that tells the Data Grid how it can access the data source. This yields surprisingly flexible and interesting higher-order behavior. For example, there is now a very clear forced relationship between the columns displayed and the columns retrieved by the iterator, and they can actually re-use the same schema definition modulo a “dual number” that “carries forward” extra semantic content needed in each context. Further, the Data Grid can now model distributed ad-hoc queries on multiple data sources, to do pivoting and unpivoting of the data on demand. This ends up being a lot cleaner model than most “Data-bound Data Grid Controls” you see, like the ones Microsoft developers for both the Desktop and browser. Since the definition for configuring the Data Grid is purely schematic, saving configurations and even diffing them, such as on a per user or per environment basis, is cheap. The only flaw in this approach is the semantic constraints you mention. Building hyperlinked structures, as has been done using Prolog to describe semantic navigation in databases, isn’t really done today. As a result, frameworks like Ext.js actually have multiple Data Grid objects all to handle different corner cases on grouping, roll-ups, etc., rather than dispatching these details dynamically based on those semantic constraints.

    Monday, May 23, 2011 at 11:10 am #
  2. In many ways this reminds me of the Microsoft Research work on semantic subtyping, which also takes a very database-centric view of type systems, and which uses refinement typing (on top of a first-order logic language for describing refinements) to enable a very expressive type system. They also use a SMT solver as part of their typechecker. The best paper on it is probably Semantic Subtyping with an SMT Solver. I would be very interested to hear your take on how Ensō compares.

    Monday, May 23, 2011 at 7:44 pm #
  3. Damien Cassou wrote::

    Dear William,

    I guess the following code snippet from your post as a typo:


    class Field
    name: string
    type: Type
    single: bool*

    single is a boolean, not a list of booleans.

    Monday, May 30, 2011 at 11:57 pm #
Get plugin http://www.fastemailsender.com