Decades ago in my college database class we learned about relational databases, network databases, and hierarchical databases. Back then, relational was cool. And hierarchical was definitely passé. Today, hierarchical is making a comeback with the Google App Engine (GAE). Here is a brief overview.
In GAE, persistent data for a given application consists of a set of entities. An entity has a kind: a string that designates a set of similar entities. However, entities of a given kind need not be homogeneous.
The properties of an entity are represented by a name-value map. The names are strings. The values are basic types including dates and blobs. Properties may also be multi-valued. The fact that homogeneity within kinds is not a requirement means that different entities of a given kind may have different sets of property names, and that they may have differently-typed properties for identically named properties.
Entities form distinct strict hierarchies within a data store. An entity either has a unique parent, or no parent (a root entity). A root entity and all of its descendants form a cluster of entities known as an entity group. The parent relationship between entities and the corresponding entity groups arising from this relationship play a pivotal role in the GAE data store.
One way in which the hierarchical nature of the model manifests itself is in the construction of entity keys. Within a given kind, a root entity is identified either by a unique name, a string, or by a unique ID, a long integer. So a root entity is globally uniquely identified by the combination of kind and name/ID. Let’s call this combination of kind and name/ID a simple key [my terminology]. Non-root entities are identified within their parents by unique simple keys. So in general, a key in the data model can be thought of as a path composed of simple keys. A key of length 1 uniquely identifies a root entity. A key of length 2 uniquely identifies an entity at level 2 of the hierarchy. And so on.
Another way in which the hierarchical nature of the model manifests itself is in the togetherness semantics of entity groups, which, as you will recall, consist of a root entity and all of its descendants. The underlying storage structure used to store entities is Google’s BigTable. BigTables are stored in a distributed fashion by assigning ranges of records (based on key) to different BigTable servers. The ranges are called tablets. What’s special about entity groups with respect to this type of sharding is that members of an entity group are not divided to different servers: they stay close together at all times, and are managed by a single server at any given time. A transaction involving members of the same entity group can therefore be managed by a single server and implemented simply and efficiently.
Currently GAE transactions are limited to single entity groups. GAE could, but currently does not, support transparent transaction management across multiple entity groups, by implementing distributed transactions under the covers.
Why can’t we have a dummy root entity and have the entire data store hang off of that in a single entity group? Because GAE’s design limits throughput by entity group. Specifically:
- GAE uses optimistic concurrency control for transactions, and its concurrency control algorithm operates at the root level of an entity group. The optimistic concurrency control timestamp of the root reflects the last update time of any entity in the group. So large entity groups increase the likelihood of timestamp conflicts and resulting rollbacks.
- Currently the data store cannot support more than about 10 writes per entity group per second all told. So large entity groups reduce parallelism between transactions since a transaction affecting any entity in an entity group requires a write to the group’s root entity.
One other major restriction on transactions is that queries are not supported within transactions. You can get an entity by its key within a transaction. But you cannot run a search based on property values within a transaction.
Relationships other than the special parent relationships may be represented explicitly in an entity as a property whose value is the key of the related entity. [Parent relationships, of course, are managed implicitly by GAE.] But GAE does not provide special support for relationships other than the parent relationship. No special support for traversal or joins, for example.
By default all indexable properties are automatically indexed (text and blob values are not indexed). It is also possible to explicitly create composite indexes on more than one property. Surprisingly for a search engine company, full text indexing of text fields is not supported at this time.
This completes my brief overview of the data model.
Clearly, there are a number of choices in the basic design of the GAE data model that are different from those of relational databases. Some of these, like native support for hierarchies and multi-valued properties can help you model your application more easily. Clearly too, other choices, such as the transaction restrictions, and the non-existence of native support for joins, can make your applications more complex, or less able to satisfy the real requirements of your users.
Now let’s shift attention to GAE’s high level APIs. GAE provides JDO and JPA interfaces to its persistent data. The main abstraction embodied by these high-level interfaces is the homogeneity of kinds. Since in JDO and JPA entities are persistent versions of Java objects, an entity kind in these interfaces represents a Java class that requires persistence. All entities of a given kind then necessarily have the same set of properties and corresponding property types.
Unfortunately, at the moment, the high-level interfaces do not hide the fact that transactions are limited to individual entity groups. A transaction that spans a second entity group triggers an exception, independently of which interface is used.
Nor do the high-level interfaces transparently provide joins or certain other familiar SQL functions at this time. In other words, the JDO and JPA query language variants supported by GAE are somewhat limited. A join query appearing in a JDO query, for example, will trigger an exception.
The designers of the data store API have so far avoided exposing functionality whose performance may be iffy, or which may complicate the management of the data store. So for now, we have to roll our own frameworks, or look to third party offerings to fill in the gaps: gaps between our expectations of functionality from databases, conditioned as they are by relational databases, and what the GAE can reasonably deliver.
As an example, the impression one gets is that the App Engine folks are not at this time eager to embrace transparent distributed transactions across entity groups in the base GAE product. Clearly life can get more complicated with say two-phase commit and the possibility of a distributed transaction coordinator crashing after the prepare phase of a transaction. Other developers, however, are working on distributed transactions (see, for example, this talk at Google IO 2009).
The extent to which such third party additions find success, and are worked into the standard GAE development ecosystem is something to keep an eye on over next months and years.