Understanding Unfamiliar Databases

A user recently asked:

What kind of approaches and techniques can you employ to become familiar with an existing database if you are tasked with supporting and/or modifying it? How can you easily and effectively ramp up your knowledge of a database you have never seen before?

Here was my reply:

  • The first thing I do is create an Entity-Relationship Diagram (ERD). Sometimes you can simply describe the metadata with command-line tools but to save time there are some tools that can generate a diagram automatically.

  • Second, examine each table and column make sure I learn the meaning of what it stores.

  • Third, examine each relationship and make sure I understand how the tables relate to one another.

  • Fourth, read any views or triggers to understand custom data integrity enforcement or cascading operations.

  • Fifth, read any stored procedures. Also read SQL access privileges if there are such.

  • Sixth, read through parts of the application code that use the database. That’s where some additional business rules and data integrity rules are enforced.

I’m posting to my blog the questions I’ve answered on StackOverflow, which earned the “Good Answer” badge. This was my answer to “What are the Best Ways to Understand an Unfamiliar Database?

Why Should You Use an ORM?

A user recently asked for good arguments in favor of using Object/Relational Mapping technology:

If you were to motivate [sic] the “pro’s” of why you would use an ORM to management/client, what would the reasons be?

Try and keep one reason per answer so that we can see what gets voted up as the best reasons.

I offered four answers. The first three got the most votes, but my last answer got little interest.

  1. Speeding development. For example, eliminating repetitive code like mapping query result fields to object members and vice-versa.
  2. Making data access more abstract and portable. ORM implementation classes know how to write vendor-specific SQL, so you don’t have to.
  3. Supporting OO encapsulation of business rules in your data access layer. You can write (and debug) business rules in your application language of preference, instead of clunky trigger and stored procedure languages.
  4. Generating boilerplate code for basic CRUD operations (Create, Read, Update, Delete). Some ORM frameworks can inspect database metadata directly, read metadata mapping files, or use declarative class properties.

There are lots of other reasons for and against using ORM frameworks. Generally, I’m not a fan of ORM’s, because their benefits don’t seem to make up for their complexity and tendency to perform slowly. Their chief value is in reducing the time taken in repetitive development tasks.

Hibernate, for example, is about 800,000 lines of code (Java and XML), but it’s complex enough that I doubt it’s easier to learn or to use than SQL. Besides, there seem to be fundamental tasks, such as a simple JOIN that are impossible to do through the entity interface. Please correct me if I’m wrong, but I’ve been searching tutorials and examples and I haven’t found a way to fetch a joined result set from two entities, without writing a custom query in HQL (Hibernate’s abstract version of SQL).

I was also led to a blog by Glenn Block, titled “Ten advantages of an ORM (Object Relational Mapper).” I disagree with Block on several points. He cites some traits of ORMs as advantages where I see them as defects. He also cites features that are not specific to ORMs; they could be achieved with any type of data access library.

update: Upon request, here are some specific comments on Glenn Block’s list of advantages of an ORM:

1. Facilitates implementing the Domain Model pattern

Not necessarily. I can design Domain Model classes containing plain SQL as easily as I can design classes that operate on the database via an ORM layer. Keep in mind that ActiveRecord is not a Domain Model.

2. Huge reduction in code.

Depends. When executing simple CRUD operations against a single table, yes. When executing complex queries, most ORM implementations fail spectacularly compared to the simplicity of using SQL queries.

3. Changes to the object model are made in one place.

This is not a benefit of an ORM. Many people use ORM interfaces inexpertly, so when the database structure changes, they still have to update many places in their application to reflect the change. But instead of redesigning SQL queries, they have to redesign usage of the ORM. There is no net win. They could structure their application using plain SQL queries and still be as likely to achieve the benefit of DRY.

4. Rich query capability.

Absolutely wrong.

5. You can navigate object relationships transparently.

This is definitely a negative rather than a positive. When you want a result set to include rows from dependent tables, do a JOIN. Doing the “lazy-load” approach, executing additional SQL queries internally when you reference columns of related tables, is usually less efficient. Leaving it up to the ORM internals deprives you of the opportunity to decide which solution is better.

6. Data loads are completely configurable …

This is not a benefit of an ORM. It is actually easier to achieve this using plain SQL.

7. Concurrency support.

Again, not a benefit of an ORM.

8. Cache managment.

This has nothing to do with using an ORM. I can cache data using SQL.

9. Transaction management and Isolation.

Also has nothing to do with using an ORM versus a more direct DAL.

10. Key Management.

Ditto.

I’m posting to my blog the questions I’ve answered on StackOverflow, which earned the “Nice Answer” or “Good Answer” badges. This was my answer to “Why Should You Use An ORM?

The Next-Gen Databases

A user recently asked:

I’m learning traditional Relational Databases (with PostgreSQL) and doing some research I’ve come across some new types of databases. CouchDB, Drizzle, and Scalaris to name a few, what is going to be the next database technologies to deal with?

SQL is a language for querying and manipulating relational databases. SQL is dictated by an international standard. While the standard is revised, it seems to always work within the relational database paradigm.

Here are a few new data storage technologies that are getting attention currently:

  • CouchDB is a non-relational database. They call it a document-oriented database.
  • Amazon SimpleDB is also a non-relational database accessed in a distributed manner through a web service. Amazon also has a distributed key-value store called Dynamo, which powers some of its S3 services.
  • Dynomite and Kai are open source solutions inspired by Amazon Dynamo.
  • BigTable is a proprietary data storage solution used by Google, and implemented using their Google File System technology. Google’s MapReduce framework uses BigTable.
  • Hadoop is an open-source technology inspired by Google’s MapReduce, and serving a similar need, to distribute the work of very large scale data stores.
  • Scalaris is a distributed transactional key/value store. Also not relational, and does not use SQL. It’s a research project from the Zuse Institute in Berlin, Germany.
  • RDF is a standard for storing semantic data, in which data and metadata are interchangeable. It has its own query language SPARQL, which resembles SQL superficially, but is actually totally different.
  • Vertica is a highly scalable column-oriented analytic database designed for distributed (grid) architecture. It does claim to be relational and SQL-compliant. It can be used through Amazon’s Elastic Compute Cloud.
  • Greenplum is a high-scale data warehousing DBMS, which implements both MapReduce and SQL.
  • XML isn’t a DBMS at all, it’s an interchange format. But some DBMS products work with data in XML format.
  • ODBMS, or Object Databases, are for managing complex data. There don’t seem to be any dominant ODBMS products in the mainstream, perhaps because of lack of standardization. Standard SQL is gradually gaining some OO features (e.g. extensible data types and tables).
  • Drizzle is a relational database, drawing a lot of its code from MySQL. It includes various architectural changes designed to manage data in a scalable “cloud computing” system architecture. Presumably it will continue to use standard SQL with some MySQL enhancements.

Relational databases have weaknesses, to be sure. People have been arguing that they don’t handle all data modeling requirements since the day it was first introduced.

Year after year, researchers come up with new ways of managing data to satisfy special requirements: either requirements to handle data relationships that don’t fit into the relational model, or else requirements of high-scale volume or speed that demand data processing be done on distributed collections of servers, instead of central database servers.

Even though these advanced technologies do great things to solve the specialized problem they were designed for, relational databases are still a good general-purpose solution for most business needs. SQL isn’t going away.

I’m posting to my blog the questions I’ve answered on StackOverflow, which earned the “Good Answer” badge. This was my answer to “The Next-Gen Databases.”

ActiveRecord does not suck

I’ve been reading a few blog postings such as Kore Nordmann’s ActiveRecord sucks and Mike Seth’s ActiveRecord sucks, but Kore Nordmann is wrong.

ActiveRecord is fine.  It is a tool that does just what it’s designed to do.  What sucks is when developers try to make it do other things than what it’s intended to do.

I worked for Zend, managing the Zend Framework project through its 1.0 release.  I also completed the implementation and documentation of Zend_Db and its related components. To set the record straight, Zend_Db does not implement the ActiveRecord pattern. It implements the Table Data Gateway and Row Data Gateway patterns, which taken together offer similar value as the ActiveRecord pattern.

I totally agree with Mike Seth that MVC should not be taken as “ActiveRecord-View-Controller.” I tried to make this point in documentation, screencasts, conference presentations, and in many mailing list messages, but I met with little success.

Unfortunately, the Ruby on Rails drumbeat that Models are simply database table wrappers has established momentum.  The term “Model” has (incorrectly) become synonymous in many developers’ minds with “ActiveRecord.”  Since Models by this definition are tied to database access and largely implementing various query techniques, Ruby on Rails development forces you to write a large amount of code in the controller classes that should properly be written in Model classes.

This has a few consequences. Unit-testing controller classes becomes very complex, since that’s where the majority of your application code resides.  To test a controller class you need to mock HTTP requests, and sift through HTML output.  This is fine, but it results in more work since testing the controller class is so important and complex.  If the application code were separated into a true Model, then unit-testing the controller would simply be testing whether the HTTP request had been communicated to the Model correctly.  Testing the behavior of the Model would be much more straightforward unit-testing of a class API in isolation, requiring no mock HTTP requests or scraping HTML output.

Also, unit-testing Rails-style Model classes is difficult, since the Model is coupled with the database.  We start to see unfortunate things like database fixtures as being necessary before you can execute the simplest tests against your Model class.  This makes testing Models time-consuming, error-prone, and run slowly.

If developers were to separate Models and Controllers properly, and separate data access components from Models, unit-testing all of these classes could be done more simply, and with greater isolation from other classes.  This makes it easier to diagnose defects, when they occur.  Isn’t this the point of unit tests?

A Model is a class that provides a logical component of your application domain.  Models are products of OO design, which is a development activity I see get very little attention in the developer blogosphere or the developer tools market.  Developers seem more enchanted by evangelizing their favorite code editor or debugger, or by squeezing more performance out of their database, than by mental analysis to make sure their OO architecture is modeling its application requirements well.

A single Model class may be backed by a database table, or multiple database tables, or perhaps even no database tables.  Data persistence should be an internal implementation detail within a Model; the external API of the Model class should reflect its logical OO requirements, not the physical database structure.
(update) What I often tell people is that the relationship between a Model and an ORM class should be “HAS-A” rather than “IS-A.”  The latter is the assumption of Rails and other frameworks who are enticed by ActiveRecord.  If the Model uses ORM objects instead of inheriting from ORM classes, then you can design the Model to contain all data and code for the domain it’s supposed to model — even if it takes multiple database tables to represent it.

Many developers have complained that Zend Framework provides no base Model class. Of course it doesn’t provide a base Model class! That’s your job. This complaint is like saying that Microsoft Word sucks because it doesn’t provide finished documents for you.

So I wouldn’t say ActiveRecord sucks.  I would say that people are expecting ActiveRecord to magically solve their OO design for them, in a false quest to avoid doing the design themselves.

In Support of the Relational Model

Every few years, a wave of people claim that the relational model falls short of modeling data structures of modern applications. This includes some really smart people recently such as:

Jim Starkey
Brian Aker
MIT researchers

This trend of criticizing the relational model started almost immediately after E. F. Codd published his seminal paper, “A Relational Model of Data for Large Shared Data Banks” in 1970.

Some data modeling tasks are just complex by nature, and it is necessarily a difficult, time-consuming analysis to figure out how to structure the storage for a given application in a manner that is both logically complete and also efficient.

The complexity does not originate from the technology; the complexity originates from nontrivial real-world scenarios, which are full of complexity, inconsistency, and special-case exceptions. It isn’t that the database modeling technology is insufficient, it’s that the job is genuinely hard.

This is an example of what I call a Waterbed Problem. When you’re lying on a waterbed, and you push down with your hand to make part of the bed lower, the water inside is displaced and rises in some other area. Even if you get a lot of friends to help you with the waterbed (okay this is starting to get racy but don’t worry), the water is still mostly non-compressible, so naturally no matter how hard you push, you only displace the water, you can’t make it shrink.

The waterbed analogy applies when we try to simplify a computer organization problem. We “push down” on the complexity of the task by inventing new methods for describing the problem. Sometimes this leads to nifty things like domain-specific languages. But generally there is a trade-off. We make one part of the problem easier to solve, at the cost of making another part of the problem less flexible, or harder to solve. The analogy is that the complexity rises in some other area, in response to our attempt to push the complexity down in the area we’re working on currently.

We might even create a simple solution for the problem at hand, but the new solution is simply unable to address some related problems. This might be analogous to a leak in the waterbed, but probably I’ve exhausted this analogy by now.

Flexibility of the data modeling solution is not the only criterion. It must also support rigidity where needed. If you can’t define the structure of your data model enough to disallow invalid data, then you’re asking for trouble. If the data management is just a pool of data elements and relationships, and any enforcement of structure is dependent on application logic, then your data integrity is at risk of bugs in that application, or susceptible to invalid changes made via a different client application.

I believe the relational model strikes a good balance between flexibility and rigidity, and that’s why it has been a good choice for general-purpose data modeling for so long. Other techniques have advantages in certain areas, but they always have a balance of disadvantages in other areas.

There is no magic bullet that can make the data modeling process easier. Not object databases, not hierarchical databases, not semantic databases, not network databases. Different data modeling paradigms can describe the data relationships differently, but they can’t make the problem simpler.

To respond to the smart folks who say that the relational model is inadequate, I concede that the relational model is the worst form of DBMS except for all those other forms that have been tried from time to time (with apologies to Winston Churchill).