Author: Bill Karwin

  • SQL Antipatterns slides

    Monday I gave a presentation at the MySQL User Conference in Santa Clara.  I uploaded my presentation materials as a PDF to my website.  It’s available under the Creative Commons 2.0 license for non-commercial, no derivative use.

  • Speaking at the MySQL Conference

    It’s time for the 2008 MySQL Conference. This year should be especially interesting.

    Sun announced they would acquire MySQL on Jan. 16 – fewer than eight weeks ago (I bet it doesn’t seem that way to the folks involved). The deal was closed only Feb. 26, and I’m sure there are still months of work to go, to fully integrate the two companies. Exciting times!

    As always, database server technology and connectivity is the focus of the MySQL Conference. The sessions are organized into a dizzying 17 tracks, only two of which aren’t specific to some technology aspect of using MySQL. I’m looking forward to seeing many of the sessions.
    One in particular that caught my eye is Beat Vontobel’s talk, “The Lost Art of the Self Join” in which he says he will solve a Sudoku puzzle in SQL. I gave a talk at OSCON 2006, “SQL Outer Joins for Fun and Profit,” in which I also used SQL to solve Sudoku puzzles. It’ll be interesting to see Beat’s solution.
    This year I’m giving a tutorial, “SQL Antipatterns.” This is my way of describing many best practices of database and query design, by counter-example. After answering SQL questions on newsgroups for many years, I have seen a lot of counter-examples.
    I’m busy over the next couple of weeks preparing my presentation slides. Here’s my complete outline of topics for the MySQL Conference Tutorial:
    Logical Database Antipatterns
    • Jaywalking – using a comma-separated list of values in a string.
    • Entity-Attribute-Value.
    • Multi-column Attributes.
    • Metadata tribbles.
    Physical Database Antipatterns
    • ID Required.
    • Phantom Files.
    • ENUM Antipattern.
    • Readable Passwords.
    Query Antipatterns
    • Ambiguous GROUP BY ((also discussed in recent blog posts by Carsten and Roland Bouman)
    • Using HAVING instead of WHERE.
    • Poor Man’s Search Engine (LIKE and REGEXP).
    • Implicit Columns in SELECT and INSERT.
    Application Development Antipatterns
    • SQL Injection.
    • Parameter Façade.
    • Pseudokey Neat Freak.
    • Session Coupling.
    • Phantom Side Effects.
    It’s not too late to register for the MySQL Conference!
  • TDD lesson from Sudoku


    I am a Sudoku addict. I like to analyze the logic strategies for solving these puzzles. I even gave a presentation at OSCON 2006 about using SQL to solve Sudoku puzzles.

    The image I’m including is a screenshot from jigsawdoku.com, copyright 2008 by Rachel Lee and Gideon Greenspan. This is my favorite Sudoku web interface currently.

    It’s most satisfying to solve the puzzle with no “crutch.” That is, every number is placed in its square without having to guess, and you never have to take a number out of a square after finding that it’s incorrect. Computer-based Sudoku interfaces that give you “hints” are also unsatisfying. It’s like doing a crossword puzzle in pencil!

    I was getting pretty good at solving the Hard puzzles, but I had hit a wall solving the puzzles in about 5 or 6 minutes and I couldn’t improve my time any more. One day I was in a hurry, and I wanted to finish the puzzle and go do something else. I started guessing as I placed the numbers. I used the “hint” button after each guess, to tell me if I had gotten it right.

    I only guessed when I had narrowed down the choice to two possible squares. In those cases I had a 50% chance of being right, in which case the hint told me I had made no mistakes. If the hint told me that I had made a mistake, I knew it must be caused by the number I had just placed.

    What I found was that I immediately cut my time in half. I could solve Hard puzzles regularly in under 3 minutes, sometimes under 2.5 minutes. This was surprising and a bit discouraging. This meant that solving the puzzle with rigorous logic, and without guesswork costs twice as much time as solving the puzzle in a sloppy fashion. Where’s the satisfaction in being sloppy?

    But I’ve been thinking about this. It’s an analogy to running tests frequently during incremental software development. Let me explain how.

    When computers were massive machines operated behind locked doors using punch cards. As you designed your program, you had to imagine it running in your head, and anticipate the bugs and design flaws as a “thought experiment.” Then you thought the program was ready, you’d put a rubber band around your stack of punch cards, and put them in the queue to be run by the operator. The next morning you’d get your result and see if your program ran correctly.

    Today, in most cases, the computers can run your program thousands of times per day if you need them to. It’s very inexpensive to run a partially-finished program, so now you can use the computer instead of your imagination to find out if the code works correctly. You can even use testing tools that make it easy to run tests repeatedly and identically with the touch of a button.

    The efficiency of running repeatable tests enables Test-Driven Development, or at least a hybrid approach in which you write code incrementally and employ tests frequently to validate your work.

    Here’s where it comes back to Sudoku. As I was placing numbers in the Sudoku grid and using the “hint” button to tell me if I had made the correct choice or not, I was practicing Continuous Integration. That is, I made the smallest change I could to the system (placing a single number in the grid) and then I re-validated the result with a test that was automated and repeatable.

    I observed that I could solve the puzzle much more quickly and accurately using this approach. This illustrates the benefit of using software testing during development. You’ll get a more robust product in the end, and it will take less time than if you had to write all the code up front.

    Some people think that since writing software tests takes some time on its own, that it’ll make the project schedule take longer. But I would point out that during the Sudoku game, I had to move my mouse to the “hint” button every time I placed a number on the grid. And yet I still solved the puzzle in half the time it took me to do it in the traditional way. The “overhead” of doing the testing, which one might assume is wasteful, in fact resulted in a net gain of productivity.

    Now I don’t feel like I’m cheating by using the “hint” button in Sudoku. I’m just working in a more efficient manner.

  • PDO v2 CLA issues

    Wez Furlong posted a request for discussion regarding the future of PDO and they proposed to adopt a CLA (Contributor License Agreement) to manage contributions to the project. Some questions have come up indicating some misunderstandings about how CLA’s work.

    As you may know, I worked for about a year as the project manager of the Zend Framework. I was involved in administering the CLA process for that project and managing community contributions. I’ll describe my experiences on that project and give my understanding of the CLA issues. I must say however that I’m not a lawyer, nor have I talked to a lawyer about these issues.

    I should also say that I have been participating in the PDO v2 discussions. I was an employee of Zend when they started organizing meetings to discuss PDO v2. I have since left Zend, but I’m continuing to participate, now as a community developer. I’d really like to see the DBMS vendors get involved, because it would add a lot of much-needed developer resources to the PDO project.

    Discussions and Patches

    Questions have come up about how to apply the terms of a CLA to code patches submitted via an issue tracker, or a mailing list, or in chatrooms. Do these require that the author sign the CLA? In the Zend Framework project, the answer is yes. Otherwise that code cannot be incorporated into the project.

    If the author of a patch was not a CLA signer, we could do one of two things:

    1. Ask the author to sign the CLA and then grant that specific patch to the project retroactively. We did this on several occasions, and in general it was not a problem.

    2. Someone else writes a different, original solution for the problem for which the patch was made, and submits this new contribution under the terms of the CLA.

    The latter was often necessary anyway, because the patch author might not understand some of the code architecture, or their solution covered only some cases. For example, when I maintained Zend_Db, sometimes a patch was submitted to fix an issue in one DBMS-specific adapter class, but the issue really affected all DBMS brands. So I re-coded a new solution in the abstract superclass. Thus I did not use the contributor’s patch verbatim, but I wrote an alternative solution to address the same goal.

    Discussions, bug reports, and feature requests are not considered to be IP and thus are not subject to the terms of the CLA. This applies to talk on the issue tracker, the mailing lists, chatrooms, or IM. It also applies to face-to-face meetings, users groups, conferences, etc. Anyone can offer comments, criticisms, wishlists, etc. to the project without conflicting with the terms of the CLA.

    When describing a feature request, often the desired usage is illustrated with code. This code shows usage, not implementation, so it’s not likely to become part of the project. The CLA does not apply to this type of contribution. However, if that code showing usage becomes part of the project in the form of documentation or demo scripts, then in that case yes, the CLA would apply.

    Basically, a reasonable guideline is that if some contribution is checked into the project’s source control (CVS), then it’s subject to the terms of the CLA. This applies to code, test code, test data, docs, build scripts, README files, etc. — anything that can be copyrighted and that gets included with the project.

    In the Zend Framework project, we also require written proposals and specifications to be submitted under the terms of the CLA. These documents were not checked into source control (though one could argue that they should be), but they still required the assurance that the contributor was not violating someone else’s IP rights to that material. This seemed like a good policy for proposals and specs, since these documents often contain prototype code.

    Documentation and Tests are Subject to the CLA

    Anything that can be copyrighted is considered intellectual property (IP). If that IP becomes part of the project it must be treated similarly as code contributions. That is, the contributor must assure that this IP is something he has the right to contribute, and he’s not illicitly copying someone else’s work. The CLA often uses the term “contribution” instead of simply “code” because the terms of the CLA apply to more types of contributions than solely code.

    For example, in the Zend Framework project, everyone who is granted commit privileges to the subversion repository or to post proposals on the wiki must first sign a CLA. Even volunteers who translate the English documentation into other languages must sign a CLA.

    Later in this blog I may say “code” for simplicity, but what I say applies to all IP contributions in the project.

    The CLA Does Not Prohibit Code Reuse

    The CLA does not require that every contribution be original work. It does require that the contributor agree to contribute only IP that they have rights to. If the contributor is also the copyright holder of that work, this is relatively straightforward, but they could also be contributing non-original work if they have the permission of its owner.

    The point has been made that a CLA-governed project cannot build on other OSS code. Yes and no. In practice it’s rare to incorporate code of any significant size from a non-CLA-governed source, because most OSS projects cannot assure with certainty that all their contributions have been made in a manner compatible with a CLA. But this isn’t the fault of a CLA, it’s just a result of the organic way most OSS projects grow.

    CLA Does Not Exclude Community Involvement

    One assertion is that one needs to sign the CLA to view the PDO v2 specification. This is incorrect. The current PDO v1 spec is online now, and my understanding is that the PDO v2 spec will be open too. Similarly, speculation that one needs to sign a CLA to view the source code is false. Perhaps people are confusing CLA with the concept of an NDA.

    Some people believe that the use of a CLA blocks the community from being involved with the project, or that the work occurs in mysterious smoky rooms behind closed doors. This is also not true. Non-CLA signers can give feedback and discussion — but the actual code and other IP must be written by people who have signed a CLA.

    Here’s a hypothetical scenario: any community member can read the specification or the code and say, “I don’t like the way it handles feature X. It fails to account for case Y.” The contributor of the spec says, “okay, I’ve edited the spec with case Y in mind, does that satisfy your issue?”. Community member responds, “yes, that’s good.”

    See? The community stays involved, and their feedback is heeded. The Zend Framework is a good example. Hundreds of developers who never signed the CLA have filed bugs or feature requests in the issue tracker, or asked questions on the mailing lists.

    CLA Does Not Restrict OSS Freedom

    One assertion that has been made is that requiring a CLA opposes the spirit of free software, since it places conditions on contributions. I would point out that not everyone has commit privileges to the PHP project. Contributions are carefully vetted, discussed, reviewed. Many are flat out rejected, sometimes for subjective or inconsistent reasons. No one is complaining about this process — I’m not either. But it should be noted that PHP is not a free-for-all. There are good reasons to filter contributions for the sake of quality.

    The tradition of PHP includes an unwritten assumption that contributors grant their work freely, and do not expect compensation or to dictate special terms of use for their contribution. There’s also an assumption that contributors are not plagiarizing code or other IP. In fact, there have been some recent cases where code had to be removed or reimplemented to avoid IP conflicts. So PHP does have a commitment to respect other people’s copyrights and licenses, and to preserve clean IP in the project.

    Given that, is it such a bad thing for contributors to make their agreement with those traditions explicit? To promise that you contribute only IP that you have a right to contribute, and that you do so freely and do not expect compensation, seems very consistent with the spirit of OSS.

    Another tenet of OSS is that anyone can create derived works. It does not mean that a given project must accept contributions. The proposed PDO License permits creating derived works, and it even explicitly states that the derived work may use a different license, which in my mind supports the spirit of free software (I’m not a GPL zealot).

    CLA Does Not Make Contributors Legally Responsible

    Another assertion that has been made about the CLA is that by signing something, the contributor becomes “legally responsible” for their code. Of course you are responsible for not plagiarizing your code, but that’s true regardless of whether the project uses a CLA or not.

    The other interpretation of “legally responsible” is in regards to liability for damages if the code is defective. In fact, the CLA has a clause by which the contributor disclaims responsibility for the code. This “AS-IS” clause (the part in all-caps) is common in software licenses.

    But without a CLA, there is no such agreement between the original contributor and the project. The project itself could sue the contributor for a bug that resulted in damages (actually my understanding is that since PHP and PDO incorporate no legal entity, the project cannot initiate a lawsuit). Having a CLA between the contributor and the project makes it clear that the contributor offers no warranty for his code. Thus the contributor has more protection by using a CLA than by not using a CLA.

    CLA Does Not Protect Contributors From Being Sued

    Wouldn’t that be clever, to sign a form like that! “I hereby certify that no one can sue me.” It’s nonsense to expect a piece of paper to give this guarantee.

    Yet this is the argument some people use against CLA’s: that it doesn’t protect them from being sued if they contribute code that conflicts with someone else’s patent or copyright. People who use this objection have gotten it backwards. A CLA doesn’t protect you as a contributor from being sued — it is your agreement that you won’t sue other people who use your code.

    This is beneficial to you as a contributor because it goes both ways. There are other contributors writing code for the project. If these other contributors have also signed the CLA, they have agreed to grant their work to the project (subject to the terms of the license). They’ve stated that they won’t come back later and demand other terms for using their contribution. Any protection you get is not because you have signed the CLA, it’s because all the other contributors have signed the CLA.

    The purpose of the CLA is not to protect you if you write code that conflicts with someone else’s IP. That’s your responsibility. Keep in mind that this is no different if you contribute to a project that has no CLA process.

    CLA is Not the Commercial Vendors’ Plot to Control PHP

    This is incredibly cynical, and it doesn’t even make any business sense. The commercial DBMS vendors have demonstrated their commitment to OSS by contributing to many projects. Their interest is in making it attractive for developers using PHP to adopt their latest DBMS technology, by ensuring that their cutting-edge features are supported by PDO.

    But to do this, they need some assurance that by participating in the PDO project, they won’t become exposed to other contributions that contain “tainted” IP. It’s the job of the legal services professionals in each of these companies to protect them from such risk. I’m sure it’s fresh in their minds in the wake of the SCO-Linux controversies.

  • MacBook Air – Almost as Thin as a Sinclair ZX80

    I just watched the photos and text from today’s keynote at MacWorld. Steve Jobs unveiled the MacBook Air, described as the world’s thinnest notebook computer. I suddenly recalled being at a computer graphics user group at UCSC, circa 1982.

    One of the hobbyists in that group had a Sinclair ZX80, a kit computer from the UK that was notable as the first personal computer available for under £100. The thing that makes me think of that early PC in relation to Apple’s new notebook is that the ZX80 was even smaller and thinner than a MacBook Air. We usually think of early PC’s as chunky monstrosities like the Osborne 1 or the Commodore PET that nonetheless cost over $1500.

    It’s interesting to recall that more than 25 years ago, the $100 portable computer was a reality, though with far less computing power than a modern alarm clock.

  • “Pure” mysqlnd interface feedback

    After I posted my idea about a pure PHP mysqlnd driver, I received an email from Ulf Wendel from the MySQL AB team who works on the mysqlnd driver. He told me I could post his comments since he doesn’t have a Blogger account. I’ll include his comments below in blockquotes and then I’ll comment below that.

    Ulf Wendel writes:

    What is “native”?

    PHP and the Borg[1] seem to be relatives. Whenever a real PHP hacker find a useful C library in the universe he speaks the famous words “resistance is futile”. A little later, the C library is available to PHP has become a PHP extension. This is one of the secrets of the early success of PHP: PHP is extentable. Extensions allow all the PHP drones to use the collective power of C libraries. As of today the PHP Function Reference shows 189 extensions [2]. Guess how many of them are written in C and how many of them are based on a C library…

    To make it short: with PHP you can choose between C and PHP. Same with Lua: a scripting language (Lua) and C (the language Lua is implemented in) can be mixed whenever appropriate. That’s the nature and the secret of both PHP and Lua. Whenever your coding bees hit a limitation of the simple to use scripting language you use the C level. Implementing the MySQL Client/Server Protocol[3] is such an example. It would be slow.

    Compare that to Java. Is it common to extend the Java programming language? No, not really, very few people start hacking a Java virtual machine. And once you have hacked one virtual machine, what about portability and the other virtual machines out there. Alternatives? Well, if you enjoy cross-compiling, maybe… That’s why you would never want to write a JDBC driver in any other language but Java. And, Java is more of a compiled language than PHP is, therefore it is fast enough.

    Native for PHP can mean both: C and PHP. If its C – like with mysqlnd – you have to ask what external dependencies exist. The MySQL native driver for PHP (mysqlnd) gets all the infrastructure it needs from PHP itself. The driver is part of PHP, therefore mysqlnd runs on all platforms that run PHP.

    Why using C?

    You list some disadvantages of choosing C:

    a) Platform-dependence not given

    Mysqlnd runs on all platforms that run PHP. Mysqlnd is a part of PHP, it does use in particular:

    – PHP memory management (limits really work now!)
    – PHP data structures (less copy operations and memory savings)
    – PHP Streams

    b) Communication protocol inspection not possible

    The MySQL native driver for PHP does use PHP Streams. PHP Streams feature hooks, aka Stream Filters [4]. With a little hacking you could expose the internal stream to the userland (PHP scripts). However, we favour MySQL Proxy[5] and therefore we have not implemented it.

    To sum up: technically its possible but we have not enabled it. Tell us why you need it inside PHP and you have a fair chance to see it implemented.

    Last but not least: have you ever worked on a raw binary network stream in PHP. Do your really want to know about the details of the protocol which Andrey started to love while he was implementing it? Or do you want to use something that is already there and might be the future standard: MySQL Proxy[5].

    c) Deployment problems with no access to php.ini

    First, with mysqlnd foundations have been laid to enable MySQL support by default: no license issue, no version issues, no external library dependencies. Its up to php.net and its community to decide if mysqlnd should be added to the default PHP configuration.

    Second, if you choose a hosting service that does not configure/compile PHP as you need it, you do something wrong – honestly.

    Anyway: all the above is minor stuff. The main reason is maximum integration into PHP for the best performance and easy deployment with no license issues.

    [1] http://en.wikipedia.org/wiki/Borg_%28Star_Trek%29
    [2] http://www.php.net/manual/en/funcref.php
    [3] http://forge.mysql.com/wiki/MySQL_Internals_ClientServer_Protocol
    [4] http://www.php.net/manual/en/ref.stream.php
    [5] http://forge.mysql.com/wiki/MySQL_Proxy

    Now my comments to Ulf’s comments:

    First, thank you very much Ulf for your reply to my blog posting. I appreciate getting information “from the source” and it’s important to get more information about libmysqlnd out to the community.

    I want to reiterate that I think Libmysqlnd is the right solution for the PHP community, given the requirements of providing high-performance, quality extension with a PHP-compatible license. I look forward to libmysqlnd being part of the standard PHP distribution if the PHP community approves it.

    I’m not lobbying to change libmysqlnd! I’m just supposing that a MySQL connector written in PHP code might also be interesting, even if it were not the preferred connector for MySQL server. It would be useful in a few circumstances, and also could be a debugging tool.

    The performance advantage of C over of PHP is important. It stands to reason that an implementation of the MySQL protocol in a scripting language would be quite a bit slower.

    However, it would be interesting to try it and measure the actual difference in performance, if for no other reason than understanding exactly how much performance advantage is achieved by using C. I understand from a user’s comment that libraries exist in other scripting languages that implement the MySQL Protocol. Net::MySQL for Perl, Ruby/MySQL, and an unreleased Python library.

  • Idea for “pure” mysqlnd interface

    When I first heard about the mysqlnd project at the MySQL Camp in 2006, somehow I thought it was going to be 100% PHP code. I can’t remember now if I was making an incorrect assumption or if they planned to write it in PHP but changed their mind.

    What we have today is a DLL that is a drop-in replacement for libmysql. This has a few nice advantages. It means the replacement library will be API-compatible with all three PHP extensions for MySQL, (mysql, mysqli, and pdo_mysql) so it requires no one to change their PHP application code. The libmysqlnd is licensed under the PHP license, so it can be distributed with PHP core without any license conflict. All in all a good solution that will help both PHP and MySQL.

    But what about my (mistaken) impression of a MySQL client written in PHP, instead of implemented as a DLL? Essentially this would implement in PHP code the MySQL client/server protocol.

    It’s certainly possible that implementing this protocol in a language like PHP would be hopelessly inefficient and not suitable for any real world development. But it would have a few nice advantages:

    • Providing a platform-independent MySQL client.
    • Deploying easily in hosting environments with no access to configure php.ini.
    • Allowing PHP debugging tools to inspect at the protocol level.
    • Serving as a model for similar drivers for Perl, Ruby, Python, etc.
    • Licensing with the same license as the respective language, e.g. PHP License, Artistic License, or GPL.

    Writing a driver in the host language has precedent. A JDBC type 4 driver is supposed to implement the vendor’s client/server protocol in 100% Java code, so as to appear to the RDBMS server as indistinguishable from the vendor’s own client library.

    So for now, this is just an idle idea. Please don’t anyone ask me where to download this imaginary software, it doesn’t exist as far as I know!

  • How to Save $100 Million

    Last night I listened to an interesting interview on public radio, relating a story from the New Yorker magazine about Michigan Dr. Peter Pronovost saving millions of dollars and hundreds of lives of patients.

    How did he do it? He taught doctors and nurses to use checklists to avoid mistakes in the intensive care units of hospitals. Mistakes that could put patients’ health or lives at risk.

    What’s interesting about this story is that it’s an extremely low-technology solution to a type of problem that exists in virtually every field. In this case, it applies to medical care. But it easily applies to manufacturing. In Japan, they call it poka-yoke, or mistake-proofing. Check out this book too: “Mistake Proofing: Designing Errors Out” by Richard B. Chase and Douglas M. Stewart.

    Do checklists and similar techniques mean hamstringing the creative process in these fields? Absolutely not! On the contrary, effective use of checklists can free our attention from repetitive details, so that we can devote more of our energy to innovation and creativity. We don’t need to keep the details of well-understood procedures in our short-term memory, if we write down the steps so that we can do them without burden, or delegate the work to a teammate.

    Why not go the extra step and create technology to automate those procedures? Because a checklist doesn’t necessarily remove the requirement for human attention, to exercise good judgment and critical thinking. Some steps may require analysis, or may be performed conditionally based on the result of a previous step. It’s usually very expensive to make a machine that does that kind of analysis.

    Procedures are inexpensive to modify and prototype when humans perform them. It might turn out that the whole procedure is revealed to be incorrect, and needs to be re-thought. If automation technology had been developed for that procedure, the cost of developing that technology would be wasted. If the procedure were merely a checklist, then we just need to re-train the operation staff and voilà!

    Notice that books like David Allen’s “Getting Things Done” focus on non-technological methods for organizing and avoiding letting things fall through the cracks.

    If checklists and other easy organizational techniques are such a good idea, why don’t we employ them more? In the article about Dr. Peter Pronovost, he remarked that it’s surprising that it has taken so long to adopt his methods, and if there were a drug that achieved the same positive results he does, it would be mandatory in every hospital. A clue to the explanation is in if there were a drug. Follow the money! The solution that is marketed most aggressively is not the one that is most cost-effective; it’s often the one that is least cost-effective, because its vendor stands to make the most money from that one.

  • In Support of the Relational Model

    Every few years, a wave of people claim that the relational model falls short of modeling data structures of modern applications. This includes some really smart people recently such as:

    Jim Starkey
    Brian Aker
    MIT researchers

    This trend of criticizing the relational model started almost immediately after E. F. Codd published his seminal paper, “A Relational Model of Data for Large Shared Data Banks” in 1970.

    Some data modeling tasks are just complex by nature, and it is necessarily a difficult, time-consuming analysis to figure out how to structure the storage for a given application in a manner that is both logically complete and also efficient.

    The complexity does not originate from the technology; the complexity originates from nontrivial real-world scenarios, which are full of complexity, inconsistency, and special-case exceptions. It isn’t that the database modeling technology is insufficient, it’s that the job is genuinely hard.

    This is an example of what I call a Waterbed Problem. When you’re lying on a waterbed, and you push down with your hand to make part of the bed lower, the water inside is displaced and rises in some other area. Even if you get a lot of friends to help you with the waterbed (okay this is starting to get racy but don’t worry), the water is still mostly non-compressible, so naturally no matter how hard you push, you only displace the water, you can’t make it shrink.

    The waterbed analogy applies when we try to simplify a computer organization problem. We “push down” on the complexity of the task by inventing new methods for describing the problem. Sometimes this leads to nifty things like domain-specific languages. But generally there is a trade-off. We make one part of the problem easier to solve, at the cost of making another part of the problem less flexible, or harder to solve. The analogy is that the complexity rises in some other area, in response to our attempt to push the complexity down in the area we’re working on currently.

    We might even create a simple solution for the problem at hand, but the new solution is simply unable to address some related problems. This might be analogous to a leak in the waterbed, but probably I’ve exhausted this analogy by now.

    Flexibility of the data modeling solution is not the only criterion. It must also support rigidity where needed. If you can’t define the structure of your data model enough to disallow invalid data, then you’re asking for trouble. If the data management is just a pool of data elements and relationships, and any enforcement of structure is dependent on application logic, then your data integrity is at risk of bugs in that application, or susceptible to invalid changes made via a different client application.

    I believe the relational model strikes a good balance between flexibility and rigidity, and that’s why it has been a good choice for general-purpose data modeling for so long. Other techniques have advantages in certain areas, but they always have a balance of disadvantages in other areas.

    There is no magic bullet that can make the data modeling process easier. Not object databases, not hierarchical databases, not semantic databases, not network databases. Different data modeling paradigms can describe the data relationships differently, but they can’t make the problem simpler.

    To respond to the smart folks who say that the relational model is inadequate, I concede that the relational model is the worst form of DBMS except for all those other forms that have been tried from time to time (with apologies to Winston Churchill).

  • Less code vs. good code

    Alex Netkachov, Vidyut Luther, and Richard Heyes discuss the pros and cons of writing code that is short. Here are some thoughts from me.

    In general, I don’t think turning three lines of code into one line makes an application better, or more readable, or prettier, or whatever is your goal. The logical extension of this is to switch to writing Perl code:

    #!/usr/bin/perl
    # Valentine.pl, copyright 2001-2007 Bill Karwin
    map( ( $r=$_, map( ( $y=$r-$_/3, $l[24-$r].=(‘ ‘,’@’)[$y**2-20*$y+($_**2)/3<0] ), (0..30) ), ), (0..24) );
    print join(“n”, map(reverse($_).$_, @l)), “n”;

    The problem with such super-compressed code is that it becomes harder to code, harder to debug problems, and harder to maintain. Imagine being a newly hired developer assigned to a project that contains code all written like the above.

    Of course, we’re not talking about code as obfuscated as the above. But taking a step in the direction of compressed code for the sake of compressed code, instead of for some functional improvement, is a step toward making the code less maintainable.

    For example, think about reading diffs in your revision control system, when a bug fix occurs on such a compressed line of code. It’s harder to see the nature of the bug fix, compared to a more expanded form of coding. For this reason, I prefer to write expanded code–even when writing Perl code!

    The ternary operator is another good example. Vidyut expresses that the ternary operator should not exist. It’s true that it can be abused like any programming construct, but it is definitely useful to have the ternary operator in the language. It is a simple if/else construct, which has a return value. The conventional if/else construct doesn’t have a return value. Alex gives an example that is a perfect use of the ternary operator.

    $message = ($age < 16) ? 'Welcome!' : 'You are too old!';

    But notice I said simple. The ternary operator is inappropriate to use for any if/else blocks that contain anything more complex than a single expression. What if the application requires that we log the event when someone who is too old attempts to access the system?

    One can try to do acrobatics in code to accomplish two steps in one expression, but this is not a good practice because if a bug occurs in either step or the requirements change, you end up breaking both.

    // Wrong
    $message = ($age < 16) ? 'Welcome!' : log_and_return_message('You are too old!');

    Once either the positive or negative block becomes anything other than a single expression, you do need to change to using an if/else construct. So if you are tempted to use a ternary operator because it’s shorter or it’s prettier, consider if there is any likely scenario in which you would have to restructure it anyway. If so, that’s probably a good reason to use if/else instead of a ternary expression.