Blog

Best. Perl Script. Ever.
A user recently asked:

What has been your best programming experience so far?

The most successful program I’ve ever written was this Perl script:
```
map(($r=$_,map(($y=$r-$_/3,$l[24-$r]
.=(' ','@')[$y**2-20*$y+($_**2)/3<0]),(0..30)),),(0..24));
print join("\n", map(reverse($_).$_, @l)), "\n";
```
I wrote this for a woman I was dating in 2001. Writing a Perl script for my girlfriend is not as geeky as it sounds, at least in this case. She’s also a software developer, and she was taking a Perl class at the time.

I consider this script a great success because she married me in 2007!

I’ll leave it as an exercise for the reader to run the script in a console window and see its output (I promise it’s not a Trojan Horse or any other kind of evil trick).

I’m posting to my blog the questions I’ve answered on StackOverflow, which earned the “Good Answer” badge. This was based on my answer to “What is your best programming experience?“
January 14, 2009
The Next-Gen Databases
A user recently asked:

I’m learning traditional Relational Databases (with PostgreSQL) and doing some research I’ve come across some new types of databases. CouchDB, Drizzle, and Scalaris to name a few, what is going to be the next database technologies to deal with?

SQL is a language for querying and manipulating relational databases. SQL is dictated by an international standard. While the standard is revised, it seems to always work within the relational database paradigm.

Here are a few new data storage technologies that are getting attention currently:
- CouchDB is a non-relational database. They call it a document-oriented database.
- Amazon SimpleDB is also a non-relational database accessed in a distributed manner through a web service. Amazon also has a distributed key-value store called Dynamo, which powers some of its S3 services.
- Dynomite and Kai are open source solutions inspired by Amazon Dynamo.
- BigTable is a proprietary data storage solution used by Google, and implemented using their Google File System technology. Google’s MapReduce framework uses BigTable.
- Hadoop is an open-source technology inspired by Google’s MapReduce, and serving a similar need, to distribute the work of very large scale data stores.
- Scalaris is a distributed transactional key/value store. Also not relational, and does not use SQL. It’s a research project from the Zuse Institute in Berlin, Germany.
- RDF is a standard for storing semantic data, in which data and metadata are interchangeable. It has its own query language SPARQL, which resembles SQL superficially, but is actually totally different.
- Vertica is a highly scalable column-oriented analytic database designed for distributed (grid) architecture. It does claim to be relational and SQL-compliant. It can be used through Amazon’s Elastic Compute Cloud.
- Greenplum is a high-scale data warehousing DBMS, which implements both MapReduce and SQL.
- XML isn’t a DBMS at all, it’s an interchange format. But some DBMS products work with data in XML format.
- ODBMS, or Object Databases, are for managing complex data. There don’t seem to be any dominant ODBMS products in the mainstream, perhaps because of lack of standardization. Standard SQL is gradually gaining some OO features (e.g. extensible data types and tables).
- Drizzle is a relational database, drawing a lot of its code from MySQL. It includes various architectural changes designed to manage data in a scalable “cloud computing” system architecture. Presumably it will continue to use standard SQL with some MySQL enhancements.
Relational databases have weaknesses, to be sure. People have been arguing that they don’t handle all data modeling requirements since the day it was first introduced.

Year after year, researchers come up with new ways of managing data to satisfy special requirements: either requirements to handle data relationships that don’t fit into the relational model, or else requirements of high-scale volume or speed that demand data processing be done on distributed collections of servers, instead of central database servers.

Even though these advanced technologies do great things to solve the specialized problem they were designed for, relational databases are still a good general-purpose solution for most business needs. SQL isn’t going away.

I’m posting to my blog the questions I’ve answered on StackOverflow, which earned the “Good Answer” badge. This was my answer to “The Next-Gen Databases.”
January 13, 2009
Verifying a Company Uses Best Practices
A user recently asked how to use the Joel Test in an interview, to confirm that a software company practices what they preach with regard to professional software development habits:

I’ve got an interview with a company that claims to score a 12 on the Joel Test. […] What are some ways of determining if they really implement all 12 points? Are there any particular questions I can ask?

It’s reasonable to say, “show me.” Ask them for examples and concrete details of their support for the Joel Test subjects. Since they claim they score all 12 points, they are obviously proud of it. People tend to like to show off, so they’ll probably be eager to share more details.
If you ask more specific questions, it’ll become apparent from their descriptions whether they really have those good practices.

We can think of many specific follow-up questions to the basic questions. The Joel Test questions are in bold below, and my follow-ups, er, follow:
1. Do you use source control? What source control system do you use? Why did you pick that one? What is your branch/release policy? What are your tag naming conventions? Do you organize your tree by code vs. tests at the top with all modules under each directory, or do you organize by module at the top with code and tests under each module directory?
2. Can you make a build in one step? What tools do you use to make builds? How long does it take to go from a clean checkout to an installation image? What would it take to modify the build? Is it integrated into your testing harness? What would it take to duplicate a build environment? Are the build scripts and tools also under source control?
3. Do you make daily builds? What software testing tools do you use for daily builds? Do you use a Continuous Integration tool? If so, which one? How do you identify who “broke the build?” What is your test coverage?
4. Do you have a bug database? What bug tracker software do you use? Why did you pick that one? What customizations did you apply to it? Can you show me trends of rate of bugs logged or bugs fixed per month? How does a change in source control get associated with the relevant bug?
5. Do you fix bugs before writing new code? What is your bug triage process? Who is involved in prioritizing bugs? How many bugs did you fix in the last release of your product? Do you do bug hunts with bounties for finding critical bugs?
6. Do you have an up-to-date schedule? Can I see it? How far are you ahead of/behind schedule right now? How do you do estimating? How accurate a method has that turned out to be?
7. Do you have a spec? Can I read one? Do you have a spec template? Can I see that? Who writes the specs? Who reviews and approves the specs?
8. Do programmers have quiet working conditions? Can I see the cubicle or work area for the position I’m interviewing for? (or an equivalent work area)
9. Do you use the best tools money can buy? What tools do you use? Are you up to date on versions? What tools do you want you don’t have yet? Why not?
10. Do you have testers? How many? Can I meet one? Do testers do black-box or white-box testing?
11. Do new candidates write code during their interview? What code would you like me to write? What are you looking for by seeing my code?
12. Do you do hallway usability testing? How frequently? Can I see a report documenting one of your usability testing sessions? Can you give me an example of something you changed in the product as a result of usability testing?
Beware if their answers to the specific follow-up questions are evasive like, “um yeah, we are committed to doing more best practices and we’ll be looking to you to help us effect changes toward that goal.” If they’re so committed to it, why don’t they have anything to show for it yet? Probably because like many companies, when the schedule is in jeopardy, following “best practices” goes out the window.

I’m posting to my blog the questions I’ve answered on StackOverflow, which earned the “Good Answer” badge. This was my answer to “Administering the Joel Test.”
January 11, 2009
Do I really need version control?
A user recently asked:

I read all over the internet (various sites and blogs) about version control. How great it is and how all developer NEED to use it because is a god bless.

Here is the question: do I really need this? … I usually work alone (freelancer) and I had no client that asked me to use svn (but never is too late for this, right?). So, should I start and struggle to learn to use svn (or something similar?) Or it’s just a waste of time?

Here’s a scenario that may illustrate the usefulness of source control even if you work alone.
Your client asks you to implement an ambitious modification to the website. It’ll take you a couple of weeks, and involve edits to many pages. You get to work.
You’re 50% done with this task when the client calls and tells you to drop what you’re doing to make an urgent but more minor change to the site. You’re not done with the larger task, so it’s not ready to go live, and the client can’t wait for the smaller change. But he also wants the minor change to be merged into your work for the larger change.
Maybe you are working on the large task in a separate folder containing a copy of the website. Now you have to figure out how to do the minor change in a way that can be deployed quickly. You work furiously and get it done. The client calls back with further refinement requests. You do this too and deploy it. All is well.
Now you have to merge it into the work in progress for the major change. What did you change for the urgent work? You were working too fast to keep notes. And you can’t just diff the two directories easily now that both have changes relative to the baseline you started from.
The above scenario shows that source control can be a great tool, even if you work solo. Source control can solve many problems for you, such as the following:
- You can use branches to work on longer-term tasks and then merge the branch back into the main line when it’s done.
- You can compare whole sets of files to other branches or to past revisions to see what’s different.
- You can track work over time (which is great for reporting and invoicing by the way).
- You can recover any revision of any file based on date or on a milestone that you defined.
For solo work, Subversion is recommended. CVS is all but antiquated, and GIT is more useful for distributed teams. A good book is Pragmatic Version Control Using Git by Travis Swicegood.
I’m posting to my blog the questions I’ve answered on StackOverflow, which earned the “Good Answer” badge. This was my answer to “Do I really need version control?“
January 9, 2009
SQL Antipatterns Tutorial at the MySQL Conf & Expo 2009
My tutorial proposal was accepted, so I’ll be speaking April 20 at the MySQL Conference & Expo 2009 in Santa Clara.

My tutorial is “SQL Antipatterns Strike Back.” SQL Antipatterns are frequent blunders committed by software developers, both novice and expert.

I gave a similar tutorial last year, and I think it was well-received. I’m keeping and improving the most interesting topics from last year’s tutorial, but over half of this year’s tutorial will be all-new!

The new topics in my tutorial include:
- Polymorphic Associations
- Storing Tree-Structured Data
- Index Shotgun
- Using NULL
- Goldberg Machines
- Using JOIN (…or not)
- See No Evil
- Magic Beans
- Diplomatic Immunity
The improved topics from last year include:
- Entity-Attribute-Value
- Metadata Tribbles
- FLOAT
- ENUM
- Ambiguous GROUP BY
- Using HAVING instead of WHERE
- Parameter Facade
- Phantom Side Effects
Some of these topic names are meant to be mnemonics, just like names of design patterns and anti-patterns. You’ll have to attend the tutorial to see the full meaning of these SQL Antipatterns!
December 16, 2008
Enough is enough, PlanetMySQL

Lately planetmysql.com has shown an increasing amount of commercial promotion for Sun products. Many recent articles included in this supposedly MySQL-related feed aggregator have nothing to do with MySQL.

Now I see an article included in the feed from ComputerWorld, about database researcher David Dewitt getting a job at Microsoft. Not only is this non-MySQL-related content, but more than half of the text in this article’s RSS summary is an advertisement for “Expedition Week” on the National Geographic Channel! Here’s what I saw in my RSS reader:

Microsoft hires noted database researcher to help scale SQL Server
from Planet MySQL by ComputerWorld

David DeWitt retired from the Computer Science Department at the University of Wisconsin last year, but he’s already returned, this time as head of a new database research center located on the Madison campus and funded primarily by Microsoft.

Presented By:
Expedition Week Continues Tonight

Seven nights of one great discovery after another continues tonight at 9P e/p only on National Geographic Channel. From the ancient pyramids to the ocean depths, from lost cities to outer space, travel with the latest generation of intrepid explorers as they make one great discovery after another. Expedition Week, only on National Geographic Channel.

Ads by Pheedo

I’m removing planetmysql from my RSS reader. So long — and no thanks for all the Glassfish.

November 20, 2008

Building MySQL from Bazaar

I wanted to try using MySQL source from its new VCS, Bazaar, and I saw Daniel Fischer’s extremely helpful post “Getting started with Bazaar for MySQL code“. Thanks Daniel!

Downloading using Bazaar is indeed slow, as Daniel warns. I am constrained to 1MBps ADSL, and the intial download took quite a bit of time. I think over an hour. But subsequent merges and other operations are quicker.

Building the source seems conventional, but I have run into Bug #37075 running tests. There seems to be a regression that is exhibited by th LIMIT tests with prepared statements. Here are the steps to reproduce:

$ configure --with-innodb --with-ssl
$ make
$ cd mysql-test
$ perl mysql-test-run limit

Here’s the output:

TEST                           RESULT         TIME (ms)
-------------------------------------------------------

main.limit                     [ fail ]

--- /Users/bill/workspace/bzr/mysql-server/mysql-test/r/limit.result 2008-06-26
01:04:04.000000000 +0300
+++ /Users/bill/workspace/bzr/mysql-server/mysql-test/r/limit.reject 2008-06-26
23:47:59.000000000 +0300
@@ -107,6 +107,7 @@
set @a=14632475938453979136;
execute s using @a, @a;
1
+1
set @a=-14632475938453979136;
execute s using @a, @a;
ERROR HY000: Incorrect arguments to EXECUTE

mysqltest: Result length mismatch

Warnings from just before the error:
Error 1292 Truncated incorrect DECIMAL value: ''

Aborting: main.limit failed in default mode.
To continue, re-run with '--force'.

This affects MySQL 5.1.27 (revno 2662) and MySQL 6.0.6 (revno 2671). But the tests pass in MySQL 5.0.66.

June 27, 2008

ActiveRecord does not suck

I’ve been reading a few blog postings such as Kore Nordmann’s ActiveRecord sucks and Mike Seth’s ActiveRecord sucks, but Kore Nordmann is wrong.

ActiveRecord is fine. It is a tool that does just what it’s designed to do. What sucks is when developers try to make it do other things than what it’s intended to do.

I worked for Zend, managing the Zend Framework project through its 1.0 release. I also completed the implementation and documentation of Zend_Db and its related components. To set the record straight, Zend_Db does not implement the ActiveRecord pattern. It implements the Table Data Gateway and Row Data Gateway patterns, which taken together offer similar value as the ActiveRecord pattern.

I totally agree with Mike Seth that MVC should not be taken as “ActiveRecord-View-Controller.” I tried to make this point in documentation, screencasts, conference presentations, and in many mailing list messages, but I met with little success.

Unfortunately, the Ruby on Rails drumbeat that Models are simply database table wrappers has established momentum. The term “Model” has (incorrectly) become synonymous in many developers’ minds with “ActiveRecord.” Since Models by this definition are tied to database access and largely implementing various query techniques, Ruby on Rails development forces you to write a large amount of code in the controller classes that should properly be written in Model classes.

This has a few consequences. Unit-testing controller classes becomes very complex, since that’s where the majority of your application code resides. To test a controller class you need to mock HTTP requests, and sift through HTML output. This is fine, but it results in more work since testing the controller class is so important and complex. If the application code were separated into a true Model, then unit-testing the controller would simply be testing whether the HTTP request had been communicated to the Model correctly. Testing the behavior of the Model would be much more straightforward unit-testing of a class API in isolation, requiring no mock HTTP requests or scraping HTML output.

Also, unit-testing Rails-style Model classes is difficult, since the Model is coupled with the database. We start to see unfortunate things like database fixtures as being necessary before you can execute the simplest tests against your Model class. This makes testing Models time-consuming, error-prone, and run slowly.

If developers were to separate Models and Controllers properly, and separate data access components from Models, unit-testing all of these classes could be done more simply, and with greater isolation from other classes. This makes it easier to diagnose defects, when they occur. Isn’t this the point of unit tests?

A Model is a class that provides a logical component of your application domain. Models are products of OO design, which is a development activity I see get very little attention in the developer blogosphere or the developer tools market. Developers seem more enchanted by evangelizing their favorite code editor or debugger, or by squeezing more performance out of their database, than by mental analysis to make sure their OO architecture is modeling its application requirements well.

A single Model class may be backed by a database table, or multiple database tables, or perhaps even no database tables. Data persistence should be an internal implementation detail within a Model; the external API of the Model class should reflect its logical OO requirements, not the physical database structure.

(update) What I often tell people is that the relationship between a Model and an ORM class should be “HAS-A” rather than “IS-A.” The latter is the assumption of Rails and other frameworks who are enticed by ActiveRecord. If the Model uses ORM objects instead of inheriting from ORM classes, then you can design the Model to contain all data and code for the domain it’s supposed to model — even if it takes multiple database tables to represent it.

Many developers have complained that Zend Framework provides no base Model class. Of course it doesn’t provide a base Model class! That’s your job. This complaint is like saying that Microsoft Word sucks because it doesn’t provide finished documents for you.

So I wouldn’t say ActiveRecord sucks. I would say that people are expecting ActiveRecord to magically solve their OO design for them, in a false quest to avoid doing the design themselves.

May 23, 2008
SQL Antipatterns slides

Monday I gave a presentation at the MySQL User Conference in Santa Clara. I uploaded my presentation materials as a PDF to my website. It’s available under the Creative Commons 2.0 license for non-commercial, no derivative use.

Download the PDF at http://www.karwin.com/downloads/SQL_Antipatterns_MySQLUC2008.pdf

April 16, 2008
Speaking at the MySQL Conference
It’s time for the 2008 MySQL Conference. This year should be especially interesting.
Sun announced they would acquire MySQL on Jan. 16 – fewer than eight weeks ago (I bet it doesn’t seem that way to the folks involved). The deal was closed only Feb. 26, and I’m sure there are still months of work to go, to fully integrate the two companies. Exciting times!

As always, database server technology and connectivity is the focus of the MySQL Conference. The sessions are organized into a dizzying 17 tracks, only two of which aren’t specific to some technology aspect of using MySQL. I’m looking forward to seeing many of the sessions.

One in particular that caught my eye is Beat Vontobel’s talk, “The Lost Art of the Self Join” in which he says he will solve a Sudoku puzzle in SQL. I gave a talk at OSCON 2006, “SQL Outer Joins for Fun and Profit,” in which I also used SQL to solve Sudoku puzzles. It’ll be interesting to see Beat’s solution.

This year I’m giving a tutorial, “SQL Antipatterns.” This is my way of describing many best practices of database and query design, by counter-example. After answering SQL questions on newsgroups for many years, I have seen a lot of counter-examples.

I’m busy over the next couple of weeks preparing my presentation slides. Here’s my complete outline of topics for the MySQL Conference Tutorial:

Logical Database Antipatterns
Jaywalking – using a comma-separated list of values in a string.

Entity-Attribute-Value.

Multi-column Attributes.

Metadata tribbles.
Physical Database Antipatterns
ID Required.

Phantom Files.

ENUM Antipattern.

Readable Passwords.
Query Antipatterns
Ambiguous GROUP BY ((also discussed in recent blog posts by Carsten and Roland Bouman)

Using HAVING instead of WHERE.

Poor Man’s Search Engine (LIKE and REGEXP).

Implicit Columns in SELECT and INSERT.
Application Development Antipatterns
SQL Injection.

Parameter Façade.

Pseudokey Neat Freak.

Session Coupling.

Phantom Side Effects.
It’s not too late to register for the MySQL Conference!
March 10, 2008