The Case Against The Case Against Auto Increment in MySQL

In the Pythian blog today, John Schulz writes The Case Against Auto Increment In MySQL, but his blog contains some misunderstandings about MySQL, and makes some bad conclusions.

The Concerns are Based on Bad Assumptions

In his blog, Schulz describes several concerns about using auto-increment primary keys.

Primary Key Access

“…when access is made by a secondary index, first the secondary index B-Tree must be traversed and then the primary key index must be traversed.”

This is true. If your query looks up rows by a secondary key, InnoDB does that lookup, finds the associated primary key value, then does another lookup of the primary key value in the clustered index (i.e. the table). But if your query looks up rows by the table’s primary key, the first step is skipped.

But this has nothing to do with using an auto-inc mechanism to generate primary key values. The way secondary key lookups work is still true if we use another method to create primary key values, such as a UUID or a natural key. This is not an argument against auto-inc.

A mitigating feature of InnoDB is that it caches frequently-requested values from secondary keys in the Adaptive Hash Index, which skips the double-lookup overhead. Depending on how likely your application requests the same values repeatedly from a secondary index, this can help.

This concern is also irrelevant for queries that do lookup data by primary key. Whether you generate the primary key as auto-inc or not, it’s common for applications to search for data by primary key.

Scalability of Auto-Inc Locks

The blog claims:

“When an insert is performed on a table with an auto increment key table level lock is placed on the table for inserts only until the transaction in which the insert is performed commits. When auto-commit is on, this lock lasts for a very short time. If the application is using manual commits, the lock will be maintained for a longer period.”

This is simply incorrect. The auto-inc lock are not held by the transaction until it commits. That’s the behavior of row locks. An auto-inc lock is released immediately after a value is generated.

See AUTO_INCREMENT Handling in InnoDB for a full explanation of this.

You can demo this for yourself:

  1. Open two MySQL clients in separate windows (e.g. Terminal windows running the mysql CLI). In each window, begin a transaction, which disables autocommit.
  2. Insert into a table with an auto-inc primary key in the first window, but do not commit yet.
  3. Insert into the same table in the second window. Observe that the second insert succeeds immediately, with its own new auto-inc value, without waiting for the first session to commit.

This demonstrates that an auto-inc lock is not held for the duration of a transaction.

If your database has such a high rate of concurrent inserts that the auto-inc lock is a significant bottleneck, you need to split writes over multiple MySQL instances, or else consider using a different technology for data ingestion. For example, a message queue like RabbitMQ or ActiveMQ, or a data stream like Logstash or Kafka. Not every type of workload is best solved with an RDBMS.

Key Conflicts After a Replication Failure

The scenario is that an OS or hardware failure causes a MySQL replication master to crash before its binary logs have been copied fully to its replica, and then the applications begin writing new data to the replica.

“In a situation like this failing over to the slave will result in new rows going into auto increment tables using the same increment values used by the previous master.”

Yes, there’s a small chance that when using asynchronous replication, you might be unlucky enough to experience a catastrophic server failure in the split-second between a binary log write and the replica downloading that portion of the binary log.

This is a legitimate concern, but it has nothing to do with auto-inc primary keys. You could have the same risk of creating duplicate values in any other column with a unique constraint. You could have a risk of orphaned rows due to referential integrity violations.

This risk can be mitigated by using Semi-Synchronous Replication. With this option, no transaction can be committed on the master until at least one semi-sync replica has received the binary log for that transaction. Even if the master instance suffers a catastrophic power loss and goes down, you have assurance that every transaction committed was also received by at least one semi-sync replica instance.

The above risk only occurs during OS or hardware crashes. See Crash-safe MySQL Replication: A Visual Guide for good advice about ensuring against data loss if the MySQL Server process aborts for some other reason.

Key Duplication Among Shards

This concern supposes that if you use a sharded architecture, splitting your data over multiple MySQL instances…

“…you will quickly find that the unique keys you get from auto-increment aren’t unique anymore.”

This supposes that a table on a given shard generates a series of monotonically increasing auto-inc values, not knowing that the same series of values are also being generated on its sister shards.

The solution to this concern is to configure the shards to generate values offset from each other (this was quickly pointed out by Rick James in a comment on the blog).

Set the MySQL option auto_increment_increment to the number of shards, and set auto_increment_offset to the respective shard number on each instance. With this arrangement, each shard won’t generate values generated by the other shards.

The Proposed Solutions Have Their Own Problems

Schulz recommends alternatives to using auto-incremented primary keys.

Natural Key

A natural key is one that is part of the business-related data you’re storing.

“Examples of Natural keys are National Identity numbers, State or Province identity number, timestamp, postal address, phone number, email address etc.”

There are problems with using a natural key as the primary key:

  • It might be hard to find a column or set of columns that is guaranteed to be unique and non-null, and is a candidate key for the table. For example, a national identity number isn’t a good choice, because a person who isn’t a citizen won’t have one.
  • Business requirements change regularly, so the columns that once were unique and non-null might not remain so.

Natural keys are most useful in tables that seldom change, for example a lookup table.

Natural Modified Key

The suggestion is to add a timestamp or a counter column to a natural primary key for cases when natural primary key column can’t be assumed to be unique. By definition, this means the supposed natural key is not a candidate key for the table.

It’s not unusual for a table to have no good choice of columns that can be assured to be unique. In these cases, a pseudokey based on an auto-inc mechanism is a standard solution.

UUID

The suggestion is to use a globally unique UUID as a primary key.

“To save space and minimize the impact on index block consumption UUIDs should be stored as binary(16) values instead of the Char(36) form they are usually seen.”

Even when stored as binary, a UUID requires more space than an INT (4 bytes) or BIGINT (8 bytes). Keep in mind that primary key values are internally appended to every secondary index, so it’s not merely double the space, it scales up with the number of indexes your tables have. This doesn’t sound like you’re saving space.

“…they do not require table locks…”

It’s worse than that. MySQL’s UUID() function is implemented with an internal global lock, instead of a table lock. It’s very brief of course, but in theory you can get contention. This contention might even be worse than the auto-inc table lock, because the same global lock is needed by all tables on the MySQL instance for which you generate a UUID.

The fact that UUID doesn’t insert in key order is actually a big deal for insert performance under high load. This can be mitigated by reformatting the UUID as described by Karthik Appigatla in 2014, but this is not default behavior and it’s not widely used.

Random insert order also leads to fragmentation in the clustered index and less locality of pages in the buffer pool. Michael Coburn showed this in an excellent blog post in 2015: Illustrating Primary Key models in InnoDB and their impact on disk usage.

MySQL tables have no way to generate a UUID automatically. You would have to write a trigger to do this, or more application code. You would have to write a separate trigger for every table that uses a UUID. This is a lot more work than simply declaring your primary key column with the AUTO_INCREMENT option.

UUIDs have their uses. They are most useful for distributed applications that need to generate globally unique values, without central coordination. Aside from that use, UUIDs are more trouble than they’re worth.

Custom Key Generator

The suggestion is to use some other software as a central generator for creating primary key values atomically and without duplicates. This can work, but it’s operational complexity and overhead to run another software service just for generating id values. It’s a single point of failure. How will you explain to your CIO that the database is running fine, but the applications still can’t load any data because the Snowflake server went down?

Besides, other customer key generators are unlikely to have the same ACID reliability as InnoDB. If your key-generator service ever restarts, how do ensure it has not lost its place in the sequence of values it generates?

Conclusion

Like all software, using MySQL’s auto-increment feature requires some expertise and understanding to be used in the best way. Every feature has appropriate uses, as well as some edge cases where we should use a different mechanism.

But it’s bad advice to conclude from this that we need to avoid using the feature altogether. For the majority of cases, it’s a simple, effective, and efficient solution.

Webinar on PHP and MySQL Replication

Using MySQL replication gives you an opportunity to scale out read queries. However, MySQL replication is asynchronous; the slave may fall behind.
This Wednesday, January 23 2013, I’ll be presenting a free webinar about using MySQL replication on busy PHP web sites.  Register here:  http://www.percona.com/webinars/readwrite-splitting-mysql-and-php

Applications have variable tolerance for data being out of sync on slaves, so we need methods for the application to query slaves only when their data are within tolerance. I describe the levels of tolerance, and give examples and methods for choosing the right tolerance level in your application. 

This talk shows the correct ways to check when the slave is safe to query, and how to architect your PHP application to adapt dynamically when the slave is out of sync.
I’ll also demonstrate an extension to the popular PHP Doctrine database access library, to help application developers using MySQL to make use of read slaves as effectively as possible.

Please join me in this free webinar this Wednesday!

Sql Injection Slides Posted

I gave a presentation today at the MySQL Conference & Expo 2010, titled SQL Injection Myths and Fallacies. Thanks to everyone who came to my talk! I appreciate your interest in learning to develop more secure applications. SQL Injection is a serious threat to web applications, and it’s only going to get worse. It’s incumbent on you as software developers to learn how to write secure code!

My slides are now online in two places: on the MySQL Conference website, and at SlideShare.net/billkarwin.

I also handed out cards for a 20% discount on my upcoming book, SQL Antipatterns. One chapter in my book is devoted to SQL Injection risks and methods for defending against them. You can pre-order the hardcopy book and receive it as soon as it ships. You can also get the downloadable beta e-book right away, and receive an update when the editing is done.

I left a stack of the leftover discount cards on the collateral table in the hallway. If you didn’t get one, you’ll have another chance when I talk at the PHP TEK-X conference in Chicago in May!

Rendering Trees with Closure Tables

I got a comment from a reader about the Naive Trees section of my presentation SQL Antipatterns Strike Back. I’ve given this presentation at the MySQL Conference & Expo in the past.

I’d also like to mention that I’ve developed these ideas into a new book, SQL Antipatterns: Avoiding the Pitfalls of Database Programming. The book is now available in Beta and for pre-order from Pragmatic Bookshelf.

Here’s the reader’s question:

I would like to ask if there’s a way I can dump all the hierarchies in a single query using a closure table? For example I have a following tree:

rootree
– 1stbranch
– midbranch
– corebranch
– leafnodes
– lastbranch
– lastleaf

and I want to display it like:

rootree -> 1stbranch
rootree -> midbranch
rootree -> midbranch -> corebranch
rootree -> midbranch -> corebranch -> leafnodes
rootree -> lastbranch
rootree -> lastbranch -> lastleaf

The Closure Table is a design for representing trees in a relational database by storing all the paths between tree nodes. Using the reader’s example, one could define and populate two tables like this:

drop table if exists closure;
drop table if exists nodes;

create table nodes (
node int auto_increment primary key,
label varchar(20) not null
);

insert into nodes (node, label) values
(1, ‘rootree’),
(2, ‘1stbranch’),
(3, ‘midbranch’),
(4, ‘corebranch’),
(5, ‘leafnodes’),
(6, ‘lastbranch’),
(7, ‘lastleaf’);

create table closure (
ancestor int not null,
descendant int not null,
primary key (ancestor, descendant),
foreign key (ancestor) references nodes(node),
foreign key (descendant) references nodes(node)
);

insert into closure (ancestor, descendant) values
(1,1), (1,2), (1,3), (1,4), (1,5), (1,6), (1,7),
(2,2),
(3,3), (3,4), (3,5),
(4,4), (4,5),
(5,5),
(6,6), (6,7),
(7,7);

What we need to do is find all the descendants of the root node 1, then for each of these descendant nodes, list its ancestors in order, separated by an arrow. We can use MySQL’s useful GROUP_CONCAT() function to build this list for us.

select group_concat(n.label order by n.node separator ‘ -> ‘) as path
from closure d
join closure a on (a.descendant = d.descendant)
join nodes n on (n.node = a.ancestor)
where d.ancestor = 1 and d.descendant != d.ancestor
group by d.descendant;

Here’s the output in the MySQL client. It looks like what the reader asked for:

+-------------------------------------------------+
| path |
+-------------------------------------------------+
| rootree -> 1stbranch |
| rootree -> midbranch |
| rootree -> midbranch -> corebranch |
| rootree -> midbranch -> corebranch -> leafnodes |
| rootree -> lastbranch |
| rootree -> lastbranch -> lastleaf |
+-------------------------------------------------+

I do assume for the purposes of ordering that all of a node’s ancestors have a lower node number. You could alternatively use a pathlength column to the closure table and sort by that.

The Closure Table design is nice compared to the Nested Sets (or Preorder Traversal) design, because it supports the use of referential integrity. By using indexes, the EXPLAIN report shows that MySQL query optimizer does a pretty good job on it (I’ve omitted a few columns for brevity):

+-------+--------+-------------------+--------------------------+
| table | type | ref | Extra |
+-------+--------+-------------------+--------------------------+
| d | range | NULL | Using where; Using index |
| a | ref | test.d.descendant | |
| n | eq_ref | test.a.ancestor | |
+-------+--------+-------------------+--------------------------+

Speaking on SQL Injection at MySQL Conference

O'Reilly MySQL Conference & Expo 2010

I’m speaking this year at the MySQL Conference & Expo 2010 in Santa Clara. Be sure to get your early registration discount by Feb 22! If you miss that deadline, get 25% off with this discount code: mys10fsp

I’m presenting a talk on SQL Injection Myths and Fallacies. This may seem like a topic that’s been done to death, but it’s important for all developers to understand it. This reminds me of a story:

My mother volunteers with the League of Women Voters. One of their activities is helping college students register to vote. So every year they set up a table on campus and help young people fill out the forms.

One day one of the women expressed frustration: “We’ve been doing this for ten years! When are these students going to learn how to register to vote for themselves?!”

The rest of the group looked at her blankly. Finally someone said calmly, “You realize that every year a new class of students becomes eligible to vote, right?

The woman who complained felt suitably embarrassed.

I’m going to cover the basics about SQL Injection, but I’ll also show how much of the advice about SQL Injection (even advice from noted security experts) misses the whole picture. I’ll also give some new techniques for remedies, that I seldom see in books or blogs. Come on by!

What is QEP?

In the context of database programming, QEP is an acronym for Query Execution Plan.

The database server analyzes every SQL query and plans how to use indexes and order tables to produce the result in the most efficient way.

You can get a report of the QEP for a SELECT query using the EXPLAIN command in MySQL. This is an important tool to analyze your SQL queries and detect inefficiencies.

Consultant Ronald Bradford uses the acronym QEP freely, but this term is not ubiquitous and doesn’t appear in the MySQL documentation. Googling for QEP yields some confusing results:

Remember: if you are using an uncommon acronym in your writing, define it when you use it. Even if you have used the acronym in past blog posts, define it again in a new blog post. Or at least link to the original post where you defined it.

Free Software vs. Gratis Software

A lot of folks are unclear on the subtleties of free software and open source. Mike Hogan writes a blog article”Is Hybrid Licensing of OSS Hypocrisy?” to try to shed some light on this. With respect, I think he has missed part of it.

We’re talking about two orthogonal things here. One is open-source versus closed-source, and the other is whether we charge money for software licenses or not. As Mike points out, the former is about development methodology. The latter is about business model.

There ought to be nothing wrong with charging money for open-source software. In fact, there isn’t — even the GPL permits this. This is related to the origin of open-source in the Free Software movement. “Free software” to them is about what the user is licensed to do with that software, not whether they paid anything to license it. In fact, some companies package free software and charge for it (presumably with some added value).

The marketing interpretation of “free software” is the “free chips & salsa” concept Mike mentions. It’s a way of encouraging the market to adopt this product. It’s a loss leader, usually followed by upselling the customer with other products or services.

There’s also a case for no-charge versions of closed-source software. Typically these have either limited features (“crippleware”) or they expire after a limited time (e.g. “demo”).

Few companies have found a way to use open source methods to develop full products they then charge money for. But this could simply be because it’s hard to drive adoption of any software product, regardless of whether it’s open-source or closed-source.

On the issue of hybrid licensing, I see this as no hypocrisy; I see it as more freedom. If I develop some code, offer it under the GPL license, and you use my code as part of your project, then you are obligated to license your project with a GPL-compatible license. This is termed the “viral” nature of the GPL license, and it’s clearly intended to promote the free software movement.

What if you don’t want to adopt a GPL-compatible license for your project? Well, no one is forcing you to use my code. But my code is really amazingly good, and you want it. You want it so much that you’re willing to give me money if I grant you a license to use it under different terms. If I’m willing to do that, now you have more freedom — you can choose to contribute your own code to the body of free software in the world, or you can choose not to. But the latter choice may have a different price tag associated with it.

This should still promote the principles of the free software movement. It would be wrong to charge someone for their freedom. But in the hybrid license model, you can avoid paying for a license simply by joining the movement, by spreading the freedom. If you want to stick to your closed-source model, you can pay for the privilege of using my code in that way.

I don’t see any hypocrisy in a software maker using a hybrid licensing model, as long as they are consistent and honest about it.

I’m Speaking on SQL at OSCON

OSCON 2009

Early Registration has been extended to June 23. Save up to $250!

Enter my friends-of-speaker discount code “os09fos” when you register, and save an additional 20%! Just because you read my blog.

Practical Object-Oriented Models in SQL

Wednesday July 22, 5:20pm.

SQL is from Mars, Objects are from Venus.

This talk is for software developers who know SQL but are stuck trying to implement common object-oriented structures in an SQL database. Mimicking polymorphism, extensibility, and hierarchical data in the relational database paradigm can be confusing and awkward, but they don’t have to be.

  • Polymorphism: Suppose your blog supports comments, but then your comments need to reference multiple types of content, for example news, blog articles, and videos. What then?
  • Extensibility: We’ve all designed customizable software, allowing customers to extend a data model with new data attributes. See how to design flexible systems, while using efficient SQL queries.
  • Hierarchies: Tree-structured data relationships are common, but working with trees in SQL usually implies recursive queries. There are a few solutions to solve this more cleanly.
  • ActiveRecord Dos and Don’ts: Web development frameworks have popularized the use of design patterns, but when it comes to multi-table queries, complex views, and assignment of OO responsibilities, ActiveRecord falls short as a one-size-fits-all Domain Model.

BoF: Meet Authors from Pragmatic Bookshelf

Wednesday July 22, 7:00pm

Gather with published and upcoming authors of programming books from the industry favorite publisher, Pragmatic Bookshelf. Join this informal chat about programming, writing books, job hunting, and career development.

Agenda:

  • Author introductions, books, OSCON presentations.
  • Experiences working with a publisher.
  • How does authoring a book aid a tech career?
  • What tech books would you like to see?

Pragmatic Bookshelf authors attending OSCON include:

  • Ian Dees is presenting “Testing iPhone Apps with Ruby and Cucumber” at OSCON (Wednesday 10:45am). Ian authored the book “Scripted GUI Testing with Ruby.”
  • Bill Karwin is presenting “Practical Object-Oriented Models in SQL” at OSCON (Wednesday 5:20pm). Bill is currently writing a book “SQL Antipatterns.”
  • Other Prag authors are attending OSCON, and plan to be at this BoF.

EAV FAIL

The photo above illustrates (by counter-example) an important characteristic of a normalized database: each logical “type” of attribute belongs in a separate column.

Just because three values happen to be numeric doesn’t mean it makes sense to SUM() them together. But if dissimilar attributes are stored in the same column, it’s tempting to treat them as compatible in this way.

This also shows a fallacy of the Entity-Attribute-Value antipattern. In this design, all attribute values are stored in a single column.

CREATE TABLE EntityAttributeValue (
entity VARCHAR(20) NOT NULL,
attribute VARCHAR(20) NOT NULL,
value VARCHAR(1000) NOT NULL,
PRIMARY KEY (entity, attribute)
);

INSERT INTO EntityAttributeValue (entity, attribute, value)
VALUES
('New Cuyama', 'Population', '562'),
('New Cuyama', 'Ft. above sea level', '2150'),
('New Cuyama', 'Established', '1951'),

SELECT SUM(value) FROM EntityAttributeValue
WHERE entity = 'New Cuyama';

The Entity-Attribute-Value design does not support or conform to rules of database normalization.

To be clear, the proper way to design a database is to put different attributes in different columns. Use column names, not strings, to identify the attributes.

CREATE TABLE Cities (
city_id SERIAL PRIMARY KEY,
city_name VARCHAR(100) NOT NULL,
population INT UNSIGNED NOT NULL,
feet_altitude SMALLINT UNSIGNED NOT NULL,
year_established SMALLINT UNSIGNED NOT NULL
);

SQL Antipatterns Strike Back! Slides

I presented my tutorial at the MySQL Conference & Expo today. I have fun preparing it and presenting it, and I got many good questions and comments from the audience. Thanks to everyone for coming and participating!

I have uploaded my slides with a Creative Common 3.0 license to my SlideShare account: http://www.slideshare.net/billkarwin

For those who did not get to see my tutorial, I’m presenting some selections from it during a 45-minute session at the MySQL Camp on Wednesday at 2:00pm, under the title “Practical Object-Oriented Models in SQL.”

See you next year!