Author: Bill Karwin

  • The Case Against The Case Against Auto Increment in MySQL

    In the Pythian blog today, John Schulz writes The Case Against Auto Increment In MySQL, but his blog contains some misunderstandings about MySQL, and makes some bad conclusions.

    The Concerns are Based on Bad Assumptions

    In his blog, Schulz describes several concerns about using auto-increment primary keys.

    Primary Key Access

    “…when access is made by a secondary index, first the secondary index B-Tree must be traversed and then the primary key index must be traversed.”

    This is true. If your query looks up rows by a secondary key, InnoDB does that lookup, finds the associated primary key value, then does another lookup of the primary key value in the clustered index (i.e. the table). But if your query looks up rows by the table’s primary key, the first step is skipped.

    But this has nothing to do with using an auto-inc mechanism to generate primary key values. The way secondary key lookups work is still true if we use another method to create primary key values, such as a UUID or a natural key. This is not an argument against auto-inc.

    A mitigating feature of InnoDB is that it caches frequently-requested values from secondary keys in the Adaptive Hash Index, which skips the double-lookup overhead. Depending on how likely your application requests the same values repeatedly from a secondary index, this can help.

    This concern is also irrelevant for queries that do lookup data by primary key. Whether you generate the primary key as auto-inc or not, it’s common for applications to search for data by primary key.

    Scalability of Auto-Inc Locks

    The blog claims:

    “When an insert is performed on a table with an auto increment key table level lock is placed on the table for inserts only until the transaction in which the insert is performed commits. When auto-commit is on, this lock lasts for a very short time. If the application is using manual commits, the lock will be maintained for a longer period.”

    This is simply incorrect. The auto-inc lock are not held by the transaction until it commits. That’s the behavior of row locks. An auto-inc lock is released immediately after a value is generated.

    See AUTO_INCREMENT Handling in InnoDB for a full explanation of this.

    You can demo this for yourself:

    1. Open two MySQL clients in separate windows (e.g. Terminal windows running the mysql CLI). In each window, begin a transaction, which disables autocommit.
    2. Insert into a table with an auto-inc primary key in the first window, but do not commit yet.
    3. Insert into the same table in the second window. Observe that the second insert succeeds immediately, with its own new auto-inc value, without waiting for the first session to commit.

    This demonstrates that an auto-inc lock is not held for the duration of a transaction.

    If your database has such a high rate of concurrent inserts that the auto-inc lock is a significant bottleneck, you need to split writes over multiple MySQL instances, or else consider using a different technology for data ingestion. For example, a message queue like RabbitMQ or ActiveMQ, or a data stream like Logstash or Kafka. Not every type of workload is best solved with an RDBMS.

    Key Conflicts After a Replication Failure

    The scenario is that an OS or hardware failure causes a MySQL replication master to crash before its binary logs have been copied fully to its replica, and then the applications begin writing new data to the replica.

    “In a situation like this failing over to the slave will result in new rows going into auto increment tables using the same increment values used by the previous master.”

    Yes, there’s a small chance that when using asynchronous replication, you might be unlucky enough to experience a catastrophic server failure in the split-second between a binary log write and the replica downloading that portion of the binary log.

    This is a legitimate concern, but it has nothing to do with auto-inc primary keys. You could have the same risk of creating duplicate values in any other column with a unique constraint. You could have a risk of orphaned rows due to referential integrity violations.

    This risk can be mitigated by using Semi-Synchronous Replication. With this option, no transaction can be committed on the master until at least one semi-sync replica has received the binary log for that transaction. Even if the master instance suffers a catastrophic power loss and goes down, you have assurance that every transaction committed was also received by at least one semi-sync replica instance.

    The above risk only occurs during OS or hardware crashes. See Crash-safe MySQL Replication: A Visual Guide for good advice about ensuring against data loss if the MySQL Server process aborts for some other reason.

    Key Duplication Among Shards

    This concern supposes that if you use a sharded architecture, splitting your data over multiple MySQL instances…

    “…you will quickly find that the unique keys you get from auto-increment aren’t unique anymore.”

    This supposes that a table on a given shard generates a series of monotonically increasing auto-inc values, not knowing that the same series of values are also being generated on its sister shards.

    The solution to this concern is to configure the shards to generate values offset from each other (this was quickly pointed out by Rick James in a comment on the blog).

    Set the MySQL option auto_increment_increment to the number of shards, and set auto_increment_offset to the respective shard number on each instance. With this arrangement, each shard won’t generate values generated by the other shards.

    The Proposed Solutions Have Their Own Problems

    Schulz recommends alternatives to using auto-incremented primary keys.

    Natural Key

    A natural key is one that is part of the business-related data you’re storing.

    “Examples of Natural keys are National Identity numbers, State or Province identity number, timestamp, postal address, phone number, email address etc.”

    There are problems with using a natural key as the primary key:

    • It might be hard to find a column or set of columns that is guaranteed to be unique and non-null, and is a candidate key for the table. For example, a national identity number isn’t a good choice, because a person who isn’t a citizen won’t have one.
    • Business requirements change regularly, so the columns that once were unique and non-null might not remain so.

    Natural keys are most useful in tables that seldom change, for example a lookup table.

    Natural Modified Key

    The suggestion is to add a timestamp or a counter column to a natural primary key for cases when natural primary key column can’t be assumed to be unique. By definition, this means the supposed natural key is not a candidate key for the table.

    It’s not unusual for a table to have no good choice of columns that can be assured to be unique. In these cases, a pseudokey based on an auto-inc mechanism is a standard solution.

    UUID

    The suggestion is to use a globally unique UUID as a primary key.

    “To save space and minimize the impact on index block consumption UUIDs should be stored as binary(16) values instead of the Char(36) form they are usually seen.”

    Even when stored as binary, a UUID requires more space than an INT (4 bytes) or BIGINT (8 bytes). Keep in mind that primary key values are internally appended to every secondary index, so it’s not merely double the space, it scales up with the number of indexes your tables have. This doesn’t sound like you’re saving space.

    “…they do not require table locks…”

    It’s worse than that. MySQL’s UUID() function is implemented with an internal global lock, instead of a table lock. It’s very brief of course, but in theory you can get contention. This contention might even be worse than the auto-inc table lock, because the same global lock is needed by all tables on the MySQL instance for which you generate a UUID.

    The fact that UUID doesn’t insert in key order is actually a big deal for insert performance under high load. This can be mitigated by reformatting the UUID as described by Karthik Appigatla in 2014, but this is not default behavior and it’s not widely used.

    Random insert order also leads to fragmentation in the clustered index and less locality of pages in the buffer pool. Michael Coburn showed this in an excellent blog post in 2015: Illustrating Primary Key models in InnoDB and their impact on disk usage.

    MySQL tables have no way to generate a UUID automatically. You would have to write a trigger to do this, or more application code. You would have to write a separate trigger for every table that uses a UUID. This is a lot more work than simply declaring your primary key column with the AUTO_INCREMENT option.

    UUIDs have their uses. They are most useful for distributed applications that need to generate globally unique values, without central coordination. Aside from that use, UUIDs are more trouble than they’re worth.

    Custom Key Generator

    The suggestion is to use some other software as a central generator for creating primary key values atomically and without duplicates. This can work, but it’s operational complexity and overhead to run another software service just for generating id values. It’s a single point of failure. How will you explain to your CIO that the database is running fine, but the applications still can’t load any data because the Snowflake server went down?

    Besides, other customer key generators are unlikely to have the same ACID reliability as InnoDB. If your key-generator service ever restarts, how do ensure it has not lost its place in the sequence of values it generates?

    Conclusion

    Like all software, using MySQL’s auto-increment feature requires some expertise and understanding to be used in the best way. Every feature has appropriate uses, as well as some edge cases where we should use a different mechanism.

    But it’s bad advice to conclude from this that we need to avoid using the feature altogether. For the majority of cases, it’s a simple, effective, and efficient solution.

  • The Private Option

    There’s a famous case of a fumbled rollout of a website: HealthCare.gov, the federal health insurance exchange used by independent insurance customers in about two-thirds of states in the USA.

    These days, the an updated version of HealthCare.gov functions fine, so you’re wondering what the hubbub was about when it was launched.

    Poor Debut

    Proponents said that a slow rollout is not unexpected. People who managed the health insurance exchange in Massachusetts that served as the model for the Affordable Care Act say that the same initial bugs and slow adoption affected their program too.

    The site has performance and scalability problems, has an overly complex user experience, and sometimes calculates wrong answers. The result is that of the 100,000 people who signed up for independent health insurance after October 1 2013, fewer than 27,000 used the federal exchange.

    Why Did it Fail?

    HealthCare.gov had a major obstacle: they had to handle several times the originally anticipated demand. The original plan was for each US state to implement their own health insurance exchange, to serve people in their respective state, and HealthCare.gov would handle those who couldn’t. It was assumed that only a small minority of states would rely on HealthCare.gov, and these would be the states with smaller populations. As it turned out, a majority of states refused to implement their own exchange web sites. In December 2012, when the states were required to have blueprints describing their solution, reportedly 25 states didn’t meet that deadline
    By the time of the rollout of the ACA, only 14 states were signing people up using their own state-run exchange, whereas the rest of the states–more than two-thirds–were relying on the federal exchange. These include some of the highest population states like Texas and Florida, and 20 states who had taken federal money to plan their state exchanges, but ultimately also relied on the federal exchange.
    The Private Option

    A few young programmers created an alternative web site they call HealthSherpa.com in their spare time, after the ACA debut on October 1 2013. Their web site is a prototype effort to make a more streamlined portal for people to find the health insurance plans they’re eligible for. It seems to work, and it’s very fast. It uses raw data that is accessible publicly from the federal government.

    It’s a valid question then: why didn’t the federal government—or any of the states—employ a small team of web experts to throw together such a site for a fraction of the cost?

    HealthSherpa.com doesn’t have all the functions that HealthCare.gov is supposed to. It doesn’t do credit checks, it doesn’t actually even sign anyone up for health care. It just allows consumers to find the data that pertains to them, and then it links to the websites for the respective insurance carriers. And HealthSherpa.com doesn’t create the data—it might be true that part of the effort behind HealthCare.gov has created the raw data that HealthSherpa.com uses.

    Also, HealthSherpa.com isn’t (yet) serving tens of millions of users, as HealthCare.gov is supposed to do. I work for Percona, a company that offers consulting and support for database operations, which is just one aspect of web site scalability. Scalability for a web site is complex, much more difficult than most people appreciate.

    But it’s worth noting that even with these limitations, there’s a pretty big difference between a three guys throwing together a working website in a few days, versus major federal IT contractor CGI Federal spending $174 million since they announced winning the contract in December 2011 (i.e. 22 months until their go-live deadline of October 1 2013), but they still failed to implement a site that could handle the demand.

    Conclusion

    So here’s some hindsight views on the HealthCare.gov project:

    • They should have anticipated the demand from all 50 states. This may have been over-engineering, since the intention was to serve only a minority. But they had no control over which states would agree to create their own exchanges, and every reason to think there would be political resistance to doing so.
    • They should have had a beta test period. No large-scale web site can handle the load of millions of users on its first day, not even sites implemented by major web experts like Google and Amazon. They restrict enrollment to a limited subset of their users, sometimes by invitation only. They leave enough time to work out the problems before going fully public.
    • They should have provided raw data only, not the whole web site. Let other entrepreneurs innovate the best way to search the data. Maybe someone would even create a Facebook game for selecting your insurance.
    • They should have set the deadline after scoping the project.
  • Thoughts on Wonder Woman

    The first Wonder Woman film was released this month, and it was worth the wait. It has generated a lot of commentary. You don’t see this kind of attention paid to most superhero films. There’s a lot to recommend the film.

    Here is a summary of the plot (WARNING: SPOILERS):

    • In youth, the protagonist continually is told not to expect to be a hero or warrior, despite a desire to do so.
    • Two of the protagonists mentors, one of whom is a military leader, disagree about whether the protagonist is ready to go to war.
    • The mettle of the protagonist is proven during a combat exercise.
    • The protagonist meets a competent and loyal spy, who works for an ally nation.
    • The mentor who first had faith in the protagonist is killed.
    • The protagonist is driven to subterfuge in a desire to join the war effort.
    • The war is against German nationalists.
    • Enemy agents attack the protagonist and the agent in their home city. The protagonist apprehends the attacker, but before questioning, the enemy commits suicide with a cyanide capsule.
    • The protagonist carries a bulletproof shield.
    • The protagonist and the spy recruit a rag-tag group of fighters to help them get behind enemy lines and sabotage a German weapon facility.
    • The protagonist is ordered not to charge into battle, but disobeys the order, to save the lives of a a small number of people.
    • The spy love interest teaches the protagonist to dance.
    • There are two principle German villain characters. 
    • One of the German villains is disfigured and wears a mask. 
    • One of the German villains is a creepy little scientist.
    • The German villain characters turn on their superiors, and have their own agenda of world conquest.
    • The German villain’s plan is to use weapons of mass destruction against allied cities. The weapons are loaded onto a comically oversized German bomber plane.
    • The blond-haired man climbs aboard the plane as it is taking off, fights the crew and pilot, takes over the plane, and sacrifices his life by ditching the plane away from populated areas.

    Oh wait—this is the plot of Captain America: The First Avenger (2011).

  • Running PHP at a Windows 10 Command Line

    Running PHP at a Windows 10 Command Line

    A technical writer friend of mine asked me to help her this week. She needs to run PHP scripts at the command-line on Windows 10. She installed WAMP Server which includes PHP. I think she just needs to change the PATH so when she runs “php” in a command window, it will find the PHP interpreter.

    I hardly use Windows these days. But I do have a Windows PC around, so I tried installing WAMP, and then figuring out what it takes to change one’s PATH on Windows these days. Here are the steps, with screen shots.

    1. Open Windows Settings and click the System icon:

    2. Click the “About” link

    3. Click the “System info” link

    4. Click the “Advanced system settings” link

    5. Click the “Environment Variables…” button

    6. Select the “Path” variable and click the “Edit…” button

    7. Click the “Browse…” button

    8. Browse to the directory “C:wamp64binphpphp5.6.19” and click the “Ok” button

    9. Continue clicking the “Ok” buttons for all the windows that you opened during this exercise

    10. Open a command shell window and run “php -v” to confirm you can now use PHP via your PATH.

    Now you should be able to run PHP in the command window from any directory.
  • Webinar on PHP and MySQL Replication

    Using MySQL replication gives you an opportunity to scale out read queries. However, MySQL replication is asynchronous; the slave may fall behind.
    This Wednesday, January 23 2013, I’ll be presenting a free webinar about using MySQL replication on busy PHP web sites.  Register here:  http://www.percona.com/webinars/readwrite-splitting-mysql-and-php

    Applications have variable tolerance for data being out of sync on slaves, so we need methods for the application to query slaves only when their data are within tolerance. I describe the levels of tolerance, and give examples and methods for choosing the right tolerance level in your application. 

    This talk shows the correct ways to check when the slave is safe to query, and how to architect your PHP application to adapt dynamically when the slave is out of sync.
    I’ll also demonstrate an extension to the popular PHP Doctrine database access library, to help application developers using MySQL to make use of read slaves as effectively as possible.

    Please join me in this free webinar this Wednesday!
  • C Pointers Explained, Really

    While I was in college, a friend of mine complained that he was confused while programming in C, struggling to learn the syntax for pointers.

    He gave the example of something like: *x=**p++ being ugly and unreadable, with too many operations layered on each other, making it hard to tell what was happening.  He said he had done a bit of programming with assembly language, but he wasn’t accustomed to the nuances of C.

    I wrote the following explanation on our student message board, and I got a lot of good feedback.  Some people said that they had been programming in C for years, but not until they read my post did they finally understand pointers.  So here it is, unearthed from my backups and slightly edited.  I hope it helps someone again…


    Message 1956 (8 left): Thu Jan 25 1990 2:44am
    From: Bill! (easterb@ucscb)
    Subject: Okay

    Well, if you know assembly, you have a head start on many of the CS freshpersons here. You at least know about memory maps: RAM is a long long array of bytes. It helped me to learn about pointers if I kept this in mind. For some reason, books and instructors talking about pointers want to overlook this.

    When I have some code:

    main()
    {
      int n;
      int *p;

    There is a place in my memory that looks like this:

                :
    Address:    :
             |-----|
    0x5100   |     | n is an integer, one machine word big
             |-----|
    0x5104   |     | p is a pointer, also one word big
             |-----|
    0x5108   |     | other unused memory
             |-----|
                :
                :

    Let’s give these variables some values. I set n to be the number 151.

    n = 151;

    I set the pointer p to point to the integer n.

    p = &n;

    That says, “the value of the variable p is assigned the address of the variable n”.

               :
    Address:   :    Value at that address:
             |----|
    0x5100   | 151| n
             |----|
    0x5104   |5100| p
             |----|
    0x5108   | ?  |
             |----|
               :
               :

    Now I want to print out the value of n, by two ways.

    printf("n is %d.n", n);
    printf("n is %d.n", *p);

    The * operator says, “give me the object at the following address.” The object’s type is the type that the pointer was declared as. So, since we declared “int *p”, the object pointed at will be _assumed_ by C to be an int. In this case, we were careful to make this coincide with what we were pointing at.

    Now I want to print out the memory address of n.

    printf("n is located at $%x.n", &n);
    printf("n is located at $%x.n", p);

    The & operator says, “tell me the address where the following object starts.” In this case, it is hex 5100 (I put a ‘$’ before it, to conform to the Assembly notation I am used to). Notice the _value_ of p is an address.

    Hm. Does p have an address? Sure. It is a variable, and all variables have their own address. The address of p is hex 5104.

    printf("p is located at $%x.n", &p);

    Here we are taking the address of a pointer variable, using the & operator.

    main()
    {
    char name[] = "Bill";
    char *p;
    int *q;

    Now we have an array to play with. Here’s how memory looks now:

           |---|
    0x5100 |'B'| "name" is an address constant that has value hex 5100
           |---|
    0x5101 |'i'| char: 1 byte
           |---|
    0x5102 |'l'| char: 1 byte
           |---|
    0x5103 |'l'| char: 1 byte
           |---|
    0x5104 |   | char: 1 byte
           |---|
    0x5105 |   | p is a pointer: 1 word
           |---|
    0x5109 |   | q is a pointer: 1 word
           |---|
    p = name;

    We set p to the value of name. Now p has value hex 5100 too. We can use the * dereferencing operator on p, and get the character ‘B’ as a result.

    Now what happens if I do this:

    ++p;

    The pointer p is incremented. What value does it have now? Hex 5101. Pretty simple.

    Now let’s try something irresponsible:

    q = name;

    But q is a pointer to int! If we dereference q, it will take the word (typically 4 bytes) beginning at address “name” (which is hex 5100) and try to convert it to an int. ‘B’, ‘i’, ‘l’, ‘l’ converted to an int will be some large number, dependant on the bit-ordering algorithm on your machine’s architecture. On ucscb, it becomes 1114205292. (to see how, line up the binary representation of the ascii values for those 4 characters, and then run the 32 bits together, and convert that resultant binary number as an integer.)

    What we have just seen here is a key issue of pointers that I mentioned earlier: C assumes that what they are pointing at is an object of the type that the pointer was designed to point at. It is up to the programmer to make sure this happens correctly.

    ++q;

    The int pointer is incremented. What value does it have now? Hex 5104. Huh?!? The answer is simple if you accept the above paragraph. It gets incremented by the size of the object it _thinks_ it is pointing at. It’s an int pointer, so incrementing it makes it advance a number of bytes equal to the size of an int.

    Now print the dereferenced value of q (i.e. the value of the object q is pointing to). Well, it’s pointing at a null byte, and then the first 3 bytes of the char *p. Now we’re all messed up. Nice going. Try to convert _that_ to an integer representation. Well actually, C will do it happily. But it’ll be another weird number.

    main()
    {
      int n;
    
      n = 151;
      f(n);
    }
    
    f(x)
    int x;
    {
      printf("%d.n", x);
    }

    Here is a simple program that passes an int “by value”. That is, it copies the value of n into the new variable x!

           |---|
    0x5100 |151| n is an integer
           |---|
    0x5104 |151| x is another integer
           |---|

    When we mention x, we are using the value at location 5104, and we can change it, read it, whatever, and it won’t affect n, the int at location 5100.

    But what if we want to have f() modify the value and then have that new value be available in main()? C does this by passing the variable “by reference”.

    main()
    {
    int n;
    
      n = 151;
      f(&n);
    }
    
    f(x)
    int *x;
    {
      printf("%d.n", *x);
      *x = 451;
    }

    Pass the _address_ of n, and declare x as a _pointer_ to int. Actually, this is still passing by value, but the value being passed is the address, not the number.

           |----|
    0x5100 | 151| n is an integer
           |----|
    0x5104 |5100| x is a pointer to int
           |----|

    Now if f() when we make use of *x, we are referring to the value at location 5100. This is the location of n. After the assignment “*x = 451;”, this is what we have:

           |----|
    0x5100 | 451| n is an integer
           |----|
    0x5104 |5100| x is a pointer to int
           |----|

    x still points to location 5100, but we have changed the value of the object at that location.

    Well, those are the basics. You mentioned things like “*x=**p++” being ugly and unreadable. Well, yeah, but here is a diagram that may help:

           |----| here is a word in memory with initial value 0. 
    0x5100 | 0  | no variable name
           |----|
    0x5104 | 12 | here is a value, a word in memory. no variable name.
           |----|
    0x5108 |5104| Here is an int pointer, pointing at the previous word.
           |----|
    0x511c |5108| here is p, a pointer to int pointer.
           |----|
    0x5120 |5100| here is x, a pointer. guess where it's pointing.
           |----|

    First let’s see what p and x were declared as:

    int *x; /* pointer to int */
    int **p; /* pointer to pointer.

    The subordinate pointer is a pointer to int.*/

    You should know now what “*x” means. It means, “the value of location 5100.”

    And you know what “*p” means, “the value of location 5108”. Now that value is another address! Okay, let’s dereference that address: “**p” and we find (by the declaration) an int.

    Now “*x = **p” looks like, “this int at 5100 gets the value of that int at 5104.”

    And what does “**p++” mean? Well, ++ binds tighter than *, so this is equivalent to: *( *( p++ ) )

    Or, “pointer to pointer to int, and by the way, after we’re done, p has been incremented. But we looked where it was pointing before it got incremented, so we don’t care. Let the next statement worry about it.”


    This content is copyright 2012 by Bill Karwin.  I’ll share it under the terms of the Creative Commons License, Attribution-NonCommercial-ShareAlike 3.0 Unported.

  • Don’t Put the Cart Before the Horse

    Don’t Put the Cart Before the Horse

    April 2nd I made this undiplomatic statement (funny how Twitter practically encourages being provocative):

    #ZF 2.0 is a great example of second-system syndrome.

    Matthew Weier O’Phinney and I have a good working relationship. I think his work on the Zend Framework project has been amazing, both from a technology perspective and a marketing perspective. 
    Matthew and Bill
    So when Matthew asked me to clarify my Tweet, I was happy to reply, in the spirit of constructive criticism. These thoughts apply to many projects–not just ZF–so I thought they would be of general interest. Here’s the content of my reply:

    When I’ve reviewed project proposals or business plans, one thing I often advise people is that you can’t describe the value of a project in terms of how you implemented it. Users don’t want to hear about how you used XML, or dependency injection, or unit tests, or agile methodology, or whatever. They want to hear what they can do with this product.

    After reading the roadmap for ZF 2.0, I observed that a great majority of the planned changes are refactoring and internal architectural changes. These are worthwhile things to do, but the roadmap says very little about the feature set, or the value to users.

    What I’m saying is that implementation does not drive requirements. That’s putting the cart before the horse.

    I admit that for a developer framework, this line is more blurry than in other products. Your users do care about the architecture more than they would for a traditional application. But that still doesn’t account for the emphasis on implementation changes in the roadmap, and the lack of specific feature objectives.

    For instance, some goals for the controller are described in a list of four bullet items: lightweight, flexible, easy to extend, and easy to create and use custom implementations (which sounds close to easy to extend). Then it jumps right into implementation plans.

    So how flexible does it need to be, and in what usage scenarios? What does lightweight mean? How will you know when it’s lightweight? Are there benchmark goals you’re hoping to meet?

    Another example is namespacing. Yes, using namespaces allows you to use shorter class names. Is that the bottleneck for users of ZF 1.x? Do you need to create a namespace for every single level of the ZF tree to solve this? Would that be the best solution to the difficulties of using ZF 1.x?

    The point is that the way to decide on a given implementation is to evaluate it against a set of requirements. You haven’t defined the requirements, or else you’ve defined the requirements in terms of a desired implementation.

    My view is that requirements and implementation are decoupled; a specific implementation should never be treated as one of the requirements, only a means of satisfying the requirements.

    Regards,
    Bill Karwin

  • Sql Injection Slides Posted

    I gave a presentation today at the MySQL Conference & Expo 2010, titled SQL Injection Myths and Fallacies. Thanks to everyone who came to my talk! I appreciate your interest in learning to develop more secure applications. SQL Injection is a serious threat to web applications, and it’s only going to get worse. It’s incumbent on you as software developers to learn how to write secure code!

    My slides are now online in two places: on the MySQL Conference website, and at SlideShare.net/billkarwin.

    I also handed out cards for a 20% discount on my upcoming book, SQL Antipatterns. One chapter in my book is devoted to SQL Injection risks and methods for defending against them. You can pre-order the hardcopy book and receive it as soon as it ships. You can also get the downloadable beta e-book right away, and receive an update when the editing is done.

    I left a stack of the leftover discount cards on the collateral table in the hallway. If you didn’t get one, you’ll have another chance when I talk at the PHP TEK-X conference in Chicago in May!

  • Announcing Awk on Rails

    Announcing Awk on Rails

    Awk on Rails is a new kind of web application development framework, with a distinction that no other framework has: Awk on Rails is fully POSIX compliant.

    Awk on Rails brings the best practices of modern web application development to the ALAS stack (Apache, Linux, Awk, Shell). This stack is entirely new to the field of web development, yet already brings decades of maturity.

    • Installation is a breeze — in fact, it’s unnecessary, because Awk on Rails uses commands and tools already provided by your operating system.
    • Develop web applications that leverage the power of high-speed interprocess I/O pipelining, utilizing POSIX regular expressions to optimize request routing through common gateway interfaces.
    • Generate your Awk on Rails application code–using awk! A sophisticated script-based front-end called wreak takes care of it for you.
    • You get unlimited flexibility to customize the base application scripts, using your choice of development environment: vi or emacs.
    • SQL? We got NoSQL! We don’t need no stinking SQL! Tired of being confused by relational databases? Manage your data in an “X-treme” non-relational data store exclusive to Awk on Rails. It’s called Hammock, and it’s based on the POSIX key-value system NDBM. To initialize your data store, it’s as simple as running the command: wreak hammock.
    • Design and render application views using the simple and popular M4 language. We all know we need to keep application design separate and free from logic. Awk on Rails can make sure this happens!
    • Embedded source code documentation is easy using a custom macro package. Create ready-to-typeset manuals with one simple command: nroff -Mawkdoc.
    • Awk on Rails comes with example applications to get you started, including a blogging & content management platform AwkWord, and a syndication provider AWRY.
    • Does it scale? Of course! Thanks to the power of Moore’s Law, you’ll stay ahead of the curve over the long haul.
    • Development, deployment, and distribution are all powered by a convenient set of three distinct software licenses. No other framework supports this many licenses! Contributing back to the Awk on Rails project? You get to sign and submit a fourth license — at no charge!

    You will soon be able to download source for Awk on Rails and join its development community, at the social source repository SCCSHub.net. As soon as we figure out whether the licenses allow us to distribute our own source code, you may be able to use it in your projects too!

    Look for future Awk on Rails developments and announcements in 2010.* Also look for an innovative cloud computing extension to Awk on Rails, called VaporWare.

    Awk on Rails: Not Really Rapid, Not Exactly Agile, More Like Dodgy.

    * Awk on Rails comes with no guarantee of release dates or timeliness of announcements. Check your calendars.
  • Rendering Trees with Closure Tables

    I got a comment from a reader about the Naive Trees section of my presentation SQL Antipatterns Strike Back. I’ve given this presentation at the MySQL Conference & Expo in the past.

    I’d also like to mention that I’ve developed these ideas into a new book, SQL Antipatterns: Avoiding the Pitfalls of Database Programming. The book is now available in Beta and for pre-order from Pragmatic Bookshelf.

    Here’s the reader’s question:

    I would like to ask if there’s a way I can dump all the hierarchies in a single query using a closure table? For example I have a following tree:

    rootree
    – 1stbranch
    – midbranch
    – corebranch
    – leafnodes
    – lastbranch
    – lastleaf

    and I want to display it like:

    rootree -> 1stbranch
    rootree -> midbranch
    rootree -> midbranch -> corebranch
    rootree -> midbranch -> corebranch -> leafnodes
    rootree -> lastbranch
    rootree -> lastbranch -> lastleaf

    The Closure Table is a design for representing trees in a relational database by storing all the paths between tree nodes. Using the reader’s example, one could define and populate two tables like this:

    drop table if exists closure;
    drop table if exists nodes;

    create table nodes (
    node int auto_increment primary key,
    label varchar(20) not null
    );

    insert into nodes (node, label) values
    (1, ‘rootree’),
    (2, ‘1stbranch’),
    (3, ‘midbranch’),
    (4, ‘corebranch’),
    (5, ‘leafnodes’),
    (6, ‘lastbranch’),
    (7, ‘lastleaf’);

    create table closure (
    ancestor int not null,
    descendant int not null,
    primary key (ancestor, descendant),
    foreign key (ancestor) references nodes(node),
    foreign key (descendant) references nodes(node)
    );

    insert into closure (ancestor, descendant) values
    (1,1), (1,2), (1,3), (1,4), (1,5), (1,6), (1,7),
    (2,2),
    (3,3), (3,4), (3,5),
    (4,4), (4,5),
    (5,5),
    (6,6), (6,7),
    (7,7);

    What we need to do is find all the descendants of the root node 1, then for each of these descendant nodes, list its ancestors in order, separated by an arrow. We can use MySQL’s useful GROUP_CONCAT() function to build this list for us.

    select group_concat(n.label order by n.node separator ‘ -> ‘) as path
    from closure d
    join closure a on (a.descendant = d.descendant)
    join nodes n on (n.node = a.ancestor)
    where d.ancestor = 1 and d.descendant != d.ancestor
    group by d.descendant;

    Here’s the output in the MySQL client. It looks like what the reader asked for:

    +-------------------------------------------------+
    | path |
    +-------------------------------------------------+
    | rootree -> 1stbranch |
    | rootree -> midbranch |
    | rootree -> midbranch -> corebranch |
    | rootree -> midbranch -> corebranch -> leafnodes |
    | rootree -> lastbranch |
    | rootree -> lastbranch -> lastleaf |
    +-------------------------------------------------+

    I do assume for the purposes of ordering that all of a node’s ancestors have a lower node number. You could alternatively use a pathlength column to the closure table and sort by that.

    The Closure Table design is nice compared to the Nested Sets (or Preorder Traversal) design, because it supports the use of referential integrity. By using indexes, the EXPLAIN report shows that MySQL query optimizer does a pretty good job on it (I’ve omitted a few columns for brevity):

    +-------+--------+-------------------+--------------------------+
    | table | type | ref | Extra |
    +-------+--------+-------------------+--------------------------+
    | d | range | NULL | Using where; Using index |
    | a | ref | test.d.descendant | |
    | n | eq_ref | test.a.ancestor | |
    +-------+--------+-------------------+--------------------------+