Practical Database Design, Part 1 Primary keys and related matters

Practical database design, Part 1

By Philipp K. Janert - 2004-02-25 Page: 1 2 3 4

Primary keys and related matters

A relational database (DB) stores two kinds of information -- data and plumbing. Data comprises the customer names, inventory numbers, item descriptions, and so on, that the application uses. Plumbing refers to the primary and foreign keys that the DB needs to find database records and relate them to one another.

Basic plumbing

For the purpose of data modeling, the plumbing should be largely transparent. In fact, purist DB lore makes no distinction between data and plumbing. However, you will see that it is more efficient for administration and maintenance, as well as in terms of runtime performance, to have some additional fields to serve as DB keys.

Every table must have a primary key: an attribute or combination of attributes that are guaranteed to be unique and not-null. It is generally helpful to introduce a surrogate key -- a table attribute which has no business meaning, but simply serves as unique identifier for each record in the table. This is the plumbing that I have been referring to.

The requirements for a primary key are very strict. It must:

Exist
Be unique
Not change over time

Surrogate keys help to mitigate the fact that real business data never reliably fulfills these requirements. Not every person has a Social Security Number (think of those outside the U.S.), people change their names, and other important information.

Business data might also simply be bad -- glitches in the Social Security Administration's system may lead to different persons having the same Social Security Number. A surrogate key helps to isolate the system from such problems.

The second reason that surrogate keys are favorable has to do with efficiency and ease of maintenance, since you can choose the most efficient datatype for the surrogate key. Furthermore, the surrogate key typically is a single field (not a compound key), which simplifies the schema (particularly when the key is used in other tables as a foreign key).

Every table should have a dedicated column to serve as this table's primary key. This column may be called id or pk (or possibly <table_name>_id or <table_name>_pk). Most databases are tuned for queries on integers, so it makes sense to use this datatype as primary key. Many databases, including Postgres and Oracle, also provide a special serial or sequence integer type, which generates a sequence of unique integers. Declaring a column to be of this type guarantees that a unique key is generated for each inserted row.

Foreign keys are table attributes, the values of which are the primary keys of another table. It often makes sense to label foreign key columns explicitly, for instance, by adopting a naming convention such as <other_table_name>_fk. A referential integrity constraint (references) should be declared as part of the CREATE statement when creating the table.

It bears repeating that the surrogate keys discussed earlier are part of the plumbing only -- their existence does not obviate the modeling requirement to be able to form a primary key from the business data alone. Such a business data candidate key is a subset of all attributes, the values of which are never null, and each combination of values is unique. As a check on correct data modeling, such a candidate key must exist and should be documented for every table.

Strictly speaking, you may not always find a candidate key among the business data. Imagine a table recording the first and last name for each user, but having no further attributes. Now assume that there are two different persons, both of whom have the first name "Joe" and last name "Blow." In such a case, there exists no combination of table attributes that can form a suitable candidate key.

The underlying problem here is whether you are talking about the uniqueness of datasets or about the uniqueness of the underlying entities -- users, in this example. It is generally more intuitive, in particular to developers used to object-oriented analysis, to model the uniqueness of the underlying entities. Surrogate keys as discussed earlier can help to achieve this.

Alternate keys and visible identifiers
As part of the plumbing, the surrogate key has no need to ever be visible outside the DB. In particular, it should never be revealed to the user. This allows the DB administrator to change the representation of the keys at will if necessary. If a business need arises for providing the user with a unique identifier to a particular dataset, this identifier should be considered real business data and kept separate from the plumbing. For instance, an additional column called VisibleAccountNumber or the like can be introduced. Of course, this attribute should be non-null and unique so that it forms an alternative candidate key (an alternate key). Having a separate column for visible identifiers also makes it possible to generate and format the values for this attribute in a user-friendly way so that it is easy to read over the phone to a customer support person, for instance.

A borderline case is when the identifier is not directly visible, but may still be accessible to the user. Examples include hidden fields in Web pages in which an identifier is shuttled to the client to be used as a parameter in the following request. Although there is no need for the user to handle the identifier, a malicious user may read and attempt to spoof it. Using the numerical values of a primary key directly, in principle, allows any attacker to walk the entire table!

Defences against this problem include either encrypting and decrypting the value of the primary key, or protecting the key by appending a Message Authentication Code (MAC). An alternative is to use a hard-to-spoof visible identifier attribute for the table, such as the hash of the record's primary key or creation timestamp. (Of course, the uniqueness of this attribute must be assured.)

Whether the key is visible to the application (as opposed to the end user) depends on the specifics of the project. Using a numeric type directly carries the key's database representation straight into the application code and should be avoided to prevent coupling. In small-scale developments, a String representation of the key's value may be acceptable (all datatypes that can be stored in a DB must be able to be serialized).

But a better solution is a simple wrapper object that adds very little complexity, but provides strong decoupling of the database keys' representation from their interfaces. A danger exists in making the wrapper object too smart. The intention with surrogate keys is to make them simple and efficient for the database to handle. Settings from a database value and possibly from a String, comparing with another key object, and possibly serializing are all the methods that are required. Smarts, such as the ability to verify the contents based on a checksum calculation, suggest that this object probably belongs to the business data domain (like the visible record identifiers, introduced earlier).

The problem of the Universally Unique Identifier
A final consideration concerns the possible need for a Universally Unique Identifier (UUID). The short answer is that relational databases do not require UUIDs at all. In fact, the entire UUID concept is somewhat unrelated to relational database management. Relational database keys -- the plumbing -- need only be unique per table, which can be achieved by using an auto-incrementing datatype such as the serial type mentioned earlier.

UUIDs can have some technical difficulties. To ensure uniqueness, all UUIDs must be generated by a centralized service -- which leads to scalability problems and can become a single point of failure. (The scalability issue can be mitigated by a stratified approach in which a central master is used to give out seeds to several slaves, which in turn generate the final identifiers in batches, and so on.) To represent a UUID in a database, use either a string attribute or a compound key comprising several integer columns. Both approaches are significantly slower than operations based on keys made up of long integers. Compound keys also increase the complexity of the DB schema when used as foreign keys.

In the end, whether or not a record in a database needs to have a truly globally unique ID is dictated by the business rules, not the database architecture. Some records may already contain some form of UUID (merchandise items, for instance, typically possess a Universal Product Code as barcode). Some other records, possibly corresponding to principal business entities, may otherwise already contain a unique identifier as part of their business data (such as the combination of timestamp and account name for a ledger entry). If this is not the case, a UUID can be generated and stored alongside the business data for those records that require it. In any case, UUIDs should be considered part of the business data -- not of the plumbing.

Even if (and this is a big if) the object-relational mapping approach chosen requires every business object to have a persistent, unique ID, there is no need to base the internal workings of the underlying relational database engine on this fact.

In summary, I argue to keep business data apart from the database's internal plumbing. Building a relational database around UUIDs breaks this principle by using attributes, which, in the end, are really part of the business data, as internal infrastructure. (For a totally different point of view on this issue and a careful discussion of the problems involved in generating UUIDs in a scalable fashion, see Scott Ambler's paper, "Mapping objects to relational databases," in Resources.)

View Practical database design, Part 1 Discussion

Page: 1 2 3 4 Next Page: Datatypes

First published by IBM developerWorks