Practical Database Design, Part 1 Datatypes

Practical database design, Part 1

By Philipp K. Janert - 2004-02-25 Page: 1 2 3 4

Datatypes

The SQL standard defines a number of standard datatypes and most DB vendors support some additional ones which are specific to their own product. In the absence of truly compelling reasons to the contrary, avoid such extensions for the sake of portability.

Strings and numbers

In general, numerical types pose few problems -- just select one that is large enough to support the necessary range of values.

The attempt to find the optimal width of a string column is usually not worth the effort. You can avoid a lot of confusion later on by making all text messages of type varchar(n) and limiting yourself to a few standard string lengths and introducing aliases for them, such as: 32 bytes ("Label"), 256 bytes ("Note"), and 4k ("Text").

Even if other business requirements restrict the maximum length of certain fields to specific values, the DB schema is arguably not the best place to enforce these rules. By the time the data reaches the DB, it is too late to do anything about it (except reject it). Individual restrictions, stemming from business rules and requirements, should be enforced by the business logic layer, which handles user interaction and input validation. On the other hand, maintenance of the DB schema is considerably simplified if it is restricted to a handful of different string attributes.

Limit the use of fixed-width strings to codes of all sorts (as opposed to variable-length strings for real text). Keep in mind however, that many seemingly fixed-length codes do actually become wider over time. The prudent DB designer tries to avoid anything similar to the Y2K problem for new development work.

Time and money

A type to record timestamps (date/time combinations) is always necessary and is, fortunately, covered by the SQL standard. No fully satisfying way exists to record a monetary value, however.

Saving monetary values and treating them in program code as a floating-point values always leads to round-off errors. Recording the value as an exact integer of the smallest currency subdivision (such as "cent" for US dollars, as well as for Euros and other appropriate currencies) may not be sufficient either. Many values carry more digits behind the decimal point than the two for which actual coins exist (just visit your local gas station). A choice of decimal with 5 to 9 digits should work, though.

It goes without saying that no monetary value should ever be recorded without also recording the currency -- even if you think your application will never handle anything but US dollars. Consider setting up a currency table and relating it to the monetary values using foreign keys rather than embedding currency information directly. This helps with internationalization (different currency names and symbols), as well as with formatting issues.

Booleans and the System Types Table

The use of the type bool anywhere in the design is a hint to rethink this particular module again. Very few attributes are truly limited to only two values -- even a gender column has a malicious tendency to morph towards (at least) three states -- male, female, and unknown. Allowing nulls only masks the real problem. A more flexible approach to type codes is needed.

In many places in the database, attributes determine a record's type in some way. The GenderType mentioned above is such an instance. Other examples may include ItemType (such as Merchandise, Freight, Packaging, Insurance), PaymentType (Cash, Check, MoneyOrder, CreditCard, Barter), and things like StoreType, MembershipType, DeviceType, and many more. This also includes instances in which you want to store the type of an object in some applicable object model.

For each type, you need some form of documentation telling you not only the name of the type, but also the characteristics associated with it; for instance, you may want to know what permissions each UserType entails. What better place to keep this information than in the database itself?

Any record that has some form of type information associated with it should contain a type code column, which itself is a foreign key referencing a type code table. The foreign key constraint ensures that no records with invalid types exist. The type code table might have the following attributes:

typeCode_pk
label (unique mnemonic, such as varchar(32))
description (varchar(256) should be sufficient)
uri (pointing to additional resources, where necessary)
codeGroup_fk

Additional attributes are, of course, conceivable -- such as a three-letter code or a visible numeric code.

The codeGroup_fk attribute serves to organize related type codes. For instance, all subscriber types might form a group. The codeGroup_fk attribute is a foreign key into a separate code group table. However, realizing that a code group is nothing but a type code itself, the relationship can be made recursive so that codeGroup_fk references typeCode_pk. Not only does this make the additional type code table unnecessary, but it also makes it possible to order groups in an arbitrarily deep hierarchical structure. It is best to keep the type of code system relatively simple and straightforward.

Complex datatypes

Finally, there are some common but complex datatypes -- such as phone numbers, postal addresses, contact information, and credit cards -- that occur in almost every database schema. Typically, such records need to be accessed from many tables in the database. In a typical eCommerce system, for instance, it might be necessary to store contact information for users, suppliers, warehouses, and admins.

Rather than including those attributes in the respective user, supplier, or other record. (and thereby repeating those columns throughout the database), it makes sense to set up a single table for the contact information that is referenced through foreign keys by all other tables. This has two immediate benefits:

It is easier to later change the cardinality of the relationships.
It localizes any future changes to the complex datatype.

Anticipating the attributes that will likely be required for each of those complex types is something of an art. My suggestion is to try to strive for completeness from the outset rather than being forced to change the schema each time an additional field becomes necessary.

A sampling of possible attributes for postal addresses includes:

Department
Company
Mail Stop
Address Line 1
Address Line 2
Address Line 3
City
State
Postal Code ("Zip")
Country

Full contact information might include the following attributes:

Title
First Name
Middle Name (or Initial)
Last Name
Suffix (such as jr. or sr.)
Home address
Work address
Home Phone
Work Phone
Cell Phone
Fax
Pager
Email

Finally, phone numbers should never be considered flat numbers. In fact, they break down into the following fields:

CountryCode
AreaCode (Number Plan Area)
ExchangeCode (Prefix)
LineNumber (Suffix)
Extension

In a phone number such as 987-1234, the prefix is the 987 and the suffix is the 1234. The extension is the only part of the phone number that is optional. It is probably sufficient to use char(4) for all columns, but one might consider char(6) to be on the safe side. Note that area codes in the US are limited to three digits, but this is not true for other countries.

Sensitive data should be kept in encrypted form. Even if the database system itself is compromised, the data is still protected from misuse. The most famous example of this kind of data management is the Unix password system which only stores hashes of the users' passwords rather than the passwords themselves. Some data, such as credit card numbers, needs to be encrypted in a recoverable fashion; however, a one-way encryption (as for the Unix password file) will not do. This leads to the problem of encryption key management -- clearly, it should not be stored in the DB, along with the secrets, but supplied at boot-time, for instance.

View Practical database design, Part 1 Discussion

Page: 1 2 3 4 Next Page: Summary and Resources

First published by IBM developerWorks