Secure Programmer: Validating Input More specific data types

Secure programmer: Validating input

By David A. Wheeler - 2004-01-23 Page: 1 2 3 4 5

More specific data types

Of course, there are many more specific types of data. Here are a few guidelines for some of them.

Filenames

If the data is a filename (or will be used to create one), be very restrictive. Ideally, don't let users choose filenames, and if that won't work, limit the characters to small patterns such as ^[A-Za-z0-9][A-Za-z0-9._\-]*$. You should consider omitting from the legal patterns characters like "/", control characters (especially newline), and a leading "." (which are hidden files in UNIX/Linux). A leading "-" is also a bad idea, since poorly-written scripts may misinterpret those as options: if there's a file named "-rf", then in UNIX/Linux the command rm * will become rm -rf *. Omitting "../" from the pattern is a good idea, to keep attackers from "escaping" the directory. When possible, don't allow globbing (selecting groups of files using the characters *, ?, [], and {}); an attacker can make some systems grind to a halt by creating ridiculously convoluted globbing patterns.

Windows has an additional problem: some filenames (ignoring the extension and upper/lower case) are always considered physical devices. For example, a program that tries to open "COM1" or even "com1.txt" in any directory will get stuck trying to talk to a serial connector. Since I'm concentrating on UNIX-like systems, I won't go into the details of how to deal with this problem, but it's worth noting, because this is an example where simply checking for legal characters isn't enough.

Locale

Given today's global economy, many programs must let each user select the language to be displayed and other culture-specific information (such as number formatting and character encoding). Programs get this information as a "locale" value provided by the user. For example, the locale "en_US.UTF-8" states that the locale uses the English language, using United States conventions, and uses UTF-8 character encoding. Local UNIX-like programs get this information from an environment variable (usually LC_ALL, but it might be set by the more specific LC_COLLATE, LC_CTYPE, LC_MONETARY, LC_NUMERIC, and LC_TIME; other values to check are NLSPATH, LANGUAGE, LANG, and LINGUAS). Web applications can get this through the Accept-Language request header, though other ways are often used, too.

Since the user may be an attacker, we need to validate the locale value. I recommend that you make sure that locales match this pattern:

^[A-Za-z][A-Za-z0-9_,+@\-\.=]*$

How I created this validation pattern may be even more instructive than the pattern itself. I first searched for the relevant standards and library documentation to determine what a correct locale should look like. In this case, there are competing standards, so I had to make sure the final pattern would accept all of them. It didn't take long to realize that only the characters listed above are needed, and limiting the character set (especially the first character) would eliminate many problems. I then thought about common dangerous characters (such as "/" for a directory separator, the lone ".." for "upper directory," leading dashes, or the empty locale), and confirmed that they wouldn't get through the filter.

UTF-8

Internationalization has had another impact on programs: character encodings. Handling text requires that there be some convention for converting characters into the numbers computers actually handle; these conventions are called character encodings. A particularly common way of encoding text today is UTF-8, a wonderful character encoding that is able to represent any character in essentially any language. UTF-8 is especially nice because its design makes ordinary ASCII text a simple subset of UTF-8. As a result, programs originally designed to only handle ASCII can often be easily upgraded to process UTF-8; in some cases they don't have to be modified at all.

But, like all good things, the UTF-8 design has a downside. Some UTF-8 characters are normally represented in one byte, others in two, others in three, and so on, and programs are supposed to always generate the shortest possible representation. However, many UTF-8 readers will accept "excessively long" sequences; for example, certain three-byte sequences may be interpreted as a character that's supposed to be represented by two. Attackers can use this fact to "slip through" data validators to attack programs. Your filter might not allow the hexadecimal values 2F 2E 2E 2F ("/../"), but if it allows the UTF-8 hexadecimal values 2F C0 AE 2E 2F, the program might also interpret that as "/../". So, if you accept UTF-8 text, you need to make sure that every character uses the shortest possible UTF-8 encoding (and reject any text not in its shortest form). Many languages have utilities to do this, and it's not hard to write your own. Note that the sequence "C0 80" is an overlong sequence that can represent NIL (character 00); some languages (such as Java) enshrine this particular sequence as acceptable.

E-mail Addresses

Many programs must accept e-mail addresses, but correctly handling all possible legal e-mail addresses (as specified in RFC 2822 and 822) is surprisingly difficult. Jeffrey Friedl's "short" regular expression to check them is 4,724 characters long, and even that doesn't cover some cases. However, most programs can be quite strict and only accept a very limited subset of e-mails to work well. In most cases, it's okay to reject technically valid addresses like "John Doe <john.doe@somewhere.com>" as long as the program can accept normal Internet addresses in the "name@domain" format (like "john.doe@somewhere.com"). Viega and Messier's 2003 book has a subroutine that can do this check.

Cookies

Web applications often use cookie values for important data. As I'll discuss later, it's important to remember that users can reset cookie values and form data to anything they want. However, there's one important validation trick that's worth mentioning now. If you accept a cookie value, check that its domain value is what you expect (i.e., one of your sites). Otherwise, a (possibly cracked) related site might be able to insert spoofed cookies. Details on how this attack works are explained in IETF RFC 2965, if you're curious (see Resources for a link).

HTML

Sometimes your program will take data from an untrusted user and give it to another user. If the second user's program might be harmed by that data, then it's your job to protect the second user! Attacks that exploit an apparently trustworthy intermediary to pass on malicious data are called "cross-site malicious content" attacks.

These problems are especially a problem for Web applications, such as those that implement community "bulletin boards" to allow users to add running commentary. In this case, attackers can try to add commentary in HTML format with malicious scripts, image tags, and so on; their goal is to cause all other users' browsers to run the malicious code when they view the text. Since attackers are usually trying to add malicious scripts, this particular variation is called a "cross-site scripting attack" (XSS attack).

It's often best to prevent this attack by validating any HTML you accept to make sure that it doesn't have this kind of malicious content. Again, what you do is enumerate what you know is safe, and then forbid anything else.

Generally, in HTML you can at least accept these, as well as all their ending tags:

(paragraph)
(bold)
(italics)
(emphasis)
(strong emphasis)
<pre> (preformatted text)
(forced line break -- note that it doesn't require a closing tag)

Remember that HTML tags are not case sensitive. Don't accept any attributes unless you've checked the attribute type and its value; there are many attributes that support things such as Javascript that can cause trouble for your users.

You can certainly expand the set, but be careful. Be especially wary of any tag that causes the user to immediately load another file, such as the image tag -- those tags are perfect for XSS attacks.

One additional problem is that you'll need to make sure that an attacker can't mess up the formatting of the rest of the document, in particular, you want to make sure that any commentary or fragment doesn't look like it's "official" content. One way to do this is to make sure that any XML or HTML commands are properly balanced (anything opened is closed). In XML, this is termed "well-formed" data. If you're accepting standard HTML, you should probably not require this for paragraph markers (), because they're often not balanced.

In many cases you'll want to accept <a> (hyperlink), and for that you'll probably want to require the attribute "href". If you must, you must, but you'll need to validate the URI/URL that you're linking to -- which is our next topic.

URI/URLs

Technically, a hypertext link can be any "uniform resource identifier" (URI), and today most people only see a particular kind of URI called a "Uniform Resource Locator" (URL). Many users will blindly click on a hypertext link to a URI, under the presumption that it won't hurt to display it. Your job, as a developer, is to make sure that this user expectation is true.

Although URIs provide a lot of flexibility, if you're accepting a URI from a potential attacker, you need to check it before passing it on to anyone else. Attackers can slip lots of odd things into URIs that can fool users. For example, attackers can include queries that may cause the user to do undesirable things, and they can fool the user into thinking they're viewing a different site than what they're really viewing.

Unfortunately, it's difficult to give a single pattern that protects users in all situations. However, a mostly safe pattern that prevents most attacks, and yet lets most useful links get through (say on a public Web site), is:

^(http|ftp|https)://[-A-Za-z0-9._/]+$

A pattern that allows some more complex patterns is:

^(http|ftp|https)://[-A-Za-z0-9._]+(\/([A-Za-z0-9\-\_\.\!\~\*\'\%\?]+))*/?$

If your needs are more complex, you'll need more complex patterns for checking the data; see my book (listed in Resources) for some alternatives.

Data Files

Complex data files and data structures are generally made up of a lot of smaller components. Simply break the file or structure down and check each piece. If you depend on certain relationships between the components, check those too. Initially, writing this code can be a little dreary, but it also has a real advantage in reliability: many mysterious problems instantly disappear if you immediately reject bad data.

View Secure programmer: Validating input Discussion

Page: 1 2 3 4 5 Next Page: Wrap-up & Resources

First published by IBM developerWorks