Again, with strings you need to identify what is legal, and reject
any other string. Often the easiest tool for specifying legal strings
are regular expressions: Just write a pattern using a regular
expression to describes what string values are legal, and throw away
data that doesn't match the pattern. For example,
specifies that the string must be at least one character long and that
it can only include upper-case letters, lower-case letters, and the
digits 0 through 9 (in any order). You can use regular expressions to
limit which characters are allowed and to be more specific (for
example, you can often limit even further what the first character can
be). Just about all languages have libraries that implement regular
expressions; Perl is based on regular expressions, and for C, the
regexec(3) are part of the POSIX.2 standard and are widely available.
If you use regular expressions, be sure to indicate that you want to match the beginning (usually symbolized by
^) and end (usually symbolized by
$) of the data in your match. If you forget to include
an attacker could include legal text inside their attack to bypass your
check. If you're using Perl and you use its multi-line option (
m), watch out: you must use
\A for the beginning and
\Z for the end instead, because the multi-line option changes the meaning of
Thebiggest problem is figuring out exactly what should be legal in the string. In general, you should be as restrictive as possible. There are a large number of characters that can cause special problems; where possible, you don't want to allow characters that have a special meaning to the program internals or the eventual output. That turns out to be really difficult, because so many characters can cause problems in some cases.
Here is a partial list of the kinds of characters that often cause trouble:
- Normal control characters (characters with values less than 32): This especially includes character 0, traditionally called NUL; I call it NIL to distinguish it from C's NULL pointer. NIL marks the end of strings in C; even if you don't use C directly, many libraries call C routines indirectly and can get confused if given NIL. Another problem is line ending characters, which can be interpreted as command endings. Unfortunately, there are several line ending encodings: UNIX-based systems use character linefeed (0x0a), but DOS based systems (including Windows) use the CP/M marking carriage-return linefeed (0x0d 0x0a), the Apple MacOS uses carriage return (0x0d), many IBM mainframes (like OS/390) uses next line (0x85), and some programs even (incorrectly) use the reverse CP/M marking (0x0a 0x0d).
- Characters with values higher than 127: These are used for international characters, but the problem is that they can have many possible meanings, and you need to make sure that they're properly interpreted. Often these are UTF-8 encoded characters, which has its own complications; see the UTF-8 discussion later in this article.
- Metacharacters: Metacharacters are characters that have special meanings to programs or libraries you depend on, such as the command shell or SQL.
- Characters that have a special meaning in your program: For instance, characters used as delimiters. Many programs store data in text files, and separate the data fields with commas, tabs, or colons; you'll need to reject or encode user data with those values. Today, a common problem is the less-than sign (<), because XML and HTML use this.
This isn't an exhaustive list, and you often must accept some of these characters. Later articles will discuss how to deal with these characters, if you must accept them. The point of this list is to convince you to try to accept as few characters as possible, and to think carefully before accepting another. The fewer characters you accept, the more difficult you make it for an attacker.