No, this post isn't about converting you to Christianity (but you can read all about that in my book "Coding with Jesus"). This post is about string escaping, and what it means.
Character escaping is what you do when you move a string from one environment to the next. For example, when you're creating a C program, you must escape certain characters such as breaks and quotes. To output the text "hello world" (with quotes) you end up writing
printf("\"hello world\"");
Your compiler sees the \" and interprets as just a regular ". You need to escape the " because if not, the compiler will think it's something else... in this case it will assume the string has ended.
The same goes for PHP -> HTML, PHP -> JS (JSON), JS -> HTML (innerHTML), PHP -> XML, plain text -> RTF, binary -> MySQL, etc, etc. In the example above we were going from plain text to C. I'm focusing on PHP in this post, I just opened with a C example because I didn't want to look like a PHP developer.
php, html, and urls
Explaining proper escaping is difficult, especially when it comes to web development because in most cases you're going to be dealing with four or five different environments... PHP, HTML, Javascript, MySQL, and sometimes XML. Your data starts in HTML, it will make its way to PHP, then to MySQL, then back to PHP, then to HTML again (or Javascript). To add to the confusion, URL components require their own escaping which makes them a sixth environment.
Your escaping needs to be perfect or you're going to run into problems. Escape too often and something like "<3" will turn into "<3", escape too little and you've opened your site up to a huge security hole, escape with the wrong encoding and all your spaces suddenly become %20's. To get your escaping right you need to think of it like a stack. Every time you output a string to another environment, that environment will pop its escaping off the stack. If its escaping isn't on the stack it's either going to fail, or display some ugly data. Going by that rule, before you give your data to another environment you need push the right escaping to it.
In PHP, before outputting data to HTML, the proper way to escape it is with htmlspecialchars. So instead of doing
echo $foo
You'd end up doing
echo htmlspecialchars($foo)
Now... the problem gets trickier when you try to do other things with that data. Say you want a URL in the form "index.php?param=$foo". Since an <a> tag is HTML we can just escape it like regular HTML, right? Nope! In this case you need to use urlencode to escape it as a URL component. If not, if $foo were equal to "bar&bar=foo" our URL would turn out to be "index.php?param=bar&bar=foo". With urlencode, this URL would be "index.php?param=bar%26foo%3Dbar". This URL would get passed to the browser, the user would click it and send a request to your web server. At this point the web server will pop its escaping off the escape stack and bar%26foo%3Dbar will magically turn back into bar&foo=bar for your PHP to use.
If you refer to my stack example, the URL example should have some escaping popped when it goes to your HTML output... and it does. Luckily a urlencoded parameter will have no change when the user's browser unescapes it for HTML. That makes escaping in this case fairly easy.
mysql
When the browser POST's data to your server, by the time it gets to PHP your escaping stack should be empty. So before you send your data off in a MySQL query you need to use mysql_real_escape_string or addslashes on it. When it gets to MySQL, it'll pop your escaping off and store the exact data you gave it. That is... unless you have magic_quotes enabled
magic_quotes is a nightmarish PHP "feature" that is enabled almost by default. Essentially what it does is escape all data it gives you with addslashes. This assume that all data passed to your script is going in a database which is often not the case. If you were to go to foo.php?data='hello' and...
echo htmlspecialchars($_GET['data'])
The user would see \'hello\' . This is because PHP took the liberty of pushing MySQL escaping onto your stack and by the time it got to the user they were left with the responsibility of interpreting addslashes.
outputting javascript
Outputting Javascript falls into the same boat as URLs in that it'll first be interpreted as HTML and then as JS (unless you're creating a JSON request, in which case it's only interpreted as JS). If I have a string $foo='"hello"</script>'; and I want to alert that exact string on the user's side, I will be entering Escaping Hell. I simply write
echo '<script>alert("' . $foo . '")</script>'
My code won't execute as expected. That's because of the </script>. Our string first gets popped as HTML and if the user's browser encounters a </script> it's going to end JS interpretation. So we then change our code to
echo '<script>alert("' . htmlspecialchars($foo) . '")</script>'
So we've pushed HTML escaping which gets popped by the user's browser, but then it's going to also pop JS escaping which we haven't done. The alert call ends up as alert(""hello"</script>") which doesn't make any sense. What we have to do is throw in an addslashes before the htmlspecialchars (and thus the idea of a stack).
echo '<script>alert("' . htmlspecialchars(addslashes($foo)) . '")</script>'
So our code basically pushes JS escaping, pushes HTML escaping and then the user's browser pops HTML escaping, then JS escaping. Simple, no?
how you can avoid nasty escaping errors (if you only read one section, make this it)
So what's the best way to avoid making a character escaping error? You should keep your "stack" empty at all times. Escape data the moment it changes environments and not a line sooner. The longer you handle escaped data internally, the more time you'll have to accidentally escape it again. It will also become unclear exactly where and when data was escaped. You could end with a string that wasn't escaped at all because it was supposed to be escaped in another logic branch. If you htmlspecialchars a string as soon as you get it from your database (or worse, htmlspecialchars it before it goes into the database) you're going to run into maintainability issues with your code. What's to say that later down the road you won't want to use that data in a URL, or in a Javascript call, or in an RSS feed.
By passing around escaped strings your data has lost meaning. No longer are you handling data, you're handling another language's interpretation of that data. It then becomes more and more likely that what gets to your user is not the data you had intended, but something else horribly different.
Post Comment
Comments: 41