Expectually Useful Apache Rewrite Techniques Explained

September 5, 2010 · 16 comments

by Lars

Expectually Useful Apache Rewrite Techniques Explained

Apache HTTP Server (usually just Apache) has been the most popular web server in the Internet for more than a decade, with millions of installations serving up a huge portion of the World Wide Web. In 2009 it became the first web server to surpass the 100 million web site milestone. As of july 2009[update] Apache served over 47% of all websites and over 66% of the million busiest. Knowing that Apache is this popular makes it worth spending a some time learning the basics. If you’re an IT professional working with web development it is almost certain that you will need, if not now then later.

Apache is developed and maintained by an open community of developers under the auspices of the Apache Software Foundation. The application is available for a wide variety of operating systems, including Unix, GNU, FreeBSD, Linux, Solaris, Novell NetWare, Mac OS X, Microsoft Windows, etc.

Apache supports a variety of features, many implemented as compiled modules (mod_xxx) which extend the core functionality. Through modules Apache support:

  • Common programming languages like Perl, Python, Tcl, and PHP.
  • Security through popular authentication modules like mod_access, mod_auth, mod_digest, and mod_auth_digest
  • SSL and TLS support (mod_ssl)
  • Proxy using the mod_proxy module (It implements proxying capability for FTP, CONNECT (for SSL), HTTP/0.9, HTTP/1.0, and (as of Apache 1.3.23) HTTP/1.1.)
  • Custom log files (mod_log_config), and filtering support (mod_include and mod_ext_filter).
  • URL rewriter (also known as a rewrite engine, implemented under mod_rewrite). This is the module we will take a closer look at here.

This article will show you some really useful Apache Rewrite Techniques and easy to digest introduction to Apache basics. You may want to get yourself a cheat sheet before going on: Apache Mod_Rewrite Cheat Sheet


ElegantThemes
ThemeForest

Advertisement

General notes on Apache configuration

Apache configuration for the virtual hosts is done by specifying directives in the .htaccess (hypertext access) file. The .htaccess file is placed inside the web tree, and is able to override a subset of the server’s global configuration. Don’t forget that .htaccess commands are case-sensitive. The first examples will be intensive on theory to help you understand what is going on.

Caution!

Being a simple configuration file, .htaccess is extremely powerful. Even the slightest syntax error can result in severe server malfunction. It is crucial to make backup copies of everything related to your site (including any original .htaccess files) before following any of the techniques in this article. If you have the possibility to set up a test site it is highly recommended as well to avoid production downtime. It is also important to test your entire website thoroughly after making any changes to your .htaccess file as you may not see an hidden error by loading the website front page. Use your backup if the situation gets messy!

When to use .htaccess

.htaccess files should only be used when the main server configuration file is inaccessible (relevant for most low cost shared hosting services). .htaccess directives provide directory-level configuration without requiring access to Apache’s main server cofiguration file (httpd.conf). However, due to performance and security concerns, the main configuration file should always be used for server directives whenever possible. This is most relevant for really high load web sites where every millisecond counts.

If you have access to httpd.conf it is recommended to define replicate rules for multiple virtual hosts once and only once via your httpd.conf file. Then, simply instruct your target htaccess file(s) to inheret the httpd.conf rules by including this directive:

RewriteOptions Inherit

Prevent Acess to .htaccess

Add the following code block to your htaccess file to add an extra layer of security. Any attempts to access the htaccess file will result in a 403 error message. Of course, your first layer of defense to protect htaccess files involves limiting direct file access by setting htaccess file permissions via CHMOD to 644:

# secure htaccess file
<Files .htaccess>
order allow,deny
deny from all
</Files>

You can use this code to prevent access to other files as well by adding other files at the location of .htaccess.

Commenting .htaccess Code

Comments are essential to maintaining control over any involved portion of code. Comments in .htaccess code are fashioned on a per-line basis, with each line of comments beginning with a pound sign #.

Mod_Rewrite

A rewrite engine is software that modifies a web URL’s appearance (URL rewriting).

Some of the benefits of a rewrite engine are:

  • Making website URLs more descriptive to improve user-friendliness and search engine optimization (ex. adding tags and category to urls)
  • Preventing undesired “inline linking” (also known as hotlinking, direct linking, offsite image grabs and bandwidth theft). If you’re a blogger and you realise that some of you posts are copies 1-1 on other sites this may be what you need. Preventing images to load or replacing with an image stating “unauthorized image grap” will make your posts less attractive to be copied.
  • Not exposing the inner workings of a web site’s address to visitors
  • The URLs of pages on the web site can be kept even if the underlying technology used to serve them is changed (ex. changing to a new blogging platform or a new permalink structure)

Turn Mod_Rewrite On

Mod_rewrite is enabled and configured using directives in the .htaccess file for you virtual host. Place the following code at the beginning of your .htaccess file to turn mod_rewrite on:

RewriteEngine on

The Mod_Rewrite Module

This directive is the one doing the work and it can occur more than once. Each directive then defines one single rewriting rule. The definition order of these rules is important, because this order is used when applying the rules at run-time.

The basic format for a mod_rewrite command is:

RewriteRule Pattern Substitution [Flag(s)]

Pattern can be regular expression which gets applied to match the current URL. The URL you redirect to is always relative to the directory in which your .htaccess file is placed.

Substitution of a rewriting rule is the string which is substituted for (or replaces) the original URL for which Pattern matched

More details on patterns, substitutions and available flags here. A few Regex hints included below:

A Basic Redirect example

If you just want to create a simple 301 redirect from one URL to another, then use the following code:

RewriteRule ^therequestedfile.html$ theredirecttofile.html

This is a very basic rule that means any requests for therequestedfile.html will be sent to theredirecttofile.html. If one or both of the files is located in subdirectories this is just added to the pattern and substitution strings.

no “www” redirect (class B)

Class B means that all of the traffic to http://www.yourdomain.com is politely and silently redirected to http://yourdomain.com. From a SEO perspective it is generally a good idea not to allow both www and no-www as incoming traffic and links will be split between the two options. You may as well decide to redirect all http://yourdomain.com to http://www.yourdomain.com being currently the most common approach as far as I know. The rule below can easily be adjusted to do that.

<IfModule mod_rewrite.c>
RewriteEngine On
#no-www class B redirect rule
RewriteCond %{HTTP_HOST} ^www\.yourdomain\.com$ [NC]
RewriteRule ^(.*)$ http://yourdomain.com/$1 [R=301,L]
</IfModule>

[NC] is a flag for no case checking. [L] or Last means stop if matched. Interesting reading on [L] here. I have listed available Flags in the bottom of the article. If you are using WordPress you should use the following code.

# BEGIN WordPress
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
#no-www class B redirect rule
RewriteCond %{HTTP_HOST} ^www\.yourdomain\.com$ [NC]
RewriteRule ^(.*)$ http://yourdomain.com/$1 [R=301,L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
</IfModule>
# END WordPress

We are using RewriteCond and will just give a brief introduction below. A back-reference to a server variable is used to get access to the domain name requested (RewriteCond %{HTTP_HOST}).

The RewriteCond directive defines a rule condition. Precede a RewriteRule directive with one or more RewriteCond directives.

RewriteCond TestString CondPattern [Flag(s)
  • RewriteRule (test string) backreferences: These are back-references of the form $N(0 <= N <= 9) which provide access to the grouped parts (parenthesis!) of the pattern from the corresponding RewriteRule directive (the one following the current bunch of RewriteCond directives).RewriteCond backreferences: These are back-references of the form %N
  • (1 <= N <= 9) which provide access to the grouped parts (parentheses!) of the pattern from the last matched RewriteCond directive in the current bunch of conditions.RewriteMap expansions: These are expansions of the form ${mapname:key|default}
  • Server-Variables: These are variables of the form %{ NAME_OF_VARIABLE }

CondPattern is a standard Extended Regular Expression with some additions. You can prefix the pattern string with a '!' character (exclamation mark) to specify a non-matching pattern

More details on RewriteCond here.

Redirect Multiple Domains to a Single Domain

If you have multiple domains pointing to your site, it’s possible you could take a hit in the search engines for having duplicate content. Use the following code to redirect visitors from two domains to just one:

RewriteCond %{HTTP_HOST} ^www.yourdomain.com$ [NC,OR]
RewriteCond %{HTTP_HOST} ^yourdomain.com$ [NC,OR]
RewriteCond %{HTTP_HOST} ^www.yourdomain.com$ [NC]
RewriteRule ^(.*)$ http://yourdomain.com/$1 [R=301,L]

Block a Specific IP Address

If you want to block someone coming from a specific IP address from accessing your website, you can use the following code:

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{REMOTE_ADDR} ^(A\.B\.C\.D)$
RewriteRule ^/* http://www.yourdomain.com/ip-blocked.html [L]
</IfModule>

Replace the A\.B\.C\.D with the IP address you want to block (don’t forget to leave the “\” before each dot, which escapes the character).

User Agent Redirect

To rewrite the Homepage of a site according to the User-Agent: header of the request, you can use the following:

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond  %{HTTP_USER_AGENT}  ^Mozilla.*
RewriteRule  ^/$                 /homepage.max.html  [L]
RewriteCond  %{HTTP_USER_AGENT}  ^Lynx.*
RewriteRule  ^/$                 /homepage.min.html  [L]
RewriteRule  ^/$                 /homepage.std.html  [L]
</IfModule>

Block Specific User Agents

You can block specific User Agents by using the %{User Agent} server variable back-reference using this code. Just replace the “UserAgent” with the user agent you want to block ^Mozilla.* (not sure that is a good idea though but it can be used to keep the door closed for spambots, search engines etc.!).

RewriteCond %{HTTP_USER_AGENT} UserAgent
RewriteRule .* - [F,L]

[F,L] F, makes everything forbidden and throws a 403 server response. You can also block more than one at a time by using the [OR] flag. I have listed available Flags in the bottom of the article.:

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} UserAgent1 [OR]
RewriteCond %{HTTP_USER_AGENT} UserAgent2
RewriteRule .* - [F,L]
</IfModule>

Set up a Default Image

In case your site have a broken images a default image rewrite can make your site look more professional. Use the following code to redirect to a default image for any image whose file cannot be found.

RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule ^images/.*\.jpg$ /images/imagereplaced.jpg [L]

-f‘ (is regular file). You need to change the “/images/imagereplaced.jpg” bit to the image file you’re planning to use.

Prevent Hotlinking, Serve alternate content

The last thing most website owners want is other sites stealing their content or worse—hotlinking to their images and stealing their bandwidth. This code will help you protect all files of the types included in the last line against hotlinking (add more types as needed):

<IfModule mod_rewrite.c>
RewriteEngine on
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)?yourdomain\.com/.*$ [NC]
RewriteRule .*\.(gif|jpg|pgn)$ http://www.yourdomain.com/warning.jpg [R,NC,L]
</IfModule>
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)?domain.com/ .*$ [NC]
RewriteRule \.(gif|jpg|swf|flv|png)$ /feed/ [R=302,L]

Make sure you change “yourdomain.com” to your own domain name and warning.jpg to the image you will show as alternate content. You may also decide to show an error page with this code.

# serve a standard 403 forbidden error page
RewriteRule .*\.(gif|jpg|pgn)$ - [F,L]

To grant linking permission to a site other than yours, insert this code block after the line containing the “yourdomain.com” string.

# allow linking from the following site
RewriteCond %{HTTP_REFERER} !^http://(www\.)?externaldomain\.com/.*$ [NC]

Strip Query Strings

If all the pages on your site other than your home page are formatted as follows, with query strings instead of page names:

http://www.yourdomain.com/somepage.html?querystringparam=somedata

Those aren’t very pretty, and on top of that, search engines will show a bunch of duplicated “home” pages. If you want to get rid of the query string in your page URLs, use the following code:

RewriteCond %{QUERY_STRING} querystringparam=
RewriteRule (.*) http://www.domain.com/$1? [R=301]
RewriteCond %{QUERY_STRING} example=
RewriteRule (.*) http://www.domain.com/$1? [R=301]

This way you will get rid of the query string and the preceding question mark.

Rewrites and https

If you are using https, then you may need to make some changes to your .htaccess code. If you are doing rewrites that use the full URL (for example http://yourdomain.com/page.htm) and you will be switching between http and https, then you may do this by adding a RewriteCond to your code. You will need to test for https and code the full URL accordingly (using http or https). The following examples show how to test for https or the lack of https.

RewriteCond %{HTTPS} on
RewriteCond %{HTTPS} !=on

For example, if you want to force all requests to your site to use https, you could use the following:

RewriteEngine On
RewriteCond %{HTTPS} !=on
RewriteRule ^(.*)$ https://%{HTTP_HOST}/$1 [R,L]
RewriteCond %{HTTP_HOST} ^www.domain.net$ [NC,OR]
RewriteCond %{HTTP_HOST} ^domain.net$ [NC,OR]
RewriteCond %{HTTP_HOST} ^www.domain.net$ [NC]
RewriteRule ^(.*)$ http://domain.net/$1 [R=301,L]

Flags:

[F] Forbidden: instructs the server to return a 403 Forbidden to the client.
[L] Last rule: instructs the server to stop rewriting after the preceding directive is processed.
[N] Next: instructs Apache to rerun the rewrite rule until all rewriting directives have been achieved.
[G] Gone: instructs the server to deliver Gone (no longer exists) status message.
[P] Proxy: instructs server to handle requests by mod_proxy
[C] Chain: instructs server to chain the current rule with the previous rule.
[R] Redirect: instructs Apache to issue a redirect, causing the browser to request the rewritten/modified URL.
[NC] No Case: defines any associated argument as case-insensitive. i.e., “NC” = “No Case”.
[PT] Pass Through: instructs mod_rewrite to pass the rewritten URL back to Apache for further processing.
[OR] Or: specifies a logical “or” that ties two expressions together such that either one proving true will cause the associated rule to be applied.
[NE] No Escape: instructs the server to parse output without escaping characters.
[NS] No Subrequest: instructs the server to skip the directive if internal sub-request.
[QSA] Append Query String: directs server to add the query string to the end of the expression (URL).
[S=x] Skip: instructs the server to skip the next “x” number of rules if a match is detected.
[E=variable:value] Environmental Variable: instructs the server to set the environmental variable “variable” to “value”.
[T=MIME-type] Mime Type: declares the mime type of the target resource.

RewriteE

{ 5 comments… read them below or add one }

louis vuitton June 23, 2010 at 7:11 am

It is run once on startup of Apache receives the requested URLs on STDIN and has to put the resulting (usually rewritten) URL on STDOUT (same order!). Remove Http Referrer

Reply

haldun October 19, 2009 at 4:28 pm

Hi, thanks for the great information.
…, but I need help to rewrite a Folder name in the URL.
if you access URL http://mysite.com/myfolder/
(Server path: /httpdocs/myfolder/…)

to view as http://mysite.com/newfolder/…..
(Server path(same): /httpdocs/myfolder/…)

Can somebody help to solve this issues

Thanks

Reply

Remove Http Referrer September 23, 2009 at 2:43 pm

It is run once on startup of Apache receives the requested URLs on STDIN and has to put the resulting (usually rewritten) URL on STDOUT (same order!). Remove Http Referrer

Reply

anon September 15, 2009 at 10:50 am

FYI – “expectually” is not a word.

Reply

dj münchen August 27, 2009 at 9:26 am

Wieder was gelernt-super

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

{ 11 trackbacks }

Previous post:

Next post:


Web Analytics