hostorials.com

Web Server Log Management: Format, Rotation, and Analysis

Topic: Web Hosting Related Guides | Print This Article Print This Article | Email This Article Email This Article | 112 Views
1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading ... Loading ...

Logs generated by your Web server can provide valuable information about who is looking at your Web site and what they are looking at. If properly managed, these logs can better inform a Webmaster about what is and is not working on a Web site. It is the phrase “properly managed” that is key, however. If logs are not handled properly, you can throw away important data and, worse still, possibly interfere with the operation of your server.

The issues and principles that follow apply to any Web server operation system combination, but to keep the discussion manageable, I will be talking specifically about tools that work with the Apache Web server under the Linux operating system.

Let’s talk about what a Web server log file is, before we get too in-depth about log management. Each and every time a Web browser connects to the Web server, the server, if properly configured, will attempt to log the event. The level of detail and content of the log entry depends on how the server is configured and whether or not the file that the browser requested exists. The information that is logged is controlled by the way you have configured Apache, which comes pre-configured with a few standard logging styles. It also includes a configuration option to control what information is logged. I have included a link at the end of this article to more information on all the configuration options for log formats.

There are two types of logs - error and access logs. Error logs are used to log any errors that occur when the Web server tries to serve content to a browser. Examples of errors include: 404 - file not found, clients requesting files that they do not have permission to access, or errors from CGIs that were unable to run properly. The error log is important for detecting mistakes, problems with html links, or possible security attacks on the server. Access logs are concerned with logging successful connections to the Web server.
There are two standard formats for access logs - common and combined. Most Web servers use the common log format as the standard log format. A common log entry looks like this:127.0.0.1 - dirk [02/Nov/2001:11:00:36 -0700] “GET /index.html HTTP/1.0″ 200 1500This line tells you that “dirk,” an authenticated user connecting from 127.0.0.1, downloaded the index.html file at 11:00 on Nov 2nd, 2001. This may not seem like much information, but by using some of the tools I will discuss later, simple pieces of information like this will help you learn a lot about your users.

The combined log format is an extension of the common format. A combined format log looks like this:127.0.0.1 - dirk [02/Nov/2001:11:00:36 -0700] “GET /index.html HTTP/1.0″ 200 1500 “http://www.rackspace.com/sales/index.html” “Mozilla/4.08 [en] (Win98; I ; Nav)”As you can see, the first part of the log entry is the same as the common format. Two additional pieces of information are logged - the referrer and the user-agent or browser. The referrer is just the previous page that the user was on - the one that sent them to this page. The referrer can provide information about whether the user had the page bookmarked or if they came to the page from a banner ad. The browser section helps you determine what browsers your visitors use. This can be helpful when you are trying to decide whether or not it is wise to use flashier Web techniques, such as Flash or Windows-only Web plug-ins that may not be supported by certain browsers.

Give a Log an Inch and It Will Take Your Hard Drive

Now that you know Apache’s logging capabilities, you are probably ready to run off and turn on as much logging as possible. More data is always better than less, right? Before we get into how to turn this data into something more useful than a text file full of log entries, you need to learn about log rotation.I have already given examples of log entries, and they seem pretty harmless, don’t they? The log entry above in the common format takes up about 83 bytes on disk, and the combined entry takes up about 164 bytes. These are not terribly big numbers on their own, but the problem is that each and every Web hit generates a log entry. So if you have a Web page with seven images, there will be eight log entries generated - one for the page + seven for the images. Those tiny numbers start adding up fast. If you get as few as 100,000 hits, the log file will take up 8 MB for the common format and 15.6 MB for the combined format. The longer you let the server log to the same file, the larger that file will grow. If left on its own forever, the log file would eventually grow to fill all the disk space it has available. This can cause all sorts of problems.

This is where log rotation comes in. Log rotation is just what it sounds like. The current log file is moved to a new name and a new file is created with the original name of the log file. So the current log file is rotated out. Once the files have changed places, the Web server must either be restarted or sent a signal that it should open a connection to the new file. If this signal is not sent to notify the Web server that the log files have changed, the Web server will continue to write to the old log file (even with the new name), and the log file will continue to grow.

Log rotation is a boring and repetitive task. This is the sort of thing that computers handle a lot better than people. Erik Troan wrote a nifty little program called logrotate specifically for this problem. I won’t go into great detail on all the features and options of logrotate, but I will give you a simple example of how it works. Logrotate looks in a special directory (usually /etc/logrotate.d) to find configuration files that have information on which logs need to be rotated.

These files are actually pretty straightforward, so it is easy to manually build them.Here is an example of a stock configuration file that comes standard on Red Hat Linux:

/var/log/httpd/access_log {

missingok

postrotate

/bin/kill -HUP ‘cat /var/run/httpd.pid 2>/dev/null’ 2> /dev/null || true

endscript}

/var/log/httpd/error_log {

missingok

postrotate

/bin/kill -HUP ‘cat /var/run/httpd.pid 2>/dev/null’ 2> /dev/null || true

endscript

}

The first line tells the system the name of the log file. The second line - missingok - tells logrotate not to send out an alarm if the log is not there. Then there is a block wrapped in postrotate/endscript. The command in the middle sends a signal to Apache to notify the Web server that the configuration/logs have changed.

Where Are the Pretty Pictures?

Now that you understand how logs work and how to keep them from taking over your hard drive, it is time to make them work for you. The logs themselves are not very useful, but the way you transform that pile of data into charts and statistics can help you draw conclusions about your Web traffic, through the use of a Web log analysis program, or as it is more commonly known, a log shredder.

A log shredder looks at the log file line by line and stores this information in its own database. Once it has read through the entire log, it will produce graphs and charts to teach you about your Web traffic. There are a number of log shredder packages available for Linux, ranging in price from zero to hundreds of dollars per domain analyzed.

Analog is a program that can be very useful for basic log shredding needs (see link below). This log shredder is focused on being open, fast, and scalable, and is available for free on the Internet. It provides a lot of basic information about what files are being viewed and by whom. Analog outputs its reports with very few graphics, making them easy to view even over the slowest link. This program is light on system resources and can be useful if you do not need a lot of detail about the viewers of your Web site.

Depending on your needs, Analog may not be for you. The original measurement of the popularity of a Website was called a hit. A hit is simply a connection between a browser and the Web server. This could be a status check or a browser actually viewing a page or image. Modern Web pages are typically composed of a number of files, making the number of hits difficult to correlate to the number of viewers. The way most sites now measure popularity is a statistic called a “unique”. Each unique represents one unique person visiting a Web site during a time period, usually a day. This number is usually much, much smaller that the number of hits a Web site receives. The larger the number of unique visitors, the larger the amount of advertising revenue your site can typically command. Most Web masters are therefore more concerned with unique statistics than any other single statistic.

Unfortunately, I was not able to find any way to make Analog provide this information. I was, however, able to find another open source solution to this problem. Webalizer to the rescue! Webalizer is another open source project and provides more detailed reports than Analog. It is also built to be more graphics-oriented. It conveniently provides reports on hits and files as well as the number of visitors to the Web site, and arranges all the information it provides in both colorful graphs and extensive charts. All this information will help any Webmaster get a better idea about how people are getting to the Web site (through its analysis of referrers) and what browsers they are using.
There is, of course, a next level of log shredding. One feature that I was not able to find in any of the open source log shredding tools that I reviewed was click trail analysis. A click trail is the path a visitor takes through the Web site. This information can be very useful if used correctly. Click trails help you figure out how “sticky” your Web site is. A “sticky” Web site means that once a person gets to the site, they wander around it for a while.
The worst-case scenario for a Webmaster involves spending a lot of money on banner ads to get people to come to the site and then never getting any of them to go past the page the banner takes them to. Click trails can help detect problems with navigation as well as point out where visitors are leaving your site.
I found a tool that can provide this sort of information called Urchin. This program has an extensive tracking system built in and can help find the most popular entrance and exit pages. It also provides statistics for depth of visit (how many pages the visitor saw before they left) and how long they were on the Web site. All this information may seem like overload, but really it is key to tailoring the Web site to the visitors it attracts. Urchin is expensive when compared to the previous solutions. It retails for $199 per domain for Urchin Pro, or $295 for Urchin Dedicated, the 25-site license. As a Webmaster in the post-bubble economy, you have to decide how much the extra reporting is worth to you.

Now that you are armed with information about the care and feeding of your log files, your system will run smoother and you can rest assured that your drives will not be taken over by your Web log files. More important than that, you can kick your Webmaster skills up a notch by using all this information to make your Web site more appealing to your visitors. This insight can help you experiment with new ways of doing things and provide vital statistics to show what is and is not working and make changes accordingly.

Useful Links

Download site for the source to logrotate

Information on how to create your own log format

Analog Homepage

Webalizer Homepage

Urchin Homepage

  As Seen On: Tophosts.com


Leave a Reply


connections Hosting & domain Pixel showcase Search web hosting companies by location Domain name and IP whois tools Pay Per Click - PPC webhosting directory Affordable Domain names registration web hosting & domain KnowledgeBase

Hosting & domain industry newsletter Webmaster search engine & tool bar for IE web hosting Surveys, Polls & Research Web hosting & Domains names Marketplace Dropped (ing) domain names search engine Popular paid web directory connections

CopyRight © 2006-07 | WordPress | Policies | Comments (RSS)
|
Proudly Hosted By:
Hostorials Lives On:
YPHOST