
本文的目的是为了增加日志分析统计工具 AWStats的分析内容,扩展它的分析功能。大家可以根据自己的需求进行修改,说不定以后的新版本里会有这些功能,嘿嘿,使用愉快。
原文:http://www.antezeta.com/awstats.html
翻译:刘辉 http://www.anywolfs.com/liuhui
版权所有:antezeta.com
我们可以详细的知道那些搜索引擎通过哪些关键字和关键词而访问到我们的网站上来。根据你网站的设计方式和表达内容,不同的搜索引擎各自的搜索偏重也不同,搜索的方式不一,我们如果需要获知这些信息,那么....把下面这段代码加入到你的AWStats 配置文件的ExtraSection 部分。
ExtraSectionName5="Google Searches - Top 50"
ExtraSectionCodeFilter5="200 304"
ExtraSectionCondition5="REFERER,(.*www\.google.*)"
ExtraSectionFirstColumnTitle5="Search"
ExtraSectionFirstColumnValues5="REFERER,p=([^&]+)||REFERER,q=([^&]+)||REFERER,as_p=([^&]+)||REFERER,as_q=([^&]+)"
ExtraSectionFirstColumnFormat5="<a href='http://www.google.com/search?q=%s' title='Click to execute search'>%s</a>"
ExtraSectionStatTypes5=PHBL
ExtraSectionAddAverageRow5=0
ExtraSectionAddSumRow5=1
MaxNbOfExtra5=50
MinHitExtra5=1
一般情况下,关键词字都回被设定在特定的疑问词或字母后面。疑问词或字母设定如: p= in
REFERER,p=([^&]+). p就是一个疑问字母,大部分搜索引擎都只有一个疑问词或字母,但google相对复杂,常见的有这么四个: p=, q=, key=, query=. 可以参考搜索引擎相关信息: AWStats search_engine.pm
了解你该怎么设定。
这里下载: 包含Google, Yahoo, Ask 和 MSN等搜索引擎的AWStats 扩展设定模板 。
也可以看看车东的配置:
给AWStats增加针对Googlebot/Baiduspider/Yahoo!Slurp/MSNBot的详细统计
http://www.chedong.com/blog/archives/001200.html
更加理想的情况下,设定ExtraSectionCondition5 时通过 AND 或者 && 包含所有类似 *.google.* 的所有google站点,但不包括以 mail.google.* 开头的
gmail,如果需要gmail,可以如下面这段代码来设定ExtraSectionCondition5,如果你有更好的建议,可以在这里告诉 antezeta:
ExtraSectionCondition5="REFERER,(.*google.*)&&REFERER,^http:\/\/([^mail\.google\.])"
即要让搜索引擎知道你网站上的文字信息,又要知道你网站的所有相关页面,那么监视搜索引擎的爬行动作就显得特别重要了,下面这段代码就可以实现你的监视功能:
ExtraSectionName1="Google crawls - Top 50"
ExtraSectionCodeFilter1="200 304"
ExtraSectionCondition1="UA,(.*Googlebot.*)"
ExtraSectionFirstColumnValues1="URL,(.*)"
ExtraSectionFirstColumnFormat1="<a href='http://www.mysite.com%s' title='Item Crawled'>%s</a>"
ExtraSectionStatTypes1=PHBL
ExtraSectionAddAverageRow1=0
ExtraSectionAddSumRow1=1
MaxNbOfExtra1=50
MinHitExtra1=1
有时候,有些网名或是机器人在浏览你的网站时设定了代理,伪装成Googlebot,那么你可能得不到这个用户或这个机器人的信息。. 为了更精确的跟踪 Googlebot 访问过程, 你可以将 ExtraSectionCondition1="UA,(.*Googlebot.*)" 设定成ExtraSectionCondition1="HOST,(\.googlebot\.com$)". 这样可以更精确的了解到谁通过 Google下载了你网站的信息。如: Googlebot, 访问你的网站是是 IP *.googlebot.com.又如 Yahoo! 使用 ExtraSectionCondition1="HOST,(\.inktomisearch\.com$)"; Microsoft 使用 ExtraSectionCondition1="HOST,(msnbot\.msn\.com)"; Ask 使用 ExtraSectionCondition1="HOST,(egspd.*\.ask\.com)"。当然,你也可以设定代理信息,不过AWStats 目前不支持超过一个条件的信息,
我们通过修改 AWStats 的 HOST 参数做 Extra Section 条件判断,在awstats.pl里把下面这行:
if ($HostResolved =~ /$conditiontypeval/) { $conditionok=1; last; }
修改成
if ($field[$pos_host] =~ /$conditiontypeval/) { $conditionok=1; last; }
上面的设定是针对
Google 的 爬虫行为,不同的搜索引擎有不用的搜索方法,查看robots.pm 了解你需要选择那个搜索引擎的爬虫, 除google的Googlebot外,,th主要还有Yahoo!的 Yahoo! Slurp ;Ask的 Ask
Jeeves/Teoma ;MSN 的 msnbot。如果你不知道怎么选择,看 Search Engine
Crawlers: Who's visiting my site and why?.
这里下载: 包含Google, Yahoo, Ask 和 MSN等搜索引擎的AWStats 扩展设定模板 。
也可以看看车东的配置:
给AWStats增加针对Googlebot/Baiduspider/Yahoo!Slurp/MSNBot的详细统计
http://www.chedong.com/blog/archives/001200.html
但是,如果你想要所有搜索引擎的详细信息,那AWStats就无能为力了,因为目前的版本 ExtraSection
语法不支持该功能。
Google 最近建立了一个用xml文件来描述site map的概念。这对于拿下网站导航相对复杂的网站来说就显得异常有用了,并且可以方便搜索引擎鉴别你网站的更新部分信息. 阅读 “The Google Webmaster Dashboard, a.k.a. Google Sitemaps” 了解这个所搜引擎优化 工具。
如果你网站有site map,这一扩展就是你的意外收获:
ExtraSectionName13="sitemap.xml.gz downloads by Useragent"
ExtraSectionCodeFilter13="200 304"
ExtraSectionCondition13="URL,(^\/sitemap\.xml\.gz)"
ExtraSectionFirstColumnTitle13="UA"
ExtraSectionFirstColumnValues13="UA,(.*)"
ExtraSectionStatTypes13=HBL
ExtraSectionAddAverageRow13=0
ExtraSectionAddSumRow13=1
MaxNbOfExtra13=10
MinHitExtra13=1
类似报告可以通过 Yahoo's URL List urllist.txt, urllist.gz 和 A9 / Alexa Site Info siteinfo.xml 来创建。
AWStats 提供一个漂亮的推荐报告功能,可以分析是通过域名还是通过文件带来读者。有时候会需要通过域名带来的流量信息, 下面的扩展就可以满足我们的要求,把www\.mysite\.com 换成你的域名, 在每一个逗号前都要加 \ 符号。
ExtraSectionName1="Referring Sites by domain - Top 25"
ExtraSectionCodeFilter1="200 304"
# Filter on ANY REFERER except "mysite". Change mysite to your domain name.
ExtraSectionCondition1="REFERER,^(?!http:\/\/www\.mysite\.com)"
ExtraSectionFirstColumnTitle1="Site"
ExtraSectionFirstColumnValues1="REFERER,^[hH][tT][tT][pP]:\/\/([^\/]+)\/"
ExtraSectionFirstColumnFormat1="<a href='http://%s/' rel='nofollow' title='http://%s/ [new window]'>%s</a>"
ExtraSectionStatTypes1=PHL
ExtraSectionAddAverageRow1=1
ExtraSectionAddSumRow1=1
MaxNbOfExtra1=25
MinHitExtra1=1
我们的站点都提供 RSS 订阅服务, 我们可以跟踪到每个 浏览器和蜘蛛下载过我们的信息。下面的代码可以跟踪到以 .xml, .rdf 或 .rss结尾的所有订阅者和蜘蛛的top信息。 通过设定 URL 参数来指定相关文件后缀,创建一个"content group" 来跟踪订阅。
ExtraSectionName2="Top 30 RSS Readers/Spiders"
ExtraSectionCodeFilter2="200 304"
ExtraSectionCondition2="URL,\.xml|\.rdf|\.rss"
ExtraSectionFirstColumnTitle2="RSS Reader/Spider"
ExtraSectionFirstColumnValues2="UA,(.*)"
ExtraSectionStatTypes2=HBL
ExtraSectionAddAverageRow2=1
ExtraSectionAddSumRow2=1
MaxNbOfExtra2=30
MinHitExtra2=1
有时候可能我们需要关注一些特殊页面。.这个案例可以分析初每个包含avascript的url的相关信息.
ExtraSectionName24="Pages with javascript in name"
ExtraSectionCodeFilter24="200 304"
# Filter on specific URL, including possible jsessionid
ExtraSectionCondition24="URL,(^\/.*javascript.*\.html)"
ExtraSectionFirstColumnTitle24="URL"
ExtraSectionFirstColumnValues24="URL,(.*)"
ExtraSectionStatTypes24=PBL
ExtraSectionAddAverageRow24=0
ExtraSectionAddSumRow24=0
MaxNbOfExtra24=1
MinHitExtra24=1
下面这个例子可以展示你站点文件的下载top排行,因地制宜,大家自行修改:
# To do: Ideally parameterize from not page list.
ExtraSectionName15="Downloads (diff,doc,pdf,rtf,sh,tgz,zip) - Top 10"
ExtraSectionCodeFilter15="200 304"
ExtraSectionCondition15="URL,(.*((\.diff)|(\.doc)|(\.pdf)|(\.rtf)|(\.sh)|(\.tgz)|(\.zip)))"
ExtraSectionFirstColumnTitle15="Download"
ExtraSectionFirstColumnValues15="URL,(.*)"
ExtraSectionFirstColumnFormat15="%s"
ExtraSectionStatTypes15=HBL
ExtraSectionAddAverageRow15=0
ExtraSectionAddSumRow15=1
MaxNbOfExtra15=10
MinHitExtra15=1
下面是一些 AWStats 爱好者做的一些插件:
默认情况下 AWStats 报告迄今为止本年度所有月份的详细统计。但是如果你想看跨年度的最近几个月的访问详情就非常困难了。除非在新的页面上打开以前的统计报告来看看。
rkodey 做的插件补丁解决了部分这些问题,可以查看最近12个月的详细报告,就不用去跨年度了。
有兴趣可以看看这个补丁插件的原创论坛板块 AWStats patch ID 1103597。有下面几个版本,想要那个自己选吧:
| AWStats Version | Patch Version | Patched awstats.pl |
|---|---|---|
| 6.2 | awstats.pl-1.783_last_12_months.patch | |
| 6.4 | awstats.pl-1.814_last_12_months.patch | |
| 6.5 | awstats.pl-1.857_last_12_months.patch | awstats.p.l.gz |
| 6.6 (1.887) | awstats.pl-1.857_last_12_months.patch |
这里有个模板,如果你不知道怎么设置可以参考这个文档 instructions for patching a file。
目前我们正在使用的是支持 AWStats 6.5 版的补丁。加压下载的文件,替换原来的 awstats.pl,替换前最好把 AWStats 的相关文件稍作备份,这个大家都知道了吧。
感谢 Josep Ruano 的插件,让我们可以清楚的明白那些访客读者来自哪个国家。(需要AWStats 6.5 或更高版本)
动心了吧,那么你去修改下你的awstats.mysite.conf 文件,启用 ShowDomainsStats:
# Show domains/country chart
# Context: Web, Streaming, Mail, Ftp
# Default: PHB, Possible column codes: PHB
ShowDomainsStats=UVPHB
保存后立即生效,不过最好重新分析下日志。
AWStats 的 http 请求不仅包括常用的 GET 和 POST 请求, HEAD requests – 获取 HTTP header 信息。不过这个请求常常被一些黑客利用了,所以过滤这个请求起始是有所必要的。
车东解决了这个问题 suggested filtering out HEAD requests。针对 AWStats 6.5 版,把这个里面的代码 awstats.pl.head.diff 拷贝到你的 awstats.pl 文件里面。
如果你用的是AWStats 6.6 版在awstats.pl 的 6322行增加
|| $field[$pos_method] eq 'HEAD'
使用我们提供的 AWStats search_engines.pm 和引擎数据库,可以准确的检测搜索引擎信息,参考 our list of updates and download the data base 。
AWStats
通过 aggregates search referrals 把同一个域名下的各国的搜索引擎统一成一个单一的引擎,比如 google.ca 和 google.co.uk 被统一成 Google。
通过 AWStats 的 browsers.pm 分析访问者的浏览器类型,这里下载 our browsers.pm.tgz browser database,里面有所有浏览器的信息,针对 AWStats 6.5 或更早的版本 去这里 older version 寻早答案。
随着不同版本的 Linux 的发行和普及,分析linux系统和分析 Windows 或者 Mac 系统一样重要,被站长们关注,这个功能的扩展增加进了 AWStats 6.5和以后的版本里。使用 operating_systems.pm 文件来满足我们这个需求。包含已知的linux操作系统有:
在分析报告里面,每个操作系统都会有对应的logo 文件信息,并且直接超链接到他们的主页上去。包括但不限于Windows,Macintosh 和Linux等版本。
AWStats 6.5 还增加了对 BSD 操作系统的分析,不过这个补丁里面没有。
默认情况下对 Mandrake/Mandriva 分析没有开启,需要自己开启。
安装方法:
awstats/wwwroot/cgi-bin/lib/operating_systems.pm to awstats/wwwroot/cgi-bin/lib/operating_systems.pm.bck
这个步骤可选,不加图片软件照样运行。
linux.png to lin.png
这个步骤可选并且存在危险性。不过如果不做这步,那么报告会位每个liunx系统进行一次展示,反之就会进行分组,得到更有规律更适合我们需要的报告。
awstats/wwwroot/cgi-bin/awstats.pl to awstats/wwwroot/cgi-bin/awstats.pl.bck
.cat 顶级域名 于2005年发行于欧洲 ,Francesc Roca Tugas 在AWStats里为 .cat 域名做了一个标记。
'bz','Belize','ca','Canada','cc','Cocos (Keeling) Islands',
修改成
'bz','Belize','ca','Canada','cat','Catalan Linguistic and Cultural Community','cc','Cocos (Keeling) Islands',
并保存AWStats 6.5 was recently released. Antezeta takes a look at new functionality in 6.5.
Some unscrupulous sites attempt to increase their Search Engine rankings and general visibility by automatically creating links to their sites on other sites. The primary target is blog sites which publish the latest referring URLs. A secondary target is sites which publish their web analytics statistics.
Consider a fictitious example: A site called www.dreamingdamsels.xxx has an automated program which requests the home page from www.mysite.com. If www.mysite.com publishes the most recent referrers on their site, then www.dreamingdamsels.xxx has just created a link to www.dreamingdamsels.xxx from www.mysite.com. Similarly, the automated program will try make enough requests from www.dreamingdamsels.xxx to become one of the top referrers, landing in the AWStats Web Analytics Top Referrals report. If a site publishes it's web analytics reports, www.dreamingdamsels.xxx will appear on the site. In either case, the end game is to procure an automatic, free link from your site to theirs, in a parasitic approach to Search Engine Optimization.
Thanks to a contribution from Rod Begbie, AWStats version 6.5 has a referral spam filtering feature.
To enable the filtering, add a SkipReferrerBlackList to your awstats.mydomain.conf configuration file:
# Use SkipReferrersBlackList if you want to exclude records coming from a SPAM
# referrer. Parameter must receive a local file name containing rules applied
# on referrer field. If parameter is empty, no filter is applied.
# An example of such a file is available in lib/blacklist.txt
# You can download an updated version at Need new site. Old list is no longer available.
# Change : Effective for new updates only
# Example: "/mylibpath/blacklist.txt"
# Default: ""
#
SkipReferrersBlackList="/usr/share/awstats/wwwroot/cgi-bin/lib/blacklist.txt"
Change the path of blacklist.txt to match that on your system. Current files can be obtained from Need new site. Old list is no longer available.
Notes:
There are several tactics for managing spam referrals:
Traditionally AWStats has followed a monthly reporting model – except for the monthly history, report sections present an overview of activity for a given month – either the current month to date or a previous month. While this works for many sites, a common need is to see what is happening on a more granular level. With version 6.5, sites will be able to run hourly, daily, monthly and/or yearly reports.
Expanding on a previous unsupported work-around documented in the AWStats FAQ (ID FAQ-COM600), version 6.5 introduces a new configuration option, DatabaseBreak.
DatabaseBreak automates the process of creating the correct AWStats intermediary statistics files necessary for hourly, daily, monthly and yearly report generation. While the functionality is still rough around the edges, we note what is possible with the current implementation.
Currently, support to generate the intermediary files works well.
For sites using the command line interface, there is one change to make: the new option DatabaseBreak should be specified for each reporting granularity desired. DatabaseBreak can take values of year, month, day and hour.
awstats.pl -config=antezeta_com -configdir=/etc/awstats -update -debug=0 -LogFile=access_log -DatabaseBreak=month
awstats.pl -config=antezeta_com -configdir=/etc/awstats -update -debug=0 -LogFile=access_log -databasebreak=day
awstats.pl -config=antezeta_com -configdir=/etc/awstats -update -debug=0 -LogFile=access_log -DatabaseBreak=year
awstats.pl -config=antezeta_com -configdir=/etc/awstats -update -debug=0 -LogFile=access_log -DatabaseBreak=hour
This will create statistics database files for the configuration file awstats.antezeta_com.conf.
| File | Description |
|---|---|
| awstats2005.antezeta_com.txt | 2005 Yearly file |
| awstats082005.antezeta_com.txt | August 2005 Monthly |
| awstats08200515.antezeta_com.txt | 15 August 2005 Daily |
| awstats0820051500.antezeta_com.txt | 15 August 2005 Hourly from midnight 00 to 01 am |
| awstats0820051501.antezeta_com.txt | 15 August 2005 Hourly from 01am to 02am |
| ... | ...additional hourly files ... |
| awstats0820051522.antezeta_com.txt | 15 August 2005 Hourly from 22 to 23 (10 to 11pm) |
| awstats0820051523.antezeta_com.txt | 15 August 2005 Hourly from 23 to 00 (11pm to midnight) |
DatabaseBreak
is case insensitive; month is the default value.
Do
not put DatabaseBreak in your awstats.mysite.conf
file. Specifying DatabaseBreak on the command line does not currently seem to
override a configuration file value; no statistics file will be generated if the
command line value does not agree with the configuration file value.
We
have placed all of the statistics files in the same directory. While this does
not seem to confuse AWStats, you could assign
DirData="__VarDirData__" in your AWStats configuration file,
assigning an appropriate value to VarDirData each time you
generate statistics with different DatabaseBreak values, i.e. export
VarDirData=/awstatsdata/month to keep each set of statistics files in
separate directories.
We have not yet verified statistics generation using the CGI on demand update. It should work with the databasebreak=hour&hour=18&day=22&month=08&year=2005 syntax described below.
The on-demand CGI report drop down interface has not yet been updated to take advantage of the new statistics files, but there is a URL work-around. To generate
To
start AWStats reports in databasebreak mode without entering a long URL, use an intermediary html file
with JavaScript to create the URL and, with a redirect, start AWStats for you.
We have created an example AWStats html start
file you can save and modify to fit your needs.
Known issues: The script will attempt to start AWStats using yesterday as the starting date, as long as today is not the first of the month. More elegant logic would set the month to the previous month and the day to the last day of that month (28, 29, 30, 31). In the case of January, the year would be reduced by one. There may be a need to prefix single digit days and months with a 0, i.e. 08 for August. This has not been done. If someone wants to add these enhancements, write us and we will post them here.
To
run hourly or daily reports on historical data, you will need to generate hourly
and daily statistics files for the historical data. You will not need to
regenerate existing monthly statistics files.
Once the statistics files have been generated, two changes are necessary when generating static reports:
The
year option does not yet seem to be supported in
reporting.
If you are generating PDFs on Fedora Core, you may want to see our HTMLDOC RPM instructions.
If you are using the maxmind GeoIP plugin, you will need to specify the full
path for GeoIP.dat:
LoadPlugin="geoip GEOIP_STANDARD /usr/local/share/GeoIP/GeoIP.dat"
While you're updating your GeoIP file path, see our Windows and Linux AWStats GeoIP installation instructions for information on MaxMind's newly available GeoLiteCity database.
There are a few other new options in AWStats which we'll be looking at shortly.
What is your experience with AWStats 6.5? Write us with your feedback.
The GeoIP Lite plugins provide country and city information about users ("hosts") connecting to your website. Organization information, either a large company or an ISP, is available using the AS Numbers database. Review our Windows and Linux AWStats GeoIP installation instructions.
AWStats is an excellent tool for small to medium sized websites. Larger sites may want to consider more advanced commercial tools at least to have better visitor recognition and user path (click stream) navigation analysis.
A few open source clickstream (user path navigation) analysis tools are available, albeit none currently integrate with AWStats. We have written rudimentary StatViz and Pathalizer installation and configuration instructions.
AWStats can show where visitors went after a certain page, or how they arrived at a certain page, using ExtraSections. Last updated: 2006-04-06.
# Assumes default page is "/" and is always referenced as /, not index.html etc.
# Assumes default page extension is html. This will thus exclude directory pages which appear as \
# Change html to your page suffix if different, i.e. htm.
ExtraSectionName25="Navigation from Home Page - Top 25"
ExtraSectionCodeFilter25="200 304"
ExtraSectionCondition25="REFERER,http:\/\/www\.mysite\.com\/"
ExtraSectionFirstColumnTitle25="URL"
ExtraSectionFirstColumnValues25="URL,(.*html$)"
ExtraSectionFirstColumnFormat25="%s"
ExtraSectionStatTypes25=PHBL
ExtraSectionAddAverageRow25=0
ExtraSectionAddSumRow25=1
MaxNbOfExtra25=25
MinHitExtra25=1
# Assumes default page is always linked to as "/". Some sites need to add index.html or default.asp as the case may be.
ExtraSectionName26="Navigation to Home Page from within site - Top 25"
ExtraSectionCodeFilter26="200 304"
ExtraSectionCondition26="URL,(^\/$)"
ExtraSectionFirstColumnTitle26="REFERER"
ExtraSectionFirstColumnValues26="REFERER,^http:\/\/www\.mysite\.com\/(.*)"
ExtraSectionFirstColumnFormat26="%s"
ExtraSectionStatTypes26=PHBL
ExtraSectionAddAverageRow26=0
ExtraSectionAddSumRow26=1
MaxNbOfExtra26=25
MinHitExtra26=1
AWStats uncovers a wealth of data about your website. Yet, to the untrained eye, report interpretation can be overwhelming. The following books on the field of Web Analytics can help you better interpret existing AWStats reports and provide inspiration for new reports through the extra section extensibility support.
Additional Antezeta featured books
Let Antezeta help you in the selection, implementation and usage of a Web Analytics solution!
Contact us today to find out more about this topic and the rest of the Web Ecosystem.
If this document helped you, help us by spreading the word. Link to this document on your web site by copying this code:
<a href="http://www.antezeta.com/awstats.html">Anteztea AWStats resources: how to configure and use</a>
In Italiano:
<a href="http://www.antezeta.it/awstats.html">Risorse per AWStats da Antezeta: come configurare ed utilizzare</a>