Tips, tricks, and pokes, just WebTrends Analytics
Random header image... Refresh for more!

An epitaph for server logs? AVG Linkscanner

Applies to:  server log analytics, SDC analytics

The epitaph should be:  “Here lies The Server Log.  A pioneer and solid citizen for a very long time.  It  finally met its match.”

If you keep up with web analytics news, you may or may not know about the recent fuss over the behavior of a free, widely-used virus-protection program from the company AVG.  In the last few weeks it has been noticed dumping tons of false hits (all with empty referrer fields) into server logs whenever your site comes up on search results pages.  Its method of virus protection consists of following the links on search results pages, to check all of them for malware — before the human searcher clicks on anything.   As a result, every link appearing on a viewed search results page gets a non-human visit recorded in its server logs.  (But not in its SDC or other tag-based logs, phew!  The bot does not request images, so the SDC tag doesn’t get triggered.  At this time.)

At this time, this bot leaves a very subtle calling card that can be (at this time, did we say that already?) the basis for an exclude hit filter.  You’ll have to re-analyze your older data to get it out of your stats.  We started seeing it in large quantities about May 1 although we could find it in April in lower numbers.  The extent of false visits and hits depends on how much search activity your site gets.  On one of our sites, on May 21, it accounted for:

  • 1% of all hits
  • 3% of all visits
  • 9% of all “direct traffic” visits
  • 12% of all visits identifiable as Pay-per-click by markers in the landing page

The hit filter is based on Browser and this string:

;1813

That’s semicolon one eight one three.  That’s the only way to recognize it.  And this method may not last long.  Remember, the malware people want to recognize it too, and it won’t be hard for them to adjust their code to look for this marker in any file request and return a benign file, while still serving malicious stuff to browsers not so marked.  So, it’s in the best interests of the anti-virus program to be absolutely indistinguishable from a human visitor.  When they (AVG) completely anonymize the browser string, our ability to filter them out will be gone.

[June 30 update - Well, they've anonymized their UA string.  They're now using one common for IE 6.  If you filter this one out, you've filtered out many of your IE6 human visitors.  But for the record, the string is:  Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1)  ]

[July 6 update - as of July 9, AVG Linkscanner will no longer scan all results on a search page, instead scanning just those clicked on.  Existing copies of  the program should update themselves over the following few days]

If there ever was a good reason to change from server logs to SDC tags, this could be it.  Until, of course, programs like AVG start executing javascript (instead of just reading it) and requesting gifs.  At that point, we may as well start ignoring single-page-visits entirely, although in the server logs we looked at the average individual had 3 of these hits.  Of course, the AVG program may further morph to follow links off the landing page.  At that point we don’t have a lot of options.  Yes, this has been called a doomsday scenario for analytics.  Some of us however are confident this is a blip.  What do you think?

[June 30 update:  for a short time, the program requested the no-javascript SDC gif.]  

If you want to know more, WebMasterWorld (http://www.webmasterworld.com) and AVG Watch (http://www.avg-watch.org) have good continuing discussions by savvy people as well as entertainment from a lot of newbie panickers.  It’s a good place for the latest, best, and most amusing info. 

While we wait for it all to unfold, or perhaps the word is unravel, let’s go into the differences between server logs and SDC logs.  You server log people should be aware of what you get if you switch and what you lose.  You SDC people, probably pretty smug right now, should think about what you’ve already lost.

This part of this topic covers two questions about server logs and SDC logs -

  • what are the good and bad differences?
  • are there workarounds for the bad differences?

First, what are the differences?  Here are all that we can think of, in terms of things they don’t do that the other does:

A.       Server logs don’t:

  1. Show page views resulting from using the back button (i.e. the cache in the visitor’s browser and not requested from the server)
  2. Show page views resulting from any other kind of caching — AOL’s cache of home pages, corporate proxy caches, and other local caches that display a saved page rather than make a fresh request to the server
  3. Show page views resulting from pages that have been copied to someone’s hard disk, then viewed directly from the hard disk.  Also, showing the drive and path on which the page has been stored.
  4. Show page view resulting from page code that has been, um, copied and repurposed on someone else’s site without removing the tag.  Also, showing the domain that is hosting those repurposed files.
  5. Capture clicks that jump to an anchor point in the same page (i.e. URLs such as /faq.html#item5
  6. Collect information from the browser regarding the following aspects of the visitor’s computer:  screen resolution; size of window; enablement of (or version of) javascript, java, flash; local time zone of the visitor’s computer, …..

B.       SDC logs (using the ordinary SDC tag which is similar to the Google Analytics tag and many others) don’t:

  1. Show downloads of untaggable files such as pdfs, docs, swfs, etc that you might consider to be important
  2. Show requests for other kinds of files that most analysts don’t care about anyway – jpg, gif, css, js, etc
  3. Capture traffic from most spiders and bots
  4. Show 404 (Page Not Found) events
  5. Show 500 (Server Error) events
  6. Show time-to-serve (a standard field in server logs)
  7. Show KB sent or received as a result of the request (a standard field in server logs)
  8. POST events
  9. HEAD events
  10. Record page views where the visitor clicked away from the page immediately, before the SDC tag had time to load
  11. Capture virtual redirects

The Advanced SDC tag that comes out of the Tag Builder pretty much takes care of #1 and tracks other things like form button submit clicks that neither server or tag logs normally get. 

 http://www.webtrendsoutsider.com/2008/customized-sdc-tag-builder/

In future posts we’ll talk about a few other work-arounds. 

Share:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google

Tags

, , ,

Somewhat Related Posts

  • Server logs or SDC javascript tracking?
  • ...
  • Tracking “Page Not Found” 404’s with SDC tags
  • ...
  • Reasons for “Direct Traffic” in referrer reports
  • ...

    5 comments

    1 Jacques Warren { 06.20.08 at 10:24 am }

    Hi rocky,

    “Users of On-Demand can only filter out this beast going forward” is not entirely exact, since being on WTOD means using SDC, thus not susceptible to the AVG bot effect (yet?).

    Great post by the way. I wonder however if that bot can have a huge impact on a site number, since it would visit the site only when it showed in search engine results, and providing it was used by the visitor. Personally, I haven’t seen suspicious activity from a ;1813 browser (i.e. would show up at the top of Top Visitors), but I will look it up.

    2 rocky { 06.20.08 at 10:43 am }

    Ow. I guess it shows that I don’t use On Demand. Of course you’re right and I’ve removed it from the original post.

    3 rocky { 06.20.08 at 11:20 am }

    And I’ve added some stats to the post showing relative numbers.

    4 MitchellT { 06.21.08 at 12:31 pm }

    This is a very good post and it is thought provoking.

    Regarding http 4xx and 5xx errors, it should be easy enough to code custom pages for each error and place an image ‘ping’ tag to the SDC server so those pages may be tabulated similar to the way they are using logs.

    5 rocky { 06.29.08 at 5:45 pm }

    Exactly. We have a post on that almost ready.

    Leave a Comment