If you keep up with web analytics news, you may or may not know about the recent fuss over the behavior of a free, widely-used virus-protection program from the company AVG.
Applies to: server log analytics, SDC analytics
The epitaph should be: “Here lies The Server Log. A pioneer and solid citizen for a very long time. It finally met its match.”
If you keep up with web analytics news, you may or may not know about the recent fuss over the behavior of a free, widely-used virus-protection program from the company AVG. In the last few weeks it has been noticed dumping tons of false hits (all with empty referrer fields) into server logs whenever your site comes up on search results pages. Its method of virus protection consists of following the links on search results pages, to check all of them for malware — before the human searcher clicks on anything. As a result, every link appearing on a viewed search results page gets a non-human visit recorded in its server logs. (But not in its SDC or other tag-based logs, phew! The bot does not request images, so the SDC tag doesn’t get triggered. At this time.)
At this time, this bot leaves a very subtle calling card that can be (at this time, did we say that already?) the basis for an exclude hit filter. You’ll have to re-analyze your older data to get it out of your stats. We started seeing it in large quantities about May 1 although we could find it in April in lower numbers. The extent of false visits and hits depends on how much search activity your site gets. On one of our sites, on May 21, it accounted for:
- 1% of all hits
- 3% of all visits
- 9% of all “direct traffic” visits
- 12% of all visits identifiable as Pay-per-click by markers in the landing page
The hit filter is based on Browser and this string:
That’s semicolon one eight one three. That’s the only way to recognize it. And this method may not last long. Remember, the malware people want to recognize it too, and it won’t be hard for them to adjust their code to look for this marker in any file request and return a benign file, while still serving malicious stuff to browsers not so marked. So, it’s in the best interests of the anti-virus program to be absolutely indistinguishable from a human visitor. When they (AVG) completely anonymize the browser string, our ability to filter them out will be gone.
[June 30 update – Well, they’ve anonymized their UA string. They’re now using one common for IE 6. If you filter this one out, you’ve filtered out many of your IE6 human visitors. But for the record, the string is: Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1) ]
[July 6 update – as of July 9, AVG Linkscanner will no longer scan all results on a search page, instead scanning just those clicked on. Existing copies of the program should update themselves over the following few days]
If you want to know more, WebMasterWorld (http://www.webmasterworld.com) and AVG Watch (http://www.avg-watch.org) have good continuing discussions by savvy people as well as entertainment from a lot of newbie panickers. It’s a good place for the latest, best, and most amusing info.
While we wait for it all to unfold, or perhaps the word is unravel, let’s go into the differences between server logs and SDC logs. You server log people should be aware of what you get if you switch and what you lose. You SDC people, probably pretty smug right now, should think about what you’ve already lost.
This part of this topic covers two questions about server logs and SDC logs –
- what are the good and bad differences?
- are there workarounds for the bad differences?
First, what are the differences? Here are all that we can think of, in terms of things they don’t do that the other does:
A. Server logs don’t:
- Show page views resulting from using the back button (i.e. the cache in the visitor’s browser and not requested from the server)
- Show page views resulting from any other kind of caching — AOL’s cache of home pages, corporate proxy caches, and other local caches that display a saved page rather than make a fresh request to the server
- Show page views resulting from pages that have been copied to someone’s hard disk, then viewed directly from the hard disk. Also, showing the drive and path on which the page has been stored.
- Show page view resulting from page code that has been, um, copied and repurposed on someone else’s site without removing the tag. Also, showing the domain that is hosting those repurposed files.
- Capture clicks that jump to an anchor point in the same page (i.e. URLs such as /faq.html#item5
B. SDC logs (using the ordinary SDC tag which is similar to the Google Analytics tag and many others) don’t:
- Show downloads of untaggable files such as pdfs, docs, swfs, etc that you might consider to be important
- Show requests for other kinds of files that most analysts don’t care about anyway – jpg, gif, css, js, etc
- Capture traffic from most spiders and bots
- Show 404 (Page Not Found) events
- Show 500 (Server Error) events
- Show time-to-serve (a standard field in server logs)
- Show KB sent or received as a result of the request (a standard field in server logs)
- POST events
- HEAD events
- Record page views where the visitor clicked away from the page immediately, before the SDC tag had time to load
- Capture virtual redirects
The Advanced SDC tag that comes out of the Tag Builder pretty much takes care of #1 and tracks other things like form button submit clicks that neither server or tag logs normally get.
In future posts we’ll talk about a few other work-arounds.