Doubles! Reasons for Discrepancies between Webtrends and Google Analytics Visit Counts

Three reasons why Google Analytics sees more visits than Webtrends does

doubles

Google Analytics usually shows more visits than Webtrends does, for the same site, same time.

There are three reasons:

  1. If a visit starts before midnight and finishes after midnight, Google Analytics counts two visits.  Webtrends counts one visit.
  2. If a page view happens in the middle of a visit that has a different campaign (organic search, paid search, or any hit with utm_campaign= in it), Google Analytics counts two visits. In other words, if a visitor who has your site open in one tab, then uses a campaign or search link on another tab to come to the site separately, Google Analytics considers that second action the start of another visit.    The same thing happens if the visitor backs out of the site then returns via another search or campaign.  In all the above, Webtrends counts one visit.  (Note: these assume the visitor is moving around with no gaps of 30 minutes or more.)
  3. If you have WT or GA tags on two or more domains, Google Analytics will start a new visit when you cross domains.  The exception is when the sites are linked and the links have been specially coded to transfer a Google Analytics visitor ID.   Webtrends counts one visit.  The exceptions for Webtrends are Safari (and soon Firefox, and maybe eventually other browsers), or any situation where third party cookies are not accepted.

If you know of any extra wrinkles to this or other reasons for the higher visit count in Google Analytics, let us know!

Discrepancies between WebTrends and Google AdWords

To account for the inevitable disrepancies between reports by my main analytics program and Google AdWords (and the other sponsored searches), I’ve done a lot of research including hit-by-hit comparisons. Here is my current list of reasons for discrepancies.

Face it, they are never going to agree.   But that doesn’t mean the differences are random or stupid.

From time to time I compare my data and Google AdWords’ reports, day by day and keyword by keyword.  Sometimes I do it with log data.  I’ve gotta be sure I can have a clear conscience when I tell users that everybody’s doing the best they can.

From this work and things that others have talked about, I have tried to assemble a complete list of reasons for discrepancies. 

In all, I’ve seen differences in any month that range from -10% (WT lower) to +5% (WT higher).  Sometimes it’s spot-on, and it’s usually within 2% or so. 

Sometimes the discrepancies build up gradually, apparently involving long-term slow changes in visitor habits.  #1 below is an example.  Some are relatively sudden, due to changes by another site, by the PPC programs themselves, or by spiders and bots that come into existence and disappear quickly.

In a typical month, for the sites I watch, the effects of the factors I describe here tend to be modest and generally cancel each other out, except for #1 which is fairly constant and usually dominates the other statistical differences.  But over time it’s inevitable that we’ll have occasions when The Perfect Storm happens and the differences get really big.  I’ve even seen instances when the engines report a downturn and we see an uptick.  

Reasons for discrepancies (with Google AdWords or any of the PPC programs) include:

Related to WebTrends settings

  1. WebTrends reports on visits while the search engines report on clicks.  Or rather MY WebTrends reports on visits.  I set the paid-term dimension to record only the keywords seen on the first hit of a visit.  I’ve found that a surprising number of visitors do additional searches during their visit or they back out of the site then return by clicking on the same PPC link.  When they do that, the end result is a visit with two (or many) PPC clicks in it – one to start the visit, and others happening in the middle of the visit.  WebTrends (as I set it up) will report one visit while the search engines will report several clicks.  Result:  WebTrends is lower.
  2. WebTrends is filtering out some of the PPC visitors, such as your own IP address range.  Result:  WebTrends is lower. 

Invalid clicks

  1. Some fraudulent clicking software produces “hits” for search engines’ stats but the bots never actually reach the site.  Later on, the search engine may detect them and  give us a refund.  However, the search engines don’t retroactively change the click reports.  Result:  WebTrends is lower.  (Did you know that Google’s reporting allows you to get a count of the clicks they consider to be fraudulent, by keyword?)
  2. Some fraudulent clicking is caught immediately by the search engine and is removed from both billing and click reporting.  If those clicks do reach the site, the clicks still are seen as visits by WebTrends, which does not know they should be removed.  Result:  WebTrends is higher.
  3. Prefetch bots (notably the AVG Linkscanner bot we have been talking about for the last few weeks) can follow PPC links and produce false visits in WebTrends, although they may be caught and removed from reporting by the search engines.  We know that our detection methods for this bot are not as good as those used by the search engines.  Result:  WebTrends is higher.

Technical glitches

  1. Sometimes the search engines under-report for a day or so.  The cause is usually technical difficulties at their end.  Result:  WebTrends is higher.  These discrepancies are occasional but can be large, often clustering during a two or three week period when search engines make major changes in their systems.  This can happen less than once a year but can be as much as 3% in a month.
  2. The search engines sometimes drop our tracking parameters from the URLs, usually because of misprocessing of one of our uploads or a flaw in our uploads coming from our end.  This gets corrected when the search engines notice or we do.  Result:  WebTrends is lower.  It happens rarely, once or twice a year.  Discrepancy varies.

Clicks that don’t get into SDC logs (server logs are not affected)

  1. A class of browser add-ins often called “web beacon blockers” will prevent a visitor from being tracked by SDC and other javascript, but won’t prevent that visitor from being counted by the PPC engine if they click on a sponsored search link.  One of the most popular is the Adblock Firefox plugin (there aren’t good numbers on how extensively it’s used).  Result:  WebTrends is lower.
  2. If your landing page is longish or slow, and the SDC tag is properly at the end of the page, the visitor may leave before the SDC tag has time to load.  Result:  WebTrends is lower.
  3. You have PPC ads directed to a site page (like a special landing page) that you neglected to tag for SDC data collection.  Result:  WebTrends is lower.  For a big campaign, a LOT lower.  Voice of experience here!

Mistakenly tagged links producing real traffic, but not from PPC

  1. Search engine spiders can follow PPC ads on other sites and record the entire link in their database, including its PPC tracking parameters.  As a result, they display in their natural search results some links that contain PPC tracking parameters .  If someone clicks on one of those links, the resulting visit is actually a natural search visit but the landing page URL indicates to WebTrends that it is a PPC visit.   In the past, we have seen this to be self-correcting because the search engines eventually notice the discrepancy and change the listed URL to the one they see the most, i.e. the one without tracking parameters.  Results:  WebTrends is higher.  
  2. Owners of other sites can click on PPC ads and then add a link on their site using the landing page URL they saw, i.e. one that contains PPC tracking parameters.  The result is a visit that appears to be a PPC visit, but is not.  These tend to be self-correcting over time.  Results: WebTrends is higher.
     

An epitaph for server logs? AVG Linkscanner

If you keep up with web analytics news, you may or may not know about the recent fuss over the behavior of a free, widely-used virus-protection program from the company AVG.

Applies to:  server log analytics, SDC analytics

The epitaph should be:  “Here lies The Server Log.  A pioneer and solid citizen for a very long time.  It  finally met its match.”

If you keep up with web analytics news, you may or may not know about the recent fuss over the behavior of a free, widely-used virus-protection program from the company AVG.  In the last few weeks it has been noticed dumping tons of false hits (all with empty referrer fields) into server logs whenever your site comes up on search results pages.  Its method of virus protection consists of following the links on search results pages, to check all of them for malware — before the human searcher clicks on anything.   As a result, every link appearing on a viewed search results page gets a non-human visit recorded in its server logs.  (But not in its SDC or other tag-based logs, phew!  The bot does not request images, so the SDC tag doesn’t get triggered.  At this time.)

At this time, this bot leaves a very subtle calling card that can be (at this time, did we say that already?) the basis for an exclude hit filter.  You’ll have to re-analyze your older data to get it out of your stats.  We started seeing it in large quantities about May 1 although we could find it in April in lower numbers.  The extent of false visits and hits depends on how much search activity your site gets.  On one of our sites, on May 21, it accounted for:

  • 1% of all hits
  • 3% of all visits
  • 9% of all “direct traffic” visits
  • 12% of all visits identifiable as Pay-per-click by markers in the landing page

The hit filter is based on Browser and this string:

;1813

That’s semicolon one eight one three.  That’s the only way to recognize it.  And this method may not last long.  Remember, the malware people want to recognize it too, and it won’t be hard for them to adjust their code to look for this marker in any file request and return a benign file, while still serving malicious stuff to browsers not so marked.  So, it’s in the best interests of the anti-virus program to be absolutely indistinguishable from a human visitor.  When they (AVG) completely anonymize the browser string, our ability to filter them out will be gone.

[June 30 update – Well, they’ve anonymized their UA string.  They’re now using one common for IE 6.  If you filter this one out, you’ve filtered out many of your IE6 human visitors.  But for the record, the string is:  Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1)  ]

[July 6 update – as of July 9, AVG Linkscanner will no longer scan all results on a search page, instead scanning just those clicked on.  Existing copies of  the program should update themselves over the following few days]

If there ever was a good reason to change from server logs to SDC tags, this could be it.  Until, of course, programs like AVG start executing javascript (instead of just reading it) and requesting gifs.  At that point, we may as well start ignoring single-page-visits entirely, although in the server logs we looked at the average individual had 3 of these hits.  Of course, the AVG program may further morph to follow links off the landing page.  At that point we don’t have a lot of options.  Yes, this has been called a doomsday scenario for analytics.  Some of us however are confident this is a blip.  What do you think?

[June 30 update:  for a short time, the program requested the no-javascript SDC gif.]  

If you want to know more, WebMasterWorld (http://www.webmasterworld.com) and AVG Watch (http://www.avg-watch.org) have good continuing discussions by savvy people as well as entertainment from a lot of newbie panickers.  It’s a good place for the latest, best, and most amusing info. 

While we wait for it all to unfold, or perhaps the word is unravel, let’s go into the differences between server logs and SDC logs.  You server log people should be aware of what you get if you switch and what you lose.  You SDC people, probably pretty smug right now, should think about what you’ve already lost.

This part of this topic covers two questions about server logs and SDC logs –

  • what are the good and bad differences?
  • are there workarounds for the bad differences?

First, what are the differences?  Here are all that we can think of, in terms of things they don’t do that the other does:

A.       Server logs don’t:

  1. Show page views resulting from using the back button (i.e. the cache in the visitor’s browser and not requested from the server)
  2. Show page views resulting from any other kind of caching — AOL’s cache of home pages, corporate proxy caches, and other local caches that display a saved page rather than make a fresh request to the server
  3. Show page views resulting from pages that have been copied to someone’s hard disk, then viewed directly from the hard disk.  Also, showing the drive and path on which the page has been stored.
  4. Show page view resulting from page code that has been, um, copied and repurposed on someone else’s site without removing the tag.  Also, showing the domain that is hosting those repurposed files.
  5. Capture clicks that jump to an anchor point in the same page (i.e. URLs such as /faq.html#item5
  6. Collect information from the browser regarding the following aspects of the visitor’s computer:  screen resolution; size of window; enablement of (or version of) javascript, java, flash; local time zone of the visitor’s computer, …..

B.       SDC logs (using the ordinary SDC tag which is similar to the Google Analytics tag and many others) don’t:

  1. Show downloads of untaggable files such as pdfs, docs, swfs, etc that you might consider to be important
  2. Show requests for other kinds of files that most analysts don’t care about anyway – jpg, gif, css, js, etc
  3. Capture traffic from most spiders and bots
  4. Show 404 (Page Not Found) events
  5. Show 500 (Server Error) events
  6. Show time-to-serve (a standard field in server logs)
  7. Show KB sent or received as a result of the request (a standard field in server logs)
  8. POST events
  9. HEAD events
  10. Record page views where the visitor clicked away from the page immediately, before the SDC tag had time to load
  11. Capture virtual redirects

The Advanced SDC tag that comes out of the Tag Builder pretty much takes care of #1 and tracks other things like form button submit clicks that neither server or tag logs normally get. 

 http://www.webtrendsoutsider.com/2008/customized-sdc-tag-builder/

In future posts we’ll talk about a few other work-arounds. 

Cool Custom Report: Actual vs Paid-for PPC Search Terms

If you have a PPC (pay-per-click) search effort happening, you may be allowing the PPC engines to loosely match your paid-for terms with various terms typed in by human searchers.

To keep abreast of what’s being matched to what (and maybe identify inappropriate matches that can be avoided by adding negative search terms, or get ideas for cheaper exact-match terms), try this report.

If you have a PPC (pay-per-click) search effort happening, you may be allowing the PPC engines to loosely match your paid-for terms with various terms typed in by human searchers.

To keep abreast of what’s being matched to what (and maybe identify inappropriate matches that can be avoided by adding negative search terms, or get ideas for cheaper exact-match terms), try this report:

  1. Primary dimension:  the paid-for term, derived from the marker parameters in the landing page URL (go here for a post about this)
  2. Secondary dimension:  the actual term, derived from the referrer information, using the out of the box dimension “Search Phrase”

You end up with a list of all your paid-for terms, and under each term a list of the actual typed-in terms.  If you have appreciable amounts of broad-match traffic, we guarantee you’ll find something astonishing or hilarious, or worthwhile at the very least.

If you’re puzzled by the primary dimension described above, you probably don’t have those marker parameters.  See our topic on Yahoo PPC marker parameters to find out how to get Yahoo to supply them for you.  As for the other PPC engines, Google and MSN, they don’t make it easy.  You have to add the marker parameters yourself, either with explicit landing page URLs or with macros.  But it’s well worth the trouble.

Related post: Getting Yahoo PPC to add its own markers to landing page URLs

WebTrends dimensions: Visit dimensions versus hit dimensions and how they (don’t always) work together

There are unwritten rules about how dimensions can be combined in 2-D reports. The rules have to do with the incredibly important distinctions between hit dimensions and visit dimensions.

There are unwritten rules about how dimensions can be combined in 2-D reports.  Well, the rules are not exactly unwritten.  I think they’re somewhere in the documentation.  But here, in the WebTrends Outsider, they are in plain language.  The rules have to do with the incredibly important distinctions between hit dimensions and visit dimensions.

Rule 1:
In two-dimension custom reports, the first dimension must be broader than the second.

Visit is broader than hit.  Hit is finer-grained than visit.

So you can use these combinations of Primary > Secondary dimensions in a custom report:

Visit > Visit
Visit > Hit
Hit > Hit

And you must NEVER use this combo:

Hit > Visit

For the latter, you will get a report with results, but it will freeze your soul if you look at it too closely, and astute consumers of your data will laugh at you, or worse.

Rule 2:
For Hit-Hit 2D reports, both events must happen in the same hit!

Of course, Rule 1 begs the question of what is a hit-based dimension and what is a visit-based dimension.  It’s not always intuitive and it’s definitely not in the UI or as far as I can tell in the documentation.  Here’s the correct list for all the currently available Dimension choices and how WebTrends categorizes them.  Pay attention, because you’d probably guess wrong on some.

Hit Dimensions:

  • browser
    browser version
    content group
    cookie parameter
    day of week
    directory
    download
    extension
    hour of day
    pages
    query parameter
    any custom drilldown
    query parameter (when collected on “all hits” or “hits that match xxx” or “most recent value”)
    query string (when collected on “all hits” or “hits that match xxx” or “most recent value”)
    referrer (labeled “per hit”)
    return code
    server
    time period
    url
    url with parameters

Visit-based Dimensions

  • ad campaign
    agent
    area code
    authenticated username
    city
    country
    dma
    domain name
    duration
    entry page
    entry request
    exit page
    geography drilldown
    MSA
    network
    network type
    new vs returning
    organization
    page views
    platform
    PMSA
    query parameter (when collected on “first occurrence” or “last occurrence” )
    query string (when collected on “first occurrence” or “last occurrence” )
    referring page (the one labeled “initial per visit”)
    referring site
    referring top level domain
    search engine
    search keywords
    search phrase
    state
    throughput
    time zone
    top level domain
    visitor
    visitor segment (WT.seg and WT.vhseg)