Use URL Search/Replace to Undo Hard-Coded Content Groups

URL Search & Replace – using it to remove hard-coded content groups from your reports.

 

This post is about a specific use for the Webtrends “URL Search and Replace” functionality.  We wrote about URL S&R in a general way in this post.

You should know about URL S&R because once in a while it’s very helpful.  Irreplaceable, in fact (haha).

URLSR

Basically, what URL Search/Replace does is this:

The first task the Webtrends processing engine performs is to look at the URL of the hit it’s about to process and to check whether any “Search and Replace” rule matches that URL.  If yes, it makes the specified change then sends the altered hit to be processed as usual.  If no, it sends the original hit, unchanged, to be processed as usual.  That’s it.  The important thing that makes it so useful is that Webtrends does this before absolutely any other processing of that hit.

I don’t know of any other web analytics tool that allows this, but I could be wrong.

Examples of uses:

  • Take a dedicated landing page URL and add the WT.mc_id parameter that you should’ve put there in the first place, but forgot to do, in order to get the traffic to show in campaign reports that depend on seeing WT.mc_id.
  • Change “redir.jsp?url=othersite.com/whateverpage.asp” into “othersite.com/whateverpage.asp” so you can see redirects in pages reports in a less confusing way.
  • Remove the parameter “sessionID=whatever” from all URLs in case you have those kinds of archaic things happening.
  • (if you process server logs rather than SDC data) change an important image into a page file, i.e. change “importantimage.jpg” into “importantimage.html”.

And, the subject of today’s post …

  • Make Webtrends completely ignore any hard-coded content groups (WT.cg_n) and only use the UI-defined content groups you have turned on for that profile.

Why?  If you have hard-coded content groups, they will show up everywhere  – in content group reports and also in content group path reports.  If you want to look at back-and-forth travel among a few select content groups that you defined in the UI, those hard-coded groups mess up everything.  (I know some of you out there have discovered Content Group paths, so this post is for you!)

The answer to the mess is to devote a profile to those select UI-defined content groups and, in that profile, make Webtrends blind to the hard coded ones.

Here’s how:

Since hard-coded content groups contain the text “WT.cg_n=<something>&” you can “remove” them all with this configuration in the S&R interface:

    • Replace from
    • Start of first
    • WT.cg_n=
    • Up to
    • End of next
    • &
    • with
    • <empty field> (i.e. nothin’)

Note that this will leave any content subgroups in place, which is not a big deal – these don’t show in Content Group reports, the Content Group dimension, Content Group paths, or anything else.  If you really want to suppress the subgroups also, use the specification below which relies on the fact that the WT.sys parameter pretty much always follows WT.cg_n.  (You might want to check with a debugger or in an actual log file to be absolutely sure)

    • Replace from
    • Start of first
    • WT.cg_n=
    • Up to
    • Start of next
    • WT.sys=
    • with
    • <empty field>

That’s it.  Once you have made the S&R rule, just turn it on for the selected profile.  Make sure only the important UI-defined content groups are active in that profile.

If you have any other outrageous examples of using URL S&R, let us know!

Postscript:

I realize that Webtrends probably prefers that we only use hard-coded content groups and that they (Webtrends) are trying to lead us in that direction.  It’s true that UI-created content groups use processing time and may not make it easy to architect some functions and reports.  But I think that’s a bit wrong-headed, because the UI-based ones are so much more versatile.  Google Analytics’ recent addition of content groups to their UI is, I think, validation of this.  First of all, UI-defined content groups can be created really fast.  Second, they can be turned on and off as needed, just by assigning/unassigning them to a profile, individually.  Agree?  Disagree?  Feel free to write to us.

Use Webmaster Tools to Clean Up Organic Search URLs

You probably have listings in search indexes that incorrectly contain marketing parameters, resulting in incorrect campaign reporting. Get them out of the search index using Webmaster Tools.

If a search engine has, in its index, listings for your page URLs that contain campaign parameters such as WT.mc_id or even utm_campaign, you have a reporting problem.  Anybody who clicks on one of those organic listings will show up in your reporting as having come from the campaign AND having come from organic search.  The latter is good, the former is bad.  Your campaign stats will be incorrectly larger than actual.

Yes, Google, Bing, and Yahoo frequently do have, in their indexes, URLs with campaign parameters.  See our Canonical URLs post for a description of several ways those superfluous parameters sneak into the index.

To see if your own site has any, take a break right now and run a search on this phrase and scan all the results:

site:yoursitenamehere.com (yes, include “site:”)

If you found more campaign-identified links than you wanted to see, there are things you can do.

One is to use Canonical URLs in your page code.  However, be warned that only Google actively honors this – Yahoo and Bing still are not complying with the Canonical URL tag after two years!

So … forget the Canonical URL tag.  Use Google and Bing/Yahoo Webmaster Tools to keep superfluous URL parameters out of the indexes.

(These instructions assume you already have Webmaster Tools accounts on both Google and Bing.  If not, you are missing out!)

Below are the steps we use.

  1. Go first to Google Webmaster Tools (GWT).  Do not go to Bing/Yahoo first.
  2. In Google Webmaster Tools, go to Site Configurations >> Settings >>Parameter Handling tab
  3. GWT will show you a list of parameters that it has found during its crawls.
  4. For those parameters you want Google to omit from the URLs in its index, change the Action to “Ignore.”
  5. Print the list to paper – you’ll need it for the next step.
  6. Save and close GWT
  7. Go to Bing Webmaster Tools
  8. In Bing Webmaster Tools, go to Crawl >> Crawl Settings
  9. Using the list you printed, enter all the parameters you want suppressed

The tip embedded in the above is to use Google Webmaster Tools to get a pretty complete list of all possible parameters. In fact, you’ll probably see parameters you aren’t aware of or have forgotten about.  Then, with your sure-to-be-complete list of parameters, it’s easy to fill up the Bing/Yahoo list, which is a blank slate with no starter list like Google has.

What if the Google list shows parameters that you don’t understand or are not familiar with?

Here’s a second tip that we discovered by accident.  You can coax information from the Google Webmaster Tools parameter list that will help you figure out what some of the parameters are all about.  It won’t be a complete answer, but it will help.

You have to be using Internet Explorer or Chrome, probably Firefox.  Opera, our favorite ultra-fast browser for home use, doesn’t work for this.

The GWT list of parameters looks something like this:

GWTparams

In the screen shot, note that the second column, Action, is all drop-down lists.  If any of your rows are NOT dropdown lists on your screen, click on the “Edit” or “Reset” link at the far right.  The second column item should turn into a dropdown list.

With your mouse, click on the heading or the first parameter, drag, and copy it to the clipboard.

GWTparams3

Paste it into Word.  Not Paste Special, but Paste.  It should paste as messy HTML, with extra stuff, like this:

GWTparams4

Note the red arrow above.  The MS Word pasted copy shows TWO dropdown menus per row, not one!  The second dropdown is live.  Click on it and you’ll see the known values of the parameter, as below:

GWTparams5

This second dropdown has been there all along, but was coded to not display in a browser window.  Copying and pasting it to Word just happens to make it visible.  Cool eh?  You can also use a debugger such as Fiddler to break the invisibility in the browser window, but using Word is much faster.

Now you have more information on what the mystery parameters are all about.  Some will have only one value and might be typos in the code.  Others, like “denomination” above, is revealed to have values of 10, 20, 50, 100 and so on … which we immediately recognized as denominations for gift card purchases.  Not a necessary parameter.

So, with the new information, you can set even more parameters to “ignore” and clean up your organic listings further.

While you’re in Webmaster Tools, especially the Google one, look around.  There is some very useful stuff in there.

Canonical URLs – Why You Should Care

Your site may have a page that can be reached through more than one URL variation – with and without WT.mc_id for example. This can cause search engine spiders to record more than one URL for that page, and that’s a bad thing. You can prevent the marker parameters (such as WT.mc_id) from being recorded by spiders, by using canonical link-tags.

Until Google published their Canonical URL link-tag standard in February 2009, we Outsiders hadn’t seen the word “canonical” in actual written form since grad school.

Anyway, one meaning of the word canonical is “the simplest form.”  In other words, a “canonical” mathematical model is the model with the fewest possible rules and variables, out of all possible mathematical models for a thing.

We love the irony of an obscure multisyllabic word that means “simple.”

They might have called it the “Standard for Preventing URL variations from being indexed.”  The SFPUVFBI.

You should care about Canonical URLs when you have a page on your site that can potentially be reached using multiple URLs, due to tracking parameters.

Example:

Your page http://yoursite.com/promo.asp can be reached through the plain URL.

But:

  • Because you want to track clicks from a special promo graphic on your pages, you’ve hard-coded the promo graphic’s link to go to /promo.asp?WT.ac=fromhomepage.  Or maybe /promo.asp?prevpage=homepage.
  • There are affiliate sites with links to your promo.asp page, and they have helped your tracking by hard-coding their links to go to /promo.asp?source=affiliatesitename
  • A search engine has followed a banner or affiliate or even paid search link that contains campaign parameters, and displays an organic search listing going to /promo.asp?WT.mc_id=oprahbanner or /promo.asp?WT.mc_id=paidsearchmsn
  • Somebody followed a link from one of your campaign emails and copied what they saw in their address bar into their blog, resulting in links going to /promo.asp?source=Feb2011email.
  • A search engine picks up that same link from that blog and puts it in the index.

See, that promo page can be reached through five or six different URLs, a plain one and several others with campaign parameters in them.

It should be obvious that the only one you want to be in the search engine index is the plain one.  If the wonky URLs are indexed and clicked on, your campaign reports will report on visits that appeared to come from a campaign … but actually came from an organic search listing.

This is bad.

There are other problems, too.  At best, multiple versions of a page’s URL will water down the ranks or PageRank for these pages.  At worst, the search engine will assume it’s seeing spamming, i.e. duplicate content on multiple pages.

The Canonical URL tag will fix  all of the above

All of this can be avoided by adding code to your page’s <head> section.  This bit of code was announced by Google back in February and has since been adopted, at least in intention, by other search engines such as MSN-LiveSearch, Yahoo and Ask.

(February 2011 update – Bing/Yahoo still ignores the Canonical tags!  We’re not sure about Ask.  We’ve added another post that uses Webmaster Tools to do pretty much the same thing as Canonical tags, and it WILL affect Bing/Yahoo.)

The code snippet is used only by the search engine spiders.  It states what you want to be the one and only way that you want the page to appear in their indexes.

<link rel=”canonical” href=”http://www.yoursite.com/subdirectory/promo.asp” />

Remember, it goes inside the  <head> section of the page.

By the way, this tag doesn’t affect what WebTrends sees or SDC records at all.  It’s only used by search engine spiders.

Think about how many of your pages might be affected by this problem, considering your banners, pay-per-click, affiliates, and on-site advertising.  If you have a lot of them, like we do, you might want to program your content management system to automatically put the canonical link-tag into the header of every page.

Page titles in reports – where do they come from?

(Applies to:  server log data sources, SDC)

If you are using server log files, have turned on “Retrieve HTML Page Titles, and if WebTrends doesn’t already know the title, WebTrends actually visits your site to collect the title.  Well, it visits what you’ve told it is the site.  It goes to the domain that you entered in the “Web Site URL” field on the “Home” tab of the profile’s setup.  Once on the domain, it looks for the exact URL displayed in the report.  WebTrends does NOT use the domain it finds in your logfiles. It uses the domain you specified in the setup.

So, if you put a non-existent domain name in the “Web Site URL” field, WebTrends will not find your web site and will not collect any titles.

I mentioned “… if WebTrends doesn’t already know the title.”  Here’s the deal.  For every profile, WebTrends creates a cache file of all the URLs and their titles, as it finds them.  WT checks in that file first, and then visits the site only if it doesn’t find the URL in the title cache file for the profile.

The file’s name is [profileGUID].wdb and it’s here: …/WebTrends/storage/config/wtm_wtx/datfiles/titles/

The advantage of having this file around, for server log analysis, is that WebTrends isn’t constantly visiting your site to get titles.   (Note:  when you do a fresh server log analysis of a profile and you’ve turned on HTML Title Retrieval, WebTrends will hit your server a LOT – once for each unique page.  To avoid frenzy on the part of your hosting people, you might want to run that first analysis at night or just turn off HTML Title Retrieval.  Or you can create a fake title cache file using the fresh profile’s GUID and fill it with titles from a different profile for the same site.)

Another nice thing about the file is that you can go into it and change the titles.   And if you’ve got a title cache file that you’re happy with, as I mentioned above, you can copy the entire contents into the title cache file belonging to a different profile.

A disadvantage of this file is that once WT has a title/URL combination in that title cache file, it won’t know about any later title changes you make. That is, until the individual line in the *.wdb cache expires. The default expiration time is 14 days and it can be set globally (for all profiles) here: Web Analysis >> Options >> General >> HTML Titles.  You can also force a fresh title-collecting effort by emptying that file.

For advanced people:  A quirk of Page Titles in reports is that if you’ve got parameter truncation turned on for a file type (i.e. your Pages report does not display parameters in the URL) and if your page titles vary according to the content of those parameters, WebTrends doesn’t really know which of the page title variations to use for a truncated file name.  Get it?  Suppose you have /product.asp?productID=123 that uses title “Product 123’s Page” and /product.asp?productID=456 that uses title “Product 456’s Page”.  If WebTrends is set up to display only /product.asp in the Pages report, it won’t know which title variation to display.  The answer:  WT is set to display and store (in the title cache file) the very first instance of the page title that it comes across in the logs.  And forever after, /product.asp will be displayed with that page title.  In that situation, it’s a really good idea to edit the title cache file so a non-confusing title is shown.

One final thing while we’re talking about page titles.  Suppose you want to suppress page titles completely in your reports.  If you use server logs, clear the titles file for that profile and turn off Retrieve HTML Page Titles in the UI. 

A postscript about SDC:

If you are using SDC for data collection, WebTrends looks for the title in the title cache file, as described above.   The title cache file was filled using data from the values of the WT.ti parameter in the SDC log.  If WebTrends doesn’t find the URL in the title cache file, it gets the title from the current SDC log file line, WT.ti, in turn, obtained the title from the <title> field of the page the tag is on.    Or, if there’s a WT.ti <meta> tag in the head (put there by you, because you want to override the <title>), SDC will collect the <meta> WT.ti information instead of the page’s title.  The <meta> tag trumps the <title> tag.

If you use SDC and want to suppress titles in reports, clear the titles file for that profile and modify the *.wlp file by adding a section called [autoconfig] plus this line:

HtmlTitleAutoConfig=WT.nonexistent

WT.nonexistent can be any WT. parameter name as long as it doesn’t exist. 

Five ways WebTrends can change how a page appears in reports

Betcha didn’t know there are so many ways to affect this.  Usually you won’t care, but you might.  I don’t think anybody has ever listed them all side-by-side. 

  1. Suppression of all query parameters — this is an on-off toggle in the Page Files Types and Download File Types screens, called “truncation.”  When truncated, the Pages report (and others, like the Paths reports) display only the filename.  When not truncated, ALL parameters get displayed.  (This can be a horrible sight and you can get it under control using #2 below.) 
  2. URL Rebuilding — When parameters are NOT truncated (see #1),  this feature allows you to specify which parameters you want to appear or be hidden.  
  3. URL Search & Replace (find it in the Report Configuration area) — this allows limited edits of the URL and query parameters.   
  4. The site’s domain — Normally, each page’s domain will be displayed according to what’s in the Domain or Host field of the log for each page hit.  If the Domain isn’t in the logs (for example is not being logged at all) or if you have set WebTrends to Override MultiHome, then you can control the domain name part of the URL.  WebTrends will use the domain name you enter in the Web Site URL field of the Home tab.
  5. The home page filename — WebTrends wants to change all instances of your home page filename to a plain slash (/).  It will do so if  you disclose the name of your home page file, in the Home Page File Names field of the Home tab.  There’s actually a decent reason for this, having to do with the fact that both “/” and “/default.asp” are, really, the home page and it would be nice to aggregate their stats.  If you don’t want this to happen, leave that field empty or put a nonsense filename in there.

Here’s an important distinction.  Some of the things on the list just suppress data in the displayed reports without actually removing the data from the back room or affecting analysis.  #1 and #2 for example don’t actually remove parameters from the analysis.  They are still there to be used as filter definitions or custom report dimensions or URL Parameter Analysis reports.  #1 and #2 just cause changes in how results are displayed in certain reports, specifically the Page reports, Paths reports, and so on.  Likewise, #5 just affects the display in those reports.

Others on the above list do affect the analysis.  If you alter a URL or parameter using URL Search & Replace, it’s altered for the entire analysis, and the entire profile.  If you suppress the domain name in the logs using Override MultiHome, you lose the ability to filter on domain name.