Tips, tricks, and pokes, just WebTrends Analytics
Random header image... Refresh for more!

The fearsome, frabjous* Regular Expression

The other day, we saw a thing by Somebody-or-Other that finished with this warning: “and besides, with WebTrends you have to use regular expressions.”

Oooh, regular expressions.  Run and hide under the bed.

There are two points wrong with Somebody-or-Other’s statement.  First, we don’t know of anywhere in WebTrends where you HAVE to use regular expressions.  In WebTrends it’s always an optional alternative way to tell WebTrends what to filter or collect into content groups, etc.

Second, it’s not a big deal.  If you’ve already mastered the asterisk (*) wildcard, extending that skill to regular expressions will take about two minutes.  At least, it will take that long for the baby version of regular expressions that will get you through about 99.99% of your WebTrends needs.

WebTrends does have a full regular expression engine and there’s no question that full-fledged regular expressions can be magnificently cringe-inducing.  I, personally, get more intimidated by a hearty regex than by a whole pageful of perl. 

But as said above you’ll probably never need more than baby regexes. 

(If you do develop an advanced need just call a geeky friend.  Geeky friends love to puzzle out advanced regular expressions.  Or ask on the WebTrends user forum where several regex mavens hang out.)

So this post is dedicated to those newish users of WebTrends who perceive the Regular Expression checkbox as solid proof that WebTrends is too technical and who have avoided that checkbox like the plague.

(To be honest, the name “regular expression” could be the most complicated thing about regular expressions, at least in this context.   What a dumb name.  ”Regular Expression” just means “flexible way of matching.”   If it were called ”advanced wildcards” would it make you more comfortable?)

You really only need to know two things about regular expressions (regexes) to start using them in WebTrends.

  • The simplest regular expression is just the characters you want to match.   No fancy symbols.  Suppose you want a content group that contains everything that’s an article, and all article filenames have the word “article” in them (like, “article1234.htm”)  This can be handled by the simplest form of a regular expression which is just the text that’s common to everything you want to match – in this case “article”.   In other words, it’s the same as text-match for *article* (which is NOT a regular expression because * has a funky meaning in  regular expressions).  

    Not very impressive, right?  You’re thinking, “this simplest kind of regular expression doesn’t do anything that ordinary wildcards can’t do.”   Ha!  Note that it saves you from typing asterisks!  That counts a LOT. 

    There’s one little catch if there’s any punctuation in your regular expression text, for example if you’re using “article.doc.” You need to put a backslash before the punctuation.  Like this.  “article\.doc”   (There’s a lot more to it but remember we’re giving you the pablum version.) 
  • The other majorly useful regex thingy is the pipe character “|“  (vertical bar).  It means “OR”.  So, “article|document” will match anything that has either article or document in it. Now it’s getting more interesting, right?  You can’t do that with asterisks. 

    Do you see how helpful this is in setting up a content group that will contain all article and document files?   This is, IMHO, worth the price of admission right there.

You can stop here if you want.  But if you have the courage to know just a bit more, here are two others.

  •  If you want your match to happen only if your string is found at the very beginning or the very end , then you can use two other special characters, ^ and $.  They are used only at the beginning or the end of the regular expression, respectively.  If you want a filter that will match yahoo.com but not www.yahoo.com, use ^yahoo because the ^ requires matching at the very beginning.  And yahoo.com$ will match www.yahoo.com but not yahoo.com.au, because the $ demands that your string match the very end.   And yahoo without either ^ or $ will of course match all the above. 

    Okay.  Take a deep breath.  Read it again slowly.  You’ll get it, I promise.
  • Maybe you have variable stuff you want to ignore in the middle of what you want to match.  For example suppose you’re making a content group that includes all product articles, like “/products/vorpal-blades/article.htm” and “/products/slithy-toves/article.htm” but not anything like “/press-releases/vorpal-blades/article.htm”.  You need something that means “matches any kinda junk here in the middle.”  The something is .*  – that’s dot asterisk. 

    So, your choice of regular expressions for the above situation would be: 
          /products/.*/article 
          products/.*/article 
          oducts/.*/articl 
          /products/.*/artic
          cts/.*/artic
          /produ.*/artic
          cts/.*rticle
          and so forth … lotsa choices but I’d go with the first

This just scratches the surface of regular expressions but as a WebTrends user the surface is exactly where you’ll be most of the time.  Calloo!  Callay!

 

Oh.  About the asterisk in the title.  That’s a footnote.  Just having a little wildcard pun.

Lewis Carroll, if you didn’t know.

‘Twas brillig, and the slithy toves
Did gyre and gimble in the wabe:
All mimsy were the borogoves,
And the mome raths outgrabe.

“Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!”

He took his vorpal sword in hand:
Long time the manxome foe he sought –
So rested he by the Tumtum tree,
And stood awhile in thought

And as in uffish thought he stood,
The Jabberwock, with eyes of flame,
Came whiffling through the tulgey wood,
And burbled as it came!

One, two! One, two! and through and through
The vorpal blade went snicker-snack!
He left it dead, and with its head
He went galumphing back.

“And hast thou slain the Jabberwock?
Come to my arms, my beamish boy!
O frabjous day! Callooh! Callay!”
He chortled in his joy.

‘Twas brillig, and the slithy toves
Did gyre and gimble in the wabe:
All mimsy were the borogoves,
And the mome raths outgrabe.

Share:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google

Tags

,

Somewhat Related Posts

  • How to display any KPI page as a measure column
  • ...
  • Cool custom report: Segmenting by brandedness of search terms
  • ...
  • Basing a dimension on a subpart of a URL
  • ...

    3 comments

    1 Jacques Warren { 07.29.08 at 7:20 pm }

    Oh yes, we love them regexes. Very useful indeed, and you are right: I haven’t read the big O’Reilly book on Regular Expressions (I heard it’s mandatory reading in Guantanamo), and can very well address the vast majority of situations with the regexes you mention. Rarely have I even had to resort to ranges.

    However, something still puzzles me with the Reg Ex… checkbox ! I have noticed over the years that some types of reports just work better, I mean often work at all, if I choose the regex approach (Scenario Analysis steps, URL Parameter page, some others) !?

    I got a case just today where defining a Content Group with regex showed no results (and yes, I tested the structure with the lovely and useful “Test” button, as well as tested the targeted URL), whereas not resorting to regex did… !

    Yes, Regexes are lovely and useful; it’s just how WebTrends handles them in some circumstances that is not always clear.

    But maybe it’s just me, as my wife would say…

    2 rocky { 07.30.08 at 6:18 am }

    Jacques, you’ve gotta send the details of that content group situation. I’ve never experienced anything like that.

    The Test button is a little weird. It’s sensitive to capitalization although capitalization won’t affect matching in WebTrends itself. In a really complicated regex, I wonder if there are other quirks. I’ve never seen a regex engine that didn’t have something odd about it, sooner or later. In TextPad, for example, the backslash escape character combined with parentheses turns the parentheses into operators, when the “rules” say the \ should preserve the non-operator status instead.

    This complicated regex talk is scaring me.

    Can you send me the details of your latest finding?

    3 Jacques Warren { 07.31.08 at 6:06 am }

    Hi rocky,

    Yep, we could talk about the details through another channel. However, I wanted to bring this little weird “factoid” to your reader’s attention, and maybe they could verify it themselves:

    You define something such as /pageblabla.asp (Scenario step, URL Param page, etc.) and run it. If you get “0″ result, try again with checking the regex box (and escape \.asp; or you know what? you may not even need to specify the extension in the first place) and run the test again. Boom! You get results. It seems that WT “sees” the data only when regex is used in some circumstances.

    OK, I’m pretty sure I haven’t dreamt this, but I will test it again, and let you know. Would be great if other readers tried it too.

    Leave a Comment