Archive for the ‘Digital Services’ Category

In a manner of speaking semantics won’t do (has been: Speculation in Web Analytics)

Thursday, September 30th, 2010

– Summary –
“Speculation in Web Analytics” is our original topic. This article deals with the growing limitations in clickstream data capturing and analysis, which seem to be rooted primarily in the semantic dimension of the whole Analytics discipline: Certain technical patterns are to be interpreted as equivalent for particular real-world events and entities.

The first part of the article questions some of the essential assumptions on what could make “a page view” or “a visitor”, and explicates the fundamental challenge of semantic data richness (or lack of).

Originally featuring only a side track on the importance of bijective “real-person” attributions, the particular challenges in the anonymous nature of the clickstream data are explicated next with relations to password-protected “walled gardens” (the used example for that is Facebook and its “Facebook Insights” service) which slightly twists the view on Analytics as we’ve practiced it for a decade now.

As the much loved band Tuxedomoon has put it earlier: “In a manner of speaking semantics won’t do” (hence the title of this post).

Thus the last part of this post is focusing on the syntactical side of things and the levels of analysis derived from that: how to attribute the presence of particular events within a sequence and what can go wrong here as our tools are increasingly unaware of their unawareness of disturbing influences happening online and offline.

– Here’s the full-length 12″ version of the article –

Although the idea sounds counter-intuitive at first: Web Analytics involves as much speculation as any other serious attempt of pattern recognition (as, for example, horoscopes, or the SETI program). “Data accuracy” has been a much discussed topic amongst Web Analysts for more than a decade now. And “distrusting the data” has become a commodity amongst business stakeholders around the world.

The reason for this is easy to imagine and roots in the deeply semantic (for some: semiotic) foundation and interwoven structure of the whole Analytics discipline: “A page view” in your favourite Web Analytics tool is never directly observed, but is the result of a technical attribution of the form: if a request is sent in a particular format to a specific server and the request is recognized in a particular manner, associated with a specific account and written to a dedicated data base – that makes a page view. Pretty much everything in that attribution chain could possible go wrong, aye?

To make matters worse: we all know that “a visitor” is just the result from yet another attribution: an http request is coming from a specific browser on a particular computer – and even if this event is recognized as a recurring pattern from that same browser, you still can’t be sure that it is in fact the same person operating the computer as during the earlier operations.

As everybody has had to learn – in particular those who lightheartedly have created the meme of “multi-device ownership” to increase their sales volumes:
These days, the “personal computer” (that goes for all other computing devices like abacuses and pocket calculators as well) is becoming less and less personal. In fact: at work you may use one machine, at home you may use yet another (if your spouse or pet is not occupying this particular machine for their very own purposes). In between you may use a mobile device for checking one or two things from the web while being on the bus – and in all cases a different visitor is recorded to any technical system involved in tracking clickstream data.
Only if you (as a user) are inclined to recurring patterns on repeating machines, you can make analysts and marketing managers happy, as they get recurring events from the same machines to analyze. This, and only this, can make a profile and a pattern. Yet it has nothing to do with the person behind the request.

For this precise reason smart people have thought of alternative service building mechanisms: walled gardens (or: communities), secured by a login, have become one way of taming the beast; focusing on the smaller and less complex “visit” entities instead of “visitors”, has been another. Let’s take a closer look at both.

These walled gardens and communities, secured by logins, are justified by their service core proposition as “personal services”. They are utilizing a specific value proposition which presuppose personalization. This, of course, excludes all those who don’t want to go through any signup process but just want to see what is so cool about the “social media thing”. You can’t really peak over the fences of social networks these days.

In return, social network sites makes certain pieces of anonymized data exploitable for analysts, as you can see from services like “Facebook Insights”: As an analyst you get some superficial performance metrics (such as: page views. Yay!), combined with some superficial demographic data (that is: gender and age group. Yieppie!). They (Facebook) have just set up a new dashboard format for Facebook Insights which allows to see aggregated counts for wall posts, uploads, and certain other engagement metrics on a Facebook fan page (FBFP). Facebook can track and display anonymized user activities across different tabs, yes, and therefore their service could potentially qualify for becoming an alternative to your regular average boring company home page by now.

The interesting thing with this is that on an operational level you still are dealing with people inside a community, while the anonymity of the web visit partly disappears (the “social” thing you can’t have without authentication. Not even, if you are using Snoobi, as they are limited by the very same borders of the technical attribution logic as explicated above).

People seeking to contact you within such a walled garden have to grant you certain access to their own contact data (that is pretty similar to a contact form) – but the new thing is that a second-order layer of contacts is created (friends of friends – as in LinkedIn) and that posts from you as a FBPF owner will appear in profiles of your primary contacts from now on.

The Facebook Insights dashboard is allowing you to follow genuinely anonymous amplification metrics on your contents (“Likes”, “Comments”) which you can simply use for monitoring the contents you are putting to your FBFP – but you can as well gain data from externally embedded “Like” buttons on your corporate web site (even copy-paste ready code is provided for that).

With a bit of tweaking you can make your complete Facebook presence trackable with tools like Google Analytics these days. The tracking goes amazingly deep (and of course I had to become a Facebook fan of the company (Web Digi) drilling into this topic and providing the according code sniplets).

However: any insight needs context, and that is yet lacking from the fenced community data in Facebook Insights. Although I much appreciate a timeline view on the increase of “fans” from a certain age group in a certain gender I surely would appreciate not having it as a stacked line graph from where I need to handpick and note down the timely developments per group.

Instead, I would actually love it to see a service retention clustering with it: are all of my fans actually freshmen (and: ~women!) to the service, or are they hangarounds for years by now? This way I could figure out whether the activity patterns are attributable to giggling fresh signups, keen on posting all sorts of unrelated crap on my FB page – or whether I am dealing with an audience which has an inventory sticker on their forehead already and which are gently, but quietly, appreciating posts and topics.
I would consider the latter behaviour much more appropriate for mature social network users. This participation mode was called “lurking” with regard to prevailing newbie behaviour in the discussion boards and IRCs in the Nineties. But contrary to the historic interpretation of this habit as a preceding step to participation, I guess these days this behaviour goes more for people who are beyond participation. Not exactly “retards”, but people who have given up on the idea that their active participation could make any difference.

The way analytics and insights are utilized on the Facebook Insights service resembles (yet) generic focus on the “Reach” dimension: “Do I reach the right audience?” Well: How could I know? If my target group is the infamous flock of over-eighty-old-homosexual-males-living-in-tents-in-Iceland I can only get two attributes out of five from these Insights statistics at this point.
I just may have a lot of young female fans who dig and appreciate eighty-year-old-homosexual-males-living-in-tents-in-Ireland. And they may ring up their grandpas having moved to Iceland after their coming-out. How could I know?
For the rest of the analysis currently possible I am thrown back to my own imagination and a friendly consultant would probably advise me to “generate more engaging content that will lift up the users’ participation. Y’know: makes them click”, before charging me EUR 5.000 for “social media guruship” with a cold, cold smile.
Bollocks!

Rather looking at the other last Analytics resort, the “visits, then.
We can concede that there is one ineradicable assumption about visits: visits are said to have a purpose and a goal (telos). They are performed by a complex entity called “a visitor” for which we have seen how damn hard it is to track them. Visits appear to be simpler, and one visitor can perform a lot of visits. With regard to being a sense-making entity, a visit might currently be the better bet.

Well – indeed this could be true. But more and more often I come to witness: all of my colleagues (in the – duh – agency) have at least eight visits on different web sites going on at the same time. The phenomenon is called “tabbed browsing”, and the effect on Web Analytics data interpretation is disastrous.

As a result we see more and more visits, lasting shorter and shorter across all the click stream data we are collecting. For blogs like this one it is quite clear (it has no RSS feed): people come to the main page, check for new posts, find none, and leave. High bounce rates, short time on site (particularly amongst repeated visitors), all clear, case closed.

For plenty of other visits on other sites a load of more complex things happens: Opening two links in new tabs from the main page upon arrival on the site, browsing to the “Products” pages in one tab, to the “news” section in another, playing with the products here, closing this tab then after ten minutes and looking at the news articles for which a view has been opened in another tab ten minutes ago.
We can leave out all the countless visits expiring from unattended browser tabs altogether, but the poor analyst who has to analyze the user path for such an improperly sessionized visit! My oh my!

Minutes and minutes of “time on page” while the user was busy updating their pet’s Facebook status in yet another browser tab (we traditionally have had to interpret that as “careful reading” and genuine interest in the website contents for this particular page that we have so carefully crafted four year ago), occasional erratic “open in new tab”/”close tab” actions from users across a session (a reminder: these actions are taking place within the browser itself. Contemporary web analytics tools are blind to actions like that!), and any occasional call from mum for eighteen minutes (“You haven’t been visiting me for ages! And you never call back” – “Yes, mum… No mum… sure, mum”. Things like that…), which distracts the user completely from doing any browsing. No matter how clear the visit goal was originally, we can’t truly expect to reconstruct the visit’s goal from such a fragmented session log.

Most people today are still said to use the computer not in the aforesaid manner.
I answer: what would prevent them from behaving erratically? A flatrate has become a commodity these days. Mobile phones are all around us. Tabbed browsing has become a habit. Earlier I had to pay my online activities by the minute (as I was using the phone line for being online in the pre-broadband and pre-mobile phone age, mum could not even call me as the line was busy, occupied by my Rockwell 2400 bps modem. So: no distraction from anything except the telly!). Today (particularly since I am using a Mac), I simply close the computer’s lid after having checked something small from a site for just two minutes. And it may happen that I open the lid once more an hour later to look at some other tiny detail on the same site. “Well – that’s just you. You are a geek!”, you say.
But a matter of fact is, actually: access ubiquity makes geeks!

A rather ideal situation (compared to the analyst’s nightmare I’ve just coloured before) would be that somebody with a goal sits in front of a computer and focuses entirely on reaching that goal, no matter what.
This data is collected together with all the other data within our favourite tools. And we could indeed assume that visits which have led to a conversion (that could be: a purchase, a download, or a sign-up) are motivated by purposes and aim at reaching that goal. Fair enough!
Look at that data for an hour. It tells you a lot about people’s determination, about how deep you have hidden contents that you thought nobody would ever look at, it tells you about how many page views were needed before the deed was done and the deal was closed. It tells you about how fast your site’s checkout process is to grasp and to go through.

But how could you utilize that data for identifying the underlying reasons for people NOT buying from your web shop? Which, at average conversion rates between two to three per cent, still is a major concern for most site operators.
Or: How could you figure out whether the goal-reaching experience was a pleasant one? Or: How could you determine whether the reason for an interrupted visit was mere user frustration or a call from mum?
From looking at clickstream data: you can’t. For the sake of sanity: don’t even try to!

Today’s filtering criteria for and richness of our clickstream data is not sufficiently mapping what we would need to know: I want to get all sessions which contained a user’s inactivity period (that is: no clicks recorded!) of at least five minutes. I want to compare that to visits where the inactivity period was at least ten minutes. What I wanna know is: Is there a correlation between the length of inactivity and the patterns in visit conclusion? Is there an accumulation of the same non-action pages across different visits? And: Would this pattern probably depend on the nature of the disturbance (i.e. mum calling, Scarlett Johansson or the Chippendales walking past the computer)?

Well – supposedly all of that. As the disturbances are not recorded – we can’t know.
And even if the Chippendales, mum, and Scarlett Johansson could leave a cookie behind: we couldn’t see the users’ point of frustration, or defection from the service altogether – they may just as well have continued their browsing session after having been on the phone with mum for 30 minutes and 40 seconds – which would make this particular visit technically a new visit due to the industry standard of the 30 minutes visit timeout. And be warned: In your favourite web analytics tool you may see this particular visit start from a very strange, deep entry page, too.

We are not yet there. But the growing fragmentation of our data sources gives me a hard time, occasionally. Visits are getting shorter – that seems to be a genuine trend – and the process of making a purchase decision seems to be less and less observable these days.
The true consequence of these thoughts remains somewhat unclear to me. It seems as if the principle of telos is no longer applicable to once well anticipated sense-making entities in Web Analytics (like “visitors”, or even “visits”).
The “moving target” paradigm which has threatened marketing for over three decades now has undoubtedly led to fragmented usage patterns where an inherent pattern of sense-making is no longer clearly attributable to agents outside the medium itself.
The clickstream data still has ’nuff of clear and distinct goal seeking and goal reaching patterns – but these structures are on retreat, and (as so often) the eigenvalues will take over sooner or later. “We return to the icon.” (Marshall McLuhan)

When a customer next asks me: “Yes – but is the data accurate?” I may soon have no other answer than: “OK – let’s pretend that matters.”

Stages in Web Analytics data interpretation

Tuesday, June 15th, 2010

There are a couple of articles, books, and posts out there about how Web Analytics is gonna change the business world as we know it. This article, in contrast, tries to wrap up how the conceptualization of what Web Analytics data can do has changed over the past fifteen years.

Business organizations are today putting high hopes in Web Analytics systems and tend to re-construct their whole business realm from looking at their figures. Without doubt there is good ground for the assumption that “knowing your numbers” will reveal potential for improving web services and check the validity of underlying business and content models, but first and foremost Web Analytics serves as a sensorium for a company to know what’s going on on their web site.


When I was tapping into the field of building web sites in 1996 the “hard currency” of Web Analytics was “a hit”.
Technically spoken, a hit is a request to a web server and, depending on how many elements there are on a page, the amount of hits can easily increase to several dozens per page – especially when considering that graphical layouts back then were done with tables, containing a lot of transparent 1×1 pixel gif files which were keeping the layout together.
The other thing which added some exotic flavour back in 1996 was the geographic origin of the request. After all we’ve been proud of dealing with a global infrastructure here – and receiving an http request for a European site from Australia or Vanuatu was a big thing back then.

This fixation on “hits” and requests from the other end of the world was what I would label as “Web Analytics 0.8″. It was based on “Hey – we can see something in the data!” and what was to be seen was ridiculously over-interpreted.

Nevertheless: the Internet population grew (and still is growing), so the figures went up – and that was the only thing that mattered.

To a certain extent this is still today the predominant way to look at Web Analytics data. The misconception of “hits” was soon replaced with a focus on “pages” – distinguishable entities that represent content units on web sites.
The “Web Analytics 0.9″ approach thus was still only counting things – but this time it was Page views and visits, mostly.

The first calculated ratio that people tended to come across when dealing with Web Analytics data is the “Page views per visit” ratio. Early publications attributed this number being a cool thing for “engagement”, but it only requires a little imagination to debunk this as misleading – you can’t tell whether people love to read so much on a web site, or whether they keep jumping around between pages because they simply can’t find what they are looking for.

Again: the most important thing was that the figures (any figures!) kept going up – large corporations’ marketing executives have been measuring success for traditional print campaigns in “reach volumes” and “contacts” for several dozens of years, so the emerging digital medium was measured in the same way. The focus for web sites in the days of “Web Analytics 0.9″ was still to reach a large audience.

It took until 2006 before a standard approach was formulated for the entire Web Analytics world. The initial Web Analytics Definitions paper (“Big Three Definitions”) was a gigantic step forward – in particular, as it did not only list the essential metrics, but it as well gave an idea about the types of Web Analytics metrics (counts, ratios, Key Performance Indicators), together with an indicator on how the different data universes looked: aggregated, segmented, individual.

The “invention” of the Key Performance Indicators for web sites would be what I’d consider the “Web Analytics 1.0″ stage – for the first time Web Analytics data was put into a unified concept framework.

A lot of people have contributed to the foundations and essentials before that. If you happen to come across, for example Eric Peterson’s books from 2005/2006 you will find that most of the definitions and the connotations have already been fleshed out back then. If you look at books written by Jim Novo or Jim Sterne you will find very elaborated frameworks for deriving insights from data. And if you come across Avinash Kaushik’s publications you will see that there is a lot more than just “contacts” to be counted from Web Analytics tools.

In a sense you could say that the distinct stages of the Web Analytics development were evolving from measuring things (in a technical sense as “counting”, primarily) to a more elaborated way for bringing together visitor deeds with business goals (attribution of relevance). Tying these two things together (business goals and Web Analytics) meant as well to free Web Analytics from the geek corner it has been residing in for over ten years.
The technicalities of the measurement are still left to the hairy guys in the IT department, mostly, but the conceptualization on what Web Analytics should be has led even to the idea of the “data-driven organization” for which the ambitious goal can be formulated as “replacing management’s gut feeling with solid performance measurement of marketing efforts”.

To develop Web Analytics further yet another enhancement was useful for getting the domain of technical measurement closer to the domain of business goals.
The development of e-commerce systems on the web (meaning: a new revenue source for a business) made the focus shift from reaching a growing audience of people on the web to make them purchase stuff online. The possibility to measure all different stages for a purchase process as a funnel revealed the improvement potential for the long way round from the “Add to cart” via the “Checkout” button to the “Thank you for purchasing” page.

New questions arose from there: (1) what happens between “reaching” a person and counting a purchase (aka “a conversion”) from that person?, and (2) if somebody has bought from a shop once – would (s)he consider doing it again?

Unless people don’t have to think about money as they’ve got an infinite amount of it us regular folks do have to consider purchases as spending on alternatives: putting your money into something particular means you can’t spend it on something else anymore. The decision making process is to a vast part orientated around the anticipated benefits from purchasing product X vs. the downsides of not being able to purchase product Y (funny enough: it’s not simply comparing product X vs. product Y).

Among psychologists, this phenomenon on comparing the benefits of apples (which people tend to buy anyway these days. No pun intended!) with the downsides of not buying oranges is known as a “post-purchase dissonance”.

It seems obvious that the seller of apples doesn’t precisely know whether the alternative purchase would be oranges, bananas, or grapes, thus the seller cannot anticipate the dissonances caused by other products. Instead, (s)he relies on exaggerating the upside from buying apples with the means of communication.

This communication, of course, is always only an offer. The prospect may or may not take it, but the missing link between reaching potential buyers and closing the deal is a very abstract thing called “engagement”.
Defining engagement is not a trivial thing to do. We can understand it as a prospect’s attempt to minimize his or her own post-purchase dissonance. The question is: how could that be measured?

Answer: it depends on the product. Consider your own way of making purchase decisions. If it is a simple, affordable product, it doesn’t take long to cross the line. If it is a car, or a house, it takes hell of a lot more to make up your mind.

For books and CDs it’s quite common to have a look at reviews to find out whether they are any good. For cars and houses it’s not so much the product quality, but how affordable and desirable you perceive it. Are you able to sell your old car/house? If so: at what price? Do you need to take a loan for the new one? If so: at what price? Are there any additional costs coming with the purchase? For a stereo, or a veranda?

You see the point: engagement is nothing we could give a calculation or measurement for, as the threshold for the purchase decision is highly subjective and dependent on people’s own perception of affordability. We simply can’t observe it from the outside.

A repeated purchase, however, is relatively easy to measure, as normally some sort of user identification goes along with it.
I think it was Jim Novo who wrote some very inspiring chapters about user defection and how to avoid it. “Timing” is of course the essential thing here: somebody who just had a haircut is unlikely to get the next one the next day. But of course the frequency of repurchase is different from individual to individual. An intent to repurchase only ever comes along with a certain probability. The longer the time, the less likely a repurchase becomes, so it is important to study the purchase intervals for repeated customers and to address the different customer groups just before their respective defection points.
Admitted: may work well for haircuts and Anti-Virus software (the purchased product degrades naturally over time), may not work equally well for hardware such as music players or TV sets.

Measuring user behaviour across visits must be considered one of the core principles of gaining a deeper understanding on visit and purchase patterns. I would label this attempt of bringing together user behaviour across visits with the generic purchase funnel as “Web Analytics 2.0″ approach.

Depending on the nature of the products being sold a vast proportion of users may purchase something already during their initial visit (we could say: they come with a clear purchase intention), but others need to re-visit a site several times to confirm and reassure themselves that they really are in to buying this particular product.
We of course can call this visit repetition a form of “engagement”, too. And if we can measure and compare the time interval between visits, and in particular: the time interval between the visit immediately preceding the visit leading to a purchase, we may derive valuable insights from that.

The look at data has changed over the years: from counting occurrences in a binary manner (somebody has or has not seen/clicked/bought) in the Web Analytics 0.9 and 1.0 domain to attributing complex behavioural patterns in dimensions of probability in the Web Analytics 2.0 domain a lot has happened.

The view on web site visits with a funneled approach, the consideration of generic purchase funnels with the stages of Reach, Engage, Convert, and Repeat (or Reach, Engage, Activate, and Nurture), and the idea of distinguishing different user segments based on behaviour has created an immensely rich interpretation framework with lots of intertwined layers and perspectives.

This increase in richness comes at a price: as the lack of a useful “engagement” definition shows there are more and more unobservable processes involved, to which Web Analytics can’t contribute much: has the quality of a visit been perceived as successful? Rewarding? Is the user likely to recommend the web service to his best friend? To his grandmother? Web Analytics data doesn’t tell about the subjective perceptions of users or user groups.

It doesn’t come as a surprise that particularly in tough economic times the shortening and optimization of the path from Reach to Conversion is the predominant mode of applying Web Analytics data: efficiency considerations, namely the awareness of cost and monetary outcomes has put a General Management perspective on Web Analytics data where ROI (Return on Investment), AOV (Average Order Value) and CPA (Cost per Acquisition) calculations are essential key performance metrics.

Whether the assumption that users’ interests revolve around easy purchase is really true or not remains to be seen (watch out for the next exit survey of a web service of your trust!). Nevertheless I would expect a return of the “User-centered design” approach any time soon: successful web services have to provide more value for users than just a one-click-purchase.
What that could be depends a lot more on smart service designers than on the interpretation of Web Analytics data.

What can we know? Introduction, and: About Segmentation

Friday, June 11th, 2010

Analyzing web sites with Web Analytics data

The regular analysis steps in Web Analytics are pretty similar to other insight-generating processes, namely attribution and prediction.

One particular attribution technique in Web Analytics is Segmentation. It’s an interesting one, though it comes with its particular difficulties and limitations. We’ll cover that topic first in this article before we go to the more generic topic of Attribution and its peculiarities.
From there we will try to outline possible approaches for applying Predictive Techniques. We believe these techniques could greatly contribute to the improvement of web services, though currently the attempts to apply them seem to rather point to drive sales.


When we segment data, we are applying a fundamental distinction to a population. In the beginning we can only hope that the attribute chosen for distinction holds and is not overwritten by another layer of determination which would neutralize the original distinction. Only time can tell throughout the analysis, but we stick to the original distinction as long as we can.

For that reason we Analytics people are fond of exclusive distinctions which are very likely to stick: first-time vs. repeat visitor, visitors who come in via a search engine vs. direct entries, residents from Romania vs. residents from Kirghizia.

For you as a business owner it would be a lot more interesting to apply an entirely different set of distinctions: prosperous vs. low on spending power, influencer or late adaptor, curious prospect or disgruntled owner.

Your average Web Analytics tool doesn’t know anything about the sociodemographics of your visitors as persons. The Web Analytics tool runs on your web site (a web site is a technical structure which addresses the potential information need of an unknown audience). The click stream data that is collected on it is anonymous by nature (unless you force your users to authenticate on your site), and so you have no significant chance to identify, let’s say, Mr. E.R. Brantshaw, Napia Court, Black Lion Road, London, SE 14. With the footage of the old Monty Python sketch: “He cannot be seen.”

Segmentation is a filter technique (coming from the Latin verb secare: to cut) in which you can combine several criteria to “slice” your audience into relevant groups. You can single out those visits which have crossed at least 5 pages and lasted longer than 120 seconds to see whether these visits are showing any particular patterns or content preferences when compared to other/all visits.
You may consider looking at visits made by visitors from a given country to see whether they have a shorter time on site than people from another, you can single out visits from a certain acquisition source vs. respective purchase volumes, you can slice your data into all sorts of fragments and special characteristics (too many filters combined may result in empty segments, though).

Still one thing remains unsolved:

Your Web Analytics tool is literally blind for any of your visitors’ personal details.
It requires attribution techniques which lie behind the mentioned distinctions. The primary dimension of the applicable distinction is a technical one: the geolocation can be resolved down to the city level/urban district level from resolving the IP address, information about the connection speed, browser version and Operating System are transmitted automatically by most systems and can be used to distinguish between high-speed connection Mac users with latest Chrome browsers and old Windows 98 system users with a modem and a ten-year old version of Netscape Navigator.

But all the data you can obtain from your Web Analytics tool requires attribution techniques to make the data meaningful for your business: the data doesn’t tell you where to draw the line between an influencer and an early adaptor and whether the average purchase power is higher in Romania or in Kirgizhia.

In other words: you collect data. And then you need to contrast this data with other data.

Continue reading: On Attribution

What can we know? About Attribution

Friday, June 11th, 2010

Maybe this is why the term “behavioural analysis” has gained much popularity recently. For sure: behaviour on a web site indicates different information retrieval patterns and strategies, which allow to deduce levels of user engagement and adequacy of navigational paths. But the entities the data relates to are, once more, requiring attribution.

In case you have product information assembled on your site you’ve got a fair chance to get certain insights from checking which product pages have been viewed most often and how long visitors have been on each of the product pages on average.

You can easily start attributing distinctive behavioural patterns to certain characteristics you can segment for: new visitors vs. repeat visitors to identify potential weaknesses in your navigational paths, visits using internal search vs. visits where the navigation is used to identify optimal navigation path lengths, display and click counts for teasers on your site’s main page – but the genuine downside of click stream data remains this:
it is semantically rich – it can tell you a lot about a page and about aggregated counts for all of the measurement points, but syntactically poor – it doesn’t tell you much about the relative performance of a piece of content in the context of a visit.

Consider a visit on a web site. The visit starts on a certain page (landing page), continues for a while across several other pages, and then ends at yet another page (exit page).
The way that click stream data is organized in your Web Analytics tool is – with regard to content – centered around page entities, which are sorted by occurrence (highest first).

Interpreting this data can go in several ways: You can consider the pages with the highest rankings being the most interesting ones – or you go the other way and assume that the likelihood for a page being crossed during any given visit is declining the deeper it is hidden in your navigation structure.

Try applying the same logic to visits (in case your Analytics tool supports session or pathing information): you will find that the visits with the highest occurrence counts are consisting of two pages only: i.e. the home page and the product catalogue start page. It’s hard to attribute this to a particular user interest, isn’t it? Or it leaves you with the insight that something must be wrong with your product catalogue as people tend to not continue their visit from that page.

The sad fact is: the more visit volume you have, the more the distribution of data is starting to follow formal principles and statistical patterns. The likelihood for certain click sequences to occur is inextricably linked to the amount of pages, elements tracked and navigational levels present on your site. To use only content-level explanations as an outcome of a volume analysis is misleading. Root-cause attribution is a tricky business.

Let’s look deeper:
Of course you have different page types on your site, too: your home page (top level in the navigation) where you may present latest news and announcements, hub pages which primarily serve the purpose of letting people transit to content pages which are deeper in the navigational structure, the “Contact us” page where you offer all possible contact methods, and your product pages where you have all descriptions and specifications of your cutting-edge products. From some pages you expect people to exit (“good exits”), from some you don’t want them to leave (“bad exits”).

Ask yourself: just from looking at the data of the Top 10 exit pages – can you tell which ones are “good exit” pages (could be the page with the title “How to find us”) and which ones are the “bad exit” pages (maybe the page with the title “Contact us”, if it would be only leading to a contact form)?

Comparing the popularity of product pages may sound easy, compared to these page-type categorizations. But depending on the size of your product portfolio and the hierarchical organization of product category information on your site you may see only a very small fraction of visits ever coming across any of your product pages.

Consider adding a navigational short cut to each of your products, using a “select product” dropdown on your main page. Depending on the sequence of dropdown entries you may see lower volume figures for elements deeper down in the list – or then you may see other side effects, depending on how you structure the list entries.

If you happen to have cryptic product names for parts of your portfolio (like “LV2010EG 250G”) you may see organic traffic from search engines come in at high rates. Pages of your product line “Senator” might not get many entries from search engines at all.

You surely don’t need to dive all the way down to calculate the signal entropy for each of your product names to cater for data normalization, but you see that the possible attribution techniques to apply to your data are huge – and every step of attribution increases the fuzziness of what you are originally measuring.

Continue reading: on Predictive Techniques

What can we know? About Prediction

Friday, June 11th, 2010

To avoid what can be referenced to as “analysis paralysis” in such a situation you may utilize a set of different techniques. Labeling them as “predictive techniques” may seem to be a bit far-fetched for some, but there’s a good reason for trying to shift the perspective: as root-cause analysis requires a lot of effort, a consolidated record of distinctive related user action patterns can be used for applying methods of advanced analytics.

A very light prediction method to start with is to formulate hypotheses for site improvement, and to test them one-by-one. Performing A/B tests to see the impact of a change is a suitable way for developing and evaluating prediction models – and it provides a framework for getting a sensorium for the volatility of user reactions to planned changes.

Another (hypothetical) scenario for utilizing prediction technique is to break the steps down into small pieces:
I occasionally go to airline web sites to check out options for flights. In nearly all the cases I either check departures from my current location (for myself) or for flights to my current location (for friends and family). In some rare occasions I check only parts of routes (to see whether there is any earlier/later connecting flight than the one suggested to me by the airline site I am looking at).

Five minutes ago I just hammered in a URL for an airline from Southern Europe for inspiration. Although my current location can be accurately determined from my IP address (I used whatismyipaddress.com for verifying) I am greeted on that web site in Polish, and the airport in Warsaw is entered as my departure point.

Right idea, wrong execution (that airline is operating from my current location, too). No matter how often or with which browser I return to that same site I encounter the same situation: site in Polish, departure point: Warsaw.

A slightly more powerful prediction model could work like this: (1) resolve IP address, (2) store departure point change in cookie, (3) read changed departure point cookie and display corrected departure point on next site entry.
Although the coookie method has its downsides, it can’t make matters worse than they are at the moment. Having a trace of the initially selected departure location as well as a notion of user’s applied change within the Web Analytics data helps anticipating how close or far off the prediction is. This offset (or: proximity) approach equips service development with indicative values for further reference.

The potential granularity of the captured data is refined in iterations: store the corrected departure points from all searches in the cookie. As soon as a corrected departure point appears twice, make this the new default departure point (“likely departure point”), put a value index of “0.8″ to it and delete the others from the cookie. Start collecting data again, but leave a trace in your Web Analytics data that this session has shown a custom departure point with the index value 0.8 within the segment of all returning visitors.

Consider how this scenario could continue:
After I’ve successfully configured my starting point (departure, language) upon my first visit I may return to the site any time later (second visit). If I am checking flights from the same starting point, my preference for this location is ranked higher (0.9) and stored in my cookie, too. As soon as I have booked my first flight from that location the value index for this location on that machine is set to 1, indicating a high probability that my next research will start from that same location. Other destinations less-sought after can be kept but could be decreased in value over time.

After having booked a flight, the booker’s first name/last name is confirmed as well as my gender. Next you could compare credit card holder name to passenger name(s) from the booking and you gain a good reference value for attributing this session’s browsing behaviour to a traveller on his/her own terms (as opposed to a booking which are purchased with a company credit card). Put a scoring on that, too.

Profiling and collecting personal data from other sources is a questionable thing to do (I’d recommend not to), but even if you stick to the data you are collecting yourself you may gain valuable input for scoring and valuing:
Business travelers tend to do more short term bookings (and travel in smaller groups, usually), common last names of passengers indicate families traveling etc.

So you could just sit there and collect data from your visitors’ habits, and “low-hanging fruits” will be the outcome.
Admittedly: that might take a long while, depending on how much savvy you apply to it, depending on return rates, and whether people have changed computers/cleaned up cookies since last appearance. But I believe that frequent Internet users are aware of some of the technical and conceptual shortcomings with cross-browser/cross-computer use.

As reference for the supposed language preference you may consider the OS language (tried this with another airline site a minute ago: got the language right, but screwed the location, too). If you prefer any other method: fine. As long as you keep a record how often your suggestions were rejected/adapted you have a valid means for determining the offset of your preferred method.

If you would determine that a user is online using a modem connection you may offer the possibility to change to a more lightweight plain HTML version of the booking site (consider asking a user only once for a start). It might be much appreciated and could greatly contribute to the value perception of your site.

No matter what: the use of small predictions to enhance browsing experiences on the basis of light attributions (including a generic fallback) surely are appreciated. If you save the results of the attributions you get a record of how well your predictions have worked for your users, and over time you can apply more thorough prediction patterns and methods.

A downside comes with it, though: you need to capture, read, process, and re-read a lot of data. But I do seriously think that it makes sense to start putting the emphasis on the second word of the term business intelligence.

Continue reading the last part of this article: on improving web services through better tagging