Noise to Signal


Market Basket Analysis using Google Analytics Data

Ever since I learned about Market Basket Analysis, my head was spinning with ideas on how it could be applied to web data. To back up for a second, Market Basket Analysis (MBA), is a data mining technique that catalogs the strength in relationships between combinations of items placed together during a transaction. Applications often include:

  • Recommending content in the form of “Users who view X and Y also view Z”
  • Offering promotions for combinations of items to increase revenue
  • Better understanding of user behavior and intent
  • Updating editorial decisions based on popular combinations of items

The typical use case, and where the name is derived, is in the retail setting where marketers want to know what products are commonly associated with one another during checkout. The reason we need fancy algorithms for this type of analysis is due to the explosion of combinations to evaluate. As an example, if you wanted to look at every combination of 3 items out of a set of 50 items, you would have ~20,000 combinations to evaluate. That number expands immensely as you increase the number of unique items and the size of combinations.

In the retail case, the “item” is a “product” and the “transaction” is “checkout”. However, the algorithm underlying MBA doesn’t care what you use as an “item” and “transaction”. We can just as easily run an analysis that looks at web pages as “items” and browsing sessions as “transactions”. Going further, if we have information related to webpage taxonomy and unique user IDs, we can abstract the analysis away from individual pages and look at taxonomy tags as “items” and users as “transactions”. Hopefully that gives some flavor for how flexible this analysis can be.

As a simple example, we’ll run MBA on my own personal blog. Given my small number of pages and limited amount of traffic, this analysis won’t do justice to the full power of MBA. Just be aware that this technique scales to thousands of items and tens of thousands of transactions without much effort.

The code used to generate this post can be found here.

Pulling Data from Google Analytics

I’m interested in understanding the combinations of pages that users visit during a session so that I might recommend new pages of interest during their journey. Perhaps I plan on asking my editorial team to manually attach these recommendations in WordPress or perhaps I plan to feed this information into some sort of automated personalization engine. The first step is to pull down our “items” (webpages) and “transactions” (session IDs). We’ll do this by calling the Google Analytics reporting API with the googleAnalyticsR library and grabbing pages, landing pages, and session ids.

I’ve included the session ‘landing page’ and its purpose becomes clear once you think about how we plan to use our results. On its own, MBA doesn’t provide any information about the sequence of items in a transaction, it simply indicates that “these items are associated”. Given that we want to recommend a new page to a user during their journey, we want to avoid recommending a page that is commonly associated with the start of a journey. In other words, let’s not recommend a landing page after they’ve landed!

To resolve this issue, we’ll tag the starting pages with ‘ENTRANCE-’ at the beginning. In the example session below, you can see how I differentiate someone starting their session on the ‘differential scroll tracking’ blog post. If they had not started their session there, the page path would not include ‘ENTRANCE-‘. We make no distinction regarding the ordering of the non-entrance pages.

Session IDPage Path

Looking at the table below, we can see that most sessions contained only 1 pageview and that the number of pageviews taper off after that. It’s good to get a general sense of the shape of the data before running MBA because it will influence the size of the combinations that we can reasonable expect. For example, it would be unreasonable to look for combinations of 9 different pages because only 1 session generated a combination of that length.

Running Market Basket Analysis

To run our Market Basket Analysis, we’ll use the arules package in R. Before we look at any results, it might be helpful to cover some terminology that often appears in MBA:

  • Itemsets – these are combinations of items and are often associated with a count which demonstrates how frequent the combination appeared in the transaction history. You’ll often see size-2 itemsets or size-3 itemsets, etc, indicating how many unique items appear in the itemset.
  • Support – This is the percentage of transactions in which the itemsets (or association rules, covered next) appear
  • Association Rules – These are presented in the format of “{Left Hand Side} => {Right Hand Side}” and indicate that transactions that contain itemsets on the LHS also include the item on the RHS. Note that the RHS only ever contains 1 item while the LHS can contain an itemset of any size.
  • Confidence – This is a percentage indicating the strength of our association rule. It says “Out of the users who visited the items in the LHS, XX% visited the RHS”. This is helpful, but can be misleading when items in the RHS are ubiquitous and relevant to nearly every combination of items. To resolve this, we often look at both confidence and lift.
  • Lift – This number indicates how much more likely we are to see the LHS and RHS together as opposed to apart in a transaction. A lift of 3 means we’re 3x more likely to see these items together and a lift of .33 means we’re 1/3 as likely.

With that, let’s get started. The following table shows the top 4 itemsets discovered via MBA, sorted by support.


Next, we run the Apriori algorithm to find association rules. Remember that we want to filter out any association rules where the ‘entrance’ page is on the RHS. This ensures that we never recommend an entrance page.

The best way to present association rules is often in a scatter chart that allows us to look at support, confidence, and lift in one view. Below, you can see that 4 association rules were generated that have a minimum support of 2% and a minimum confidence of 80%.

Association RuleConfidenceSupportLift
1{ENTRANCE-/2016/08/google-analytics-autotrack-js-updated-see-whats-new/} => {/2016/02/deploying-autotrack-js-through-google-tag-manager/}87.50%3.08%15.28
2{/2020/03/mobile-app-live-streaming-analytics-case-study-hope-channel/} => {/blog/}90.00%3.96%2.65
3{ENTRANCE-/2016/02/deploying-autotrack-js-through-google-tag-manager/} => {/2016/08/google-analytics-autotrack-js-updated-see-whats-new/}92.00%20.26%3.60
4{/2020/03/mobile-app-live-streaming-analytics-case-study-hope-channel/,ENTRANCE-/} => {/blog/}100.00%3.52%2.95

Analysis of Results

The scatter plot and table yield some interesting results. First, I should point out that finding an association rule with strong support, confidence, and lift is the holy grail, but exceedingly rare. Most commonly, you’ll find items with high confidence and low support, or high support and low confidence.

Notice that many of the itemsets we discovered previously, such as {Blog,Entrance-/} didn’t make the cut as association rules. This is because we’re filtering to search for association rules with a minimum confidence of 80%. This is important to avoid the situation where we recommend content that is broadly popular, but not tailored to the user’s unique viewing history.

So what we can we determine from the graph and table above?

  • Rules #1 and #3 are nearly the mirrors of one another, remember that the ‘entrance’ version of each page is considered to be a unique page. What stands out is the high lift – these 2 pages are clearly connected to one another in a way that stands apart from their connection to other pages.
  • Rule #3 is interesting because of the high confidence and high support. This is often hard to find. When I review some of the analytics underlying these figures, I see that my blog is generating a lot of SEO traffic to the ‘deploying auto track’ page and that those users are going onto the 2nd page 92% of the time. If we look at the page in question, we can see that I have an “Update” callout. It looks like that callout is working very well!
  • Rule #2 is notable because it doesn’t include an ‘entrance’ page. It’s a nice, broadly applicable rule stating that users who visit my live streaming case study are 90% likely to visit, or to have visited, the blog landing page.
  • Rule #4 is interesting given the 100% confidence (which I doubt you would ever see in a more realistic scenario). What this says is that if a user enters on the home page and at some point visits my live streaming case study then at some point they will (or will have already), with 100% certainty, visit the blog landing page. Notice that I have to emphasize the fact that this analysis gives no indication of the ordering of events. If we wanted to turn this rule into a content recommendation, we would likely want to check their browsing history first to avoid recommending a page they’ve already visited.

Closing Thoughts

Hopefully the analysis above shows how MBA can help someone dig deeper into user behavior and start looking at metrics for patterns as opposed to metrics for individual pages/products. While I used individual pages as the “items” above, websites with thousands of pages may benefit from an analysis centered on content taxonomy such as “content types”, “tags”, or “topics”. This makes the results much easier to interpret. One application of such an analysis may be feedback for the editorial team to focus on content that contains specific combinations of topics. Happy analyzing!

Viewing Google Analytics Segment Overlap in R

Google Analyitcs segments are a fantastic way to organize the results of an analysis. There are, however, a few limitations of using segments in GA:

  1. They cause reports to become sampled after 500,000 sessions (or 100M with GA360)
  2. Only 4 segments can be compared at one time
  3. Segments are saved under your Google account which makes sharing them a pain
  4. When comparing segments, it’s hard to tell how much they overlap

All of these limitations can be resolved by bringing your Google Analytics data into R with the googleAnalyticsR library, but this post will focus on #4 above: Understanding segment overlap. The code generating this blog post can be found here.

The Problem with Segment Overlap

Segments are fairly straight forward to create in GA, but can trip users up in a number of ways. One common issue is when users fail to account for segment overlap. Why should you care whether your segments overlap? Because you’ll want to interpret your segment metrics entirely different depending on the answer. Let me explain via a scenario I see fairly often.

Sally is a marketing director in charge of a major pet retailer’s website redesign. She worked with her branding agency to develop 3 different personas that they expect to find on their website: Cat Lovers, Dog Lovers, and Wholesale distributors. The UX of the website is designed to tailor to these personas and Sally is confronted with the question of how to report on website success. A natural decision is to frame the reporting KPIs around the personas developed earlier. She instructs her analytics team to create segments based on their personas.

Here’s where things start to break down. The analytics team is left to decide what behavior on the website indicates whether a user is one of those 3 personas. A very reasonable-seeming decision may be as follows:

  • Users who visit the /cats section are included in the ‘Cat lovers’ segment
  • Users who visit the /dogs section are included in the ‘Dog lovers’ segment
  • Users who log in and visit the /bulk-order section are included in the ‘Wholesalers’ segment.

A week after launch, the analytics team presents the following results:

  • Dog Lovers – 500 users, 5% conversion rate
  • Cat Lovers – 400 users, 4% conversion rate
  • Wholesalers – 200 users, 16% conversion rate

Amazing! Sally loves these numbers. The only problem is that they’re meaningless. What the analytics team failed to consider is that their wholesalers always browse the /cats or /dogs sections before making their bulk orders. This means that those 500 Dog Lovers and 400 Cat Lovers are polluted with 200 Wholesalers. Think about how the 16% conversion rate of the wholesalers might artificially inflate the conversion rates of the Dog and Cat Lovers segments.

The setup here is a bit contrived, but I’ve seen many flavors of it before. The original sin was attempting to convert UX personas into analytics segments. This encourages consumers of these reports to assume that the analytics segments are mutually exclusive when they are not. Analytics segments can only higlight behavior, not who the person is. Honestly naming segments, such as “Visited /cats Section”, is often the best way to emphasize this reality.

What does this have to do with overlap?

The problem above was that the report gave off the impression that segments were mutually exclusive when, in fact, they contained quite a bit of overlap. Without understanding the overlap, how can you interpret those numbers? Do we have 500+400+200=1100 users? Or do we have 200+(500-200)+(400-200)=700 users as would be the case if the 200 wholesalers were represented in all segments. In a more extreme scenario, you may be looking at 3 segments which all report on the exact same set of users.

As an example, how might you interpret those numbers above given each of these scenarios?

Scenario 1: Small, Even Overlap

Scenario 2: Large, Even Overlap

Large, Uneven Overlap

Scenario one is likely what the stakeholders at our pet company assumed would be the case – some slight overlap exists, but the metrics sufficiently indicate the behaviors of ‘Dog’ and ‘Cat’ lovers individually.

However, scenario two might be the reality. Perhaps 90% of their users love to compare prices across cat/dog products and visit each section at least once.

Or perhaps scenario 3 is the reality. Maybe a coupon link brought users to start their journey under /dogs which left the cat owners to then move over to /cats.

Unfortunately, there’s no way in standard GA to tell which scenario is actually occurring (though the new app+web version includes this feature). This is unfortunate, because each scenario would cause our stakeholders to interpret the segment metrics very differently.

So let’s move on to solving this issue in R.

Pulling GA Segment Data into R

I don’t have access to a pet retailer’s website, but I’m happy to share metrics from my own blog. In this scenario, I’ll create 3 segments:

  • Users who visit /blog
  • Users who visit /portfolio
  • Users who visit the home page (denoted as “/”)

Admittedly, these segments aren’t very interesting, but they mirror a common method of building segments based on page visits that are not necessary mutually exclusive. With the googleAnalyticsR library, we can create these GA segments on the fly and pull down the appropriate data from GA. Note: For this to work, you’ll need access to a user ID which could be their GA client ID. There’s a great article here on capturing client ID’s in GA using custom dimensions.

The code below shows how we can define our GA segments and pull the data.

# Use a function to generate our segments because each of the 3 segments are defined very similarly
create_pagePath_segment <- function(pagePath, operator){
  se_visited_page <- segment_element("pagePath", operator = operator, type = "DIMENSION", expression = pagePath)
  sv_visited_page <- segment_vector_simple(list(list(se_visited_page)))
  sd_visited_page <- segment_define(list(sv_visited_page))
  segment_ga4(paste0("Visited Page: ",pagePath), session_segment = sd_visited_page)

# Generate our 3 segments
s_visited_page_a <- create_pagePath_segment(page_a,"REGEX")
s_visited_page_b <- create_pagePath_segment(page_b,"REGEX")
s_visited_page_c <- create_pagePath_segment(page_c,"EXACT")

#Pull data from GA
ga <- google_analytics(viewId=view_id, date_range = c(Sys.Date()-300,Sys.Date()-1),
                       metrics = "sessions", dimensions = c(paste0("dimension",client_id_index)),
                       max=-1, segments = list(s_visited_page_a,s_visited_page_b, s_visited_page_c))

Visualizing Segment Overlap

Our next task is to visualize the overlap as a Venn diagram. We’ll use the VennDiagram library in R to do so.

# Define names of segments from the segment column
segment_names <- unique(ga$segment)
# Create a list of client IDs for each segment
segments <- lapply(segment_names, function(x){ga %>% filter(segment == x) %>% select(dimension2) %>% pull()})
colors <- brewer.pal(length(segment_names), "Dark2")

# Generate Venn diagram
diag <- venn.diagram(segments, 
             category.names = segment_names,
             width = 600,
             height= 600,
             resolution = 130,
             imagetype="png" ,
             filename = "ga_venn.png",
             cat.fontfamily = "sans",
             fontfamily = "sans",
             cat.col = colors,
             col = colors,
             fill = colors,
             cat.dist = c(.1,.1,.05),
             margin = c(.15,.15,.15))

# By default, the VennDiagram package outputs to disk, so weload the generated image here for display

While the plot above doesn’t scale the circles based on the size of the segment, it’s easy to interpret the overlap between the segments. Here we can see that 176 users visit the homepage and that a little less than 10% of those users went on to visit the blog AND the portfolio section (as denoted by the “16” in the middle).

With that, I’ll leave you with a happy accident in exploring the capabilities of the VennDiagram R library. Something you can look forward to if you start using this on your own data: a Venn diagram with 5 segments!

Causal Impact + Google Analytics – Evaluating the Effect of COVID19 on Hospital Appointments

The CausalImpact R library measures the effects of an event on a response variable when establishing a traditional control group through a randomized trial is not a viable option. It does this by establishing a ‘synthetic control’ which serves as a baseline under which the actual data is compared.

In this tutorial, we’ll look at the effect that the Coronavirus outbreak had on the number of “Make an Appointment” forms completed on a hospital website. The code for this post can be found here. To begin, we must establish a “pre-period” before the event occurred and a “post-period” after the event occurred. The pre-period is used to train a Bayesian Structural Time Series model. In the post-period, the model is used to predict our synthetic control which indicates how the outcome may have performed were the event not to have occurred.

Our pre-period will be 10/1/2019 to 3/15/2020 and our post-period will be 3/16/2020 – 5/4/2020. Our predictor variables will be the number of sessions from organic, social, and referral sources. An important assumption made by the CausalImpact library is that our predictors are not affected by our event.

Gathering Data from Google Analytics

First, we must gather the data necessary for our analysis. Our response variable, as established earlier, will be “Make an Appointment” form completions which is the goal1Completions metric in GA. Our predictor variables will come from the

We know that the hospital suspended paid media around the time of the outbreak so we’ll remove traffic from paid sources using the following filter:

channel_filter <- dim_filter(dimension="channelGrouping",operator="REGEXP",expressions="Paid Search|Display",not = T)

We call the Google Analytics reporting API twice. Once to gather the goal completion data:

# Gather goal data
df_goals <- google_analytics(viewId = view_id,
                       date_range = date_range,
                       metrics = "goal1Completions",
                       dimensions = c("date"),
                       dim_filters = my_filter_clause,
                       max = -1)

and once to gather the channel session data:

df_sessions <- google_analytics(viewId = view_id,
                       date_range = date_range,
                       metrics = c("sessions"),
                       dimensions = c("date","channelGrouping"),
                       max = -1,
                       dim_filters = my_filter_clause)

This avoids us having to aggregate the goal data after pivoting the session data. Pivoting the session data generates multiple columns of data from our single channelGrouping column. Putting this all together is shown below.

date_range <- c("2019-10-01","2020-05-04")

# Remove paid traffic
channel_filter <- dim_filter(dimension="channelGrouping",operator="REGEXP",expressions="Paid Search|Display",not = T)
my_filter_clause <- filter_clause_ga4(list(channel_filter))

# Gather goal data
df_goals <- google_analytics(viewId = view_id,
                       date_range = date_range,
                       metrics = "goal1Completions",
                       dimensions = c("date"),
                       dim_filters = my_filter_clause,
                       max = -1)
# Gather session data
df_sessions <- google_analytics(viewId = view_id,
                       date_range = date_range,
                       metrics = c("sessions"),
                       dimensions = c("date","channelGrouping"),
                       max = -1,
                       dim_filters = my_filter_clause) %>% 
   pivot_wider(id_cols=date,names_from=channelGrouping,values_from=sessions) %>%

# Merge the goal completion data into the sessions data
df <- df_sessions %>% mutate(y = df_goals$goal1Completions)

Create BSTS Model

The following code creates a Bayesian Structural Time Series model that will be used by the CausalImpact library to generate our synthetic control. It’s here that we input our pre-period and post-period as well as our predictor and response variables.

The BSTS package has several options for modifying our model. Here, we apply a “local level” which captures high level trend in the response variable. We also capture the 7-day weekly trend in our data using AddSeasonal().

df2 <- df # Create copy of our DF so we can re-run after the remove the response data from prediction period

# Assign pre and post periods
pre.period <- c(1,which(df$date == "2020-03-15"))
post.period <- c(which(df$date == "2020-03-15")+1,length(df$date))
post.period.response <- df$y[post.period[1] : post.period[2]]

# Remove outcomes from the post-period. The BSTS model should be ignorant of the values we intend to predict
df2$y[post.period[1] : post.period[2]] <- NA

# Create a zoo object which adds dates to plot output
df_zoo <- read.zoo(df2, format = "%Y-%m-%d") 

# Add loacl and seasonal trends
ss <- AddLocalLevel(list(), df_zoo$y)
ss <- AddSeasonal(ss, df_zoo$y, nseasons = 7) # weekly seasonal trend
bsts.model <- bsts(y ~ ., ss, niter = 1000, data = df3_zoo, family = "gaussian", ping=0)


The blue dots are the actual data points and the black line underneath is our estimated posterior distribution. We can see that the model does a reasonable job of predicting form completions, though there are some outliers in late February that are not well predicted. This will increase our uncertainty in our predictions and thus widen our confidence interval (the shading around the black line).

Generate Causal Impact Analysis

Now that we have our model, we can compare our prediction to what actually happened and measure the impact of the event.

impact <- CausalImpact(bsts.model = bsts.model,
                       post.period.response = post.period.response)


The top plot shows the actual data in black and our predicted distribution of the response variable in blue with the median value as a dashed blue line. The 2nd plot subtracts the predicted data from the actual data to show the difference between the two values. If the effect had no impact, we would expect the pointwise estimated to hover around 0. The last plot shows the cumulative impact of the event over time. Notice how our confidence interval (shown in blue) widens as time goes on.

Our causal impact model confirms a decrease in the number of form completions, however the 95% confidence interval quickly includes 0 which means that we cannot say with certainty that the impact extends into April. While we weren’t able to find conclusive results, being able to measure our certainty is a major benefit of Bayesian models such as this one.

Validating Our Synthetic Control

One method of validating your model is to generate predictions before the event occurred. If our model is well-behaved, we should see little difference between the predicted and actual response data.

# Filter to include only pre-event data. Also reorder columns to place y after the date
df_compare <- df %>% filter(date < "2020-02-15") %>% select(date,last_col(),2:length(df))

df_zoo <- read.zoo(df_compare, format = "%Y-%m-%d")

pre.period <- c(index(df_zoo)[1],index(df_zoo)[which(df_compare$date == "2020-01-15")])
post.period <- c(index(df_zoo)[which(df_compare$date == "2020-01-15")+1],index(df_zoo)[length(df_compare$date)])

impact <- CausalImpact(df_zoo, pre.period, post.period)


Above we see that the model doesn’t do a great job of predicting the upper spikes of the form completions which likely explains the wide confidence interval seen earlier.

Comparison to the Naive Approach

Deploying advanced modeling techniques is only useful if there are advantages over much simpler techniques. The naive method would be to use our pre-intervention data to establish an average and continue that average into the post-period to estimate a synthetic control.

Before the event, we had about 19 form fills a day. After, we had 8.5 a day. That’s a decrease of about 52%. CausalImpact estimated a decrease in 44% with a 95% confidence interval of 29%-63%. Were these numbers to be substantially different, and we had confidence in our model, we would prefer the figures generated by CausalImpact.

There are some clear cases when modeling will outperform the naive approach described above:

  • If there is a trend in the response variable, then averaging the pre-period will not capture the continuation of that trend.
  • If evaluating the degree of confidence is important, the CausalImpact model is preferable due to its ability to measure uncertainty.

Mobile App & Live Streaming Analytics: A Case Study with Hope Channel International, Inc.

Over the last 3 years, Noise to Signal has had the pleasure of designing and implementing a robust analytics system for Hope Channel International, the media arm of the Seventh-Day Adventist Church. Hope Channel’s shows focus on faith, health, and community and reaches millions of viewers across dozens of countries in just as many languages. When I was introduced to Hope Channel in 2017, they didn’t have any hard data related to video performance or live-stream viewership. They were, in effect, flying blind as it related to scheduling and programming decisions. Today, they have near real-time access to granular data related to show performance and viewership trends.

I was introduced to Hope Channel by way of Brightcove, a Boston-based company that provides an ecosystem of software and services that support online video. At this point, Hope Channel was about to launch an ambitious undertaking with Brightcove: the development of multimedia apps targeting 6 different mobile devices and set-top-boxes (ex: Roku). These apps would provide Hope Channel with over-the-top distribution capabilities, resulting in a closer connection between the church and its viewers. In this project, Hope Channel leadership saw an opportunity to finally collect the viewership analytics they were lacking and welcomed the introduction to Noise to Signal. That relationship has culminated in two major projects that I’ll describe here: 1) app analytics and 2) live stream analytics.

App Analytics

A key benefit of the over-the-top media distribution model is that the platform developer can own the user’s experience rather than relying on third party broadcasters. With this in mind, Hope Channel targeted 6 different app platforms that were likely to reach the broadest set of viewers: iPhone, Apple TV, Android, Android TV, Fire TV, and Roku. This presented an analytics design challenge: How would Noise to Signal ensure that the final reports showed data consolidated across all devices?

To solve this challenge, Noise to Signal made clever use of Google Analytics (GA), Google Firebase, Google Tag Manager (GTM), Google Data Studio, and GA’s Measurement Protocol. The final product is a Data Studio dashboard that shows data collected from each app and the ability to partition the data by dimensions such as app, region, language, date, and show. Example reporting events include: app installation, app open, screen view, video-on-demand video play, live-stream video play, and language change, among others. 

App Analytics Dashboard

The technical details are below, but it’s worth mentioning first the amount of documentation and communication that was necessary for each of the composite pieces described below to work together. Brightcove managed separate developers for each app platform, Hope Channel assisted with requirements and user acceptance testing, and two consultants from Noise to Signal provided project management and analytics implementation services. Each reporting event was painstakingly designed to behave consistently across apps and output the same data format. This meant validating the technical feasibility of each event across all app platforms and delivering precise instructions to each development team. The final outcome was truly a team effort.

Technical Details (if you are so inclined)

As mentioned previously, a key challenge was ensuring that data from each app could be collected consistently and stored and reported out of a single data source. The following describes how each technology was deployed to accomplish this purpose.

App Analytics System Diagram
  1. The Fire TV and Android TV implementations were perhaps the easiest to design because Brightcove developed these solutions as a wrapper around a webpage. Users of these devices would load a Fire TV “channel” and would be unaware that the content displayed was basic HTML. GA and GTM were built with websites in mind which made them a natural fit. However, because we were blending this “website” data with “mobile app” data from other sources, we had to find a way to transform our “website” hits into “mobile app” hits. To do this, we used custom tasks and some clever JavaScript to modify the hits as they were transported to GA. 
  2. The iOS and Android solutions benefited from the fact that both platforms are supported by Google Firebase, which provides a suite of mobile app development accelerators including crash reporting and analytics. Furthermore, Google provides a seamless integration between GTM’s mobile SDK and Firebase whereby Firebase events are automatically converted into GTM events. Those GTM events were then converted into GA hits. While this implementation led to a longer chain of events (Firebase → GTM → GA), Hope Channel was able to reap the benefits of Firebase which, among other things, includes the ability to segment viewers and generate lookalike audiences in Google Ads. 
  3. Roku and tvOS presented a conundrum. Neither are supported by Google Firebase or GTM, which meant the developers would need to work with GA’s measurement protocol directly. Fortunately, however, we were able to find community-built libraries for each platform that could be modified to suit our needs. This produced a more challenging documentation, implementation, and testing process as each hit was built from scratch with parameters that shifted depending on the situation. It was during this process that I became all-too familiar with the Charles HTTP Proxy which is often seen as a right-of-passage in the analytics and testing world. A tool used only by the downtrodden and desperate!
  4. The GA configuration, as mentioned above, was set to focus on mobile app analytics. This meant that “pageviews” were replaced with “screenviews,” among other changes. The brunt of the reporting depended on custom dimensions such as “App Name,” “Episode Title,” “Show Title,” and “Affiliate Name.” 18 custom dimensions and 3 custom metrics in total. By including the “App Name” as a custom dimension, filters could be constructed that produced unique “views” for each app platform. A single, consolidated view could then be constructed to merge all app data.
  5. There were two categories of end-users for the data collected in the previous steps: Hope Channel’s administrators and their local “affiliate” program managers. The administrators were given a dashboard that showed data collected across all apps and all affiliates. Using “report filters,” individual dashboards were then created for each affiliate program manager showing data specific to their affiliate station only (Ex: Poland, or Brazil).

Live Stream Analytics

The work above met a majority of Hope Channel’s requirements, but we knew there would be a major reporting gap related to live stream analytics. Our reports showed when users watched live stream shows but not what they watched. This was unfortunate given that live stream play events were 30x more frequent than video-on-demand plays. The underlying problem was that the apps had no knowledge of what shows were playing on any live stream at any given time. The only system with that knowledge was the live stream broadcast server which, at that time, had no structured method of sharing its information with other systems.

This second phase of work focused on closing this gap. We devised a system whereby the individual apps would send out pings every 20 seconds to a custom data collection endpoint built in Google Cloud Platform. These pings answered “when” users were engaging with a live stream channel. The “what” question would be answered by a custom API, built by Hope Channel, that accessed their live stream broadcast schedule. With these two pieces of information available in a structured manner, Noise to Signal was then able to stitch these data points together and provide robust analytics related to live stream viewership and show popularity. Data Studio was once again deployed as the reporting method while Google BigQuery was used to store the raw and stitched data.

Live Stream Analytics – Viewers per Minute

Technical Details (if you are so inclined)

The key challenge in designing this solution was one of scale. Our estimated data collection rates indicated that GA, with its 20 million hits per month limit, would not be an appropriate solution. We also had to consider how the collection rate might increase as Hope Channel continued to promote their newly developed apps. To address these challenges, we turned to Google Cloud Platform and specifically Cloud Functions and BigQuery.

  1. As part of this project, each of the 6 apps were updated by Brightcove to implement the 20-second ping procedure. Each ping included a small set of attributes to aid in downstream reporting: timestamp (most importantly), client ID, app version, live stream channel ID, language, and country.
  2. The broadcast system provided a key piece of information: what shows play at what time on what live stream channels. This data was provided by a custom-built API, built by Hope Channel, that accepted a date range and channel ID and returned the appropriate broadcast schedule.
  3. Noise to Signal’s implementation work began by constructing two HTTP Cloud Functions: one built to collect the app pings and another to collect the broadcast schedule. If you aren’t familiar with Cloud Functions, you may be familiar with Amazon’s Lambda solution or, more generally, the concept of serverless. Once the data is collected, it’s transformed and written to BigQuery. One main benefit of Cloud Functions is its ability to automatically scale based on demand. The chart below shows 165 virtual machines spun up at one time to collect data! One reasonable question may be, “but how much does it cost?” While I can’t go into specifics, the answer is: not a lot. More details can be found on the Cloud Functions pricing page.
Number of Concurrent Virtual Machines (Google Cloud Functions)
  1. Another question of scale was related to data storage. Given the volume of data (already known to be greater than 20 million hits per month), where would this be stored and could reports be generated in a reasonable amount of time? Here we turned to the combination of BigQuery and Data Studio. BigQuery partitioned tables were generated to collect the raw live stream data and broadcast schedules. Scheduled queries were then constructed to merge this data together. This is a computationally expensive operation that determines whether each live stream ping’s timestamp was between the start and end time of any particular show. By computing this in advance and storing the results, downstream reports load quicker. Finally, views were created to provide the metrics and dimensions used in the Google Data Studio reports.
  1. Finally, Data Studio was implemented as the dashboard and reporting solution. Importantly, a Data Studio parameter was created that controls the time zone displayed in the report (a feature I would love to see built into Data Studio if you would like to upvote the feature request here). The final dashboard is able to calculate important metrics such as: average minutes watched per viewer, average number of shows watched per viewer, as well as the minute-by-minute viewing trends split by app, country, or language.


The impact of this project goes beyond generating a few visually appealing reports. Hope Channel employees work hard every day to produce the best content possible for their viewers. These reports validate and bring a deeper sense of impact for that work. Most importantly, this analytics system provides opportunities for increased organizational learning and decision making. 

From our interactions with Hope Channel and Brightcove, to technical challenges overcome and the final word product, this has been by far the most rewarding project taken on by Noise to Signal to date. 

Many of the concepts presented in this case study are applicable to areas outside of media broadcasting. Do you or someone you know need a custom-built analytics system that stitches together data from multiple sources? If so, reach out!

Introducing Differential Scroll Tracking with GTM

One of the benefits of being a freelance analyst is that I have access to dozens of different client instances of Google Analytics and Google Tag Manager. One common implementation I find is scroll tracking. Whether through a custom plug-in or GTM’s out of box tracking, clients often implement events that look like this:

Event Category: Scroll
Event Action: {% Threshold}

From there, you can then add up the number of scroll events, divide them by the number of visits to the page, compare that number to other pages, and call it a day. You’ve successfully measured engagement across multiple pages.

Or have you?

A couple of things have bothered me about this approach:

  1. Pages have different sizes which means that a 75% scroll event may be very unimpressive on a smaller page and very impressive on a larger page.
  2. Users have different viewport sizes which means that a 75% event can trigger at different points on the page for different users.
  3. Users with larger viewports will, by definition, trigger more scroll events on page load than users with smaller viewports.

I don’t have any silver bullets yet for #1 or #2, but I came up with a fairly elegant solution to #3 that I’m calling Differential Scroll Tracking. The idea is simple, we only track scroll events beyond what the user initially views on page load.

[Differential Scroll Tracking] = [Total Scroll Events] – [Scroll Events Fired on Page Load]

This is accomplished through a simple JavaScript variable in GTM that calculates the percentage of the document that the user can see on page load. We then tell GTM to only fire scroll events for values above that number.

How about some examples. Here we see 3 users load the same responsive page and assume the organization wants to track scroll tracking in increments of 25%.

User A Valid Thresholds: 50%, 75%, 100%

User B Valid Thresholds: 75%, 100%

User C Valid Thresholds: 25%, 50%, 75%, 100%

We removed the possibility of a 25% event from Users A and B but allowed it for C. Additionally, User B will not trigger a 50% scroll event. Differential Scroll Tracking collects data on user engagement, rather than the out-of-box tracking which is often simply a proxy for the size of the user’s device.

Here’s how you can configure this in GTM:

1. Create a new JavaScript variable called “JS – Thresholds Greater than Initial Viewport” and input the following JavaScript:

  var thresholds = "";
  // Calculate the % of document the user views on page load
  var initial_verical_viewport_pct = document.body.clientHeight / document.documentElement.scrollHeight;
  for(i = 25;i<=100;i+=25)
        //Append thresholds only if greater than user's initial %
  	if(i / 100.0 > initial_verical_viewport_pct)
      thresholds = thresholds + i + ",";
  	//remove last comma
  	thresholds = thresholds.replace(/,\s*$/, "");
	return thresholds; //Example output: "50,75,100"

2. Create a new Scroll Depth trigger configured as follows. Note that we use our GTM variable from above in the “Percentages” field.

3. Create your GA scroll tracking tag as normal. Be sure to associate it with your new trigger.

I haven’t tried this in the wild. Let me know if you do as I’d love to hear any comments or feedback.

How to Add GA Segments to Google Data Studio Reports

**Update: Google Data Studio now includes native support for GA Segments. The post below may still be relevant if you are looking to combine data from multiple sources into a single Data Studio report /Update**

Ever since Google released Data Studio in mid-2016, I’ve received a lot of interest from clients who find its data visualization and data sharing capabilities much easier to grasp than the standard Google Analytics reports. However, anyone who has put together a Data Studio report has noticed that its simplicity is both its strength and weakness. You can easily create visually compelling reports in minutes, but it lacks the sophistication of more feature-rich tools such as Tableau. One missing feature that I’ve seen users complain about is its lack of support for GA Segments. Fortunately, with the Google Sheets connector and Google Analytics add-on for Sheets we’re able to work around this limitation. Note that this same process works (and is slightly easier) with Supermetrics, but I’ll demonstrate my solution with the GA add-on for Sheets because it’s free.

Read More

Using Google Analytics to Predict Clicks and Speed Up Your Website

Google Analytics holds a trove of information regarding the path that each user takes on your website. It’s not a leap, then, to imagine using past user behavior to predict the path that a current user will take on your website. What if we could use these predictions to download and render assets before the user requests them? Thanks to the HTML5 prerender command, we can! In this post I’ll discuss how creative applications of Google Analytics, R, Google Tag Manager, and the HTML5 prerender hint were used to create a snappier browsing experience for users of

Read More

Google Analytics Autotrack.js Updated – See What’s New

On August 2nd, Google announced the release of an updated and much improved autotrack.js plug-in that solves many common challenges that people face when implementing Google Analytics. One major change is that the autotrack library is broken out into 9 different discrete plug-ins that can be included in your solution independently of one another through the “Require” command. While there is thorough documentation from Google, I couldn’t find a nice concise description of each plug-in so I’ve provided that here.

Read More

Digital Analytics Meetup on 8/24 @ Northeastern

Last year, Todd Belcher and I started the Boston Digital Analytics meetup in order to bring together our peers in the marketing analytics industry for networking and knowledge sharing. This month, we’re hosting our 5th Web Analytics Wednesday on 8/24 at Northeastern where Sharon Bernstein will be presenting on the topic of Data Storytelling. If you’re in the Boston area, come out and meet some local analytics enthusiasts!

RSVP here –

Knowing is at Least, if not More Than, 40% of the Battle

When I send out weekly performance summaries to my clients, I often focus on just a few key take-aways and insights. For instance:

Campaign A is providing leads at $5/lead while Campaign B is converting at $15/lead. I’ve shifted most of the budget from Campaign B to Campaign A, but started an A/B test on Campaign B’s landing page to see if its performance can be improved.

These reports focus on what happened and what is about to happen. What’s missing in these emails, and discussions around measurement in general, is what didn’t happen.  In other words, what mistakes did we avoid because we had data pointing us in another direction? Read More

Read More Posts

Home   Blog   Portfolio   Contact  

Bringing clarity to marketers in a noisy world © 2020