Noise to Signal

◂ Blog: R

Viewing Google Analytics Segment Overlap in R

Google Analyitcs segments are a fantastic way to organize the results of an analysis. There are, however, a few limitations of using segments in GA:

  1. They cause reports to become sampled after 500,000 sessions (or 100M with GA360)
  2. Only 4 segments can be compared at one time
  3. Segments are saved under your Google account which makes sharing them a pain
  4. When comparing segments, it’s hard to tell how much they overlap

All of these limitations can be resolved by bringing your Google Analytics data into R with the googleAnalyticsR library, but this post will focus on #4 above: Understanding segment overlap. The code generating this blog post can be found here.

The Problem with Segment Overlap

Segments are fairly straight forward to create in GA, but can trip users up in a number of ways. One common issue is when users fail to account for segment overlap. Why should you care whether your segments overlap? Because you’ll want to interpret your segment metrics entirely different depending on the answer. Let me explain via a scenario I see fairly often.

Sally is a marketing director in charge of a major pet retailer’s website redesign. She worked with her branding agency to develop 3 different personas that they expect to find on their website: Cat Lovers, Dog Lovers, and Wholesale distributors. The UX of the website is designed to tailor to these personas and Sally is confronted with the question of how to report on website success. A natural decision is to frame the reporting KPIs around the personas developed earlier. She instructs her analytics team to create segments based on their personas.

Here’s where things start to break down. The analytics team is left to decide what behavior on the website indicates whether a user is one of those 3 personas. A very reasonable-seeming decision may be as follows:

  • Users who visit the /cats section are included in the ‘Cat lovers’ segment
  • Users who visit the /dogs section are included in the ‘Dog lovers’ segment
  • Users who log in and visit the /bulk-order section are included in the ‘Wholesalers’ segment.

A week after launch, the analytics team presents the following results:

  • Dog Lovers – 500 users, 5% conversion rate
  • Cat Lovers – 400 users, 4% conversion rate
  • Wholesalers – 200 users, 16% conversion rate

Amazing! Sally loves these numbers. The only problem is that they’re meaningless. What the analytics team failed to consider is that their wholesalers always browse the /cats or /dogs sections before making their bulk orders. This means that those 500 Dog Lovers and 400 Cat Lovers are polluted with 200 Wholesalers. Think about how the 16% conversion rate of the wholesalers might artificially inflate the conversion rates of the Dog and Cat Lovers segments.

The setup here is a bit contrived, but I’ve seen many flavors of it before. The original sin was attempting to convert UX personas into analytics segments. This encourages consumers of these reports to assume that the analytics segments are mutually exclusive when they are not. Analytics segments can only higlight behavior, not who the person is. Honestly naming segments, such as “Visited /cats Section”, is often the best way to emphasize this reality.

What does this have to do with overlap?

The problem above was that the report gave off the impression that segments were mutually exclusive when, in fact, they contained quite a bit of overlap. Without understanding the overlap, how can you interpret those numbers? Do we have 500+400+200=1100 users? Or do we have 200+(500-200)+(400-200)=700 users as would be the case if the 200 wholesalers were represented in all segments. In a more extreme scenario, you may be looking at 3 segments which all report on the exact same set of users.

As an example, how might you interpret those numbers above given each of these scenarios?

Scenario 1: Small, Even Overlap

Scenario 2: Large, Even Overlap

Large, Uneven Overlap

Scenario one is likely what the stakeholders at our pet company assumed would be the case – some slight overlap exists, but the metrics sufficiently indicate the behaviors of ‘Dog’ and ‘Cat’ lovers individually.

However, scenario two might be the reality. Perhaps 90% of their users love to compare prices across cat/dog products and visit each section at least once.

Or perhaps scenario 3 is the reality. Maybe a coupon link brought users to start their journey under /dogs which left the cat owners to then move over to /cats.

Unfortunately, there’s no way in standard GA to tell which scenario is actually occurring (though the new app+web version includes this feature). This is unfortunate, because each scenario would cause our stakeholders to interpret the segment metrics very differently.

So let’s move on to solving this issue in R.

Pulling GA Segment Data into R

I don’t have access to a pet retailer’s website, but I’m happy to share metrics from my own blog. In this scenario, I’ll create 3 segments:

  • Users who visit /blog
  • Users who visit /portfolio
  • Users who visit the home page (denoted as “/”)

Admittedly, these segments aren’t very interesting, but they mirror a common method of building segments based on page visits that are not necessary mutually exclusive. With the googleAnalyticsR library, we can create these GA segments on the fly and pull down the appropriate data from GA. Note: For this to work, you’ll need access to a user ID which could be their GA client ID. There’s a great article here on capturing client ID’s in GA using custom dimensions.

The code below shows how we can define our GA segments and pull the data.

# Use a function to generate our segments because each of the 3 segments are defined very similarly
create_pagePath_segment <- function(pagePath, operator){
  se_visited_page <- segment_element("pagePath", operator = operator, type = "DIMENSION", expression = pagePath)
  sv_visited_page <- segment_vector_simple(list(list(se_visited_page)))
  sd_visited_page <- segment_define(list(sv_visited_page))
  segment_ga4(paste0("Visited Page: ",pagePath), session_segment = sd_visited_page)

# Generate our 3 segments
s_visited_page_a <- create_pagePath_segment(page_a,"REGEX")
s_visited_page_b <- create_pagePath_segment(page_b,"REGEX")
s_visited_page_c <- create_pagePath_segment(page_c,"EXACT")

#Pull data from GA
ga <- google_analytics(viewId=view_id, date_range = c(Sys.Date()-300,Sys.Date()-1),
                       metrics = "sessions", dimensions = c(paste0("dimension",client_id_index)),
                       max=-1, segments = list(s_visited_page_a,s_visited_page_b, s_visited_page_c))

Visualizing Segment Overlap

Our next task is to visualize the overlap as a Venn diagram. We’ll use the VennDiagram library in R to do so.

# Define names of segments from the segment column
segment_names <- unique(ga$segment)
# Create a list of client IDs for each segment
segments <- lapply(segment_names, function(x){ga %>% filter(segment == x) %>% select(dimension2) %>% pull()})
colors <- brewer.pal(length(segment_names), "Dark2")

# Generate Venn diagram
diag <- venn.diagram(segments, 
             category.names = segment_names,
             width = 600,
             height= 600,
             resolution = 130,
             imagetype="png" ,
             filename = "ga_venn.png",
             cat.fontfamily = "sans",
             fontfamily = "sans",
             cat.col = colors,
             col = colors,
             fill = colors,
             cat.dist = c(.1,.1,.05),
             margin = c(.15,.15,.15))

# By default, the VennDiagram package outputs to disk, so weload the generated image here for display

While the plot above doesn’t scale the circles based on the size of the segment, it’s easy to interpret the overlap between the segments. Here we can see that 176 users visit the homepage and that a little less than 10% of those users went on to visit the blog AND the portfolio section (as denoted by the “16” in the middle).

With that, I’ll leave you with a happy accident in exploring the capabilities of the VennDiagram R library. Something you can look forward to if you start using this on your own data: a Venn diagram with 5 segments!

Causal Impact + Google Analytics – Evaluating the Effect of COVID19 on Hospital Appointments

The CausalImpact R library measures the effects of an event on a response variable when establishing a traditional control group through a randomized trial is not a viable option. It does this by establishing a ‘synthetic control’ which serves as a baseline under which the actual data is compared.

In this tutorial, we’ll look at the effect that the Coronavirus outbreak had on the number of “Make an Appointment” forms completed on a hospital website. The code for this post can be found here. To begin, we must establish a “pre-period” before the event occurred and a “post-period” after the event occurred. The pre-period is used to train a Bayesian Structural Time Series model. In the post-period, the model is used to predict our synthetic control which indicates how the outcome may have performed were the event not to have occurred.

Our pre-period will be 10/1/2019 to 3/15/2020 and our post-period will be 3/16/2020 – 5/4/2020. Our predictor variables will be the number of sessions from organic, social, and referral sources. An important assumption made by the CausalImpact library is that our predictors are not affected by our event.

Gathering Data from Google Analytics

First, we must gather the data necessary for our analysis. Our response variable, as established earlier, will be “Make an Appointment” form completions which is the goal1Completions metric in GA. Our predictor variables will come from the

We know that the hospital suspended paid media around the time of the outbreak so we’ll remove traffic from paid sources using the following filter:

channel_filter <- dim_filter(dimension="channelGrouping",operator="REGEXP",expressions="Paid Search|Display",not = T)

We call the Google Analytics reporting API twice. Once to gather the goal completion data:

# Gather goal data
df_goals <- google_analytics(viewId = view_id,
                       date_range = date_range,
                       metrics = "goal1Completions",
                       dimensions = c("date"),
                       dim_filters = my_filter_clause,
                       max = -1)

and once to gather the channel session data:

df_sessions <- google_analytics(viewId = view_id,
                       date_range = date_range,
                       metrics = c("sessions"),
                       dimensions = c("date","channelGrouping"),
                       max = -1,
                       dim_filters = my_filter_clause)

This avoids us having to aggregate the goal data after pivoting the session data. Pivoting the session data generates multiple columns of data from our single channelGrouping column. Putting this all together is shown below.

date_range <- c("2019-10-01","2020-05-04")

# Remove paid traffic
channel_filter <- dim_filter(dimension="channelGrouping",operator="REGEXP",expressions="Paid Search|Display",not = T)
my_filter_clause <- filter_clause_ga4(list(channel_filter))

# Gather goal data
df_goals <- google_analytics(viewId = view_id,
                       date_range = date_range,
                       metrics = "goal1Completions",
                       dimensions = c("date"),
                       dim_filters = my_filter_clause,
                       max = -1)
# Gather session data
df_sessions <- google_analytics(viewId = view_id,
                       date_range = date_range,
                       metrics = c("sessions"),
                       dimensions = c("date","channelGrouping"),
                       max = -1,
                       dim_filters = my_filter_clause) %>% 
   pivot_wider(id_cols=date,names_from=channelGrouping,values_from=sessions) %>%

# Merge the goal completion data into the sessions data
df <- df_sessions %>% mutate(y = df_goals$goal1Completions)

Create BSTS Model

The following code creates a Bayesian Structural Time Series model that will be used by the CausalImpact library to generate our synthetic control. It’s here that we input our pre-period and post-period as well as our predictor and response variables.

The BSTS package has several options for modifying our model. Here, we apply a “local level” which captures high level trend in the response variable. We also capture the 7-day weekly trend in our data using AddSeasonal().

df2 <- df # Create copy of our DF so we can re-run after the remove the response data from prediction period

# Assign pre and post periods
pre.period <- c(1,which(df$date == "2020-03-15"))
post.period <- c(which(df$date == "2020-03-15")+1,length(df$date))
post.period.response <- df$y[post.period[1] : post.period[2]]

# Remove outcomes from the post-period. The BSTS model should be ignorant of the values we intend to predict
df2$y[post.period[1] : post.period[2]] <- NA

# Create a zoo object which adds dates to plot output
df_zoo <- read.zoo(df2, format = "%Y-%m-%d") 

# Add loacl and seasonal trends
ss <- AddLocalLevel(list(), df_zoo$y)
ss <- AddSeasonal(ss, df_zoo$y, nseasons = 7) # weekly seasonal trend
bsts.model <- bsts(y ~ ., ss, niter = 1000, data = df3_zoo, family = "gaussian", ping=0)


The blue dots are the actual data points and the black line underneath is our estimated posterior distribution. We can see that the model does a reasonable job of predicting form completions, though there are some outliers in late February that are not well predicted. This will increase our uncertainty in our predictions and thus widen our confidence interval (the shading around the black line).

Generate Causal Impact Analysis

Now that we have our model, we can compare our prediction to what actually happened and measure the impact of the event.

impact <- CausalImpact(bsts.model = bsts.model,
                       post.period.response = post.period.response)


The top plot shows the actual data in black and our predicted distribution of the response variable in blue with the median value as a dashed blue line. The 2nd plot subtracts the predicted data from the actual data to show the difference between the two values. If the effect had no impact, we would expect the pointwise estimated to hover around 0. The last plot shows the cumulative impact of the event over time. Notice how our confidence interval (shown in blue) widens as time goes on.

Our causal impact model confirms a decrease in the number of form completions, however the 95% confidence interval quickly includes 0 which means that we cannot say with certainty that the impact extends into April. While we weren’t able to find conclusive results, being able to measure our certainty is a major benefit of Bayesian models such as this one.

Validating Our Synthetic Control

One method of validating your model is to generate predictions before the event occurred. If our model is well-behaved, we should see little difference between the predicted and actual response data.

# Filter to include only pre-event data. Also reorder columns to place y after the date
df_compare <- df %>% filter(date < "2020-02-15") %>% select(date,last_col(),2:length(df))

df_zoo <- read.zoo(df_compare, format = "%Y-%m-%d")

pre.period <- c(index(df_zoo)[1],index(df_zoo)[which(df_compare$date == "2020-01-15")])
post.period <- c(index(df_zoo)[which(df_compare$date == "2020-01-15")+1],index(df_zoo)[length(df_compare$date)])

impact <- CausalImpact(df_zoo, pre.period, post.period)


Above we see that the model doesn’t do a great job of predicting the upper spikes of the form completions which likely explains the wide confidence interval seen earlier.

Comparison to the Naive Approach

Deploying advanced modeling techniques is only useful if there are advantages over much simpler techniques. The naive method would be to use our pre-intervention data to establish an average and continue that average into the post-period to estimate a synthetic control.

Before the event, we had about 19 form fills a day. After, we had 8.5 a day. That’s a decrease of about 52%. CausalImpact estimated a decrease in 44% with a 95% confidence interval of 29%-63%. Were these numbers to be substantially different, and we had confidence in our model, we would prefer the figures generated by CausalImpact.

There are some clear cases when modeling will outperform the naive approach described above:

  • If there is a trend in the response variable, then averaging the pre-period will not capture the continuation of that trend.
  • If evaluating the degree of confidence is important, the CausalImpact model is preferable due to its ability to measure uncertainty.

Home   Blog   Portfolio   Contact  

Bringing clarity to marketers in a noisy world © 2020