Incrementality & measurement in the era of digital marketing

Assessing the true impact of digital marketing campaigns

Half the money I spend on advertising is wasted; the trouble is, I don’t know which half” once said John Wanamaker, the famous marketing pioneer. And almost two centuries later, the issue remains the same in modern-day marketing: measuring the true impact of a campaign remains an idealistic goal, a goal that marketing experts seldom reach.

At Numberly, we believe that incrementality measurement, especially through experimentation, is a solution that provides more reliable insights than the market’s standard (attribution measurement). However, incrementality is not perfect and has its drawbacks: the complexity of the experimentation process, and the access to the required data can make it difficult to conduct. That is why we take pride in the expertise we have developed in data collaboration, incrementality measurement methodology, and bias analysis to provide our customers with the best possible insights on their marketing campaigns.

So, what is your magical method anyway?

The concept is pretty simple: when using attribution models, you introduce some uncertainty on what you measure. If someone makes a purchase right after being exposed to a campaign, this purchase is counted as a result of a campaign. So you never know how many of your exposed customers would have made a purchase anyway.
The generic solution is to define a group of customers, called the “Test group” or “Treatment group”, that you target with your marketing campaign. Then you compare their turnover to the turnover that would have been generated without your campaign. The difference is logically the uplift generated by your campaign on your turnover.

Of course, “the turnover that would have been generated without your campaign” is the tricky part. Ideally, we want to have a group of customers similar to the test group, which we will call the “Control” or “Holdout” group, and measure their turnover in the same period as the test group, but without exposing them to our campaign. This should give us the amount we want to compare to. But we will see that creating this group is more complicated than what we think.

1. The biases - why we do incrementality

The “naive” solution one would want to implement is to compare the turnover on those exposed to the campaign to the turnover generated on unexposed people.

In this example, we see that the group of people who saw our ad had an 83% conversion rate (striped people are those who converted), whereas the unexposed group had a 25% conversion rate.

 

First problem: sometimes you want to target a customer, but you fail (error in the email address or phone number, cookie that you can’t find through programmatic campaigns, people who didn’t log into their social media accounts during the period of the campaign, etc…).

Second problem: sometimes, you specifically target customers who have a higher conversion rate. In this case, there is no way of distinguishing the marketing effect from the targeting effect.

Example: I am promoting a new video game, and I know that my typical customer is aged below 30. I launched a social media campaign combining TikTok, Instagram and Meta.

I want to know if my campaign was useful or not.

  • On the audience “18 to 30”, I spent 80% of my budget and generated $10/person.
  • On the audience “30 to 50”, I spent 15% of my budget and generated $8/person.
  • On the audience “50+”, I spent 5% of my budget and generated $2/person.

I did that because I am an experienced marketing professional, I know my product is more suited to a very young audience and I spent my budget according to each target’s perceived potential.

Let’s assume that our campaign was completely ineffective. This means that exposed and unexposed customers behaved the same: on a given audience, my “natural” turnover would be almost the same as what I measured on exposed customers.

In this situation, if we just compare total exposed customers versus total unexposed customers, we will see a much higher turnover on the exposed customers. This is because we mainly exposed the 18-30 segment, and this segment is inherently better for our product:

Let’s do the math:

  • Exposed customers: 80%*$10 + 15%*$8 + 5%*$2 = $9.5/person on average.

Assuming all three audiences have the same size:

  • Unexposed customers: ($10 + $8 + $2)/3 = $6.5/person on average.

If we compare between exposed and unexposed on a given audience, let’s say “18 to 30”, we don’t see any difference in the result (average turnover of $10/person) because our campaign had no impact. We successfully managed to target the best people overall but we didn’t manage to have an incremental impact on those people. Comparing exposed and unexposed customers is then biased and wrong.

To avoid that, you need to create a “control” group that is similar* to the test group and use it as your baseline. This makes your experiment look like this:

The “slightly less naive but still too naive” approach would compare the people you exposed to the control group. This will lead you to believe that your campaign was a success in the previous example (5 conversions out of 6 is better than 7 out of 14).
However, if you look at your control group in more detail, here is your realization:

This campaign was a failure (the treatment and control group are absolutely identical), but the exposure bias can lead many people to believe it was a success.

Having a similar group is a complex task that mixes marketing & mathematical expertise. One solution that works in most cases is to divide your customers into two groups randomly. The more customers you have, the more likely you are to obtain two similar groups thanks to randomness.

2. The methodology

a. Creation of the populations

From what we’ve learned by now, we know that the only option left is to compare two identical populations. One solution would be to make sure that you can isolate such populations, by testing equal distribution according to gender, age, socio-economical value, CLV, purchase behavior, etc. This is a viable option in some cases, but it has some limitations:

  • You don’t always know what is the determining factor in a population: do we want to correct our population in terms of gender, age, visits on site, purchase, geography, all of the above? What if there is a criterion you don’t know about such as a specific interest that is determining to compare populations?
  • You get less precise results than with inherently similar populations.

This methodology will be detailed in part 4, as this allows us to get incremental measurements after a campaign has already been conducted, which is a frequent business case for our clients.

This is why the best method when possible is to create your test and your control group prior to your campaign and assign at random every individual to one group. Statistical law guarantees you that with enough customers in your audience (almost always reached in a real marketing use case), you will almost always get two identical sets.

b. Compare what’s comparable

Whatever method you use to create your two populations, it is necessary to control that they have similar purchase behaviors before your campaign. This verification is needed to confirm that the difference you observe after the campaign is purely due to your marketing incremental effect.

c. Take into account the dilution

Now that we have our proper set of experimentation, conducted our campaigns, and got our results, it’s time to dive into statistics and make sure we understand what effect is measured. For that, we must introduce some useful terminology from the medical field.

In pharmaceutical studies, biases can intervene while trying to measure the true effect of a given drug. The reader is probably familiar with the placebo effect for example: taking a drug, even ineffective, can have a significant psychological (and then physical) effect. Another less famous phenomenon is the possibility for the patient to refuse medication, or pretend to follow a treatment while actually not using the drugs. This behavior can be caused by some pathology or could be the patient’s own choice. If the placebo effect is not that relevant to our marketing study, the latter is crucial: indeed, just like people can refuse to take a pill even though they received the true treatment, people can avoid exposition to an ad even though they are in the targeted test group (for example, if the person is on holiday during the campaign and no impression is available).
We then distinguish two measured effects:

  • Intent-to-treat (ITT): This represents how the prescription of the drug affects the health of a patient, without knowing if they will actually take it or not.
  • Local Average Treatment Effect (LATE): This captures the actual effect of taking the drug on the health of a patient.

In our case, we can assimilate the ITT effect to our targeting: what is the incremental effect of targeting a population? The LATE is the exposure effect: what is the incremental effect on those who saw our ad?
The ITT is the most global and least detailed indicator: it combines the LATE effect and the probability of exposure in one.
It is easy to measure: you simply need to compare your test and control group. The LATE is more tricky: you cannot compare exposed to unexposed customers because, as we saw earlier, your exposable customers (those that are easy to reach, as opt-in for CRM programs or internet natives for media audiences) are inherently better than the rest.

For example, let’s take a mailing campaign that generates a 10% increase in sales.
You know that people who always open emails from your brand are your engaged customers. Let’s assume they generate 50% more sales in general than your regular customers.
Those customers will generate (1+10%)*(1+50%) = 65% more sales than the others. But you will only be able to measure that 65% (cumulative effect of ad + bias on engaged customers), and what you want to measure is the 10% increase in sales.

This is why we take into account the exposure rate, pollution (customers in the control group that were exposed unintentionally), technical limitations, and several parameters of the campaign (total and frequency of exposure, visibility, duration) to compute the actual effect estimated for the campaign, our Local Average Treatment Effect.

3. Get a grip on significance

When we speak about incrementality, you often hear the words “my results are 70% significant”. Did you ever wonder what that means?

Let’s dive a bit into the philosophy of why we do statistics.
When we measure something, anything, in the real world, chances are what we measure is not what will happen again in the same conditions in the future.
Imagine you are a famous zoologist, and you are filling up your encyclopedia about ducks.
Starting with the Pekin duck: you get one and weigh it, it is 4.2kg. Now, how confident are you to write down “the average Pekin duck weighs 4.2kg”? Maybe yours is especially heavy, as it has been spoiled living in your garden. Or maybe you found it in the wild, somewhere with scarce food, and it is significantly lighter than it should be.

You decide you don’t have enough data to conclude, so you get two more ducks.

  • Scenario 1: The second duck is 4kg, and the third one is 4.3kg. You are significantly more confident in your measure.
  • Scenario 2: The second duck weighs 2kg, the third one is 3.4kg. Not only are you doubting that your first duck is representative of its species, but you have no idea what number to write in your encyclopedia!

In statistics, significance is a way to measure precisely, given a dataset, how confident you are in your numbers, meaning how certain you are this is a real effect and you have not just stumbled upon a group of customers (or ducks) behaving abnormally.

The idea behind this is the following: we make the hypothesis that our campaign was a failure and had no effect. Then we draw the classical distribution of probability for our campaign, based on the standard deviation of our main performance KPI. This looks like this:

μ represents the average measured value, in our case, the incremental effect measured.
σ is the standard deviation and it is a statistical KPI that can be measured in any given dataset.

This shows the possible observations depending on your actual effect.

For example, on this graphic (with a calculated standard deviation of σ around the value μ): if your campaign had no effect, you have a 68% chance of measuring an effect between μ-σ and μ+σ, which means you have 32% chance of measuring an effect larger than μ+σ. This is because, in real life, the measured effect always varies from the theoretical effect depending on the characteristics of your population and randomness.

In incrementality, we try to see if our campaign had a positive effect:

  • Let’s take our statement:
    • “32% chances of measuring an effect larger than μ+σ if the campaign had no effect”.
  • This can be rewritten as:
    • “If I measure an effect larger than μ+σ, there is less than 32% chance that my campaign had no effect”.
    • Or, better: “I am 68% sure that my campaign had a positive impact since I measured an impact of μ+σ ”.

In the same way, in this example, a measured effect of μ+2σ shows a 95% confidence that your campaign had an effect. This is called significance.

Warning: from the previous explanation, you understand that the significance is the confidence that your campaign had an incremental effect. This gives you no information on the actual effect.

Saying you measured an uplift of 25% with 95% significance, or an uplift of 80% with a 95% significance provides the same knowledge on the actual effect of the campaign. For more details, ask for a confidence interval of your result (this takes the form of “I am 90% sure that my campaign has an uplift between 12% and 34%”).

4. The other ways

There are several techniques to measure incrementality without the proper experimentation contexts.

The Difference in differences

One method consists in predicting the behavior of the exposed group thanks to regression algorithms during the campaign period, thanks to historical data.
If the prediction is correct, the difference between the model and the actuals should be the effect of your campaign. All the complexity of this method is in testing the precision of our predictions.

The geo-incrementality

When it is complicated to gather precise enough data to compare our test and control groups, a solution is to compare postal codes that are statistically similar, and only compare turnover at the level of a town or more, without having to assign and attribute each specific conversion to one or the other.
This is quite difficult in France because the Parisian behavior is really different from the rest of the country, but the US provides an excellent setup.

The “twin group”

Another method consists in building a fake “twin” group out of the unexposed people, that is statistically similar to the exposed group. This group can make a good control group to measure our marketing effects:

This method is less precise than experimental ones, plus the methodology requires to have enough data on the customers to be able to build a relevant twin group. However, it has the advantage of requiring no setup and no holdout group (which means, no impact on the diffusion).

5. Conclusion

Measurement in marketing is a much more complicated process than it looks, requiring a deep understanding of statistics, marketing biases and insights, experience from comparable projects, and technical limitations.

But measuring our incremental effect is only the first step towards building an action plan. The work of the analyst begins after obtaining that first measure: what subpopulations are the most interesting? What is our seasonal effect? (meaning: would our campaign have the same impact in summer or winter?) What insights can I get from this measurement to better understand my customers and how to reach them? How can I optimize my campaign parameters, or combine different levers such as media and emails to create the best possible synergy?