Real-Time Bidding (RTB) seen through the lens of machine learning research

Machine learning researchers have been looking at the issue of Real-time bidding, find out the main ideas from their work.

Why has the issue of bid selection in RTB attracted the attention of machine learning researchers?

Online advertising has been a central motivation for machine learning research for a decade. It has grown considerably since the 2010s (since a boom in 2014, global RTB spending has been steadily increasing) and gives rise to data collection that can be used by advertisers and publishers alike. Particularly, Search Engine Advertising (SEA) and RTB advertising are operated through different forms of bidding: for the former, keyword auctions, and for the latter, first or second price auctions historically. While game theorists have long been interested in the various bidding mechanisms, the choice of RTB or SEA bids poses a new kind of problem. Indeed, auctions are now repeated, and each player can learn from historical data to adjust its strategy.

From the advertisers’ point of view, this question of bidding choice is all the more crucial today in a context where the improvement of the bidding process (still mostly manual) could compensate for targeting efficiency losses linked to cookieless issues.

How is the issue of bid selection in RTB modeled?

In a simplified way, an RTB auction goes like this:

  • An advertising space is auctioned by a publisher
  • Advertisers place their bids
  • If the advertiser’s bid price is above the reserve price set by the publisher, the advertiser with the highest bid wins the right to display the banner of their choice.
  • He pays a price that depends on the auction model in place:

– The proposed price or a reserve price set by the publisher, if the proposed price is too low for a first price auction (the majority today),

– The second highest price or a reserve price set by the publisher, if the second highest price is too low for a second price auction.

  • It observes the user’s reaction to the banner (a click, a purchase, or none of the above) and derives some value from it.

From the advertiser’s side, the main focus is to learn how to propose an optimal bid.

The feedback linked to RTB is unique: when an advertiser loses a bid, he learns nothing about the value he would have earned from the site (similarly, if the reserve price is too high, the publisher learns nothing about the distribution of the prices offered by the advertisers).
Thus, once the auction is over, the player does not necessarily know the reward that would have been given by offering a different bid (respectively, a different reserve price, in the case of the publisher). This feedback model is reminiscent of the one-armed bandit model (see here for a gentle introduction), where one observes only the reward for the chosen action.
The bandit model has therefore often been associated with the optimal bid choice problem.

Two kinds of assumptions can be made about the value of the location and the bids proposed by the other advertisers. The simplest assumption is that these two quantities are random variables drawn from the same law independently: this is the “iid” assumption (Online learning for Repeated Auctions, Real-time bidding with side information, Efficient Algorithms for Stochastic Repeated Second-price Auctions). In particular, it implies that the way the main advertiser plays has no influence on the environment. We can also assume that these variables can be arbitrary: this is the adversarial hypothesis (Online learning for Repeated Auctions, Learning to Bid without Knowing your Value). This last hypothesis leads to much more defensive strategies.
On the other hand, competing bids can be assumed to be observed: for every move, only in some cases, or never.

A more elaborate model considers that the advertiser also has context data (on the user or the location) on which the reward depends in a linear way (real-time bidding with side information).

Why not use any classic K-arm bandit algorithm?

The simplest way to solve the bid selection problem is to consider that we have a choice between a finite number of bid values and to ignore the feedback structure specific to RTB. We can then use any K-arm bandit algorithm. The limitations of this approach are the following:

  • In reality the number of possible bids is immense, which implies that the exploration phase would last too long
  • This method deprives oneself of a large part of the information. When the auction is won, we know that any higher bid would also have won, and we know the value it would have brought. We therefore know the reward associated with any higher bid. In the same way, when one has lost, one knows the reward associated with any lower bid.

Why did most of the work initially focus on the second price?

Second-price bidding was predominant in RTB until about 2018.
They have gradually been replaced by first-price auctions, with the introduction of header bidding. The historical predominance of second price auctions only partly explains the concentration of searches on this type of auctions. Indeed, the study of the second price bids is much simpler, because these bids prompt to propose a bid equal to the value of the investment (we say that they are truthful). Competitors’ bids then have little influence on the way rational advertisers bid, and learning the value of the location is enough to get you closer to the optimal bid. On the contrary, in first-price auctions, the optimal bid depends on the bid of others as much as on the value of the location.

What are the main ideas used in this work?

The central issue is to find a way to balance exploration, which is materialized by high bids allowing to learn the value, and exploitation, which is materialized by bids close to the estimated optimal bid (i.e. the estimated value in the case of the second price).

In the iid models, we use algorithms close to UCB or linUCB in the contextual case (the most classical strategies for K-armed bandits and linear bandits) which propose as bid upper bounds of confidence of the optimal bid.

In adversarial modeling (Online learning for Repeated Auctions, Learning to Bid without Knowing your Value), methods closer to Exp3 (for adversarial bandits) have been developed. The transition from a choice between a finite number of bids to a choice between continuous bids is then more complex.


The problem of bid selection has given rise to several different models, many of which are related to the bandit model. At Numberly, we looked at a stochastic “iid” model, in Efficient Algorithms for Stochastic Repeated Second-price Auctions and Fast Rate Learning in Stochastic First Price Bidding. These papers show, among other things, that it is possible to implement algorithms that take advantage of the auction structure to achieve much better performance than generic algorithms. The model chosen is certainly very simple, but in our opinion it is quite close to reality. However, some limitations have not been taken into account: the delay between the impression of a banner and the click or purchase, or the behavior of algorithms that do not require the update of the strategy at each auction for example.

1 Header bidding is a technique that allows several ad-exchanges to compete on selected inventories
2 UCB (for Upper Confidence Bound) is a classical algorithm for K-armed bandits first proposed by Auer and which consists in choosing the arm whose upper confidence bound on the reward is the highest. LinUCB (see here) is a generalization to the contextual case.