In a simplified way, an RTB auction goes like this:
- An advertising space is auctioned by a publisher
- Advertisers place their bids
- If the advertiser’s bid price is above the reserve price set by the publisher, the advertiser with the highest bid wins the right to display the banner of their choice.
- He pays a price that depends on the auction model in place:
– The proposed price or a reserve price set by the publisher, if the proposed price is too low for a first price auction (the majority today),
– The second highest price or a reserve price set by the publisher, if the second highest price is too low for a second price auction.
- It observes the user’s reaction to the banner (a click, a purchase, or none of the above) and derives some value from it.
From the advertiser’s side, the main focus is to learn how to propose an optimal bid.
The feedback linked to RTB is unique: when an advertiser loses a bid, he learns nothing about the value he would have earned from the site (similarly, if the reserve price is too high, the publisher learns nothing about the distribution of the prices offered by the advertisers).
Thus, once the auction is over, the player does not necessarily know the reward that would have been given by offering a different bid (respectively, a different reserve price, in the case of the publisher). This feedback model is reminiscent of the one-armed bandit model (see here for a gentle introduction), where one observes only the reward for the chosen action.
The bandit model has therefore often been associated with the optimal bid choice problem.
Two kinds of assumptions can be made about the value of the location and the bids proposed by the other advertisers. The simplest assumption is that these two quantities are random variables drawn from the same law independently: this is the “iid” assumption (Online learning for Repeated Auctions, Real-time bidding with side information, Efficient Algorithms for Stochastic Repeated Second-price Auctions). In particular, it implies that the way the main advertiser plays has no influence on the environment. We can also assume that these variables can be arbitrary: this is the adversarial hypothesis (Online learning for Repeated Auctions, Learning to Bid without Knowing your Value). This last hypothesis leads to much more defensive strategies.
On the other hand, competing bids can be assumed to be observed: for every move, only in some cases, or never.
A more elaborate model considers that the advertiser also has context data (on the user or the location) on which the reward depends in a linear way (real-time bidding with side information).