Authors:
(1) Avrim Blum, Toyota Technological Institute at Chicago, IL, USA;
(2) Melissa Dutz, Toyota Technological Institute at Chicago, IL, USA.
2 Setting and 2.1 Models of behaviorally-biased opponents
4.1 Myopic Best Responder and 4.2 Gambler’s Fallacy Opponent
4.3 Win-Stay, Lose-Shift Opponent
4.4 Follow-the-Leader Opponent and 4.5 Highest Average Payoff Opponent
5 Generalizing
5.1 Other Behaviorally-Biased Strategies
5.2 Exploiting an Unknown Strategy from a Known Set of Strategies
A Appendix
A.1 Win-Stay Lose-Shift Variant: Tie-Stay
A.2 Follow-the-Leader Variant: Limited History
A.4 Highest Average Payoff Opponent
We consider two-player, repeated games with the following properties:
Symmetric
Zero-sum
Payoffs are limited to {1, 0, −1} (representing win, tie, and loss respectively)
Each action beats at least one other action
Each action loses to at least one other action
▶ Definition 1 (Permissible Game). We refer to a game with the above properties as a permissible game for convenience.
We consider how well we can do without knowing the payoff matrix M or even observing payoffs when playing against a behaviorally-biased opponent who does know M. Our goal will be to win nearly every round.
We assume all opponents know the payoff matrix and play a deterministic, behaviorallybiased strategy. The following are the particular biased adversaries we consider here:
Myopic Best Responder
The opponent plays a best response to our last action. If there are multiple best responses to our last action, we assume the opponent breaks ties according to some fixed but unknown order over actions. On the first round, since there is no previous action, we assume the opponent plays the first action in that ordering.
This is our simplest opponent, who just assumes that tomorrow will be exactly like today.
Gambler’s Fallacy
e opponent plays a best response to our “most overdue” (least-frequently-played) action. This opponent is motivated by the Gambler’s Fallacy, in which one assumes an event which hasn’t happened as often as expected is “overdue” and is more likely to occur (for example, when tossing a fair coin, the belief that Tails is more likely following a series of Heads).
If there are multiple best responses to our “most overdue” action, or if there are multiple “most overdue” actions the opponent might try to counter (for example, at the start of the game when all actions are equally “overdue” since they have all been played zero times), the opponent breaks ties using a fixed but unknown order over actions.
Win-Stay, Lose-Shift
The opponent continues to play the same action following a win, and switches to a new action following a loss. We consider two variants of this strategy: one in which the opponent behaves as if a tie is a win, continuing to play the same action after tying, and one in which the opponent behaves as if a tie is a loss, switching actions after tying.
We assume that when the opponent shifts actions, they shift to the action immediately following the current action in some fixed but unknown order over actions. In the first round, we assume the opponent plays the first action in that ordering.
Win-stay lose-shift was initially proposed as an early algorithm for bandit problems by Robbins [15]. There is also some empirical research indicating that people may play rock-paper-scissors according to an approximate version of this strategy: Wang et al. [19] observed that players are more likely to play the same action again after winning, and more likely to shift to the next action in the sequence rock-paper-scissors following a loss. More broadly, win-stay lose-shift has been proposed as a potential explanation for the evolution of cooperative behavior [13].
This strategy can also be viewed as similar to the hot-hand fallacy, in which one believes an action which has just won is having a “hot streak” and is more likely to win in the future, and conversely believes an action which has just lost is having a “cold” streak and is more likely to lose in the future.
Follow the Leader
The opponent plays the best action in retrospect, where the best action in retrospect is the action that would have achieved the highest net payoff against our history of play. If multiple actions are tied for the highest payoff, the opponent breaks ties using some fixed but unknown order over actions. In the first round, when there is no history to choose based on, the opponent plays the first action in that ordering.
It has been hypothesized that irrational strategic behavior may be explained by agents playing a misspecified or simplified version of a game rationally [7]. The above strategy is reasonable for an opponent who erroneously models our play as stochastic. In online learning, this strategy is called Follow-the-Leader or FTL (see [16]).
We also consider a limited-history variant of this opponent, in which the the opponent only considers the last r rounds, where r > 0.
Highest Average Payoff
The opponent plays the action that has achieved the highest average payoff over the times they have played it. We assume the opponent initializes the average payoff of each action as 0. If multiple actions are tied for the highest average payoff, the opponent breaks ties using some fixed but unknown order over actions.
This is a reasonable strategy for an opponent who believes our play is stochastic, in which case its actions can be viewed as coins with various biases. It also corresponds to a well-known reinforcement learning algorithm, epsilon-greedy Q-learning, with ϵ = 0 (see [17]).
This paper is available on arxiv under CC BY 4.0 DEED license.