paint-brush
Understanding Bias-Driven Opponent Models in Competitive Gameplayby@algorithmicbias

Understanding Bias-Driven Opponent Models in Competitive Gameplay

by Algorithmic BiasJanuary 24th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This section introduces the game setting: two-player, repeated, zero-sum games with limited payoffs and symmetric properties. It explores models of behaviorally-biased opponents, including Myopic Best Responder, Gambler’s Fallacy, Win-Stay Lose-Shift, Follow the Leader, and Highest Average Payoff. Each model highlights distinct biases that can be exploited to win nearly every round.
featured image - Understanding Bias-Driven Opponent Models in Competitive Gameplay
Algorithmic Bias HackerNoon profile picture
0-item

Authors:

(1) Avrim Blum, Toyota Technological Institute at Chicago, IL, USA;

(2) Melissa Dutz, Toyota Technological Institute at Chicago, IL, USA.

Abstract and 1 Introduction

2 Setting and 2.1 Models of behaviorally-biased opponents

3 Preliminaries and Intuition

4.1 Myopic Best Responder and 4.2 Gambler’s Fallacy Opponent

4.3 Win-Stay, Lose-Shift Opponent

4.4 Follow-the-Leader Opponent and 4.5 Highest Average Payoff Opponent

5 Generalizing

5.1 Other Behaviorally-Biased Strategies

5.2 Exploiting an Unknown Strategy from a Known Set of Strategies

6 Future Work and References

A Appendix

A.1 Win-Stay Lose-Shift Variant: Tie-Stay

A.2 Follow-the-Leader Variant: Limited History

A.3 Ellipsoid Mistake Bounds

A.4 Highest Average Payoff Opponent

2. Setting

We consider two-player, repeated games with the following properties:


  • Symmetric


  • Zero-sum

  • Payoffs are limited to {1, 0, −1} (representing win, tie, and loss respectively)


  • Each action beats at least one other action


  • Each action loses to at least one other action


▶ Definition 1 (Permissible Game). We refer to a game with the above properties as a permissible game for convenience.



We consider how well we can do without knowing the payoff matrix M or even observing payoffs when playing against a behaviorally-biased opponent who does know M. Our goal will be to win nearly every round.

2.1 Models of behaviorally-biased opponents

We assume all opponents know the payoff matrix and play a deterministic, behaviorallybiased strategy. The following are the particular biased adversaries we consider here:


Myopic Best Responder


The opponent plays a best response to our last action. If there are multiple best responses to our last action, we assume the opponent breaks ties according to some fixed but unknown order over actions. On the first round, since there is no previous action, we assume the opponent plays the first action in that ordering.


This is our simplest opponent, who just assumes that tomorrow will be exactly like today.


Gambler’s Fallacy


e opponent plays a best response to our “most overdue” (least-frequently-played) action. This opponent is motivated by the Gambler’s Fallacy, in which one assumes an event which hasn’t happened as often as expected is “overdue” and is more likely to occur (for example, when tossing a fair coin, the belief that Tails is more likely following a series of Heads).


If there are multiple best responses to our “most overdue” action, or if there are multiple “most overdue” actions the opponent might try to counter (for example, at the start of the game when all actions are equally “overdue” since they have all been played zero times), the opponent breaks ties using a fixed but unknown order over actions.


Win-Stay, Lose-Shift


The opponent continues to play the same action following a win, and switches to a new action following a loss. We consider two variants of this strategy: one in which the opponent behaves as if a tie is a win, continuing to play the same action after tying, and one in which the opponent behaves as if a tie is a loss, switching actions after tying.


We assume that when the opponent shifts actions, they shift to the action immediately following the current action in some fixed but unknown order over actions. In the first round, we assume the opponent plays the first action in that ordering.


Win-stay lose-shift was initially proposed as an early algorithm for bandit problems by Robbins [15]. There is also some empirical research indicating that people may play rock-paper-scissors according to an approximate version of this strategy: Wang et al. [19] observed that players are more likely to play the same action again after winning, and more likely to shift to the next action in the sequence rock-paper-scissors following a loss. More broadly, win-stay lose-shift has been proposed as a potential explanation for the evolution of cooperative behavior [13].


This strategy can also be viewed as similar to the hot-hand fallacy, in which one believes an action which has just won is having a “hot streak” and is more likely to win in the future, and conversely believes an action which has just lost is having a “cold” streak and is more likely to lose in the future.


Follow the Leader


The opponent plays the best action in retrospect, where the best action in retrospect is the action that would have achieved the highest net payoff against our history of play. If multiple actions are tied for the highest payoff, the opponent breaks ties using some fixed but unknown order over actions. In the first round, when there is no history to choose based on, the opponent plays the first action in that ordering.


It has been hypothesized that irrational strategic behavior may be explained by agents playing a misspecified or simplified version of a game rationally [7]. The above strategy is reasonable for an opponent who erroneously models our play as stochastic. In online learning, this strategy is called Follow-the-Leader or FTL (see [16]).


We also consider a limited-history variant of this opponent, in which the the opponent only considers the last r rounds, where r > 0.


Highest Average Payoff


The opponent plays the action that has achieved the highest average payoff over the times they have played it. We assume the opponent initializes the average payoff of each action as 0. If multiple actions are tied for the highest average payoff, the opponent breaks ties using some fixed but unknown order over actions.


This is a reasonable strategy for an opponent who believes our play is stochastic, in which case its actions can be viewed as coins with various biases. It also corresponds to a well-known reinforcement learning algorithm, epsilon-greedy Q-learning, with ϵ = 0 (see [17]).


This paper is available on arxiv under CC BY 4.0 DEED license.