The Secretary Problem

5 minute read

Published:

When perusing Cross Validated, I came across someone seeking to simulate the Secretary Problem which was summed up quite nicely by this problem text:

When applying for a secretarial position, $n$ candidates line up in random order for interviews, where the manager after each interview has to decide to take or decline that candidate. His assessment is qualitative in the sense that at interview $k\geq 2$, he observes only $Y_{k,n}$, the indicator that candidate $k$ is better than candidates $1,\ldots,k−1,$ and his goal is to find a strategy whose probability of selecting the best candidate is close to the maximal value $p_n^*$. It is easy to see that the optimal strategy is to select candidate

for some $k\in{2,\ldots,n}$. Let $k_n^*$ denote the optimal such $k$. Plot simulation estimates of $k_n^*$ and $p_n^*$ for $n = 5,\ldots, 50,$ and compare with the known asymptotics $k_n^*\sim n/e, p_n^*\sim 1/e$.

His questions were as follows:

  1. Why would you choose to simulate $k_n^*$?
  2. How would you do it?

Simulation is a hobby of mine so I decided to take up this question and answer it, the results of which will be here.

Why Simulate?

Quick research into the problem will reveal that the optimal algorithm to solve this problem is to rank the first $\frac{1}{e}$ applicants, not hire them, and then choose the first candidate who is better than your previous ones. For example, if there are ten candidates each with unique rankings, you might observe the following sequence.

We observe that in the first $\lfloor \frac{1}{e} \rfloor$ candidates, or first 3 candidates, the max score was a six. So, we check the fourth candidate and observe a 5, so we don’t hire. Then we observe a 4, so we don’t hire. Then we observe a 9, so he hire them and halt the interview process. In this case, we didn’t select the optimal candidate, but we were close.


But what if we don’t know this algorithm and we want to determine which $k$ is best? It isn’t immediately intuitive that $k \approx \frac{1}{e}$ is the best number of candidates to skip and so simulating can assist us in finding for which $k$ our probabilility of picking the best candidate is maximized. In the general case, simulations aid in intuition. If we can guess an optimal solution, or perhaps even prove an optimal solution through mathematics, then simulation helps to verify our results.

How would you simulate?

This is perhaps my favorite part. I will be using R to write this simulation and go through the steps on how to do this simulation.

We will start with a single simulation for arbitrary $k$, where we want to create 1000 candidates each with distinct rankings.

x <- sample(1:1000, 1000)

Now that we have our sample, we want to find the best candidate among the first $k$ candidates, which can be done with

init_best <- max(x[1:k])

Next, we begin by looping through the remaining candidates until we find one who is better than the best of our first $k$ candidates. Note that we only simuluate up to $n-1$ candidates for $k$, because it doesn’t make sense to skip every candidate as that always results in no hire.

for (i in (k+1):999) {
	if (x[i] > init_best) {
		candidate_score <- x[i]
		candidate_num <- i
		break
	}
}

So, from here, we have recorded the candidates score. We can quickly verify if this candidate is the best candidate by checking if their ranking is the max ranking. Since we have 1000 candidates, we do this with

if (candidate_score == 1000)
	success = success + 1

That is, we record that we successfully chose the best candidate and increment some value that keeps track of that. Using these, we can write a few loops to run this simulation several times which is shown below

sims <- 10000
p <- c()
p[1] <- 0
cand_ave <- c()
cand_ave[1] <- 0
for (k in 2:999) {
  success <- 0
  for (n in 1:sims) {
    x <- sample(1:1000, 1000)
    init_best <- max(x[1:k])
    candidate_score = -1
    for (i in (k+1):1000) {
      if (x[i] > init_best) {
        candidate_score <- x[i]
        break
      }
    }
    if (candidate_score == 1000)
      success <- success + 1
  }
  p[k] <- success/sims
  cat(k, " complete \n")
}
plot(1:999, p, type = "l", xlab = "Candidates Skipped",
     ylab = "Probability of selecting Best Candidate",
     main = "The Secretary Problem")

So, we create a collection of probabilities so that later on, we can plot these values. We also initialize candidate_score at -1 before each loop so as to signify the case where we end up not hiring anyone. After running this simulation 10000 times for each $k$, the results are as follows

Within this simulation, we find that the $k$ value that maximizes $p$ is $k = 332$ or $k = 374$ with $p = .3789$.

We know that the asymptotic $k_n^*$ and $p_n^*$ values are $\dfrac{n}{e}$ and $\dfrac{1}{e}$ respectively and these results are fairly close to those values which would be in this case

So the simulation approximately confirms the asymptotic values.