## The Coupon (Sticker) Collector

Super­mar­ket chain Edeka cur­rently runs this sticker col­lect­ing racket – which used to be smurf col­lect­ing – where you get four stick­ers with ani­mal pic­tures for every €10 spent. The kids are col­lect­ing like crazy, I am told, even though the pic­tures are mediocre at best. Any­way, as geeks we don’t care much about the pic­tures but rather the com­bi­na­torics and prob­a­bil­ity laws that make this scheme so suc­cess­ful.

#### Expec­ta­tions

The sim­plest quan­tity to cal­cu­late is the expected num­ber of stick­ers needed to have a full col­lec­tion of all $m$ stick­ers (for the cur­rent Edeka cam­paign there are $m$ = 180 dif­fer­ent stick­ers). Pre­sum­ably, the stick­ers are ran­domly drawn from a set $M=\bigl\lbrace 1,2,\ldots,m\bigr\rbrace$ with uni­form prob­a­bil­ity (each sticker is equally prob­a­ble), where we sim­ply assign a num­ber for each unique sticker instead of writ­ing out the ani­mal names. We also neglect the fact that the stick­ers come in packs of four – this will affect the prob­a­bil­i­ties slightly as the packs are ran­domly mixed but will likely not con­tain dou­bles.

The prob­a­bil­ity dis­tri­b­u­tion of the num­ber of (Bernoulli) draws nec­es­sary to obtain a “suc­cess” given suc­cess prob­a­bil­ity $p_i$ is given by the geo­met­ric dis­tri­b­u­tion [1]. Here, suc­cess is defined as obtain­ing a new sticker not yet within the col­lec­tion. This prob­a­bil­ity $p_i$ reduces with each new sticker obtained. It is obvi­ously 1 in the first draw, but the sec­ond sticker drawn will be new only with prob­a­bil­ity $m-1\big/m$ since there are only $m-1$ stick­ers new out of the $m$ total. Gen­er­ally,

$$p_i = \frac{m-\bigl(i-1\bigr)}{m} \quad \text{for} \quad i\in\bigl\lbrace 1,2,…,m \bigr\rbrace \label{geometric-probability}$$

The total num­ber of stick­ers drawn, $n_\Sigma$, is then just the sum of the num­ber $n_i$ of stick­ers drawn until suc­cess $i$:

$$n_\Sigma = \sum\limits_{i=1}^{m} n_i \label{n_Sigma}$$

The expected value of the geo­met­ric dis­tri­b­u­tion is the inverse of $p_i$, and since the expec­ta­tion of a sum is just the sum of expec­ta­tions,

$$\mathcal{E}\Bigl[n_\Sigma\Bigr] = \sum\limits_{i=1}^{m} \frac{1}{p_i} = \sum\limits_{i=1}^{m} \frac{m}{m-\bigl(i-1\bigr)} \label{expected-necessary-draws}$$

For the $m$ = 180 ani­mal stick­ers this means that we need to col­lect approx­i­mately 1039 stick­ers on aver­age to have the full set.

Deriv­ing the dis­tri­b­u­tion of $N$ is some­what more dif­fi­cult, since it involves the con­vo­lu­tion of $m$ prob­a­bil­ity dis­tri­b­u­tions (alter­na­tively, mul­ti­pli­ca­tion of $m$ probability-generating func­tions with sub­se­quent inverse z-transform of the result).

#### Com­bi­na­torics

A dif­fer­ent way to cal­cu­late the var­i­ous prob­a­bil­i­ties and their dis­tri­b­u­tions is from a com­bi­na­tor­i­cal point-of-view. To find the prob­a­bil­ity of $j$ stick­ers miss­ing (out of $m$ pos­si­ble) in a ran­dom draw of $n$ stick­ers – or alter­na­tively, the prob­a­bil­ity of $r=m-j$ dif­fer­ent stick­ers being present in a drawn sam­ple – it is suf­fi­cient to find the num­ber of dif­fer­ent draws of $n$ stick­ers in which exactly $j$ stick­ers are miss­ing and nor­mal­ize it by the total num­ber of pos­si­ble sam­ples. The lat­ter is sim­ply $m^n$.

To find the num­ber of pos­si­ble draws with exactly $j$ stick­ers miss­ing, we can deter­mine the num­ber of pos­si­ble draws of $n$ stick­ers which con­tain all stick­ers from a reduced set of size $m-j$ and mul­ti­ply then this by the num­ber of com­bi­na­tions of the $j$ miss­ing stick­ers (since we do not care which stick­ers are miss­ing). This prob­lem is very closely related to the birth­day prob­lem in which we’d like to know how many dif­fer­ent birth­days occur on aver­age in a group of $n$ peo­ple. For the birth­day prob­lem, $m=365$ dis­count­ing leap years. The birth­day prob­lem is very well explained and its com­bi­na­torics derived in [2] (see item 13 for the deriva­tion of the most inter­est­ing quan­tity). The prob­a­bil­ity of exactly $j$ stick­ers miss­ing out of a set $M$ of size $m$ in a ran­dom draw of $n$ stick­ers is$^1$

$$P_\text{missing}(j,m,n) = \binom{m}{j} \sum\limits_{k=0}^{m-j} (-1)^k \binom{m-j}{k}\left(1 - \frac{j + k}{m}\right)^n \label{P_missing}$$
with
$$j \in \Bigl\lbrace\max\Bigl[m-n,0\Bigr], \ldots, m-1\Bigr\rbrace \notag$$

since there have to be at least $m-n$ stick­ers miss­ing if $n\lt m$. The prob­a­bil­ity mass func­tion for the num­ber $r=m-j$ of col­lected stick­ers out of $m$ total after $n$ draws (which is also given by $P_\text{missing}(j,m,n)$) is shown below.

Prob­a­bil­ity mass func­tion for the num­ber of unique stick­ers obtained after ran­domly col­lect­ing $n$ stick­ers. The total num­ber of dif­fer­ent stick­ers avail­able is $m$ = 180.

Inter­est­ingly, the spread of this dis­tri­b­u­tion is not very large; the num­ber of unique stick­ers one can expect to own at pretty much any one point does not vary much. By cal­cu­lat­ing the dis­tri­b­u­tion for every $n$ we can plot the expected num­ber of unique stick­ers over $n$:

Aver­age num­ber of unique stick­ers obtained vs. total num­ber of stick­ers col­lected with $m$ = 180 dif­fer­ent stick­ers avail­able (blue line). Also shown is the 10–90 per­centile range (yel­low bars).

#### A Full Set

So, then, how many stick­ers to we need to col­lect to achieve $j=0$ (a full set) with some prob­a­bil­ity $p$? This can be deter­mined with the help of $\eqref{P_missing}$ by not­ing that in order to com­plete the col­lec­tion on the $n$th sticker, there must have been one sticker miss­ing on the $n-1$th draw. The prob­a­bil­ity of this case is $P_\text{missing}(1,m,n-1)$ and the prob­a­bil­ity of get­ting the miss­ing sticker on the $n$th draw is $1/m$ as in $\eqref{geometric-probability}$, so that the prob­a­bil­ity of hav­ing the full set of stick­ers after exactly $n$ draws is

$$P_\text{full}(m,n) = \sum_{k=0}^{m-1} (-1)^k \binom{m - 1}{k} \left(1 - \frac{1+k}{m}\right)^{n-1}$$
with
$$n \in \Bigl\lbrace m, m + 1, \ldots \Bigr\rbrace\notag$$

This is the prob­a­bil­ity mass func­tion (PMF) for $n_\Sigma$ as defined ear­lier in $\eqref{n_Sigma}$. The prob­a­bil­ity of hav­ing a com­plete set after $n_\Sigma$ draws also includes the prob­a­bil­i­ties of hav­ing com­pleted the set on an ear­lier draw, so we sim­ply inte­grate the PMF to obtain the cumu­la­tive den­sity func­tion:

Prob­a­bil­ity of hav­ing a full set of stick­ers vs. num­ber of stick­ers col­lected so far with $m$ = 180 dif­fer­ent stick­ers avail­able.

For­tu­nately, there are swap meets and places like eBay, or a lot of mom­mies would need to spend loads of money at Edeka (which I assume is the whole pur­pose of this exer­cise) – to get the 1039 stick­ers needed (on aver­age) means spend­ing about €2,600 in the super­mar­ket chain.

#### Other Cases

Unfor­tu­nately, it is not rea­son­ably pos­si­ble to make $m$ or $n$ arbi­trar­ily vari­able within this page, as browser math­e­mat­ics is too lim­ited to cor­rectly com­pute the desired quan­ti­ties on-the-fly (a lot of almost equal num­bers are added and sub­tracted within the sums which poses great prob­lems for float­ing point cal­cu­la­tors). So we are lim­ited to choos­ing from a few pre-computed cases. Click on the but­tons below to see the graph­ics above change accord­ingly.

Choose a value for $m$ (num­ber of dif­fer­ent items avail­able):

Foot­notes:
1 As of Octo­ber 5, 2012 the for­mula given in [2] has the index of the sum run­ning from 1 instead of 0, which I believe is an error.

Ref­er­ences:
[1] Geo­met­ric Dis­tri­b­u­tion [Wikipedia]
[2] The Birth­day Prob­lem [Uni­ver­sity of Alabama]

last posts in mathematics: