## The Coupon (Sticker) Collector

Super­mar­ket chain Ede­ka cur­rent­ly runs this stick­er col­lect­ing rack­et – which used to be smurf col­lect­ing – where you get four stick­ers with ani­mal pic­tures for every €10 spent. The kids are col­lect­ing like crazy, I am told, even though the pic­tures are mediocre at best. Any­way, as geeks we don’t care much about the pic­tures but rather the com­bi­na­torics and prob­a­bil­i­ty laws that make this scheme so suc­cess­ful.

#### Expectations

The sim­plest quan­ti­ty to cal­cu­late is the expect­ed num­ber of stick­ers need­ed to have a full col­lec­tion of all $m$ stick­ers (for the cur­rent Ede­ka cam­paign there are $m$ = 180 dif­fer­ent stick­ers). Pre­sum­ably, the stick­ers are ran­dom­ly drawn from a set $M=\bigl\lbrace 1,2,\ldots,m\bigr\rbrace$ with uni­form prob­a­bil­i­ty (each stick­er is equal­ly prob­a­ble), where we sim­ply assign a num­ber for each unique stick­er instead of writ­ing out the ani­mal names. We also neglect the fact that the stick­ers come in packs of four – this will affect the prob­a­bil­i­ties slight­ly as the packs are ran­dom­ly mixed but will like­ly not con­tain dou­bles.

The prob­a­bil­i­ty dis­tri­b­u­tion of the num­ber of (Bernoul­li) draws nec­es­sary to obtain a “suc­cess” giv­en suc­cess prob­a­bil­i­ty $p_i$ is giv­en by the geo­met­ric dis­tri­b­u­tion [1]. Here, suc­cess is defined as obtain­ing a new stick­er not yet with­in the col­lec­tion. This prob­a­bil­i­ty $p_i$ reduces with each new stick­er obtained. It is obvi­ous­ly 1 in the first draw, but the sec­ond stick­er drawn will be new only with prob­a­bil­i­ty $m-1\big/m$ since there are only $m-1$ stick­ers new out of the $m$ total. Gen­er­al­ly,

$$p_i = \frac{m-\bigl(i-1\bigr)}{m} \quad \text{for} \quad i\in\bigl\lbrace 1,2,…,m \bigr\rbrace \label{geometric-probability}$$

The total num­ber of stick­ers drawn, $n_\Sigma$, is then just the sum of the num­ber $n_i$ of stick­ers drawn until suc­cess $i$:

$$n_\Sigma = \sum\limits_{i=1}^{m} n_i \label{n_Sigma}$$

The expect­ed val­ue of the geo­met­ric dis­tri­b­u­tion is the inverse of $p_i$, and since the expec­ta­tion of a sum is just the sum of expec­ta­tions,

$$\mathcal{E}\Bigl[n_\Sigma\Bigr] = \sum\limits_{i=1}^{m} \frac{1}{p_i} = \sum\limits_{i=1}^{m} \frac{m}{m-\bigl(i-1\bigr)} \label{expected-necessary-draws}$$

For the $m$ = 180 ani­mal stick­ers this means that we need to col­lect approx­i­mate­ly 1039 stick­ers on aver­age to have the full set.

Deriv­ing the dis­tri­b­u­tion of $N$ is some­what more dif­fi­cult, since it involves the con­vo­lu­tion of $m$ prob­a­bil­i­ty dis­tri­b­u­tions (alter­na­tive­ly, mul­ti­pli­ca­tion of $m$ prob­a­bil­i­ty-gen­er­at­ing func­tions with sub­se­quent inverse z-trans­form of the result).

#### Combinatorics

A dif­fer­ent way to cal­cu­late the var­i­ous prob­a­bil­i­ties and their dis­tri­b­u­tions is from a com­bi­na­tor­i­cal point-of-view. To find the prob­a­bil­i­ty of $j$ stick­ers miss­ing (out of $m$ pos­si­ble) in a ran­dom draw of $n$ stick­ers – or alter­na­tive­ly, the prob­a­bil­i­ty of $r=m-j$ dif­fer­ent stick­ers being present in a drawn sam­ple – it is suf­fi­cient to find the num­ber of dif­fer­ent draws of $n$ stick­ers in which exact­ly $j$ stick­ers are miss­ing and nor­mal­ize it by the total num­ber of pos­si­ble sam­ples. The lat­ter is sim­ply $m^n$.

To find the num­ber of pos­si­ble draws with exact­ly $j$ stick­ers miss­ing, we can deter­mine the num­ber of pos­si­ble draws of $n$ stick­ers which con­tain all stick­ers from a reduced set of size $m-j$ and mul­ti­ply then this by the num­ber of com­bi­na­tions of the $j$ miss­ing stick­ers (since we do not care which stick­ers are miss­ing). This prob­lem is very close­ly relat­ed to the birth­day prob­lem in which we’d like to know how many dif­fer­ent birth­days occur on aver­age in a group of $n$ peo­ple. For the birth­day prob­lem, $m=365$ dis­count­ing leap years. The birth­day prob­lem is very well explained and its com­bi­na­torics derived in [2] (see item 13 for the deriva­tion of the most inter­est­ing quan­ti­ty). The prob­a­bil­i­ty of exact­ly $j$ stick­ers miss­ing out of a set $M$ of size $m$ in a ran­dom draw of $n$ stick­ers is$^1$

$$P_\text{missing}(j,m,n) = \binom{m}{j} \sum\limits_{k=0}^{m-j} (-1)^k \binom{m-j}{k}\left(1 - \frac{j + k}{m}\right)^n \label{P_missing}$$
with
$$j \in \Bigl\lbrace\max\Bigl[m-n,0\Bigr], \ldots, m-1\Bigr\rbrace \notag$$

since there have to be at least $m-n$ stick­ers miss­ing if $n\lt m$. The prob­a­bil­i­ty mass func­tion for the num­ber $r=m-j$ of col­lect­ed stick­ers out of $m$ total after $n$ draws (which is also giv­en by $P_\text{missing}(j,m,n)$) is shown below.

Prob­a­bil­i­ty mass func­tion for the num­ber of unique stick­ers obtained after ran­dom­ly col­lect­ing $n$ stick­ers. The total num­ber of dif­fer­ent stick­ers avail­able is $m$ = 180.

Inter­est­ing­ly, the spread of this dis­tri­b­u­tion is not very large; the num­ber of unique stick­ers one can expect to own at pret­ty much any one point does not vary much. By cal­cu­lat­ing the dis­tri­b­u­tion for every $n$ we can plot the expect­ed num­ber of unique stick­ers over $n$:

Aver­age num­ber of unique stick­ers obtained vs. total num­ber of stick­ers col­lect­ed with $m$ = 180 dif­fer­ent stick­ers avail­able (blue line). Also shown is the 10–90 per­centile range (yel­low bars).

#### A Full Set

So, then, how many stick­ers to we need to col­lect to achieve $j=0$ (a full set) with some prob­a­bil­i­ty $p$? This can be deter­mined with the help of $\eqref{P_missing}$ by not­ing that in order to com­plete the col­lec­tion on the $n$th stick­er, there must have been one stick­er miss­ing on the $n-1$th draw. The prob­a­bil­i­ty of this case is $P_\text{missing}(1,m,n-1)$ and the prob­a­bil­i­ty of get­ting the miss­ing stick­er on the $n$th draw is $1/m$ as in $\eqref{geometric-probability}$, so that the prob­a­bil­i­ty of hav­ing the full set of stick­ers after exact­ly $n$ draws is

$$P_\text{full}(m,n) = \sum_{k=0}^{m-1} (-1)^k \binom{m - 1}{k} \left(1 - \frac{1+k}{m}\right)^{n-1}$$
with
$$n \in \Bigl\lbrace m, m + 1, \ldots \Bigr\rbrace\notag$$

This is the prob­a­bil­i­ty mass func­tion (PMF) for $n_\Sigma$ as defined ear­li­er in $\eqref{n_Sigma}$. The prob­a­bil­i­ty of hav­ing a com­plete set after $n_\Sigma$ draws also includes the prob­a­bil­i­ties of hav­ing com­plet­ed the set on an ear­li­er draw, so we sim­ply inte­grate the PMF to obtain the cumu­la­tive den­si­ty func­tion:

Prob­a­bil­i­ty of hav­ing a full set of stick­ers vs. num­ber of stick­ers col­lect­ed so far with $m$ = 180 dif­fer­ent stick­ers avail­able.

For­tu­nate­ly, there are swap meets and places like eBay, or a lot of mom­mies would need to spend loads of mon­ey at Ede­ka (which I assume is the whole pur­pose of this exer­cise) – to get the 1039 stick­ers need­ed (on aver­age) means spend­ing about €2,600 in the super­mar­ket chain.

#### Other Cases

Unfor­tu­nate­ly, it is not rea­son­ably pos­si­ble to make $m$ or $n$ arbi­trar­i­ly vari­able with­in this page, as brows­er math­e­mat­ics is too lim­it­ed to cor­rect­ly com­pute the desired quan­ti­ties on-the-fly (a lot of almost equal num­bers are added and sub­tract­ed with­in the sums which pos­es great prob­lems for float­ing point cal­cu­la­tors). So we are lim­it­ed to choos­ing from a few pre-com­put­ed cas­es. Click on the but­tons below to see the graph­ics above change accord­ing­ly.

Choose a val­ue for $m$ (num­ber of dif­fer­ent items avail­able):

Foot­notes:
1 As of Octo­ber 5, 2012 the for­mu­la giv­en in [2] has the index of the sum run­ning from 1 instead of 0, which I believe is an error.

Ref­er­ences:
[1] Geo­met­ric Dis­tri­b­u­tion [Wikipedia]
[2] The Birth­day Prob­lem [Uni­ver­si­ty of Alaba­ma]

last posts in mathematics: