From: Paul E. Black on
On Wednesday 02 December 2009 02:03, Barb Knox wrote:
> Barb Knox <see(a)sig.below> wrote:
>> Golabi Doon <golabidoon(a)gmail.com> wrote:
>> > Given n strings S_1 through S_n, where each S_k is simply a
>> > concatenation a of letter, say "a", but of arbitrary yet finite
>> > length. For example, S_1=aa , S_2=a , S_3=aa , S_4=aaa.
>> >
>> > Consider a tape of length T. We start from the head of the tape,
>> > randomly choose a S_k, put it on the tape and move fto the next free
>> > cell, and then randomly choose another S_k and paste it in and so on.
>> > In other words, we concatenate randomly chosen strings from the given
>> > set of strings until we reach a point where the size of the chosen S_k
>> > is larger than the free space left on the tape.
>> >
>> > Example: T=7.
>> > A randomly chosen string happens to be S_1, the tape becomes aa.
>> > Next string happens to be S_2, the tape becomes aaa.
>> > Next string happens to be S_1, the tape becomes aaaaa
>> > Next string happens to be S_4, the tape is too full for that, so halt.
>> >
>> > Let's denote the number of times that S_k appears on the tape by
>> > #S_k.In the previous example #S_1=2, #S_2=1, #S_3=#S_4=0.
>> >
>> > I believe, as T approaches infinity, the ratio #S_k / sum_t (#S_t)
>> > converges to fixed number or each k. By fixed I mean, if I repeat this
>> > process, by generating another randomly concatenation of the given
>> > string set on the tape, I get the same ratios.
>> >
>> > Is my conjecture correct? If so, is it possible to express #S_k /
>> > sum_t (#S_t) in terms of L_1 through L_n where L_k is the length of
>> > S_k? Naturally, a longer S_k has less chance to appear on the tape as
>> > opposed to a smaller S_k. However, I do not know the exact
>> > relationship.
>> Golabi Doon <golabidoon(a)gmail.com> wrote:
>> > I forgot to mention the simplying assumption that when choosing a
>> > string to put it on the tape, all the strings have equal chance to be
>> > chosen, i.e. the probablity of choosing any string S_k is equal to 1/
>> > n.
>>
>> Your conjecture is true, but trivially so. (#S_k)/sum_t(#S_t)
>> approaches 1/n for every k. As the tape gets longer,
>> the length of the final string (which, as you say,
>> is more likely to be a short one than a long one)
>
> Oh bother. That turns out to be false, even though I found it
> intuitively compelling. Certainly the final+1 string (i.e., the one
> which runs off the end of the tape and is thus discarded) is more likely
> to be long than short, but ALL the other strings, including the final
> non-discarded one, have uniformly distributed lengths. That is, even
> for *small* tape lengths T, (#S_k)/sum_t(#S_t) averages 1/n for all k.
> The only constraint is that T not be < n.
>
> I was surprised by this. I would like to hear from you (yes, *you* too,
> gentle reader) what *your* intuitions say about the expected length of
> the final non-discarded string.

I worked out all the possiblities for T=4, S_1 = aa and S_2 = b (I
used different letters so I could keep track of which strings had been
chosen.) I was surprised, too. The ratio is *exactly* 50/50.

The average distributions (expected ratio) are always exactly the
probability of choosing, that is, 1/n. Here's an intuition why.

Suppose you're doing the above as a form of musical chairs. That is,
the tape is unbounded and you always stop adding strings when the
music ends, never when you run out of tape. What are the ratios?
Clearly 1/n.

When does the music stop? It doesn't really matter what causes you to
stop adding strings! The point is that up until then, the chances of
adding a string is 1/n, no more, no less. It sure doesn't matter how
much tape is left.

If the problem says keep adding until you can't add any more, then
certainly there will be more shorter strings than longer strings. But
that isn't this problem.

An analog is the baby gender ratio paradox. Suppose a society values
boys babies, but not girl babies. Specifically when a couple has a
girl, they don't have any more children. Assuming equal chance of
having a girl or boy, what's the percentage in the overall population?
An incorrect intuition is that one will see families of 3 boys and 1
girl or 4 boys and 1 girl, so there must be more boys. Actually the
ratio is exactly 1/2. How could that be? Families with several boys
will be scarce.

-paul-
--
Paul E. Black (p.black(a)acm.org)
From: Barb Knox on
In article <slrnhhcgi4.n6v.tim(a)soprano.little-possums.net>,
Tim Little <tim(a)little-possums.net> wrote:

> >> > I believe, as T approaches infinity, the ratio #S_k / sum_t (#S_t)
> >> > converges to fixed number or each k.
>
> Your proposition is slightly ambiguous: you mention the ratio
> (#S_k)/sum_t(#S_t), but do not mention whether this should be
> interpreted as a ratio of expected values, or an expected value of the
> ratio.
>
> For example, suppose |S_1| = 1, |S_2| = 5, T = 5. Then E(#S_2) = 1/2
> (it appears iff it appears first), while E(#S_1) = 31/32. So then
> E(#S_1) / E(sum_t(#S_t)) = 31/47 and E(#S_2) / E(sum_t(#S_t)) = 16/47.
>
> However, E(#S_1 / sum_t(#S_t)) = E(#S_2 / sum_t(#S_t)) = 1/2.
>
>
> On 2009-12-02, Barb Knox <see(a)sig.below> wrote:
> > Does anyone reading this have a good intuitive combinatorial (i.e.,
> > not probabilistic) argument showing that considering all possible
> > completed tapes, every S_k occurs with equal frequency regardless of
> > the tape length?
>
> As the above example indicates, they don't. Short strings do occur
> more frequently - in that example, the shorter string appears 31/16
> times more frequently overall.

My bad. I misread the original problem as saying #S_i = i. Since Paul
Black used an example isomorphic to this he too got equal frequencies.

I expect there is a simple argument why when #S_i = i the frequencies
must be equal, but I can't see it. Any ideas?


> If you disregard the probabilities of each completed tape and just
> look at the combinatorial possibilities, then the disparity is even
> greater: the possibilities are 1, 11, 111, 1111, 11111, and 2.
>
> It is true that the final strings are uniformly distributed, by
> reversal symmetry and the condition that the initial strings are.
>
>
> - Tim



--
---------------------------
| BBB b \ Barbara at LivingHistory stop co stop uk
| B B aa rrr b |
| BBB a a r bbb | Quidquid latine dictum sit,
| B B a a r b b | altum viditur.
| BBB aa a r bbb |
-----------------------------
From: Barb Knox on
In article <see-09016E.22562204122009(a)mail.eternal-september.org>,
Barb Knox <see(a)sig.below> wrote:

> In article <slrnhhcgi4.n6v.tim(a)soprano.little-possums.net>,
> Tim Little <tim(a)little-possums.net> wrote:
>
> > >> > I believe, as T approaches infinity, the ratio #S_k / sum_t (#S_t)
> > >> > converges to fixed number or each k.
> >
> > Your proposition is slightly ambiguous: you mention the ratio
> > (#S_k)/sum_t(#S_t), but do not mention whether this should be
> > interpreted as a ratio of expected values, or an expected value of the
> > ratio.
> >
> > For example, suppose |S_1| = 1, |S_2| = 5, T = 5. Then E(#S_2) = 1/2
> > (it appears iff it appears first), while E(#S_1) = 31/32. So then
> > E(#S_1) / E(sum_t(#S_t)) = 31/47 and E(#S_2) / E(sum_t(#S_t)) = 16/47.
> >
> > However, E(#S_1 / sum_t(#S_t)) = E(#S_2 / sum_t(#S_t)) = 1/2.
> >
> >
> > On 2009-12-02, Barb Knox <see(a)sig.below> wrote:
> > > Does anyone reading this have a good intuitive combinatorial (i.e.,
> > > not probabilistic) argument showing that considering all possible
> > > completed tapes, every S_k occurs with equal frequency regardless of
> > > the tape length?
> >
> > As the above example indicates, they don't. Short strings do occur
> > more frequently - in that example, the shorter string appears 31/16
> > times more frequently overall.
>
> My bad. I misread the original problem as saying #S_i = i. Since Paul
> Black used an example isomorphic to this he too got equal frequencies.

Typo: should be |S_i| = i.

> I expect there is a simple argument why when #S_i = i the frequencies
> must be equal, but I can't see it. Any ideas?

Ditto.


> > If you disregard the probabilities of each completed tape and just
> > look at the combinatorial possibilities, then the disparity is even
> > greater: the possibilities are 1, 11, 111, 1111, 11111, and 2.
> >
> > It is true that the final strings are uniformly distributed, by
> > reversal symmetry and the condition that the initial strings are.
> >
> >
> > - Tim



--
---------------------------
| BBB b \ Barbara at LivingHistory stop co stop uk
| B B aa rrr b |
| BBB a a r bbb | Quidquid latine dictum sit,
| B B a a r b b | altum viditur.
| BBB aa a r bbb |
-----------------------------
From: Tim Little on
On 2009-12-04, Barb Knox <see(a)sig.below> wrote:
> I expect there is a simple argument why when #S_i = i the frequencies
> must be equal, but I can't see it. Any ideas?

I don't see it at all. Suppose T = n = 2. Then E(#S_2) = 1/2, but
E(#S_1) = 3/4: 1/4 chance of appearing once, plus 1/4 chance of
appearing twice.


- Tim
From: Ben Bacarisse on
Tim Little <tim(a)little-possums.net> writes:

> On 2009-12-04, Barb Knox <see(a)sig.below> wrote:
>> I expect there is a simple argument why when #S_i = i the frequencies
>> must be equal, but I can't see it. Any ideas?
>
> I don't see it at all. Suppose T = n = 2. Then E(#S_2) = 1/2, but
> E(#S_1) = 3/4: 1/4 chance of appearing once, plus 1/4 chance of
> appearing twice.

But the expected value we want is for the ratio #S_k/sum_t(#S_t) and
the sum is 2 when S_1 is chosen twice.

As I see it, half the time choose S_2 first. In this case sum_t(#S_t)
is 1 and #S_1 = 0 and #S_2 = 1. Half the time #S_1/sum_t(#S_t) is 0
and #S_2/sum_t(#S_t) is 1.

The other half of the time we pick S_1 first. There is then an equal
probability of choosing S_1 again or S_2. In both cases we stop. In
the first case #S_1/sum_t(#S_t) = 2/2 = 1 and in the second case
#S_1/sum_t(#S_t) = 1/1 = 1. I.e. half the time #S_1/sum_t(#S_t) is
one and half the time is it zero.

In both of these cases #S_2 = 0 so #S_2/sum_t(#S_t) is zero half the
time and one the other half.

--
Ben.