From: Colin Guthrie on
Hi,

OK, this is really just a sounding board for a couple ideas I'm mulling
over regarding a pseudo-randomisation system for some websites I'm
doing. Any thoughts on the subject greatly appreciated!

Back Story:

We have a system that lists things. The things are broken down by
category, but you can still end up at the leaf of the category with a
couple hundred things to list, which is done via a pagination system
(lets say 50 per page).


Now, the people who own the things pay to have their things on the site.
Lets say there are three levels of option for listing: gold, silver,
bronze. The default order is gold things, silver things then bronze
things. Within each level, the things are listed alphabetically (again
this is just the default).


Now if 100 things in one category have a gold level listing, those in
the second half of the alphabet will be on page two by default. They
don't like this and they question why they are paying for gold at all.

My client would like to present things in a more random way to give all
gold level things a chance to be on the first page of results in a
fairer way than just what they happen to be named.

Right that's the back story. It's more complex than that, but the above
is a nice and simple abstraction.


Problems:

There are numerous problems to randomised listings: you can't actually
truly randomise results otherwise pagination breaks. Server-side
caching/denationalisation is affected as there is no longer "one
listing" but "many random listings". Discussing a link with a friend
over IM or email and saying things like "the third one down looks best"
is obviously broken too, but this is something my client accepts and can
live with. Also, if the intention is to reassure the thing owners that
their listing will appear further up the listings at times, the fact
that a simple refresh will not reorder things for a given session will
make that point harder to get across to less web-educated clients
(that's a nice way of saying it!). Caching proxies and other similar
things after the webserver will also come into play.


So to me there are only really two options:

1. Random-per user (or session): Each user session gets some kind of
randomisation key and a fresh set of random numbers is generated for
each thing. They can then be reliably "randomised" for a given user. The
fact that each user has their own unique randomisation is good, but it
doesn't help things like server side full page caching and thus more
"work" needs to be done to support this approach.

2. Random-bank + user/session assignment: So with this approach we have
a simple table of numbers. First column is an id and is sequential form
1 to <very big number>. This table has lots of columns: say 32. These
columns will store a random number. Once generated, this table acts as
an orderer. It can be joined into our thing lookup query and the results
can be ordered by one of the columns. Which column to use for ordering
is picked by a cookie stored on the users machine. That way the user
will always get the same random result, even if they revisit the site
some time later (users not accepting cookies is not a huge deal, but I
would suggest the "pick a random column" algorithm (used to set the
cookie initially) is actually based on source IP address. That way even
cookieless folks should get a consistent listing unless their change
their IP).



I'm obviously leaning towards the second approach. If I have 32
"pre-randomised" columns, this would get a pretty good end result I
think. If we re-randomise periodically (i.e. once a week or month) then
this can be extended further (or simply more columns can be added).

I think it's the lowest impact but there are sill some concerns:

Server side caching is still problematic. Instead of storing one page
per "result" I now have to store 32. This will lower significantly the
cache hits and perhaps make full result caching somewhat redundant. If
that is the case, then so be it, but load will have to be managed.


So my question for the lazy-web:

Are there any other approaches I've missed? Is there some cunning,
cleverness that eludes me?

Are there any problems with the above approach? Would a caching proxy
ultimately cause problems for some users (i.e. storing a cache for page
1 and page 2 of the same listing but with different randomisations)? And
if so can this be mitigated?

Thanks for reading and any insights you may have!


Col






--

Colin Guthrie
gmane(at)colin.guthr.ie
http://colin.guthr.ie/

Day Job:
Tribalogic Limited [http://www.tribalogic.net/]
Open Source:
Mandriva Linux Contributor [http://www.mandriva.com/]
PulseAudio Hacker [http://www.pulseaudio.org/]
Trac Hacker [http://trac.edgewall.org/]

From: "Jon Haworth" on
Hi Col,

Interesting problem.

> Are there any other approaches I've missed?

Off the top of my head, how about this:

1. Add a new unsigned int column called "SortOrder" to the table of widgets
or whatever it is you're listing

2. Fill this column with randomly-generated numbers between 0 and whatever
the unsigned int max is (can't remember exactly but 4.2 billion ish)

3. Add the SortOrder column to the end of all your ORDER BY clauses - SELECT
foo ORDER BY TypeOfListing, SortOrder will give you widgets sorted by
Gold/Silver/Bronze type, but in a random order for each type

4. Every hour/day/week/whatever, update this column with different random
numbers

Advantages: practically no hassle/overhead/maintenance for you; provides
same ordering sequence for all users at the same time; only breaks "third
one down"-type references when you refresh the SortOrder column rather than
on each session or page view; reasonably proxy- and cache-friendly
especially if you send a meaningful HTTP Expires header.

Disadvantages: breaks user persistence if they visit before and after a
SortOrder refresh ("I'm sure the one I wanted was at the top of the list
yesterday..."); more effort to demonstrate randomness to the client ("OK,
see how you're in ninety-third place today? Well, check again tomorrow and
you should be somewhere else on the list").

Hopefully food for thought anyway.

Cheers
Jon

From: Nathan Rixham on
Colin Guthrie wrote:
> Hi,
>
> OK, this is really just a sounding board for a couple ideas I'm mulling
> over regarding a pseudo-randomisation system for some websites I'm
> doing. Any thoughts on the subject greatly appreciated!
>
> Back Story:
>
> We have a system that lists things. The things are broken down by
> category, but you can still end up at the leaf of the category with a
> couple hundred things to list, which is done via a pagination system
> (lets say 50 per page).
>
>
> Now, the people who own the things pay to have their things on the site.
> Lets say there are three levels of option for listing: gold, silver,
> bronze. The default order is gold things, silver things then bronze
> things. Within each level, the things are listed alphabetically (again
> this is just the default).
>
>
> Now if 100 things in one category have a gold level listing, those in
> the second half of the alphabet will be on page two by default. They
> don't like this and they question why they are paying for gold at all.
>
> My client would like to present things in a more random way to give all
> gold level things a chance to be on the first page of results in a
> fairer way than just what they happen to be named.
>
> Right that's the back story. It's more complex than that, but the above
> is a nice and simple abstraction.
>
>
> Problems:
>
> There are numerous problems to randomised listings: you can't actually
> truly randomise results otherwise pagination breaks. Server-side
> caching/denationalisation is affected as there is no longer "one
> listing" but "many random listings". Discussing a link with a friend
> over IM or email and saying things like "the third one down looks best"
> is obviously broken too, but this is something my client accepts and can
> live with. Also, if the intention is to reassure the thing owners that
> their listing will appear further up the listings at times, the fact
> that a simple refresh will not reorder things for a given session will
> make that point harder to get across to less web-educated clients
> (that's a nice way of saying it!). Caching proxies and other similar
> things after the webserver will also come into play.
>
>
> So to me there are only really two options:
>
> 1. Random-per user (or session): Each user session gets some kind of
> randomisation key and a fresh set of random numbers is generated for
> each thing. They can then be reliably "randomised" for a given user. The
> fact that each user has their own unique randomisation is good, but it
> doesn't help things like server side full page caching and thus more
> "work" needs to be done to support this approach.
>
> 2. Random-bank + user/session assignment: So with this approach we have
> a simple table of numbers. First column is an id and is sequential form
> 1 to <very big number>. This table has lots of columns: say 32. These
> columns will store a random number. Once generated, this table acts as
> an orderer. It can be joined into our thing lookup query and the results
> can be ordered by one of the columns. Which column to use for ordering
> is picked by a cookie stored on the users machine. That way the user
> will always get the same random result, even if they revisit the site
> some time later (users not accepting cookies is not a huge deal, but I
> would suggest the "pick a random column" algorithm (used to set the
> cookie initially) is actually based on source IP address. That way even
> cookieless folks should get a consistent listing unless their change
> their IP).
>
>
>
> I'm obviously leaning towards the second approach. If I have 32
> "pre-randomised" columns, this would get a pretty good end result I
> think. If we re-randomise periodically (i.e. once a week or month) then
> this can be extended further (or simply more columns can be added).
>
> I think it's the lowest impact but there are sill some concerns:
>
> Server side caching is still problematic. Instead of storing one page
> per "result" I now have to store 32. This will lower significantly the
> cache hits and perhaps make full result caching somewhat redundant. If
> that is the case, then so be it, but load will have to be managed.
>
>
> So my question for the lazy-web:
>
> Are there any other approaches I've missed? Is there some cunning,
> cleverness that eludes me?
>
> Are there any problems with the above approach? Would a caching proxy
> ultimately cause problems for some users (i.e. storing a cache for page
> 1 and page 2 of the same listing but with different randomisations)? And
> if so can this be mitigated?
>
> Thanks for reading and any insights you may have!

if you use mysql you can seed rand() with a number to get the same
random results out each time (for that seed number)

SELECT * from table ORDER BY RAND(234)

Then just use limit and offset as normal.

Thus, assign each user / session a simple random int, and use it in the
query.

on a semi related note, if you need real random data, then you'll be
wanting random.org

Best,

Nathan


From: Colin Guthrie on
Thanks everyone for responses.

'Twas brillig, and Nathan Rixham at 20/08/10 13:17 did gyre and gimble:
> if you use mysql you can seed rand() with a number to get the same
> random results out each time (for that seed number)
>
> SELECT * from table ORDER BY RAND(234)
>
> Then just use limit and offset as normal.

This is a neat trick! Yeah that will avoid the need for the static
lookup table with 32 randomised columns.

Jon's strategy is more or less a simplified version of my 32-column
randomising table (i.e. just have 1 column of random data rather than
32). I would personally prefer to reduce the refresh of this data as I
don't like to annoy people when the change over day happens.

The RAND(seed) approach will probably work well (not sure of performance
verses an indexed table, but I can easily experiment with this).

If I use the numbers 1..32 as my seed, then I still get the same net
result as a 32 column table. If I just change my "seed offset" then I
get the same result as re-generating my random data tables.

From an operational perspective, RAND(seed) is certainly easier.

I'll certainly look into this. Many thanks.

Col


--

Colin Guthrie
gmane(at)colin.guthr.ie
http://colin.guthr.ie/

Day Job:
Tribalogic Limited [http://www.tribalogic.net/]
Open Source:
Mandriva Linux Contributor [http://www.mandriva.com/]
PulseAudio Hacker [http://www.pulseaudio.org/]
Trac Hacker [http://trac.edgewall.org/]

From: Colin Guthrie on
'Twas brillig, and Andrew Ballard at 20/08/10 14:24 did gyre and gimble:
> Would it work to return a list of some limited number of randomly
> ordered "featured" listings/items on the page, while leaving the full
> list ordered by whatever natural ordering (by date, order entered,
> alphabetical, etc.)? That gives every owner a chance to appear in a
> prominent spot on the page while solving the issue you cited about
> page breaks (and SEO if that is a concern). You can still use any of
> the suggestions that have been discussed to determine how frequently
> the featured items list is reseeded to help make caching practical.

Yeah we've tried to push this as an option too, but so far our clients
are not biting on this suggestion. They like the idea.... but in
addition to randomised listings too!

Speaking of SEO, that was one of our concerns about randomising listings
too. What impact do you think such randomised listings will have on SEO?

Obviously if a term is matched for a listing page that contains a thing
and when the user visits that page, the thing itself is not on in the
listing, then the user will be disappointed, but will this actually
result in SEO penalties?

Col




--

Colin Guthrie
gmane(at)colin.guthr.ie
http://colin.guthr.ie/

Day Job:
Tribalogic Limited [http://www.tribalogic.net/]
Open Source:
Mandriva Linux Contributor [http://www.mandriva.com/]
PulseAudio Hacker [http://www.pulseaudio.org/]
Trac Hacker [http://trac.edgewall.org/]

 |  Next  |  Last
Pages: 1 2
Prev: imagecreate inside an object
Next: mod_php