managing large table of data [Lisp]

Prev: lisp introspection/reflection question
Next: online shopping

From: Captain Obvious on 1 Feb 2010 10:20

TKP> I don't know much about these things, but I think that the best
TKP> solution would be a database of some kind. I am wondering what would
TKP> be the simplest and most hassle-free way to do this in CL (if that
TKP> matters, I am using SBCL).

Juho Snellman have described rather hackish way for working with
large data sets here:

http://jsnell.iki.fi/blog/archive/2006-10-15-netflix-prize.html

As I understand, with this kind of solution, you can mmap your
vectors to files once, then then OS will do the rest for you -- it
will automatically load stuff from disk, write it to disk, clear memory
from data which is not used right now etc.
On a 32-bit machine you will be able to mmap only a portion
of data at time, so you'll need a wrapper of some sort.

From: Mario S. Mommer on 1 Feb 2010 12:08

Hi Tamas,

Tamas K Papp <tkpapp(a)gmail.com> writes:
> I have a good 64 bit machine with tons of ram, but in a momentary
> lapse of reason, I installed 32 bit ubuntu on it in the past. Maybe a
> reinstall would be less hassle than a DB.

I'm quite sure that this is so. You'll probably upgrade eventually
anyway.

> I notice that you are using SBCL. I posted the message below to the
> SBCL list, but got no reply so far. I wonder if you could help me:
[...]
> - how big is ARRAY-TOTAL-SIZE-LIMIT on 64-bit SBCL? Will this allow
> me to use larger arrays? Is there another limit (provided that I
> take enough memory with a --dynamic-space-size)?

; SLIME 2009-05-19
CL-USER> ARRAY-TOTAL-SIZE-LIMIT
1152921504606846973
CL-USER> (log * 2)
60.0
CL-USER>

No idea if there are other limits. I've not bumped into any.

> - Does 64-bit result in significantly a higher memory consumption? I
> understand that fixnums will now take twice the space, but does
> anything else take up more memory?

Conses are wider too.

> - Does 64 vs 32 bit have any impact on speed (positively or
> negatively)? Can single floats be unboxed in 64-bit?"

No idea about single floats; I'd be very surprised if they would not be
unboxable, as the SBCL developers really pay a lot of attention to
performance (and my thanks go to them for that!). I have no accurate
information on the speed issue either, but would again be very surprised
if the 64 bit version would be slower.

Now to the FASL thing. It is a hack, but it works. See below. The fasls
are not portable, so one has to migrate them from one version to the
next, or from one implementation to the next. But they load fast, so
there are cases where this is a good solution.

Mario

(defpackage #:faslstore
(:export #:bindump #:binload)
(:nicknames #:fs)
(:use :cl))

(in-package #:faslstore)

(defparameter *hook* nil)

(defun gentempname nil
(format nil "~Afaslize.lisp" (get-universal-time)))

(defun bindump (data fname)
(let ((tmp (gentempname)))
(setq *hook* data)
(with-open-file (str tmp
:direction :output
:if-exists :supersede)
(format str "(in-package #:faslstore)~%~
(let ((c #.*hook*))~%~
(defun returner nil~%~
(prog1 c (setf c nil))))"))
(compile-file tmp :output-file fname
:verbose nil :print nil)
(delete-file tmp)))

(defun returner nil nil)

(defun binload (fname)
(load fname)
(returner))

From: Alberto Riva on 1 Feb 2010 14:10

Tamas K Papp wrote:
> Hi,
>
> I am doing Markov-Chain Monte Carlo in CL. Specifically, I draw a
> vector (of about 10^5 elements) from a distribution. I need about
> 10^4 draws. This makes a huge table --- I am not sure I would like to
> fit that in memory, even if I could (single float would take 4e9
> bytes, but 1e9 is not a fixnum any more, so plain vanilla Lisp arrays
> would not work on my 32-bit platform).
>
> I don't know much about these things, but I think that the best
> solution would be a database of some kind. I am wondering what would
> be the simplest and most hassle-free way to do this in CL (if that
> matters, I am using SBCL).
>
> If I think of this tables as a matrix, I will save data along one
> dimension (eg rows, each draw), but I will retrieve data along the
> other (eg columns, multiple draws for each variable).

You could simply write all your numbers out to a file (using an
appropriate encoding), and since the number of bytes per number and the
number of columns are constant, you can calculate the offset in the file
based on row and column number, and then use FILE-POSITION to jump to
that location directly.

Alberto

From: Waldek Hebisch on 1 Feb 2010 15:09

Tamas K Papp <tkpapp(a)gmail.com> wrote:
>
> I have a good 64 bit machine with tons of ram, but in a momentary
> lapse of reason, I installed 32 bit ubuntu on it in the past. Maybe a
> reinstall would be less hassle than a DB.
>
> I notice that you are using SBCL. I posted the message below to the
> SBCL list, but got no reply so far. I wonder if you could help me:
>
> "Currently, I am using SBCL on 32-bit Ubuntu (x86). I ran into a
> specific limitation (fixnum limits my array size), so am wondering
> whether to switch to 64-bit SBCL. This would require a reinstall,
> which is not a major issue but a minor PITA which would surely take a
> few hours. Before I undertake this, I have a few questions:
>
> - how big is ARRAY-TOTAL-SIZE-LIMIT on 64-bit SBCL?

1.0.16 reports 1152921504606846975

> Will this allow
> me to use larger arrays?

(defvar a)
(progn (setf a (make-array (list (expt 10 4) (expt 10 5))
:element-type 'single-float)) nil)

works OK.

> Is there another limit (provided that I
> take enough memory with a --dynamic-space-size)?
>

Yes, for me the most significant limit is number of virtual
mappings (see thread started by Martin Rubey). This is combined
limit of sbcl and Linux. Basically in default configuration
you should be always able to use 256 Mb. Typically you may
use much more, but I have seen sbcl running out of virtual
mappings already at 640 Mb. Basically if you need a lot of
small mutable object of varying lifetimes then expect troubles.
OTOH few huge arrays should be OK. (I solved my problem
switching from milions of small vectors to thousends of bigger
ones).

> - Does 64-bit result in significantly a higher memory consumption? I
> understand that fixnums will now take twice the space, but does
> anything else take up more memory?

Pointers and consequently "general" Lisp data like conses,
closures, general arrays and structures take twice the space.

Code should take similar size (you may even see smaller code
due to higher number of available registers and less need
for reloads). Specialized arrays have bigger header, but
otherwise take the same space. In particual long strings
should take similar space (for short ones header is more
significant). General single floats should also take less
space in 64 bit version: in 32 bit version you have pointer
(32 bit) plus data with header, giving 96 bits. 64-bit version
uses direct representation, taking 64 bits.

>
> - Does 64 vs 32 bit have any impact on speed (positively or
> negatively)? Can single floats be unboxed in 64-bit?"

For me 64 bits have large positive impact, the biggest factor
beeing that I have a lot of integer that are bignums on 32
bit but fixnums on 64 bit. Also getting native code for
64 bit integer helps. AFAIK in 64-bit single floats
do not require memory allocation.

Old measurements in C indicated that 64 bit is about 10%
faster than 32 bit. Lisp and Java is freqently memory
bound, so bigger data may mean less speed. Also, newer
machines have better SSE, so now performace critical
parts use SSE, which works the same both in 64 bit
and 32 bit version. Finally, Intel processors have
a some improvements that are only active in 32 bit
mode (actually maybe only one thing: instruction
fusing) -- I do not how this affects performace.

--
Waldek Hebisch
hebisch(a)math.uni.wroc.pl

From: Thomas A. Russ on 1 Feb 2010 17:31

Tamas K Papp <tkpapp(a)gmail.com> writes:

> Hi,
>
> I am doing Markov-Chain Monte Carlo in CL. Specifically, I draw a
> vector (of about 10^5 elements) from a distribution. I need about
> 10^4 draws. This makes a huge table --- I am not sure I would like to
> fit that in memory, even if I could (single float would take 4e9
> bytes, but 1e9 is not a fixnum any more, so plain vanilla Lisp arrays
> would not work on my 32-bit platform).
....
> If I think of this tables as a matrix, I will save data along one
> dimension (eg rows, each draw), but I will retrieve data along the
> other (eg columns, multiple draws for each variable). The second step
> will be done more often, so I want that to be fast. Does it matter
> for speed which dimension I consider rows or columns?

Yes, that will matter.

If you use arrays in Common Lisp, they are stored in row-major format.

But for your application, I would seriously consider changing a bit the
order in which you do things. Assuming that the draw procedure is the
same for each item you draw, and that you have a good random number
source, then it shouldn't matter if you do all of the draws for one
trial in a single pass or if you instead do all of the draws for a
single column.

Since the columns (variables? features?) are accessed more frequently,
you would want them to be contiguous in memory. It would seem,
therefore, that you would perhaps want to make the column memory the
primary dimension and have the draws be secondary.

That would suggest to me storing each column's value in a separate 10^4
length vector, and have a collection of these columns. You could do
this in memory by using a 64-bit lisp system and just having a
collection of vectors. Assuming that you really process the columns
independently, you would only be working on one of the 10^4 vectors at a
time. That should give you good locality of reference, and allow for
both cache and paging efficiency.

If need be, you could store the information externally using a (binary)
file format or a database. But for the processing, you would want to
have a contiguous vector allocated for the entire data set.

--
Thomas A. Russ, USC/Information Sciences Institute

First | Prev | Next | Last
Pages: 1 2 3
Prev: lisp introspection/reflection question
Next: online shopping