Can extra processing threads help in this case? [MFC]

Prev: Improving Pete'r Application Performance
Next: Competitors for Pet'e OCR system

From: Joseph M. Newcomer on 5 Apr 2010 20:53

See below...
On Mon, 5 Apr 2010 15:35:28 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote:

>
>"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
>message news:ro9kr5lk8kad3anflhhcj0iecrvosf381n(a)4ax.com...
>> See below...
>> On Sat, 3 Apr 2010 18:27:00 -0500, "Peter Olcott"
>> <NoSpam(a)OCR4Screen.com> wrote:
>>
>
>>>I like to fully understand the underlying infrastructure
>>>before I am fully confident of a design. For example, I
>>>now
>>>know the underlying details of exactly how SQLite can
>>>fully
>>>recover from a power loss. Pretty simple stuff really.
>> *****
>> Ask if it is a fully-transacted database and wht recover
>> techniques are implmemented in
>> it. Talk to a a MySQL expert. Look into what a rollback
>> of a transaction means. These
>> are specified for most database (my experience in looking
>> at these predates MySQL, so I
>> don't know what it does; I haven't looked at this
>> technology since 1985 or 1986)
>>
>> That's all the understaning you need. Intellectual
>> cuiriosity my suggest that you
>> understand how they implement this, but such understanding
>> is not critical to the decision
>> process.
>
>No. I need a much deeper understanding to approximate an
>optimal mix of varied technologies. A transacted database
>only solves one aspect of one problem, it does not even
>solve every aspect of even this one problem.
****
No, it does not handle the case where the disk melts down, or the entire computer room
catches fire and every machine is destroyed either by heat or by water damage.

How high an exponent do you think you have to support in the 1 in 10**n probabilities?

The simple fallback is: (a) don't charge for work not delivered (b) in the case of any
failure, require the transaction be resubmitted [and see (a)].

If you need offsite storage for file backup, this may mean that in the case of a disaster,
you lose all the income from the last backup to the time of the disaster, and that tells
you how often you need to do offsite backups. If you lose $50, this may be acceptable; if
you lose $500, this probably isn't.
joe

>
>
>> ****
>>>
>>>>>
>>>>>I ALWAYS determine feasibility BEFORE proceeding with
>>>>>any
>>>>>further analysis.
>>>> ****
>>>> No, you have been tossing buzzwords around as if they
>>>> are
>>>> presenting feasible solutions,
>>>> without justifying why you think they actually solve the
>>>> problem!
>>>
>>>On-the-fly transaction by transaction offsite backups may
>>>still be a good idea, even if it does not fit any
>>>pre-existing notions of conventional wisdom.
>> ****
>> Actually, it does, and "remote mirrored transactions"
>> covers the concept. This is a very
>> old idea, and right now major New York investment firms I
>> know of are mirroriing every
>> transaction on severs 50 miles away, just in case of
>> another 9/11 attack. And they were
>> doing this in the 1990s (the ones who weren't are now
>> doing it!). So the idea is very
>> old, and you are just discovering it. So why not
>> investigate what is available in
>> mirrored database support (it costs!),?
>> ****
>
>No need to buy this It is easy enough to build from scratch.
>It may double the complexity of my system, but, then this
>system is really pretty simple anyway. Now that you gave me
>the right terminology "remote mirrored transactions", I
>could do a little search to see if their are any good
>pre-existing design patterns. I already have a relatively
>simple one in my head. The main piece that I had forgotten
>about is the design pattern that SQLite provides on exactly
>how to go about protecting against a power loss. I already
>knew this one, but forgot about it.
>
>>
>> >I start with
>>>the most often false premise that all convention wisdom is
>>>pure hooey.
>> ****
>> So you invent new hooey in its place?
>> ****
>>>As this conventional wisdom proves itself item
>>>by item point by point, I accept the validity of this
>>>conventional wisdom only on those items and points that it
>>>specifically proved itself.
>> ****
>> Let's see if I have this right:
>>
>> (a) assume everyone else is wrong
>> (b) propose bizarre designs based on supercificial
>> understanding and buzzword gatheriing
>> (c) wait for someone to refute them
>
>Basically only count on statements to the extent that they
>are completely understood. Initially this process is (as you
>have seen) quite chaotic. As comprehension grows and the
>boundaries are better understood the process seems much more
>reasonable. Eventually a near optimal solution is
>interpolated upon.
>
>>
>> At which point, you forgot
>> (d) accuse the people who refute me of being in refute
>> mode and not listening to what I am
>> saying.
>> ****
>
>This process necessarily must reject credibility as a basis
>of truth. To the extent the supporting reasoning is not
>provided, statements must be tentatively rejected.
>
>>>This process makes those aspects
>>>of conventional wisdom that have room for improvement very
>>>explicit.
>> ****
>> I have no idea what "conventional wisdom" is here; to me,
>> the obvious situation is
>> solvable by a transacted database, and if you want to have
>> 100% recovery in the fact of
>> incredibly unlikely events (e.g., power failure), you have
>> to use more and more complex
>> (and expensive) solutions to address these low-probability
>> events.
>
>Yet another false assumption, this is the main source of
>your mistakes. SQLite is absolutely free and its
>architecture inherently provides for a power loss fault
>recovery.
>
>>
>> Perhaps in your world, power failures matter; in my world,
>> they happen once a year, under
>> carefully controlled conditions that allow for graceful
>> shutdown (the once-a-decade
>> windstorm or once-a-century blizzard that drop me back to
>> battery backup power, at which
>> point I execute graceful shutdowns; nearby lightning hits
>> that take out the entire block,
>> or something else that is going to last for an hour or
>> more...the 1-second failures that
>> earned our power company the nickname "Duquesne Flicker &
>> Flash" are covered by my UPS
>> units)
>
>That sound perfectly reasonable to me, and exactly what I
>would expect. If one can also protect against a power loss
>failure, and it only costs a tiny little bit of execution
>time, then why not do this?
>
>>
>> WHy have you fastened on the incredibly-low-probability
>> event "power failure" and why have
>> you decided to treat it as the most common catastrophe?
>> ****
>
>It is one element on the list of possible faults. It might
>be helpful if you could provide a list in order of
>probability of the most frequently occurring faults. I am
>sure that you could do this much better than I could right
>now.
>
>>>
>>>>
>>>> I use a well-known and well-understood concept, "atomic
>>>> transaction", you see the word
>>>> "atomic" used in a completely different context, and
>>>> latch
>>>> onto the idea that the use you
>>>> saw corresponds to the use I had, which is simply not
>>>> true. An atomic file operation does
>>>
>>>I understood both well. My mind was not fresh on the
>>>atomicity of transaction until I thought about it again
>>>for
>>>a few minutes.
>> ****
>> It isn't because we haven't tried to explain it to you.
>> ****
>>>
>>>> NOT guarantee transactional integrity. File locks
>>>> provide
>>>> a level of atomicity with
>>>> respect to record updates, but they do not in and of
>>>> themselves guarantee transactional
>>>> integrity. THe fundamental issue here is integrity of
>>>> the file image (which might be in
>>>
>>>They do provide one key aspect of exactly how SQLite
>>>provides transactional integrity.
>>>
>>>> the file system cache) and integrity of the file itself
>>>> (what you see after a crash, when
>>>> the file system cache may NOT have been flushed to
>>>> disk!)
>>>> ****
>>>
>>>There are simple way to force this in Unix/Linux, I don't
>>>bother cluttering my head with their names, I will look
>>>them
>>>up again when the time comes
>> ****
>> sync
>>
>> Which actually doesn't, if you read it closely and
>> understand what it does and does not
>> guarantee. I worked in Unix for 15 years, I know
>> something about the reliability of its
>> file system. And I went to talks by people (Satyanariana,
>> Ousterhout) who build reliable
>> file systems on top of Unix in spite of its fundamental
>> limitations.
>> ****
>>>There are even ways to flush
>>>the hard drives on-board buffer.
>> ****
>> And one vendor I talked to at a trade show assured me that
>> they had no way to flush the
>> onboard hard drive buffers, and when I asked "how do you
>> handle transacted file systems?"
>> he simply said "We just blame Microsoft" So I know that
>> there is at least one vendor for
>> which this is not supported. I presume you have talked
>> with the hard drive vendors'
>> technical support people before you made this statement
>> (given the evidence I have, I
>> would not trust such a statement until I had verified that
>> the hard drive model we were
>> using actually supported this capability, and the file
>> system used it, and that the OS had
>> the necessary SCSI/ATAPI command pass-thru to allow an
>> application to invoke it. But
>> then, since we had to get a patch to Win2K to make our
>> transacted file system work [the
>> problem was elevated to a "missiobn critical bug" within
>> Microsoft, and the company I
>> worked for had enough clout to get the patch], maybe I
>> just have a lot more experience in
>> this area and am consequently a lot more distrustful of
>> silly statements which do not seem
>> to have a basis in reality)
>
>One might skip all of this and simply not count a
>transaction as completed until another process sees this
>transaction in the file. I would estimate from my somewhat
>limited knowledge that this might work.
>
>>>Since I can not count on something not screwing up it
>>>seems
>>>that at least the financial transactions must have
>>>off-site
>>>backup. I would prefer this to be on a transaction by
>>>transaction basis, rather than once a period-of-time.
>> ****
>> But the point is that it should have been OBVIOUS to you
>> that this could not work! Because
>> if you had done the design that I say you have to do, to
>> identify the state machine that
>> records transactions and identify each of the cut-points,
>> it would be obvious that
>> implmenting another incredibly complex state machine
>> within this would lead only to MORE
>> COMPLEX recovery, not less complex! Assume your
>> transacted database is completely
>> reliable, look at its recovery/rollback protocols, and see
>> how well they meet your needs
>> at the cutpoints that involve the transacted database!
>> Compare the state diagram you get
>> with FTP to the state diagram you have without FTP! This
>> is pretty elementary design
>> stuff, which should be derivable from basic principles
>> (you don't need to have built a lot
>> of systems to understand this
>
>
>>
>> You know DFAs. Simply express your transaction model as a
>> DFA, and at every state
>> transition, you add a new state, "failure". Every state
>> can transition to "failure".
>> Then, your recovery consists of examining the persistent
>> state up to that point, and
>> deriving a NEW set of states, that essentially return you
>> to a know point in the state
>> diagram, where you resume computations and attempt to
>> reach a final state. That's all
>> there is to it.
>> joe
>> ****
>
>Sounds good.
>
>>>
>>>> be a more effective recovery-from-cut-point or just more
>>>> complexity? You have failed to
>>>> take the correct approach to doing the design, so the
>>>> output of the design process is
>>>> going to be flawed.
>>>>
>>>
>>>Not at all. I have identified a functional requirement and
>>>provided a first guess solution. The propose solution is
>>>mutable, the requirement is not so mutable.
>> ****
>> The point is that a requirements document leads to a
>> specification document, and a
>> specification document does NOT specify the implementation
>> details. A specification
>> document would include an analysis of all the failure
>> cut-points and the necessary
>> recovery actions. It is then up to someone to derive an
>> implementation from that
>> specification. When you start saying "FTP" and "pwrite"
>> you are at the implementation
>> level!
>> ****
>
>This may generally be the preferred approach. This case is
>different. A big part of this whole process is me learning
>the boundaries of the set of categories of solutions. Quite
>often (in this case) the nature of the solution feeds back
>into the requirements thus changing the requirements.
>Example if one can also protect against a power failure with
>little effort or expense, then let's do this too. This only
>requires using the SQLite design pattern, or tools that use
>this pattern.
>
>I am still considering that a file be the primary means of
>inter-process communication specifically because a file is
>persistent. To do this well I must fully understand things
>such as the SQLite fault tolerant design pattern, and many
>other things.
>
>>>
>>>> Build a state machine of the transactions. At each
>>>> state
>>>> transition, assume that the
>>>> machine will fail *while executing that arc* of the
>>>> graph.
>>>> Then show how you can analyze
>>>> the resulting intermediate state to determine the
>>>> correct
>>>> recovery procedure. If you do
>>>> this, concepts like "FTP" beome demonstrably
>>>> inappropriate, because FTP adds dozens of cut
>>>> points to the state transition diagram, making the
>>>> recovery that much more complex.
>>>> *****
>>>
>>>More like this:
>>>(1) I wait until the client gets their final result data.
>>>(2) Then deduct the dime from their account balance as a
>>>single atomic transaction.
>>>(3) Then I send a copy of this transaction to offsite
>>>backup.
>> ****
>> Not an unreasonable design. Key here is that you are
>> extending them credit on the
>> computation, and not charging them until it is delivered;
>> this is akin to having an
>> inventory you buy in anticipation of sales.
>> joe
>> ****
>
>No I don't generally extend credit. They must pay in advance
>in at least one dollar increments. When the transaction
>begins I check their balance. Because of sequencing issues
>this simple design could result in a negative balance some
>of the time. I would rather err on the client's side and on
>the side of simplicity, at least initially. In the long run
>I would still err on the client's side, but, maybe have some
>added complexity.
>
>For example I could deduct the full amount at the beginning
>and then possibly have to roll back the transaction at many
>possible failure points.
>
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Peter Olcott on 5 Apr 2010 22:00

"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
news:OBKVtoR1KHA.224(a)TK2MSFTNGP06.phx.gbl...
> Peter Olcott wrote:
>
>>> Serializing these operations to a single threa does not
>>> guarantee transactional integrity.
>>> See my previous discussion of the use of a hypothetical
>>> BEGIN_TRANSACTION primitive.
>>>
>>> If you believe a single thread guarantees transactional
>>> integrity, you are wrong.
>>> joe
>>
>> It does gurantee that:
>> (1) Read balance
>> (2) Deduct 10 cents from balance
>> (3) Write updated balance
>> don't have any intervening updates between steps.
>
>
> He's right Joe. For a Single Thread Access only SQLITE
> implementation, he can guarantee no write contention
> issues. Perfect for any application that doesn't have more
> than one thread. <grin>
>
> But he doesn't understand is that when step 3 is running,
> step 1 is locked for any other thread. But he doesn't
> need that, its a FIFO based accessor :)
>
>>> Actually, at the defail level, it IS hard. The
>>> assumption is that for some pair of
>>
>> I don't see what is hard about using this design pattern,
>> and I don't see how this could fail:
>
>
> It isn't hard at all. That is why I suggested SQLITE for
> your simple application idea with a single accessor
> concept.
>
>> http://sqlite.org/atomiccommit.html
>> You have a complete audit trail of every detail up to the
>> point of failure.
>
>
> HA! One thing I like about SQLITE is that they keep it
> real. They know their place in SQL and will make no excuse
> for it, no incorrect fantasy about what it CAN NOT do.
>
> Basically what it means is that WRITE/FLUSHING are all
> done at the same time because as I said above, the
> DATAFILE is LOCKED during Write/update operations, hence
> you get the idea of "SnapShot" journal for the integrity
> of the data file where there are no other contention. But
> it doesn't have record level ideas to even CONSIDER sector
> and cluster operations. Its a WHOLE or NOT at all. 0% or
> 100% , ALL or Nothing. It can only do that with a SINGLE
> WRITE ACCESSOR idea. See:
>
> http://www.sqlite.org/atomiccommit.html#atomicsector
>
> Step 3.7 is required - FLUSH. If your machine crashed
> before that - you lose integrity! See 3.4 on LOCKING:
>
> 3.4 Obtaining A Reserved Lock
>
> A single reserve lock can coexist with multiple shared
> locks from other processes. However, there can only be
> a
> single reserved lock on the database file. Hence only
> a
> single process can be attempting to write to the
> database
> at one time.
>
> And one thing it doesn't mention is the OPPOSITE. If a
> process is doing a SELECT (read only access), the database
> is LOCKED for any write (INSERT, DELETE, UPDATE) access
> until the SELECT is complete.
>
> Which is good; simple, not hard, easy for 99% of the
> people to implement but I doubt you can, and will work
> very nicely for a single accessor FIFO application.
>
> In just in case, you don't understand the limitations read
> the Appropriate Usages page:
>
> http://sqlite.org/whentouse.html
>
> You don't need a real SQL server or RDBMS since you don't
> have any need for any one of the following:
>
> - Multi-access Client/Server Application
> - High-volume Website
> - Very large dataset
> - High Concurrency
>
> --
> HLS

I would envision only using anything as heavy weight as
SQLite for just the financial aspect of the transaction. The
queue of HTTP requests would use a lighter weight simple
file. I would use some sort of IPC to inform the OCR that a
request is available to eliminate the need for a polled
interface. The OCR process would retrieve its jobs form this
simple file.

According the Unix/Linux docs multiple threads could append
to this file without causing corruption. If this is not the
case then a single thread could be invoked through some sort
of FIFO, such as in Unix/Linux is implemented as a named
pipe, with each of the web server threads writing to the
FIFO.

From: Peter Olcott on 5 Apr 2010 22:11

"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
news:eVZ1n$R1KHA.5212(a)TK2MSFTNGP05.phx.gbl...
> Hector Santos wrote:
>
>>>> If you believe a single thread guarantees transactional
>>>> integrity, you are wrong.
>>>> joe
>>>
>>> It does gurantee that:
>>> (1) Read balance
>>> (2) Deduct 10 cents from balance
>>> (3) Write updated balance
>>> don't have any intervening updates between steps.
>>
>> He's right Joe. For a Single Thread Access only SQLITE
>> implementation, he can guarantee no write contention
>> issues. Perfect for any application that doesn't have
>> more than one thread. <grin>
>>
>> But he doesn't understand is that when step 3 is running,
>> step 1 is locked for any other thread. But he doesn't
>> need that, its a FIFO based accessor :)
>
>
> You know Joe, With his low volume, he really doesn't need
> any SQL engine at all!
>

I need the SQL engine to keep track of user accounts,
including free user accounts. Every user must have an
account. Free users will be restricted to Time New Roman 10
point, and have the priority of these jobs only when no
other jobs are pending.

> He can easily just write a single text FILE per request
> and per USER account file system.

Yes, that is the sort of thing that I have envisioned. The
output text will be UTF-8.

>
> That will help with his target 100ms single transaction at
> a time FIFO design need and he will have almost 100% crash
> restart integrity!
>
> I would consider using an old school simple X12-like EDI
> format for its transaction codes and user data fields and
> he might be able to sell this idea for his B2B web service
> considerations with traditional companies familiar and use
> EDI!
>
> And whats good about using POTF (plain old text files), he
> can leverage existing tools in all OSes:
>
> - He can edit the text files with NOTEPAD or vi.
> - He can delete accounts with DEL * or rm *
> - He can back it up using zip or cab or gz!
> - He can search using dir or ls!
>
> Completely PORTABLE! SIMPLE! CHEAP! FAST! FAULT TOLERANCE!
> NETWORK SHARABLE! ATOMIC FOR APPENDS! EXCLUSIVE, READ,
> WRITE FILE LOCKING!
>

Yes that is the sort of system that I have been envisioning.
I still have to have SQL to map the email address login ID
to customer number.

I have been envisioning the primary means of IPC, as a
single binary file with fixed length records. I have also
envisioned how to easily split this binary file so that it
does not grow too large. For example automatically split it
every day, and archive the older portion.

> <grin>
>
> --
> HLS

From: Peter Olcott on 5 Apr 2010 22:32

"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
message news:e41lr59chafrfs27uakv7b8ob1iv9dqq2i(a)4ax.com...
> See below...
> On Mon, 5 Apr 2010 15:35:28 -0500, "Peter Olcott"
> <NoSpam(a)OCR4Screen.com> wrote:
>
>>
>>"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
>>message news:ro9kr5lk8kad3anflhhcj0iecrvosf381n(a)4ax.com...
>>> See below...
>>> On Sat, 3 Apr 2010 18:27:00 -0500, "Peter Olcott"
>>> <NoSpam(a)OCR4Screen.com> wrote:
>>>
>>
>>>>I like to fully understand the underlying infrastructure
>>>>before I am fully confident of a design. For example, I
>>>>now
>>>>know the underlying details of exactly how SQLite can
>>>>fully
>>>>recover from a power loss. Pretty simple stuff really.
>>> *****
>>> Ask if it is a fully-transacted database and wht recover
>>> techniques are implmemented in
>>> it. Talk to a a MySQL expert. Look into what a
>>> rollback
>>> of a transaction means. These
>>> are specified for most database (my experience in
>>> looking
>>> at these predates MySQL, so I
>>> don't know what it does; I haven't looked at this
>>> technology since 1985 or 1986)
>>>
>>> That's all the understaning you need. Intellectual
>>> cuiriosity my suggest that you
>>> understand how they implement this, but such
>>> understanding
>>> is not critical to the decision
>>> process.
>>
>>No. I need a much deeper understanding to approximate an
>>optimal mix of varied technologies. A transacted database
>>only solves one aspect of one problem, it does not even
>>solve every aspect of even this one problem.
> ****
> No, it does not handle the case where the disk melts down,
> or the entire computer room
> catches fire and every machine is destroyed either by heat
> or by water damage.
>

Ah but, then you are ignoring the proposed aspect of my
design that would handle all those things. What did you call
it "mirrored transactions". I called it on-the-fly
transaction-by-transaction offsite backup.

> How high an exponent do you think you have to support in
> the 1 in 10**n probabilities?
>
> The simple fallback is: (a) don't charge for work not
> delivered (b) in the case of any
> failure, require the transaction be resubmitted [and see
> (a)].

Yes I like that Idea.

>
> If you need offsite storage for file backup, this may mean
> that in the case of a disaster,
> you lose all the income from the last backup to the time
> of the disaster, and that tells
> you how often you need to do offsite backups. If you lose
> $50, this may be acceptable; if
> you lose $500, this probably isn't.
> joe

I don't want to ever lose any data pertaining to customers
adding money to their account. I don't want to have to rely
on the payment processor keeping track of this. Maybe there
are already mechanisms in place that can be completely
relied upon for this.

From: Hector Santos on 6 Apr 2010 12:31

Peter Olcott wrote:

> I would envision only using anything as heavy weight as
> SQLite for just the financial aspect of the transaction.

SQLITE is not "heavy weight," its lite weight and only good for a
single accessor applications. Very popular for applications in
configurations or user recorsd, but only THEY have access and no one
else.

You can do handle multiple access but at the expense of speed. The
SQLITE people makes no bones about that. SQLITE works because the
target market don't have any sort of critical speed requirement and
can afford the latency in DATAFILE sharing.

SQLITE uses what is called a Reader/Writer Lock technique very common
in synchronization of a common resource among threads

You can have many readers, but one writer.
If readers are active, the writer must wait until no more readers
if writers are active, the reader must wait until no more writers

If you use OOPS with a class based ReaderWriter Class,, it makes the
programming easier:

Get
{
CReader LOCK()
get record
}

Put
{
CWriter LOCK()
put record
}

The nice thing is that when you lose local scope, the destructor of
the reader/writer lock will release/decrement the lock references.

Now in Windows, thread synchronization is generally done use whats
called Kernel Objects. They are SEMAPHORES, a MUTEX is a special type
of semaphore.

For unix, I am very rusty here, but it MIGHT still use the old school
method which was also used in DOS using what I called "File
Semaphores." In other words, a FILE is used to signify a LOCK.

So one process will create a temporary file:

process-id.LCK

and other other processes will wait on that file disappearing and only
the OWNER (creator of the lock) can release/delete it.

As I understood it, pthreads was an augmented technology and library
to allow unix based applications to begin using threads. I can't tell
you the details but as I always understood it they all - WINDOWS and
UNIX - are conceptually the same when it comes to common resource
sharing models. In other words, you look for the same type of things
in both.

> The queue of HTTP requests would use a lighter weight simple
> file.

For you, you can use a single log file or individual *.REQ files which
might be better/easier using a File Notification event concept. Can't
tell you abou *nix, but for Windows:

FindFirstChangeNotification()
ReadDirectoryChangeW()

The former might be available under *nix since its the older idea. The
latter was introduced for NT 3.51 so its available for all NT based
OSes. It is usually used with IOCP designs for scalability and
performance.

In fact, one can use ReadDirectoryChangeW() along with Interlocked
Singly Linked Lists:

http://msdn.microsoft.com/en-us/library/ms684121(v=VS.85).aspx

to give you a highly optimized, high performance atomic FIFO concept.
However, there is a note I see for 64bit operations.

> I would use some sort of IPC to inform the OCR that a
> request is available to eliminate the need for a polled
> interface. The OCR process would retrieve its jobs form this
> simple file.

See above.

> According the Unix/Linux docs multiple threads could append
> to this file without causing corruption.

So does windows. However, there could be a dependency on the storage
device and file drivers.

In general, as long as you open for append, write and close, and do
leave it open, don't use any files stat readings or seeking on your
own, it works very nicely:

FILE *fv = fopen("request.log","at");
if (fv) {
fprint(fv,"%s\n",whatever);
fclose(fv);
}

However, if you really wanted a guarantee, then you can user a
critical section, a named kernel object (named so it can be shared
among processes), or use sharing mode open file functions with a READ
ONLY sharing attribute. Using CreateFile(), it would look like this:

BOOL AppendRequest(const TYourData &data)
{
HANDLE h = INVALID_HANDLE_VALUE;
DWORD maxTime = GetTickCount()+ 20*1000; // 20 seconds max wait
while (1)
{
h = CreateFile("request.log",
GENERIC_WRITE,
FILE_SHARE_READ,
NULL,
OPEN_ALWAYS,
FILE_ATTRIBUTE_NORMAL,
NULL);
if (h != INVALID_HANDLE_VALUE) break; // We got a good handle
int err = GetLastError();
if (err != 5 && err != 32) {
return FALSE;
}
if (GetTickCount() > maxTime) {
SetLastError(err); // make sure error is preserved
return FALSE;
}
_cprintf("- waiting: %d\n",GetTickCount()-maxTime);
Sleep(50);
}
SetFilePointer(h,0,NULL,FILE_END);

DWORD dw = 0;
if (!WriteFile(h,(void *)&data,sizeof(data),&dw,NULL)) {
// something unexpected happen
CloseHandle(h);
return FALSE;
}

CloseHandle(h);
return TRUE;
}

> If this is not the
> case then a single thread could be invoked through some sort
> of FIFO, such as in Unix/Linux is implemented as a named
> pipe, with each of the web server threads writing to the
> FIFO.

If that is all *nix has to offer, historically, using named pipes can
be unreliable, especially under multiple threads.

But since you continue to mix up your engineering designs and you need
to get that straight, process vs threads, the decision will decide
what to use.

Lets say you listen and ultimately design a multi-thread ready EXE and
you want to also allow multiple EXE to run, either on the same machine
or another machine and want to keep this dumb FIFO design for your
OCR, then by definition you need a FILE BASED sharing system.

While there are methods to do cross machine MESSAGING, like named
pipes, it is still fundamentally based on a file concept behind the
scenes, they are just "special files".

You need to trust my 30 years of designing server with HUGE IPC
requirements. You can write your OWN "messaging queue" with ideas
based on the above AppendRequest(), just change the file name to some
shared resource location:

\\SERVER_MACHINE\SharedFolder\request.log

and you got your Intra and Inter Process communications, Local,
Remote, Multi-threads, etc.!

Of course, using an shared SQL database with tables like above to do
the same thing.

Your goal as a good "Software Engineer" is to outline the functional
requirements and also use BLACK BOX interfacing. You could just
outline this using an abstract OOPS class:

class CRequestHandlerAbstract {
public:
virtual bool Append(const TYourData &yd) = 0;
virtual bool GetNext(TYourData &yd) = 0;
virtual bool SetFileName(const char *sz) { return sfn = sz; }

struct TYourData {
..fields...
};
protected:
virtual bool OpenFile() = 0;
virtual bool CloseFile() = 0;
string sfn;
};

and that is all you basically need to know. The implementation of
this abstract class will be for the specific method and OS you will be
using. What doesn't change is your Web server and OCR. It will use
the abstract methods as the interface points.

> Yes that is the sort of system that I have been envisioning.
> I still have to have SQL to map the email address login ID
> to customer number.

That will depends on how you wish to define your customer number. Its
a purely numeric and serial, i.e, start at 1, then you can define in
your SQL database table schema, an auto-increment id field which the
SQL engine will auto-increment for you when you first create the user
account with the INSERT command.

Example, a table "CUSTOMERS" in the database is create:

CREATE TABLE customers (
id int auto_increment,
Name text,
Email Text,
Password text
)

When you create the account, the insert will look like this:

INSERT INTO customers values
(NULL,'Peter','pete(a)abc.com','some_hash_value')

By using the NULL for the first ID field, SQL will automatically use
the next ID number.

In general, a typical SQL tables layout uses auto-increase ID fields
as the primary or secondary key for each table, that allows you to not
duplicate data. So you can have an SESSIONS table for currently
logged in users:

CREATE TABLE sessions (
id int auto_increment, <<--- view it as your transaction session id
cid int,
StartTime DateTime,
EndTime DataTime,
..
..
)

where the link is Customers.id = Sessions.cid.

WARNING:

One thing to remember is that DBA (Database Admins) value their work
and are highly paid. Do not argue or dispute with them as you
normally do. Most certainly will not have the patience shown here to
you. SQL setups is a HIGHLY complex subject and it can be easy if you
keep it simple. Don't get LOST with optimization until the need
arises, but using common sense table designs should be non-brainer
upfront. Also, while there is a standard in the "SQL language" there
are differences between SQL engines, like the above CREATE statements,
they are generally slightly different for different SQL engines. So I
advise you to use common SQL data types and avoid special definitions
unless you made the final decision to stick with one vendor SQL engine.

You are a standard design, all you will need at a minimum for tables are:

customers customer table
auto-increment primary key: cid

products customer products limits, etc, table
auto-increment primary key: pid
secondary key: cid

This would be a one to many table.

customers.cid <---o products.cid

select * from customers, products
where customers.cid = products.cid

You can use a JOIN here too which a DBA will
tell you to do, but the above is the BASIC
concept.

sessions sessions management table
can server as session history log as well

auto-increment primary key: sid
secondary key: cid

requests Your "FIFO"
auto-increment primary key: rid
secondary key: cid
secondary key: sid

Some DBAs might suggest combining tables, Using or not using indices
or secondary keys, etc. There are is no real answer and it highly
depends on the SQL when it comes to optimization. So DON'T key lost
with it. You can ALWAYS create indices if need be.

> I have been envisioning the primary means of IPC, as a
> single binary file with fixed length records. I have also
> envisioned how to easily split this binary file so that it
> does not grow too large. For example automatically split it
> every day, and archive the older portion.

Well, to do that you have no choice but to implement your own file
sharing class as shown above. The concept is basically a Log Rotater.
You can now update the CRequestHandlerAbstract class with one more
method requirement:

class CRequestHandlerAbstract {
public:
virtual bool Append(const TYourData &yd) = 0;
virtual bool GetNext(TYourData &yd) = 0;
virtual bool SetFileName(const char *sz) { return sfn = sz; }

virtual bool RotateLog() = 0; // << NEW REQUIREMENT

struct TYourData {
..fields...
};
protected:
virtual bool OpenFile() = 0;
virtual bool CloseFile() = 0;
string sfn;
};

But you also achieve rotation if you use a special file naming
nomenclature, this is called Log Periods. It could be based on
today's date.

"request-{yyyymmdd}.log"

That will guarantee a daily log, or do it other periods:

"request-{yyyy-mm}.log" monthly
"request-{yyyy-ww}.log" week number
"request-{yyyy-mm}.log" monthly
"request-{yyyymmddhh}.log" hourly

and so on, and you also couple it by size.

This can be handle by adding a LogPeriod, FileNameFormat, MaxSize
variables which the OpenFile() can use;

class CRequestHandlerAbstract {
public:
virtual bool Append(const TYourData &yd) = 0;
virtual bool GetNext(TYourData &yd) = 0;
virtual bool SetFileName(const char *sz) { return sfn = sz; }

virtual bool RotateLog() = 0; // << NEW REQUIREMENT

struct TYourData {
..fields...
};
protected:
virtual bool OpenFile() = 0;
virtual bool CloseFile() = 0;
ctring sfn;

public:
int LogPeriod; // none, hourly, daily, weekly, monthly...
int MaxLogSize;
Ctring FileNameFormat;
};

and by using a template idea for the file name you can use string
replacements very easily.

GetSystemTime(&st)

CString logfn = FileNameFormat;
if (logfn.Has("yyyy"}) logfn.Replace("yyyy",Int2Str(st.wYear));
if (logfn.Has("mm"}) logfn.Replace("mm",Int2Str(st.wMonth));
... etc ...

if (MaxLogSize > 0) {
DWORD fs = GetFileSizeByName(logfn,NULL);
if (fs != -1 && fs >= MaxLogSize) {
// Rename file with unique serial number
// "request-yyyymm-1.log"
// "request-yyyymm-2.log"
// etc.
// finding highest #.

RenameFileWithASerialNumberAppended(logfn)
}
}

etc.

--
HLS

First | Prev | Next | Last
Pages: 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110
Prev: Improving Pete'r Application Performance
Next: Competitors for Pet'e OCR system