Possible read/write conflict within awk. [Shell]

Prev: AWK - Stop processing
Next: bash: how to transfer some non-ascii code

From: Kenny McCormack on 2 Apr 2010 08:14

In article <2056697.Hz00ifERbk(a)xkzjympik>, pk <pk(a)pk.invalid> wrote:
>Hongyi Zhao wrote:
>
>> 2- If do the the following things:
>>
>> $ echo aa > file1
>> $ echo bb > file2
>> $ awk 'NR==FNR{a[$0]++} NR>FNR&&!a[$0]' file1 file2
>> bb
>>
>> $ awk 'NR==FNR{a[$0]++} NR>FNR&&!a[$0]' file1 file2 | sort -u > file2
>> $ cat file2
>> bb
>>
>> This time, the operation will be finished successfully.
>>
>> Any hints on this issue?
>
>Luck.

Exactly. And this is the true idea of why the CLC guys get so uppity
about "UB" (undefined behavior). This is the sort of situation where
something that works most of the time (just because of luck), is assumed
to be working by design.

I've also seen posts where people put a 'sleep' command in there, in
order to get the delay needed (see below). Again, this is something
that works most of the time, but is never guaranteed to work. This is
not, of course, to say that it isn't a clever hack.

Something like: ... | (sleep 5;cat > oneofmyinputfiles)
The problem, of course, is that there's no way to be sure of what number
to put in (for the sleep duration).

--
(This discussion group is about C, ...)

Wrong. It is only OCCASIONALLY a discussion group
about C; mostly, like most "discussion" groups, it is
off-topic Rorsharch revelations of the childhood
traumas of the participants...

From: pk on 2 Apr 2010 08:28

Kenny McCormack wrote:

> In article <2056697.Hz00ifERbk(a)xkzjympik>, pk <pk(a)pk.invalid> wrote:
>>Hongyi Zhao wrote:
>>
>>> 2- If do the the following things:
>>>
>>> $ echo aa > file1
>>> $ echo bb > file2
>>> $ awk 'NR==FNR{a[$0]++} NR>FNR&&!a[$0]' file1 file2
>>> bb
>>>
>>> $ awk 'NR==FNR{a[$0]++} NR>FNR&&!a[$0]' file1 file2 | sort -u > file2
>>> $ cat file2
>>> bb
>>>
>>> This time, the operation will be finished successfully.
>>>
>>> Any hints on this issue?
>>
>>Luck.
>
> Exactly. And this is the true idea of why the CLC guys get so uppity
> about "UB" (undefined behavior). This is the sort of situation where
> something that works most of the time (just because of luck), is assumed
> to be working by design.
>
> I've also seen posts where people put a 'sleep' command in there, in
> order to get the delay needed (see below). Again, this is something
> that works most of the time, but is never guaranteed to work. This is
> not, of course, to say that it isn't a clever hack.
>
> Something like: ... | (sleep 5;cat > oneofmyinputfiles)
> The problem, of course, is that there's no way to be sure of what number
> to put in (for the sleep duration).

The problem with using sleep and a pipe is that it can still go wrong, no
matter how many seconds you specify.

Let's assume you're trying to do something like

somecommand < file | ( sleep 10; cat > file )

Now, the pipe can only contain so much data, 64K bytes in many cases. Now if
"somecommand" isn't particularly smart, and "file" is bigger than 64K, what
may happen is that the pipe gets full (because sleep is still running), and
thus writes performed by "somecommand" block, which in turn block the whole
command and prevent it from reading further lines from "file".
The whole thing stays in that state until the sleep ends, at which point
anything can happen, depending on what kicks in first. I suppose you might
either end up with writing only a pipe's worth of data to the file, or
starting a self-feeding endless loop.

From: Mark Hobley on 2 Apr 2010 08:52

Hongyi Zhao <hongyi.zhao(a)gmail.com> wrote:
> Any hints on this issue?

You cannot use a round bobbin in a Unix shell. If the input file goes above a
certain size (which is quite small) it will become truncated, before it is
read.

http://markhobley.yi.org/shell/solutions/bobbin.html

I am told that there is a tool called "buffer" which is part of the brlcad
suite, which can be added as a bobbin between the input and the output,
allowing this to be done.

Someone offered to repackage this as a separate tool once, but I never got
round to following that through. It would be useful for this to be split off
from the main sute though. I think I raised a request with the upstream
package maintainers to split this off, but they would not do it. (It might be
worth trying again though. There seems to be problems getting brlcad into
mainstream distros, and I think the maintainers may have changed since I made
the original request. Maybe the new ones are more cooperative. If not, you can
always take the source code, and split it yourself :)

I am always interested in seeing bundles becoming split.

Mark.

--
Mark Hobley
Linux User: #370818 http://markhobley.yi.org/

From: Ed Morton on 2 Apr 2010 09:33

On 4/1/2010 10:23 PM, Hongyi Zhao wrote:
> Hi all,
>
> I use the following code to obtain the lines existing file2 but not in
> file1, and then store the results into file2 as follows:
>
> awk 'NR==FNR{a[$0]++} NR>FNR&&!a[$0]' file1 file2> file2
>
> I've a question about the above operation: does the file2 will be
> exposed to read/write conflict issue in this case? In detail, when we
> redirect the result into file2, it also as the input file for the
> awk's manipulation.
>
> Any hints on this issue?

You got your answer, so hopefully it's clear now that you don't ever want to
direct your output to the same file you're reading. You can do this instead:

cmd file > tmp && mv tmp file

wrt your script, though, it'd more commonly be written using "next" than
comparing line numbers twice, e.g.:

awk 'NR==FNR{a[$0]++;next} !a[$0]' file1 file2 > tmp && mv tmp file2

Regards,

Ed.

From: Eric on 2 Apr 2010 09:26

On 2010-04-02, pk <pk(a)pk.invalid> wrote:
> Kenny McCormack wrote:
>
>> In article <2056697.Hz00ifERbk(a)xkzjympik>, pk <pk(a)pk.invalid> wrote:
>>>Hongyi Zhao wrote:
>>>
>>>> 2- If do the the following things:
>>>>
>>>> $ echo aa > file1
>>>> $ echo bb > file2
>>>> $ awk 'NR==FNR{a[$0]++} NR>FNR&&!a[$0]' file1 file2
>>>> bb
>>>>
>>>> $ awk 'NR==FNR{a[$0]++} NR>FNR&&!a[$0]' file1 file2 | sort -u > file2
>>>> $ cat file2
>>>> bb
>>>>
>>>> This time, the operation will be finished successfully.
>>>>
>>>> Any hints on this issue?
>>>
>>>Luck.
>>
>> Exactly. And this is the true idea of why the CLC guys get so uppity
>> about "UB" (undefined behavior). This is the sort of situation where
>> something that works most of the time (just because of luck), is assumed
>> to be working by design.
>>
>> I've also seen posts where people put a 'sleep' command in there, in
>> order to get the delay needed (see below). Again, this is something
>> that works most of the time, but is never guaranteed to work. This is
>> not, of course, to say that it isn't a clever hack.
>>
>> Something like: ... | (sleep 5;cat > oneofmyinputfiles)
>> The problem, of course, is that there's no way to be sure of what number
>> to put in (for the sleep duration).
>
> The problem with using sleep and a pipe is that it can still go wrong, no
> matter how many seconds you specify.
>
> Let's assume you're trying to do something like
>
> somecommand < file | ( sleep 10; cat > file )
>
> Now, the pipe can only contain so much data, 64K bytes in many cases. Now if
> "somecommand" isn't particularly smart, and "file" is bigger than 64K, what
> may happen is that the pipe gets full (because sleep is still running), and
> thus writes performed by "somecommand" block, which in turn block the whole
> command and prevent it from reading further lines from "file".
> The whole thing stays in that state until the sleep ends, at which point
> anything can happen, depending on what kicks in first. I suppose you might
> either end up with writing only a pipe's worth of data to the file, or
> starting a self-feeding endless loop.

The basic answer is "Don't do that". Write to file3 instead of file2,
then

mv file3 file2

as the next (separate) command.

The problem is that the final outcome depends on the timing of starting
new processes, opening files, and opening files for re-direction. The
order in which the various steps are done depends on which shell you
are using, and the timing depends on how the OS kernel handles context
switching as well as how long each step takes. The size of the file will
have an impact, as will the way the OS treats a "truncate-and-write"
open for a file that another process has open for reading (Unixes don't
care, in general).

The last sample

somecommand < file | ( sleep 10; cat > file )

will be different from

somecommand file | ( sleep 10; cat > file )

and will also depend on whether the shell starts a subprocess for the
bracketed commands (which then starts process for sleep and cat) or just
sets up an internal context.

Too many variations, even if you _know_ how your shell and OS behave
there will still be timing variations, so, once again:

**Don't do that!**

Eric

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: AWK - Stop processing
Next: bash: how to transfer some non-ascii code