Diff of Text [Python]

Prev: questions about how to parse a string and put it in a dictionary
Next: great

From: Lie Ryan on 5 Jun 2010 03:53

On 06/05/10 15:43, GZ wrote:
> On Jun 4, 8:37 pm, Lie Ryan <lie.1...(a)gmail.com> wrote:
>> On06/05/10 07:51, GZ wrote:
>>> No, rsync does not solve my problem.
>>
>>> I want a library that does unix 'diff' like function, i.e. compare two
>>> strings line by line and output the difference. Python's difflib does
>>> not work perfectly for me, because the resulting differences are
>>> pretty big. I would like an algorithm that generates the smallest
>>> differences.
>>
>> is n=0 not short enough?
>>
>> pprint.pprint(list(difflib.context_diff(s, t, n=0)))
>
> This still does not do what I want it to do. It only displays the diff
> results in a different format. I want a different algorithm to
> generate a smaller diff -- in other words less differences

No, I meant I was confirming that you already have turned off context
lines (i.e. the n=0 part), right?

Also, what's the nature of the changes? You might be able to minimize
difflib's output by using word-based or character-based diff-ing instead
of traditional line-based diff-ing.

diff output is fairly compressable, so you might want to look at zipping
the output.

From: Steven D'Aprano on 5 Jun 2010 04:04

On Fri, 04 Jun 2010 22:43:48 -0700, GZ wrote:

> This still does not do what I want it to do. It only displays the diff
> results in a different format. I want a different algorithm to generate
> a smaller diff -- in other words less differences

Can you give a *short* example, showing the output from Python's diff and
what you would prefer instead?

--
Steven

From: GZ on 5 Jun 2010 21:32

Hi Lie,

On Jun 5, 2:53 am, Lie Ryan <lie.1...(a)gmail.com> wrote:
> On 06/05/10 15:43, GZ wrote:
>
>
>
>
>
> > On Jun 4, 8:37 pm, Lie Ryan <lie.1...(a)gmail.com> wrote:
> >> On06/05/10 07:51, GZ wrote:
> >>> No, rsync does not solve my problem.
>
> >>> I want a library that does unix 'diff' like function, i.e. compare two
> >>> strings line by line and output the difference. Python's difflib does
> >>> not work perfectly for me, because the resulting differences are
> >>> pretty big. I would like an algorithm that generates the smallest
> >>> differences.
>
> >> is n=0 not short enough?
>
> >> pprint.pprint(list(difflib.context_diff(s, t, n=0)))
>
> > This still does not do what I want it to do. It only displays the diff
> > results in a different format. I want a different algorithm to
> > generate a smaller diff -- in other words less differences
>
> No, I meant I was confirming that you already have turned off context
> lines (i.e. the n=0 part), right?
>
> Also, what's the nature of the changes? You might be able to minimize
> difflib's output by using word-based or character-based diff-ing instead
> of traditional line-based diff-ing.
>
> diff output is fairly compressable, so you might want to look at zipping
> the output.- Hide quoted text -
>
> - Show quoted text -

Thanks for your response.

The verboseness of the format is not really my problem and I only care
about line by line comparison for now.

Let me think of a better way to express what I mean by a "smaller
diff." After I diff the two strings, I will have something like this:

AAA
- BBB
+ CCC
+ DDD
- EEE

It means the first line does not change, the second line is replaced
by the third line, the forth line is new, and the fifth line is
deleted.

I define the "smallness" of the diff algorithm as "the sum of the
total number of minuses and pluses". In my above example, it is 4 (two
minuses and 2 pluses). Note that no matter what format we use to
represent the diff, this number is the same.

Python's difflib does not really minimize this number. It tries to
make this number small, but also tries to yield matches that look
right to people at the cost of increasing this number. (http://
docs.python.org/library/difflib.html).

What I am looking for is an algo that can really minimize this number.

From: Ben Finney on 5 Jun 2010 21:42

GZ <zyzhu2000(a)gmail.com> writes:

> Let me think of a better way to express what I mean by a "smaller
> diff." After I diff the two strings, I will have something like this:
>
> AAA
> - BBB
> + CCC
> + DDD
> - EEE
>
> It means the first line does not change, the second line is replaced
> by the third line, the forth line is new, and the fifth line is
> deleted.

Are you drawing a distinction between:

* “line FOO is replaced by line BAR”
* “line FOO is deleted, line BAR is added”

Your wording seems to make that distinction, but I don't see how it's
useful or meaningful in a discussion about diff. Are they not exactly
the same?

--
\ “Injustice is relatively easy to bear; what stings is justice.” |
`\ —Henry L. Mencken |
_o__) |
Ben Finney

From: Dave Angel on 5 Jun 2010 22:09

GZ wrote:
> <snip>
>>>>> I want a library that does unix 'diff' like function, i.e. compare two
>>>>> strings line by line and output the difference. Python's difflib does
>>>>> not work perfectly for me, because the resulting differences are
>>>>> pretty big. I would like an algorithm that generates the smallest
>>>>> differences.
>>>>>
>> <snip>
> Thanks for your response.
>
> The verboseness of the format is not really my problem and I only care
> about line by line comparison for now.
>
> Let me think of a better way to express what I mean by a "smaller
> diff." After I diff the two strings, I will have something like this:
>
> AAA
> - BBB
> + CCC
> + DDD
> - EEE
>
> It means the first line does not change, the second line is replaced
> by the third line, the forth line is new, and the fifth line is
> deleted.
>
> I define the "smallness" of the diff algorithm as "the sum of the
> total number of minuses and pluses". In my above example, it is 4 (two
> minuses and 2 pluses). Note that no matter what format we use to
> represent the diff, this number is the same.
>
> Python's difflib does not really minimize this number. It tries to
> make this number small, but also tries to yield matches that �look
> right� to people at the cost of increasing this number. (http://
> docs.python.org/library/difflib.html).
>
> What I am looking for is an algo that can really minimize this number.
>
>
The - lines aren't needed, any more than the context lines are needed,
so that will typically cut your results in half. But perhaps the real
algorithm you're describing is one that permits lines to be moved as
well as inserted and deleted. If you then consider that information to
be free (it's not part of the measure you're specifying), you may do
best with the following algorithm:

Scan the two files looking for lines that appear exactly once in each
file. Consider those lines the first invariants, and everything else a
potential change. Now, starting with each pair (which I call a bridge
between islands), look at the line previous and the line following, and
if either or both are also matched, expand the two islands to include
them. As an island grows to butt up against another island, merge them.

Now, each bridge is a "move" operation, and every line left over in
either file is either a plus or a minus, an insert or a delete. For most
editing transactions, there will be relatively few of these.

For example, if three functions were reordered, from A, B, C, to A, C,
B, there would be no plus or minus lines at all.

DaveA

First | Prev | Next | Last
Pages: 1 2 3
Prev: questions about how to parse a string and put it in a dictionary
Next: great