From: Steve Holden on
Tim Chase wrote:
> vsoler wrote:
>> Hence, I need to parse Excel formulas. Can I do it by means only of re
>> (regular expressions)?
>>
>> I know that for simple formulas such as "=3*A7+5" it is indeed
>> possible. What about complex for formulas that include functions,
>> sheet names and possibly other *.xls files?
>
> Where things start getting ugly is when you have nested function calls,
> such as
>
> =if(Sum(A1:A25)>42,Min(B1:B25), if(Sum(C1:C25)>3.14,
> (Min(C1:C25)+3)*18,Max(B1:B25)))
>
> Regular expressions don't do well with nested parens (especially
> arbitrarily-nesting-depth such as are possible), so I'd suggest going
> for a full-blown parsing solution like pyparsing.
>
> If you have fair control over what can be contained in the formulas and
> you know they won't contain nested parens/functions, you might be able
> to formulate some sort of "kinda, sorta, maybe parses some forms of
> formulas" regexp.
>
And don't forget about named ranges, which can reference cells without
using anything but a plain identifier ...

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
PyCon is coming! Atlanta, Feb 2010 http://us.pycon.org/
Holden Web LLC http://www.holdenweb.com/
UPCOMING EVENTS: http://holdenweb.eventbrite.com/

From: John Machin on
On Jan 6, 6:54 am, vsoler <vicente.so...(a)gmail.com> wrote:
> On 5 ene, 20:21, vsoler <vicente.so...(a)gmail.com> wrote:
>
>
>
> > On 5 ene, 20:05, Mensanator <mensana...(a)aol.com> wrote:
>
> > > On Jan 5, 12:35 pm, MRAB <pyt...(a)mrabarnett.plus.com> wrote:
>
> > > > vsoler wrote:
> > > > > Hello,
>
> > > > > I am acessing an Excel file by means of Win 32 COM technology.
> > > > > For a given cell, I am able to read its formula. I want to make a map
> > > > > of how cells reference one another, how different sheets reference one
> > > > > another, how workbooks reference one another, etc.
>
> > > > > Hence, I need to parse Excel formulas. Can I do it by means only of re
> > > > > (regular expressions)?
>
> > > > > I know that for simple formulas such as "=3*A7+5" it is indeed
> > > > > possible. What about complex for formulas that include functions,
> > > > > sheet names and possibly other *.xls files?
>
> > > > > For example    "=Book1!A5+8" should be parsed into ["=","Book1", "!",
> > > > > "A5","+","8"]
>
> > > > > Can anybody help? Any suggestions?
>
> > > > Do you mean "how" or do you really mean "whether", ie, get a list of the
> > > > other cells that are referred to by a certain cell, for example,
> > > > "=3*A7+5" should give ["A7"] and "=Book1!A5+8" should give ["Book1!A5]
>
> > > Ok, although "Book1" would be the default name of a workbook, with
> > > default
> > > worksheets labeled "Sheet1". "Sheet2", etc.
>
> > > If I had a worksheet named "Sheety" that wanted to reference a cell on
> > > "Sheetx"
> > > OF THE SAME WORKBOOK, it would be =Sheet2!A7. If the reference was to
> > > a completely
> > > different workbook (say Book1 with worksheets labeled "Sheet1",
> > > "Sheet2") then
> > > the cell might have =[Book1]Sheet1!A7.
>
> > > And don't forget the $'s! You may see =[Book1]Sheet1!$A$7.
>
> > Yes, Mensanator, but...  what re should I use? I'm looking for the re
> > statement. No doubt you can help!
>
> > Thank you.
>
> Let me give you an example:
>
> >>> import re
> >>> re.split("([^0-9])", "123+456*/")
>
> [’123’, ’+’, ’456’, ’*’, ’’, ’/’, ’’]
>
> I find it excellent that one single statement is able to do a lexical
> analysis of an expression!

That is NOT lexical analysis.
>
> If the expression contains variables, such as A12 or B9, I can try
> another re expression. Which one should I use?
>
> And if my expression contains parenthesis?   And the sin() function?

You need a proper lexical analysis, followed by a parser. What you
are trying to do can NOT be accomplished in any generality with a
single regex. The Excel formula syntax has several tricky bits. E.g.
IIRC whether TAX09 is a (macro) name or a cell reference depends on
what version of Excel you are targetting but if it appears like TAX09!
A1:B2 then it's a sheet name.

The xlwt package (of which I am the maintainer) has a lexer and parser
for a largish subset of the syntax ... see http://pypi.python.org/pypi/xlwt

From: Chris Withers on
John Machin wrote:
> The xlwt package (of which I am the maintainer) has a lexer and parser
> for a largish subset of the syntax ... see http://pypi.python.org/pypi/xlwt

xlrd, no?

Also worth pointing out that the topic of Python and Excel has its own
web site:

http://www.python-excel.org

....which links to a more specialist group.

Chris

--
Simplistix - Content Management, Batch Processing & Python Consulting
- http://www.simplistix.co.uk

From: John Machin on
On 12/01/2010 6:26 PM, Chris Withers wrote:
> John Machin wrote:
>> The xlwt package (of which I am the maintainer) has a lexer and parser
>> for a largish subset of the syntax ... see
>> http://pypi.python.org/pypi/xlwt
>
> xlrd, no?

A facility in xlrd to decompile Excel formula bytecode into a text
formula is currently *under discussion*.

The OP was planning to dig the formula text out using COM then parse the
formula text looking for cell references and appeared to have a rather
simplistic view of the ease of parsing Excel formula text -- that's why
I pointed him at those facilities (existing, released, proven in the
field) in xlwt.



From: Paul McGuire on
On Jan 5, 1:49 pm, Tim Chase <python.l...(a)tim.thechases.com> wrote:
> vsoler wrote:
> > Hence, I need toparseExcel formulas. Can I do it by means only of re
> > (regular expressions)?
>
> > I know that for simple formulas such as "=3*A7+5" it is indeed
> > possible. What about complex for formulas that include functions,
> > sheet names and possibly other *.xls files?
>
> Where things start getting ugly is when you have nested function
> calls, such as
>
>    =if(Sum(A1:A25)>42,Min(B1:B25), if(Sum(C1:C25)>3.14,
> (Min(C1:C25)+3)*18,Max(B1:B25)))
>
> Regular expressions don't do well with nested parens (especially
> arbitrarily-nesting-depth such as are possible), so I'd suggest
> going for a full-blown parsing solution like pyparsing.
>
> If you have fair control over what can be contained in the
> formulas and you know they won't contain nested parens/functions,
> you might be able to formulate some sort of "kinda, sorta, maybe
> parses some forms of formulas" regexp.
>
> -tkc

This might give the OP a running start:

from pyparsing import (CaselessKeyword, Suppress, Word, alphas,
alphanums, nums, Optional, Group, oneOf, Forward, Regex,
operatorPrecedence, opAssoc, dblQuotedString)

test1 = "=3*A7+5"
test2 = "=3*Sheet1!$A$7+5"
test3 = "=if(Sum(A1:A25)>42,Min(B1:B25), " \
"if(Sum(C1:C25)>3.14, (Min(C1:C25)+3)*18,Max(B1:B25)))"

EQ,EXCL,LPAR,RPAR,COLON,COMMA,DOLLAR = map(Suppress, '=!():,$')
sheetRef = Word(alphas, alphanums)
colRef = Optional(DOLLAR) + Word(alphas,max=2)
rowRef = Optional(DOLLAR) + Word(nums)
cellRef = Group(Optional(sheetRef + EXCL)("sheet") + colRef("col") +
rowRef("row"))

cellRange = (Group(cellRef("start") + COLON + cellRef("end"))
("range")
| cellRef )

expr = Forward()

COMPARISON_OP = oneOf("< = > >= <= != <>")
condExpr = expr + COMPARISON_OP + expr

ifFunc = (CaselessKeyword("if") +
LPAR +
Group(condExpr)("condition") +
COMMA + expr("if_true") +
COMMA + expr("if_false") + RPAR)
statFunc = lambda name : CaselessKeyword(name) + LPAR + cellRange +
RPAR
sumFunc = statFunc("sum")
minFunc = statFunc("min")
maxFunc = statFunc("max")
aveFunc = statFunc("ave")
funcCall = ifFunc | sumFunc | minFunc | maxFunc | aveFunc

multOp = oneOf("* /")
addOp = oneOf("+ -")
numericLiteral = Regex(r"\-?\d+(\.\d+)?")
operand = numericLiteral | funcCall | cellRange | cellRef
arithExpr = operatorPrecedence(operand,
[
(multOp, 2, opAssoc.LEFT),
(addOp, 2, opAssoc.LEFT),
])

textOperand = dblQuotedString | cellRef
textExpr = operatorPrecedence(textOperand,
[
('&', 2, opAssoc.LEFT),
])
expr << (arithExpr | textExpr)

import pprint
for test in (test1,test2, test3):
print test
pprint.pprint( (EQ + expr).parseString(test).asList() )
print


Prints:

=3*A7+5
[[['3', '*', ['A', '7']], '+', '5']]

=3*Sheet1!$A$7+5
[[['3', '*', ['Sheet1', 'A', '7']], '+', '5']]

=if(Sum(A1:A25)>42,Min(B1:B25), if(Sum(C1:C25)>3.14, (Min(C1:C25)+3)
*18,Max(B1:B25)))
['if',
['sum', [['A', '1'], ['A', '25']], '>', '42'],
'min',
[['B', '1'], ['B', '25']],
'if',
['sum', [['C', '1'], ['C', '25']], '>', '3.14'],
[['min', [['C', '1'], ['C', '25']], '+', '3'], '*', '18'],
'max',
[['B', '1'], ['B', '25']]]


-- Paul
First  |  Prev  |  Next  |  Last
Pages: 1 2 3 4 5
Prev: embedded python on mac - linking problem
Next: TypeError