|
Prev: another question about exception safety
Next: Defect Report: integral_constat::value should be constexpr
From: Oncaphillis on 19 Jun 2008 07:07 Hi, I'd like to do regex matching on istreams. Browsing through the boost.regex docs tells me that the regex_(search|match|replace) family of functions at least needs a BidirectionalIterator concept to work on. So istream_iterator seems to be a no go. All examples I could find in the boost docs seem to read in the complete file and do the match afterwards. This is fine as long as you know that the file is of moderate size. So my question 1. Is there a alternative lib around which works on ForwardIterators ? My requirements are quite high. It should be a templated lib which supports char and wchar_t and submatching. - or (preferable) - 2. Is there a elegant way to talk boost.regex to work on istreams. Beside the obvious of reading in the whole file and working on the a string. The only thing I can come up with is a custom streambuf which stuffs all CHAR_T already read into a temporary file. This way one could implement real (g|s)etpos(), tell() methods in contrast to the methods provided by the fstream which work on the underlying stream of chars (e.g. utf-8 etc). Based on this it should be possible to create a BidirectionalIterator. But may be something like this has been done already. Thank you very much indead O. -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Alex on 19 Jun 2008 16:58 On Jun 19, 6:07 pm, Oncaphillis <oncaphil...(a)snafu.de> wrote: > 1. Is there a alternative lib around which works on You may want to look into Poco::BufferedBidirectionalStreamBuf : http://poco.svn.sourceforge.net/viewvc/poco/poco/trunk/Foundation/include/Poco/BufferedBidirectionalStreamBuf.h?view=markup and Poco::RegularExpression : http://poco.svn.sourceforge.net/viewvc/poco/poco/trunk/Foundation/include/Poco/RegularExpression.h?view=markup There is no readily available class in POCO but it should be rather straightforward to implement it. See Poco::FileStream for an implementation example: http://poco.svn.sourceforge.net/viewvc/poco/poco/trunk/Foundation/include/Poco/FileStream.h?view=markup Alex -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Martin T. on 20 Jun 2008 04:38 Oncaphillis wrote: > Hi, > > I'd like to do regex matching on istreams. Browsing > through the boost.regex docs tells me that the > regex_(search|match|replace) family of functions > at least needs a BidirectionalIterator concept to work > on. So istream_iterator seems to be a no go. > All examples I could find in the boost docs seem to > read in the complete file and do the match afterwards. > This is fine as long as you know that the file is of > moderate size. > > So my question > > 1. Is there a alternative lib around which works on > ForwardIterators ? My requirements are quite high. > It should be a templated lib which supports char and > wchar_t and submatching. > I cannot imagine that a regexp parser could work with only a ForwardIterator ... > - or (preferable) - > > 2. Is there a elegant way to talk boost.regex to work on > istreams. Beside the obvious of reading in the whole > file and working on the a string. > > The only thing I can come up with is a custom streambuf which > stuffs all CHAR_T already read into a temporary file. This > way one could implement real (g|s)etpos(), tell() methods in > contrast to the methods provided by the fstream which > work on the underlying stream of chars (e.g. utf-8 etc). > I'm afraid I do not quite see what kind of problem a temporary file would solve? > Based on this it should be possible to create a > BidirectionalIterator. > fstream already provides even random access methods, so it should be possible to write a BidirectionalIterator for ifstream. (Even a RandomAccessIterator should be possible.) Or so I thought :-) The main problem I see with files and Bidir/Random Iterators is that they're always used in pairs and both the begin() and the end() may well be modified. If you have an iterator pointing to a certain file position and you need a second iterator pointing to a second _independent_ file position then the only way to achieve this is to open the file _twice_ and have two different ifstream objects. I think if you accept that, it should not be too hard to write a Bidir iterator for ifstream. (Iterator equality would then be defined by file position equality.) br, Martin -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Oncaphillis on 20 Jun 2008 12:51 Martin T. wrote: > I cannot imagine that a regexp parser could work with only a > ForwardIterator ... So, may be I should call it a crippled BackwardIterator. In theory a nondeterministic finite state automaton first only needs a ForwardIterator in order to call something a match. It consumes one symbol at a time and if it enters one of it's accepting states it's done. If it doesn't match there has to be a way to reenter parsing after the previous starting position. But it never has to cross the previous starting pos. (May be that doesn't hold true for back references like (a.*b)(\1), I'm not so sure. Are these provided by boost.regex ? So my understanding is that this "crippled backward iterator" should be able to be repositioned at one previously remembered location or any place further downstream but never anywhere further upstream. But I may be awfully wrong So I can think of an implementation which eats up symbols and stores them in a string. If it doesn't match it starts again at a cleverly chosen position after the last starting pos. From now on symbols are taken from the string until it has to consume new symbols from the file again. But as far as I can tell it only would have to step forward within the file. Of course you still run the risk of exceeding available memory when a potential match grows and grows and you're trying to match the file "aa---(190 Gbyte)--bb" against (aa.*bb). > I'm afraid I do not quite see what kind of problem a temporary file > would solve? > fstream already provides even random access methods, so it should be > possible to write a BidirectionalIterator for ifstream. (Even a > RandomAccessIterator should be possible.) > Or so I thought :-) The problem with seeking in an fstream is it actually breaks the abstraction of "stream of symbols". If for example you read UTF-32 (wchar_t on my system) from an wfstream and the underlying file is utf-8 you do not seek to a given symbol position but to a given byte position. There is no way of achieving this unless you remember every symbol position on your way through the file since a single symbol might be encoded by 1-4 byte. My "temporary file" approach would save those (fixed size) symbols provided by the frontend of the fstream. O. -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Éric Malenfant on 20 Jun 2008 12:51 On Jun 19, 6:07 pm, Oncaphillis <oncaphil...(a)snafu.de> wrote: > > 2. Is there a elegant way to talk boost.regex to work on > istreams. Beside the obvious of reading in the whole > file and working on the a string. Boost.Regex supports "partial matches". One of the things this allows is to apply the regex on "chunks" of the file, without having to load the entire file content's in memory. See the second example on this page for... an example: http://www.boost.org/doc/libs/1_35_0/libs/regex/doc/html/boost_regex/partial_matches.html -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
|
Next
|
Last
Pages: 1 2 Prev: another question about exception safety Next: Defect Report: integral_constat::value should be constexpr |