@REGEX question

May 20, 2008
12,178
133
Syracuse, NY, USA
How am I to interpret the return value of @REGEX[] below? It doesn't seem to be the (documented) "number of matching groups".

Code:
e:\logs\mercury> echo %@REGEX["(efused)|(uthor)|(known)",refused]
4
 
May 20, 2008
12,178
133
Syracuse, NY, USA
How am I to interpret the return value of @REGEX[] below? It doesn't seem to be the (documented) "number of matching groups".

Code:
e:\logs\mercury> echo %@REGEX["(efused)|(uthor)|(known)",refused]
4

[Working with this Oniguruma stuff gives me a headache!]

This snippet (below) comes close to the documentation. It always gives a count of the matches. I replaced troublesome characters with GT, LT, GE, and LE.

Code:
    UChar    *mstart=(UChar*)szString,
            *mend=(UChar*)szString + 2 * lstrlen(szString);

    OnigRegion *region = onig_region_new();

    // here's the interesting stuff
    INT matches = 0, i;
    while ( onig_search(regex, mstart, mend, mstart, mend, region, 0) GE 0 )
    {
        matches += 1;

        // find the match and move past it
        // first see if the match was a group
        for ( i=1; i < region-GTnum_regs; i++ )
        {
            if ( region->beg[i] GE 0 ) // match was a group
            {
                mstart += region-GTend[i];
                break;    // keep looking (continue the while)    
            }
        }

        if ( i == region-GTnum_regs ) // match was not a group (region 0)
        {
            mstart += region-GTend[0];
        }
        // keep looking (continue the while)
    }
    Sprintf(psz, L"%d", matches);
Here are a few examples.

Code:
g:\projects\4utils\release> echo %@regex[o|g,doggiepoo]
5

g:\projects\4utils\release> echo %@regex[(oo)|(g),doggiepoo]
3

g:\projects\4utils\release> echo %@regex[(oo)|(gg),doggiepoo]
2

g:\projects\4utils\release> echo %@regex[(o)|(g),doggie]
3

g:\projects\4utils\release> echo %@regex[(o)|g,doggie]
3

g:\projects\4utils\release> echo %@regex[o|g,doggie]
3

g:\projects\4utils\release> echo %@regex[(s)|f,doggie]
0

g:\projects\4utils\release> echo %@regex[o|h,dog]
1

g:\projects\4utils\release> echo %@regex[(foo),foozzz]
1

g:\projects\4utils\release> echo %@regex[(foo),foozzzfoo]
2
 
May 20, 2008
12,178
133
Syracuse, NY, USA
You can shorten that by looping backwards so the region 0 match only gets counted if no group match was found.

Code:
    INT matches = 0;
    while ( onig_search(regex, mstart, mend, mstart, mend, region, 0) GE 0 )
    {
        matches += 1;

        for ( INT i = region-GTnum_regs-1; i GE 0; i-- )
        {
            if ( region->beg[i] GE 0 )
            {
                mstart += region-GTend[i];
                break;    
            }
        }
    }
    Sprintf(psz, L"%d", matches);
 

rconn

Administrator
Staff member
May 14, 2008
12,557
167
> How am I to interpret the return value of @REGEX[] below? It doesn't
> seem to be the (documented) "number of matching groups".
>
>
> Code:
> ---------
> e:\logs\mercury> echo %@REGEX["(efused)|(uthor)|(known)",refused]
> 4
> ---------

I tried that on several regular expression testers, and got results of 0, 1,
or 4, depending on the RE emulation desired.

So -- what are you trying to do, and what language syntax are you using?

Rex Conn
JP Software
 
May 20, 2008
12,178
133
Syracuse, NY, USA
On Sun, 11 Jul 2010 22:25:42 -0400, rconn <>
wrote:

|---Quote---
|> How am I to interpret the return value of @REGEX[] below? It doesn't
|> seem to be the (documented) "number of matching groups".
|>
|>
|> Code:
|> ---------
|> e:\logs\mercury> echo %@REGEX["(efused)|(uthor)|(known)",refused]
|> 4
|> ---------
|---End Quote---
|I tried that on several regular expression testers, and got results of 0, 1,
|or 4, depending on the RE emulation desired.
|
|So -- what are you trying to do, and what language syntax are you using?

I use PERL syntax. Your return value doesn't seem to depend on how
many are found. Are you returning region.num_regs? That's always the
number of parens (plus 1) in the regex. That's what it looks like
(see below). You have to loop to get all the matches.

Code:
v:\> echo %@regex[(a)|(b)|(c),cat]
4

v:\> echo %@regex[(a)|(b)|(c),ccaat]
4

v:\> echo %@regex[(a)|(b)|(c),cccaaat]
4

v:\> echo %@regex[(a)|(b)|(c)|(d),cccaaat]
5

v:\> echo %@regex[(a)|(b)|(c)|(d),ccaat]
5

v:\> echo %@regex[(a)|(b)|(c)|(d),cat]
5
 
May 20, 2008
12,178
133
Syracuse, NY, USA
On Sun, 11 Jul 2010 22:25:42 -0400, rconn <>
wrote:

|So -- what are you trying to do

I was just pointing out that, contrary to the help, @REGEX[] doesn't
return the number of matching groups. The code I posted (and the
complete version I emailed you) simply always returns the number of
matches. As far as counting matches is concerned, groups are not
significant; there are 3 matches here [a|b|c,cab] as well as here
[(a)|(b)|(c),cab] ... also here [(a|b|c),cab]. I'm not even sure
whether there's any point in using groups in a simple "find_a_match"
or "count_the_matches" function.
 
May 20, 2008
12,178
133
Syracuse, NY, USA
On Sun, 11 Jul 2010 22:25:42 -0400, rconn <>
wrote:

|So -- what are you trying to do

I was just pointing out that, contrary to the help, @REGEX[] doesn't
return the number of matching groups. The code I posted (and the
complete version I emailed you) simply always returns the number of
matches. As far as counting matches is concerned, groups are not
significant; there are 3 matches here [a|b|c,cab] as well as here
[(a)|(b)|(c),cab] ... also here [(a|b|c),cab]. I'm not even sure
whether there's any point in using groups in a simple "find_a_match"
or "count_the_matches" function.

Here's a simpler, faster, and much more intuitive (than code I posted earlier) way to count matches.

Code:
    UChar    *at = (UChar*) pString,
            *mend=(UChar*)pString + lstrlen(pString) * sizeof(WCHAR);
    INT        matches = 0,
            matchlen;

    while ( at < mend )
    {
        matchlen = onig_match(regex, (UChar*) pString, mend, at, NULL, option);
        if ( matchlen >= 0 )
        {
            matches += 1;
            at += matchlen;
        }
        else
        {
            at += 2;
        }
    }

    Sprintf(psz, L"%d", matches);
If you want to count matches you must plow through the string looking for subsequent ones. The onig_match function is a bit odd ... It checks to see if a match starts at "at". The parameter indicating the beginning of the whole string (pString, above) appears irrelevant; the function works even if that parameter is NULL or greater than "at"; it seems not used at all.
 

Similar threads