Find or compare string using regular expression

About regular expressions

Regular expression syntax

str.replacerx

 

Syntax

int findrx(string pattern [from] [flags] [result] [submatch])

or

int findrx(string pattern rf [flags] [result] [submatch])

 

Parameters

string - string to search in.

pattern - regular expression that matches substring to find. String.

from - 0-based character index, from which to start search. Default 0.

rf - variable of type FINDRX. You can use it to set the part of string where to search, and a callout callback.

flags:

1 Case insensitive.
2 Whole word. This adds \b to the beginning and end of pattern.
4 Find all. Valid only if result is array.
8 Multiline. If this flag is set (or (?m) is used in pattern), ^ and $ match the beginning and end of line. Default: ^ and $ match the beginning and end of whole string.
16 Don't need submatches. This flag makes this function faster when result is array.
32 QM 2.3.0. Convert pattern from UTF-8 to ANSI. Used when QM is running in Unicode mode (ignored otherwise). Set this flag if pattern contains non ASCII characters, but string is ANSI (not UTF-8). It is needed because these characters in pattern normally consist of 2 or 3 bytes, whereas characters in string consist of 1 byte.
128 Only compile pattern.
pcre flags  

result - variable of type str, int, ARRAY(str) or ARRAY(CHARRANGE).

submatch - submatch to find. Integer. If 0 (default), finds whole match. Before QM 2.4.3 - not used if result is array or 0.

 

Remarks

Finds a substring in string. To specify the substring, is used regular expression (pattern). The function can find a whole match, a submatch, or all matches and submatches. A match is the part of string that matches pattern. A submatch is the part of the match that matches a captured subpattern. A captured subpattern is the part of pattern that is enclosed in parentheses and does not begin with ?.

 

The return value depends on flags and other arguments:

default 0-based index of first character of the match in string, or -1 if not found.
nonzero submatch 0-based index of first character of the submatch in string, or -1 if not found.
flag 4 number of matches, or 0 if not found.
flag 128 not used.

 

result can be used to get more information about the found match and submatches. This table shows what the function stores in result variable depending on its type. Assume that flag 4 is not used.

str result receives the match or submatch (if submatch is nonzero). If flag 128, receives the compiled pattern.
int result receives length of the match or submatch.
ARRAY(str) result receives the match in element 0 and submatches in subsequent elements. If flag 16 - only match. QM 2.4.3: if submatch not 0 - only the submatch.
ARRAY(CHARRANGE) result receives start and end offsets of the match and submatches. If flag 16 - only match. QM 2.4.3: if submatch not 0, receives only the submatch. QM 2.4.3: ARRAY(POINT) can be used too (POINT is the same as CHARRANGE, just shorter member names).

 

The CHARRANGE type is used to store start and end positions of a substring in a string.

 

type CHARRANGE cpMin cpMax

 

cpMin - start of substring (match or submatch) in string. It is 0-based index of first character of substring in string.

cpMax - end of substring.

 

If flag 4 is set and result is array, finds all matches. It creates two-dimensional array. To access an element, use result[x y], where y is match index (0 - first match, 1 - second match, ...), and x is 0 or submatch index (0 - whole match, 1 - first submatch, ...). For example, result[0 0] contains first match, result[0 1] - second match, result[1 0] - first submatch of first match.

 

If flag 128 (only compile) is set, and result is str variable, the function does not search. It only compiles pattern and stores compiled data into result variable. You can use that variable later with functions findrx and str.replacerx as pattern. If multiple operations are performed with the same pattern, using compiled pattern is about 2 times faster, because then pattern does not have to be compiled each time. To compile pattern, are used only pattern, flags and result. You should use same flags value when compiling and later.

 

Examples

 Find digits (10)
str subject="abc10 100 def"
out findrx(subject "\d+")

 Find digits as whole word (100), and store into s
str subject="abc10 100 def"
str s
if(findrx(subject "\d+" 0 2 s)>=0) out s

 Extract HTML tags (simplified; useful only as "find all" example)
str html
IntGetFile("http://www.google.com" html)
str pattern="<(.*)>.*<\/\1>" ;;matches a HTML tag
ARRAY(str) a
findrx(html pattern 0 4 a)
int i
for(i 0 a.len)
	out "submatch=%s, whole=%s" a[1 i] a[0 i]

 Extract URL components
str subject="http://msdn.microsoft.com:80/scripting/default.htm"
str pattern="(\w+):\/\/([^/:]+)(:\d*)?([^# ]*)"
int i; ARRAY(str) a
if(findrx(subject pattern 0 0 a)<0) out "does not match"; ret
for i 0 a.len
	out a[i]

 Extract URL components; show offsets and lenghts
str subject="http://msdn.microsoft.com:80/scripting/default.htm"
str pattern="(\w+):\/\/([^/:]+)(:\d*)?([^# ]*)"
int i; ARRAY(CHARRANGE) a
if(findrx(subject pattern 0 0 a)<0) out "does not match"; ret
for i 0 a.len
	int offset(a[i].cpMin) length(a[i].cpMax-a[i].cpMin)
	str s.get(subject offset length)
	out "offset=%i length=%i %s" offset length s

 Extract only server from URL 
str subject="http://msdn.microsoft.com:80/scripting/default.htm"
str pattern="(\w+):\/\/([^/:]+)(:\d*)?([^# ]*)"
str server
if(findrx(subject pattern 0 0 server 2)>=0) out server