Tokenize (split) string

Syntax

int tok(string arr [n] [delim] [flags] [arr2])

 

Parameters

string - string to tokenize. Usually str variable.

arr - receives tokens. Variable of type ARRAY(str) or ARRAY(lpstr). Also can be pointer-based array of str or lpstr. Can be 0 if don't need.

n - max number of tokens required. If omitted or -1, gets all.

delim - delimiters.

flags:

1

Modify string: substitute first delimiter character after each token to 0.

  • It is useful when arr is array of lpstr.
  • string must be of type str, not lpstr.
2

If there are more than n tokens, get whole right part as last (n-1 th) token.

  • For example, if string is "a b c" and n is 2, you will get "a" and "b c" instead of "a" and "b".
4

Don't split parts enclosed in " " (double quotation marks).

  • For example, tok "a, ''b, c''" a -1 ", ''" 4 gets "a" and "b, c", not "a", "b", "c".
8 Don't split parts enclosed in ( ).
16 Don't split parts enclosed in [ ].
32 Don't split parts enclosed in { }.
64 Don't split parts enclosed in < >.
128 Don't split parts enclosed in ' '.
0x100 delim is table of delimiters.
0x200

QM 2.3.1. Recursive parsing of parts enclosed in ()[]{}<>.

  • For example, when parsing string "<a (b > c) d>" with flags 8|64, you would get 3 tokens: "a (b ", "c" and "d". With flags 8|64|0x200 will be 1 token: "a (b > c) d".
0x400

QM 2.3.1. Don't apply this default behavior of parsing parts enclosed in ()[]{}<>:

1. Characters )]}> in parts enclosed in "" are ignored.

2. A single character )]}> enclosed in ' ' is ignored.

0x1000 QM 2.3.3. Delimiters are blanks (space, tab, new line, control characters) and delim characters.
0x2000

QM 2.3.5. Always trim blanks around tokens. Also removes blank tokens.

  • For example, tok " a , b " a -1 "," 0x2000 gets "a" and "b", not " a " and " b ".

arr2 - array for parts between (after) tokens. Will have same length as arr. Can be 0 if don't need.

 

Remarks

Parses string and stores tokens in arr. Returns number of tokens.

 

If arr is array of str, it receives copies of tokens. If it is array of lpstr, it receives pointers to tokens within string; it is faster.

 

QM 2.3.5. Applies flags 4-128 even if delim does not contain these characters. Then tokens include these characters.

 

QM 2.3.5. Fixed bug: flags 4-128 ignored when the enclosed part is preceded by a non-delimiter character.

 

Tips

Although tok can be used to get lines of a multiline string, there are simpler ways. See example3, foreach, findl, str.getl.

To parse strings also can be used regular expressions (findrx, str.replacerx) and other string functions, like find, findc, findw.

 

Example1

str s = "one two three"
ARRAY(str) arr
int i nt
nt = tok(s arr)
for(i 0 nt) out arr[i]
 Output:
 one
 two
 three

 

Example2

str s = "one, (two + three) four five"
ARRAY(str) arr arr2
int i nt
nt = tok(s arr 3 ", ()" 8 arr2)
for(i 0 nt) out "'%s' '%s'" arr[i] arr2[i]
 Output:
 'one' ', ('
 'two + three' ') '
 'four' ' '

 

Example3

str s = "one[]two[]three"
ARRAY(str) arr = s
for(int'i 0 arr.len) out arr[i]
 Output:
 one
 two
 three

 

Example4

str s="abcdef"
int i
 Split s into characters as strings:
ARRAY(str) a.create(s.len)
for(i 0 a.len) a[i].get(s i 1)
 Split s into characters as character codes:
ARRAY(int) b.create(s.len)
for(i 0 b.len) b[i]=s[i]