Dynamic (?:Regex Highlighting)++ with Javascript!

Version 20110421_2300

Interactive tester: DynamicRegexHighlighterTester.html.

This page demonstrates, documents and tests the Javascript dynamic regex highlighter script: DynamicRegexHighlighter.js. To see the highlighting in action, place the mouse cursor over the various parts of the regular expressions below. The components which are highlighted include: matching character class delimiters (and any quantifiers), comments, comment groups and matching (possibly nested) group delimiters (and their quantifiers). When moused over, these regex components are highlighted in THIS COLOR (or more precisely, in whatever style is defined by the CSS class: .regex_hl). The script identifies erroneous unbalanced parentheses and these are displayed in RED (or more precisely, in whatever style is defined by the CSS class: .regex_err). When the mouse is placed over a numbered capturing group, a tooltip appears indicating the capture group number (e.g. "Capture group $3"). Note, however, that if the regex has a "branch reset" construct, (?|(like)|(this)|(one)), then the script is not smart enough to compute capture group numbers beyond that point. In this case, affected capture group numbers are simply not included in the tooltips.

Each regular expression from the script is presented below in two formats: 1.) fully commented in free-spacing format, and 2.) uncommented in native Javascript format. Additionally, an extended example pseudo regex is provided which tests the various PCRE regex constructs that this script recognizes. During the page load process, (which can take quite some time on pages that contain many large, highly complex regexes), progress information is displayed in the browser's status bar at the bottom of the page (if the browser is setup to allow this.) Once the page is loaded, the script is idle except for mouseover and mouseout events, which are handled very quickly. When the page is unloaded, the script once again comes back to life to free up the memory it allocated and to null out all its references to DOM node objects (to prevent nefarious memory leaks that happen in IE if you don't).

Usage: (easy as one two three!)

Add a script tag to the document head element to include: DynamicRegexHighlighter.js.
e.g. <script type="text/javascript" src="DynamicRegexHighlighter.js"></script>
Add a .regex_hl class selector to the stylesheet to define what the highlighted regex text should look like. To show unbalanced parentheses as visible errors, add a .regex_err class selector to the stylesheet before the .regex_hl rule.
e.g. <style type="text/css">.regex_err {color: #FFF; background-color: #F00;} .regex_hl {color: #333; background-color: #0F0;}</style>
Wrap the regex to be highlighted in an element having either class="regex" or class="regex_x". The regular expression should be valid and in native regex format and should have all "<", ">" and "&" characters converted to HTML entities, so that the web page is valid. If the regex is written in free-spacing mode with #comments (i.e. with Perl syntax "x" modifier set), use the ".regex_x" class variation, and wrap it in a PRE tag (to preserve whitespace for IE).
e.g. <h1 class="regex">Dynamic (?:Regex Highlighting)++ with Javascript!</h1">
e.g. <pre class="regex_x">Free spacing ("x" mode) regex with #comments here.</pre">

Adding HTML markup to the Regexes:

You can apply HTML markup to the regex and the dynamic highlighter script will still work correctly as long as you follow a few rules. There are several atomic multi-character regex tokens which must not be split up by an HTML opening or closing tag. These not-to-be-interrupted regex tokens include the following:

The opening sequence of a group is atomic:
OK: (?:...), (?:...), (?:...), (?:...), (?:...).
BAD: (?:...), (?:...), (?:...), (?:...).
MORE BAD: (?>...), (?=...), (?<=...), (?!...), (?<!...), etc.
These escaped metacharacters "()|[]\#" are atomic:
OK: $, $, \|, \[, \], \\, \#.
ALSO OK: You may split up other escaped metacharacters: \s, \w, \2, \u89AF, etc.
BAD: \(, \), \|, \[, \], \\, \#.
The opening sequence of a character class is atomic:
OK: [^...], []...], [^]...], [^...], [^...], [^...], [^...].
BAD: [^...], []...], [^]...], [^]...], etc.
An embedded POSIX character class is atomic:
OK: [...[:alpha:]...], [...[:^alpha:]...], [...[:alpha:]...].
BAD: [...[:alpha:]...], [...[:alpha:]...], [...[:^alpha:]...], [...[:alpha:]...], etc.
Quantifiers applied to character classes and groups are atomic:
OK: [A-F]++, (A|B)*?, [ABC]{10,}, (\s\w+){3,5}+.
ALSO OK: You may split up other quantifiers: \w++, \s*?, \\{10,}, X{3,5}+.
BAD: [A-F]++, (A|B)*?, [ABC]{10,}, (\s\w+){3,5}+.

Note that these rules apply only to the DynamicRegexHighlighter.js script and have nothing to do with regular expression syntax itself. For the ultimate regular expression tutorial and reference, Mastering Regular Expressions (3rd Edition) by Jeffrey Friedl is the definitive guide and is highly recommended by this author.

Colorization note: Steven Levithan's color JavaScript Regex Syntax Highlighter script does not allow HTML markup in its input text. However, its generated output, (which does contain lots of HTML mark-up), may be fed into this script. Thus, if you are using both scripts, be sure to apply the colorizer script first (as is demonstrated on this and the tester page). Note, however, that if a non-Javascript regex is fed into the colorizer script, it will identify many common valid regex expressions (such as possessive quantifiers, lookbehind, etc), as errors, and will also mark-up these errors in a way that violates the markup rules described above. (e.g. it will split up a "++" possessive quantifier to show the second plus as an error.) The dynamic highlighter will then misinterpret these expressions. Bottom line: when using the colorizer, only feed it Javascript syntax regular expressions!

Regular Expression Goodness:

Following are all the regexes from the DynamicRegexHighlighter.js script. First up is the Phase 1 regex used to parse regexes having commenting turned on (i.e. regexes with the Perl style "x" modifier set). This first listing is presented in verbose format with a liberal sprinkling of comments and indentation for clarity.

# Rev:20100913_0900 github.com/jmrware/DynamicRegexHighlighter
# re_1_cmt: Match character classes, comment groups, HTML tags, and comments.
  ( [^[(#<\\]+(?:\\[^<][^[(#<\\]*)*       # $1: Everything else (starting w/non-escape)
  |           (?:\\[^<][^[(#<\\]*)+       #  or everything else (starting w/escape).
  )                                       # End $1. (Note: No escaped "\<" allowed.)
| (\[\^?)                                 # $2: Character class opening delim.
  (                                       # $3: Character class contents.
    \]?                                   # Unescaped ] allowed if first char.
    [^[\]\\]*(?:\\[\S\s][^[\]\\]*)*       # Non-[], escaped-anything (normal*).
    (?: \[                                # Allow a non-escaped "[", and it
      (?::\^?\w+:\])?                     # may be embedded POSIX char class.
      [^[\]\\]*(?:\\[\S\s][^[\]\\]*)*     # More non-[], escaped-anything.
    )*                                    # Unroll-the-loop (special normal*)*
  )                                       # End $3. Character class contents.
  \]                                      # Character class closing delimiter.
  ((?:</?\w+\b[^>]*>)*)                   # $4: HTML tags between "]" and quantifier.
  ((?:(?:[?*+]|\{\d+(?:,\d*)?\})[+?]?)?)  # $5: Optional char class quantifier.
| (\((?!\?\#))                            # $6: Opening "(" (non comment group).
| (\(\?\#[^)]*\))                         # $7: Comment group (cmt_grp).
| ((?:</?\w+\b[^>]*>)+)                   # $8: Embedded HTML tags (open or close).
| (\#.*)                                  # $9: Comment (cmt).

Here is the exact same regex in its raw, uncommented native Javascript format as it appears in the script (with a few added newlines to avoid it going off screen).

var re_1_cmt = /([^[(#<\\]+(?:\\[^<][^[(#<\\]*)*|(?:\\[^<][^[(#<\\]*)+)|
(\[\^?)(\]?[^[\]\\]*(?:\\[\S\s][^[\]\\]*)*(?:\[(?::\^?\w+:\])?[^[\]\\]*
(?:\\[\S\s][^[\]\\]*)*)*)\]((?:<\/?\w+\b[^>]*>)*)((?:(?:[?*+]|\{\d+(?:,\d*)?
\})[+?]?)?)|(\((?!\?#))|(\(\?#[^)]*\))|((?:<\/?\w+\b[^>]*>)+)|(#.*)/g;

Following, are the remaining regular expressions from the script in both commented and non-commented formats.

# Rev:20100913_0900 github.com/jmrware/DynamicRegexHighlighter
# re_1_nocmt: Match character classes and comment groups (no comments).
  ( [^[(\\]+(?:\\[\S\s][^[(\\]*)*         # $1: Everything else (starting w/non-escape)
  |         (?:\\[\S\s][^[(\\]*)+         #  or everything else (starting w/escape).
  )                                       # End $1.
| (\[\^?)                                 # $2: Character class opening delim.
  (                                       # $3: Character class contents.
    \]?                                   # Unescaped ] allowed if first char.
    [^[\]\\]*(?:\\[\S\s][^[\]\\]*)*       # Non-[], escaped-anything (normal*).
    (?: \[                                # Allow a non-escaped "[", and it
      (?::\^?\w+:\])?                     # may be embedded POSIX char class.
      [^[\]\\]*(?:\\[\S\s][^[\]\\]*)*     # More non-[], escaped-anything.
    )*                                    # Unroll-the-loop (special normal*)*
  )                                       # End $3. Character class contents.
  \]                                      # Character class closing delimiter.
  ((?:</?\w+\b[^>]*>)*)                   # $4: HTML tags between "]" and quantifier.
  ((?:(?:[?*+]|\{\d+(?:,\d*)?\})[+?]?)?)  # $5: Optional char class quantifier.
| (\((?!\?\#))                            # $6: Opening "(" (non comment group).
| (\(\?\#[^)]*\))                         # $7: Comment group (cmt_grp).

var re_1_nocmt = /([^[(\\]+(?:\\[\S\s][^[(\\]*)*|(?:\\[\S\s][^[(\\]*)+)|
(\[\^?)(\]?[^[\]\\]*(?:\\[\S\s][^[\]\\]*)*(?:\[(?::\^?\w+:\])?[^[\]\\]*
(?:\\[\S\s][^[\]\\]*)*)*)\]((?:<\/?\w+\b[^>]*>)*)((?:(?:[?*+]|\{\d+(?:,\d*)?
\})[+?]?)?)|(\((?!\?#))|(\(\?#[^)]*\))/g;

# Rev:20100913_0900 github.com/jmrware/DynamicRegexHighlighter
# re_2: Match inner (non-nested) PCRE syntax regex groups.
\(                         # Regex group opening "(" delimiter.
(                          # $1: Optional group type specification.
  \?                       # All special group types start with a "?".
  (?:                      # Non-capture group for group types alternatives.
    [:|>=!]                # Types specified with a single character.
  | &gt;                   # Atomic group (HTML entity).
  | &lt;[=!]               # Look behind (HTML entity).
  | <[=!]                  # Look behind (Note 1).
  | P?&lt;\w+&gt;          # Named capture group (Python/Perl) (HTML entity).
  | P?<\w+>                # Named capture group (Python/Perl) (Note 1).
  | '\w+'                  # Named capturing group (Perl).
  | (?=<span[^>]*>&\#40;)  # Previously-marked nested generic conditional.
  | \(                     # Begin conditional group with "(" delimiter.
    (?:                    # Non-capture group for conditional alternatives.
      [+\-]?\d+            # Absolute/+-relative reference condition.
    | &lt;\w+&gt;          # Named reference condition (Perl) (HTML entity).
    | <\w+>                # Named reference condition (Perl) (Note 1).
    | '\w+'                # Named reference condition (Perl).
    | R&amp;\w+            # specific recursion condition (HTML entity).
    | R&\w+                # specific recursion condition (Note 1).
    | \w+                  # Named reference condition (PCRE)
    ) \)                   # End conditional group with ")" delimiter.
  | (?:                    # Group types that must have zero content.
      R                    # Recurse whole pattern.
    | (?:-?[iJmsUx])+      # Flag modifiers (PCRE).
    | [+\-]?\d+            # Call subpattern by absolute/+-relative number.
    | &amp;\w+             # Call subpattern by name (Perl) (HTML entity).
    | &\w+                 # Call subpattern by name (Perl) (Note 1).
    | P&gt;\w+             # Call subpattern by name (Python) (HTML entity).
    | P>\w+                # Call subpattern by name (Python) (Note 1).
    | P=\w+                # Reference by name (Python).
    )(?=\))                # Ensure this group type has no contents.
  )                        # End non-capture group of group types alternatives.
)?                         # End $1: Optional group type specification.
([^()]*)                   # $2: Inner group contents.
\)                         # Regex group closing ")" delimiter.
((?:</?\w+\b[^>]*>)*)                   # $3 HTML between ")" and quantifier.
((?:(?:[?*+]|\{\d+(?:,\d*)?\})[+?]?)?)  # $4: Optional quantifier.
# Note 1: Handle "<", ">" and "&", even if not converted to HTML entities.

var re_2 = /\((\?(?:[:|>=!]|&gt;|&lt;[=!]|<[=!]|P?&lt;\w+&gt;|P?<\w+>|'\w+'|
(?=<span[^>]*>&#40;)|\((?:[+\-]?\d+|&lt;\w+&gt;|<\w+>|'\w+'|R&amp;\w+|R&\w+|\w+)
\)|(?:R|(?:-?[iJmsUx])+|[+\-]?\d+|&amp;\w+|&\w+|P&gt;\w+|P>\w+|P=\w+)(?=\))))?
([^()]*)\)((?:<\/?\w+\b[^>]*>)*)((?:(?:[?*+]|\{\d+(?:,\d*)?\})[+?]?)?)/g;

# Rev:20100913_0900 github.com/jmrware/DynamicRegexHighlighter
# re_escapedgroupdelims: Convert escaped group delimiter chars to HTML entities.
  ( [^\\]+(?:\\[^()|][^\\]*)*  # $1: Everything else (starting with non-escape),
  |       (?:\\[^()|][^\\]*)+  #  or everything else (starting with escape).
  )                            # End $1.
| \\([()|])                    # $2: Escaped "(", ")" or "|".

/([^\\]+(?:\\[^()|][^\\]*)*|(?:\\[^()|][^\\]*)+)|\\([()|])/g

# Rev:20100913_0900 github.com/jmrware/DynamicRegexHighlighter
# re_open_html_tag: Match HTML opening tag with at least one attribute.
<                  # Opening tag opening "<" delimiter.
(                  # $1: Opening tag name and attribute contents.
  \w+\b            # Tag name.
  (?:              # Non-capture group for required attribute(s).
    \s+            # Attributes must be separated by whitespace.
    [\w\-.:]+      # Attribute name is required for attr=value pair.
    (?:            # Non-capture group for optional attribute value.
      \s*=\s*      # Name and value separated by "=" and optional ws.
      (?:          # Non-capture group for attrib value alternatives.
        "[^"]*"    # Double quoted string (Note: may contain "&<>").
      | '[^']*'    # Single quoted string (Note: may contain "&<>").
      | [\w\-.:]+  # Non-quoted attrib value can be A-Z0-9-._:
      )            # End of attribute value
    )?             # Attribute value is optional.
  )+               # One or more attributes required.
  \s* /?           # Optional whitespace and "/" before ">".
)                  # End $1. Opening tag name and attribute contents.
>                  # Opening tag closing ">" delimiter.

/<(\w+\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[\w\-.:]+))?)+\s*\/?)>/g

A Script Test Chamber:

# Example pseudo-regex demonstrating all recognized PCRE component types.

(?# CHARACTER CLASSES)
[...]                          # positive character class
[^...]                         # negative character class
[]...]                         # unescaped ] allowed if first char
[^]...]                        # unescaped ] allowed if first char
[x-y]                          # range (can be used for hex characters)
[[:xxx:]]                      # positive POSIX named set
[[:^xxx:]]                     # negative POSIX named set
[[:alpha:][:alpha:][:alpha:]]  # can have multiple embedded POSIX cc
[[[[[:alpha:][[[[:alpha:][[[]  # can have unescaped non-POSIX class "["

(?# QUANTIFIERS applied to character classes and simple capture groups.)
[x]?         (x)?         # 0 or 1, greedy
[x]?+        (x)?+        # 0 or 1, possessive
[x]??        (x)??        # 0 or 1, lazy
[x]*         (x)*         # 0 or more, greedy
[x]*+        (x)*+        # 0 or more, possessive
[x]*?        (x)*?        # 0 or more, lazy
[x]+         (x)+         # 1 or more, greedy
[x]++        (x)++        # 1 or more, possessive
[x]+?        (x)+?        # 1 or more, lazy
[x]{1}       (x){1}       # exactly n
[x]{1,2}     (x){1,2}     # at least n, no more than m, greedy
[x]{1,2}+    (x){1,2}+    # at least n, no more than m, possessive
[x]{1,2}?    (x){1,2}?    # at least n, no more than m, lazy
[x]{1,}      (x){1,}      # n or more, greedy
[x]{1,}+     (x){1,}+     # n or more, possessive
[x]{1,}?     (x){1,}?     # n or more, lazy
[x]{10}      (x){10}      # exactly nn (multiple digits)
[x]{10,20}   (x){10,20}   # at least nn, no more than mm, greedy
[x]{10,20}+  (x){10,20}+  # at least nn, no more than mm, possessive
[x]{10,20}?  (x){10,20}?  # at least nn, no more than mm, lazy
[x]{10,}     (x){10,}     # nn or more, greedy
[x]{10,}+    (x){10,}+    # nn or more, possessive
[x]{10,}?    (x){10,}?    # nn or more, lazy

(?# CAPTURING)
(...)           # capturing group
(?<name>...)    # named capturing group (Perl)
(?'name'...)    # named capturing group (Perl)
(?P<name>...)   # named capturing group (Python)
(?:...)         # non-capturing group
(?|(...)|(...)) # "branch reset" non-capturing group; reset group
                # numbers for capturing groups in each alternative

(?# ATOMIC GROUPS)
(?>...)         # atomic, non-capturing group

(?# OPTION SETTING)
(?i)            # caseless
(?J)            # allow duplicate names
(?m)            # multiline
(?s)            # single line (dotall)
(?U)            # default ungreedy (lazy)
(?x)            # extended (ignore white space)
(?-i)           # NOT caseless
(?-J)           # NOT allow duplicate names
(?-m)           # NOT multiline
(?-s)           # NOT single line (dotall)
(?-U)           # NOT default ungreedy (lazy)
(?-x)           # NOT extended (ignore white space)
(?i-Jm-sU-x)    # multiple options at once.
(?-iJ-ms-Ux)    # multiple options at once.

(?# LOOKAHEAD AND LOOKBEHIND ASSERTIONS)
(?=...)         # positive look ahead
(?!...)         # negative look ahead
(?<=...)        # positive look behind
(?<!...)        # negative look behind

(?# BACKREFERENCES)
(?P=name)       # reference by name (Python)

(?# SUBROUTINE REFERENCES {POSSIBLY RECURSIVE})
(?R)            # recurse whole pattern
(?1)            # call subpattern by absolute number
(?+1)           # call subpattern by relative number
(?-1)           # call subpattern by relative number
(?&name)        # call subpattern by name (Perl)
(?P>name)       # call subpattern by name (Python)

(?# CONDITIONAL PATTERNS)
(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)
(?(1)...)        # absolute reference condition
(?(+1)...)       # relative reference condition
(?(-1)...)       # relative reference condition
(?(<name>)...)   # named reference condition (Perl)
(?('name')...)   # named reference condition (Perl)
(?(name)...)     # named reference condition (PCRE)
(?(R)...)        # overall recursion condition
(?(R1)...)       # specific group recursion condition
(?(R&name)...)   # specific recursion condition
(?(DEFINE)...)   # define subpattern for reference
(?(?=...)...)    # assertion condition (positive lookahead)
(?(?!...)...)    # assertion condition (negative lookahead)
(?(?<=...)...)   # assertion condition (positive lookbehind)
(?(?<!...)...)   # assertion condition (negative lookbehind)

(?# MISCELLANEOUS TESTS)
# test HTML tags having "&<>()|[]" delimiter chars in attribute values.
HTML TAG            # in open regex
(?# HTML TAG)       # in comment group
# HTML TAG          # in comment
[HTML TAG in character class]
(HTML TAG in group)
\HTML TAG           # with \ escape immediately before <

# character class regexes with HTML tags
[charclass] [charclass] [charclass] [charclass]
[charclass] [charclass] [charclass] [charclass]
[charclass]++ [charclass]++ [charclass]++ [charclass]++ [charclass]++
[charclass]++ [charclass]++ [charclass]++ [charclass]++ [charclass]++

# characters class regexes with multiple HTML tags
[charclass] [charclass] [charclass] [charclass]
[charclass] [charclass] [charclass] [charclass]
[charclass]++ [charclass]++ [charclass]++ [charclass]++ [charclass]++
[charclass]++ [charclass]++ [charclass]++ [charclass]++ [charclass]++

# group regexes with HTML tags
(?:group) (?:group) (?:group) (?:group)
(?:group) (?:group) (?:group) (?:group)
(?:group)++ (?:group)++ (?:group)++ (?:group)++ (?:group)++
(?:group)++ (?:group)++ (?:group)++ (?:group)++ (?:group)++

# group regexes with multiple HTML tags
(?:group) (?:group) (?:group) (?:group)
(?:group) (?:group) (?:group) (?:group)
(?:group)++ (?:group)++ (?:group)++ (?:group)++ (?:group)++
(?:group)++ (?:group)++ (?:group)++ (?:group)++ (?:group)++

[  (   )   | ]   # unescaped group delimiters inside char class
[ \(  \)  \| ]   # escaped group delimiters inside char class
( \(  \)  \| )   # escaped group delimiters inside group
  \(  \)  \|     # escaped group delimiters outside
) ) ( (          # unbalanced parentheses

Notes and limitations:

During Phase 1 processing, the re_1 regex will match (invalid) empty character classes. (i.e. /[]/ or /[^]/). It is best to use only valid regexes.
The HTML document should not have any element having id="xREx" as this may cause an error during parsing when running Internet Explorer. This problem should be very rare and is easily avoided.
Firefox 2 refuses to break up very long words and will display them as one very long line (with a horizontal scroll bar). It is best to add some line breaks to long regexes.
When using the interactive tester, it is important to choose the correct value for the Perl "x" free spacing mode checkbox option. If you fail to check this option for regexes having #comments, the parser will get confused if there are any unbalanced metacharacters within the comments (and may (erroneously) report unbalanced parentheses).
When using the color syntax highlighting option, remember that the colorization script is designed to only handle regexes written in the Javascript regex flavor (i.e. no lookbehind, possessive quantifiers, named capture groups, atomic grouping, etc.) But most importantly, the Javascript flavor does not allow comments. For this reason, the interactive tester will not allow you to select both the "x" flag option and the color option at the same time.