EggRegex improvements

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

EggRegex improvements

Marco Barisione
Hi,
we have written a new syntax highlighting engine for GtkSourceView (used
by gedit, MonoDevelop and other programs.) We are using EggRegex for
regular expression matching and it seems to work well. I have made some
changes to EggRegex that are listed below, the patch is attached.
Sorry if I have a single patch for all the changes.

[This is on bugzilla.gnome.org as bug #306941]

- Now EggRegex works with offsets instead of indexes, so it is
consistent with glib, however it is still possible to use indexes if
desired to avoid double conversions (for instance indexes used
internally by GtkSourceView --> offsets --> indexes used by PCRE.)

- I have corrected some memory leaks and some other bugs.


Modified functions:
- egg_regex_new accepts an argument use_offsets to choose between offset
and index mode.

- egg_regex_match returns a boolean instead of the number of matched
substrings, it is more intuitive to use "if (egg_regex_match (...))"
insted of "if (egg_regex_match (...) >= 0)." To obtain the number of
matched substrings you can use egg_regex_get_match_count().

- egg_regex_match does not accept anymore the length argument, so it is
very easy to use this function in the most common case. If a length is
needed egg_regex_match_extended can be used.

- egg_regex_fetch_pos returns a boolean so it is possible to know if the
position has been retrieved.

- egg_regex_replace_eval: the regex passed to the eval function is now
constant as modifying it could lead to an unexpected behaviour.

- All the functions that perform a match accept a start_position
argument, this differs from just passing over a shortened string and
setting EGG_REGEX_MATCH_NOTBOL in the case of a pattern that begins with
any kind of lookbehind assertion, such as "\b".


New functions:
- egg_regex_copy: returns a copy of an EggRegex structure.

- egg_regex_get_pattern: returns the regular expression pattern.

- egg_regex_match_extended: like egg_regex_match but allows a starting
offset different from 0 (useful when using lookbehind assertions.)
Accepts a GError to report errors.

- egg_regex_match_next_extended: like egg_regex_match_next but accepts a
starting offset and a GError.

- egg_regex_get_match_count: returns the number of matched substrings.

- egg_regex_fetch_named_pos: like egg_regex_fetch_pos but for named
subexpressions.

- egg_regex_expression_number_from_name: returns the number of a named
subexpression.

- egg_regex_replace_literal: like egg_regex_replace() but the
replacement is included literally, so "\1" is not a backreference but
just a "\" followed by "1".

- egg_regex_escape_string: escapes a string, useful to dinamically
create regular expressions.


I have also written an (ugly and incomplete) automatic test for
EggRegex, it is attached as testregexauto.c.


--
Marco Barisione


--- libegg/libegg/regex/eggregex.c 2004-07-31 23:57:06.000000000 +0200
+++ eggregex.c 2005-06-08 21:19:01.333030256 +0200
@@ -1,6 +1,7 @@
 /* EggRegex -- regular expression API wrapper around PCRE.
  * Copyright (C) 1999, 2000 Scott Wimer
  * Copyright (C) 2004 Matthias Clasen <[hidden email]>
+ * Copyright (C) 2005 Marco Barisione <[hidden email]>
  *
  * This is basically an ease of user wrapper around the functionality of
  * PCRE.
@@ -47,18 +48,43 @@
 /* FIXME when this is in glib */
 #define _(s) s
 
+/* FIXME move this to the sgml file:
+<!-- ##### USER_FUNCTION EggRegexEvalCallback ##### -->
+<para>
+Specifies the type of the function passed to egg_regex_replace_eval().
+It is called for each occurance of the pattern @regex in @string, and it
+should append the replacement to @result.
+</para>
+
+<para>
+Do not call on @regex functions that modify its internal state, such as
+egg_regex_match(); if you need it you can create a temporary copy of
+@regex using egg_regex_copy().
+</para>
+
+@regex: a #EggRegex.
+@string: the string used to perform matches against.
+@result: a #GString containing the new string.
+@user_data: user data passed to egg_regex_replace_eval().
+Returns: %FALSE to continue the replacement process, %TRUE to stop it.
+FIXME: I would prefer to use %FALSE to stop the process.
+FIXME: We could return the replacement string or NULL to stop the process.
+*/
+
 struct _EggRegex
 {
   gchar *pattern;       /* the pattern */
   pcre *regex; /* compiled form of the pattern */
   pcre_extra *extra; /* data stored when egg_regex_optimize() is used */
+  gboolean use_offsets; /* use offsets or indexes */
   gint matches; /* number of matching sub patterns */
   gint pos; /* position in the string where last match left off */
   gint *offsets; /* array of offsets paired 0,1 ; 2,3 ; 3,4 etc */
   gint n_offsets; /* number of offsets */
   EggRegexCompileFlags compile_opts; /* options used at compile time on the pattern */
   EggRegexMatchFlags match_opts; /* options used at match time on the regex */
-  gint string_len; /* length of the string last used against */
+  gssize string_len; /* length of the string last used against */
+  gint start_position; /* starting index in the string */
   GSList *delims; /* delimiter sub strings from split next */
 };
 
@@ -75,20 +101,29 @@
 
 /**
  * egg_regex_new:
- * @pattern: the regular expression
- * @compile_options: compile options for the regular expression
- * @match_options: match options for the regular expression
- * @error: return location for a #GError
+ * @pattern: the regular expression.
+ * @compile_options: compile options for the regular expression.
+ * @match_options: match options for the regular expression.
+ * @use_offsets: use offsets or indexes
+ * @error: return location for a #GError.
  *
  * Compiles the regular expression to an internal form, and does the initial
  * setup of the #EggRegex structure.  
  *
- * Returns: a #EggRegex structure
+ * If @use_offsets is %TRUE then all the lengths and positions are offsets,
+ * i.e. the are characters countsa; if %FALSE indexes (i.e. bytes counts) are
+ * used. You should normally pass %TRUE as glib uses offstes and this will
+ * avoid lots of errors, intead you should use indexes only if you already
+ * have indexes to improve performances avoiding a double conversion.
+ * Once the structure is created you cannot change this option anymore.
+ *
+ * Returns: a #EggRegex structure.
  */
 EggRegex *
 egg_regex_new (const gchar         *pattern,
      EggRegexCompileFlags   compile_options,
      EggRegexMatchFlags     match_options,
+     gboolean             use_offsets,
      GError             **error)
 {
   EggRegex *regex = g_new0 (EggRegex, 1);
@@ -96,12 +131,16 @@
   gint erroffset;
   gint capture_count;
   
+  g_return_val_if_fail (pattern != NULL, NULL);
+
   /* preset the parts of gregex that need to be set, regardless of the
    * type of match that will be checked */
   regex->pattern = g_strdup (pattern);
   regex->extra = NULL;
   regex->pos = 0;
   regex->string_len = -1; /* not set yet */
+  regex->start_position = -1; /* not set yet */
+  regex->use_offsets = use_offsets;
 
   /* set the options */
   regex->compile_opts = compile_options | PCRE_UTF8 | PCRE_NO_UTF8_CHECK;
@@ -122,7 +161,10 @@
        pattern, erroffset, errmsg);
       g_propagate_error (error, tmp_error);
 
-      return regex;
+      g_free (regex->pattern);
+      g_free (regex);
+
+      return NULL;
     }
 
   /* otherwise, find out how many sub patterns exist in this pattern,
@@ -135,16 +177,18 @@
   return regex;
 }
 
-
 /**
  * egg_regex_free:
- * @regex: a #EggRegex structure from egg_regex_new()
+ * @regex: a #EggRegex structure from egg_regex_new().
  *
  * Frees all the memory associated with the regex structure.
  */
 void
 egg_regex_free (EggRegex *regex)
 {
+  if (regex == NULL)
+    return;
+
   g_free (regex->pattern);
   g_slist_free (regex->delims);
   g_free (regex->offsets);
@@ -155,10 +199,53 @@
   g_free (regex);
 }
 
+/**
+ * egg_regex_copy:
+ * @regex: a #EggRegex structure from egg_regex_new().
+ *
+ * Copies a #EggRegex.
+ *
+ * Returns: a newly allocated copy of @regex.
+ */
+EggRegex *
+egg_regex_copy (const EggRegex *regex)
+{
+  EggRegex *copy;
+
+  g_return_val_if_fail (regex != NULL, NULL);
+
+  copy = egg_regex_new (regex->pattern, regex->compile_opts,
+      regex->match_opts, regex->use_offsets, NULL);
+  /* egg_regex_new() should not fail */
+  g_return_val_if_fail (copy != NULL, NULL);
+
+  /* if the regex has been studied we also study the copy */
+  if (regex->extra != NULL)
+    egg_regex_optimize (copy, NULL);
+
+  return copy;
+}
+
+/**
+ * egg_regex_get_pattern:
+ * @regex: a #EggRegex structure.
+ *
+ * Returns a reference to the regular expression pattern of
+ * @regex.
+ *
+ * Returns: the pattern string passed to egg_regex_new().
+ */
+const gchar *
+egg_regex_get_pattern (const EggRegex *regex)
+{
+  g_return_val_if_fail (regex != NULL, NULL);
+
+  return regex->pattern;
+}
 
 /**
  * egg_regex_clear:
- * @regex: a #EggRegex structure
+ * @regex: a #EggRegex structure.
  *
  * Clears out the members of @regex that are holding information about the
  * last set of matches for this pattern.  egg_regex_clear() needs to be
@@ -168,8 +255,11 @@
 void
 egg_regex_clear (EggRegex *regex)
 {
+  g_return_if_fail (regex != NULL);
+
   regex->matches = -1;
   regex->string_len = -1;
+  regex->start_position = -1;
   regex->pos = 0;
 
   /* if the pattern was used with egg_regex_split_next(), it may have
@@ -180,8 +270,8 @@
 
 /**
  * egg_regex_optimize:
- * @regex: a #EggRegex structure
- * @error: return location for a #GError
+ * @regex: a #EggRegex structure.
+ * @error: return location for a #GError.
  *
  * If the pattern will be used many times, then it may be worth the
  * effort to optimize it to improve the speed of matches.
@@ -192,6 +282,11 @@
 {
   const gchar *errmsg;
 
+  g_return_if_fail (regex != NULL);
+
+  if (regex->extra != NULL)
+    return;
+
   regex->extra = _pcre_study (regex->regex, 0, &errmsg);
 
   if (errmsg)
@@ -208,39 +303,82 @@
 
 /**
  * egg_regex_match:
- * @regex: a #EggRegex structure from egg_regex_new()
- * @string: the string to scan for matches
- * @string_len: the length of @string, or -1 to use strlen()
- * @match_options:  match options
- *
- * Scans for a match in string for the pattern in @regex. The starting index
- * of the match goes into the pos member of the @regex struct. The indexes
- * of the full match, and all matches get stored off in the offsets array.
- *
- * The @match_options are combined with the match options specified when the
- * @regex structure was created, letting you have more flexibility in reusing
- * #EggRegex structures.
- *
- * Returns:  Number of matched substrings + 1, or 1 if the pattern has no
- *           substrings in it.  Returns #GREGEX_NOMATCH if the pattern
- *           did not match.
+ * @regex: a #EggRegex structure from egg_regex_new().
+ * @string: the string to scan for matches.
+ * @match_options:  match options.
+ *
+ * Scans for a match in string for the pattern in @regex. The @match_options
+ * are combined with the match options specified when the @regex structure
+ * was created, letting you have more flexibility in reusing #EggRegex
+ * structures.
+ *
+ * Returns: %TRUE is the string matched, %FALSE otherwise.
  */
-gint
+gboolean
 egg_regex_match (EggRegex          *regex,
        const gchar     *string,
-       gssize           string_len,
        EggRegexMatchFlags match_options)
 {
+  return egg_regex_match_extended (regex, string, -1, 0,
+   match_options, NULL);
+}
+
+/**
+ * egg_regex_match_extended:
+ * @regex: a #EggRegex structure from egg_regex_new().
+ * @string: the string to scan for matches.
+ * @string_len: the length of @string in bytes, or -1 to use strlen().
+ * @start_position: starting index of the string to match.
+ * @match_options:  match options.
+ * @error: location to store the error occuring, or NULL to ignore errors.
+ *
+ * Scans for a match in string for the pattern in @regex. The @match_options
+ * are combined with the match options specified when the @regex structure
+ * was created, letting you have more flexibility in reusing #EggRegex
+ * structures.
+ *
+ * Setting @start_position differs from just passing over a shortened string
+ * and  setting EGG_REGEX_MATCH_NOTBOL in the case of a pattern that begins
+ * with any kind of lookbehind assertion, such as "\b".
+ *
+ * Returns: %TRUE is the string matched, %FALSE otherwise.
+ */
+gboolean
+egg_regex_match_extended (EggRegex          *regex,
+        const gchar       *string,
+ gssize             string_len,
+ gint               start_position,
+ EggRegexMatchFlags match_options,
+ GError           **error)
+{
+  g_return_val_if_fail (regex != NULL, FALSE);
+  g_return_val_if_fail (string != NULL, FALSE);
+  g_return_val_if_fail (start_position >= 0, FALSE);
+
   if (string_len < 0)
     string_len = strlen (string);
+  else if (regex->use_offsets)
+    string_len = g_utf8_offset_to_pointer (string, string_len) - string;
+
+  if (regex->use_offsets)
+    start_position = g_utf8_offset_to_pointer (string, start_position) - string;
 
   regex->string_len = string_len;
+  regex->start_position = start_position;
 
   /* perform the match */
   regex->matches = _pcre_exec (regex->regex, regex->extra,
-       string, regex->string_len, 0,
+       string, regex->string_len,
+       regex->start_position,
        regex->match_opts | match_options,
        regex->offsets, regex->n_offsets);
+  if (regex->matches < PCRE_ERROR_NOMATCH)
+  {
+    g_set_error (error, EGG_REGEX_ERROR, EGG_REGEX_ERROR_MATCH,
+ _("Error while matching regular expression %s"),
+ regex->pattern);
+    return FALSE;
+  }
 
   /* if the regex matched, set regex->pos to the character past the
    * end of the match.
@@ -248,40 +386,72 @@
   if (regex->matches > 0)
     regex->pos = regex->offsets[1];
 
-  return regex->matches; /* return what pcre_exec() returned */
+  return regex->matches >= 0;
 }
 
-
 /**
  * egg_regex_match_next:
- * @regex: a #EggRegex structure
- * @string: the string to scan for matches
- * @string_len: the length of @string, or -1 to use strlen()
- * @match_options: the match options
- *
- * Scans for the next match in @string of the pattern in @regex.  The starting
- * index of the match goes into the pos member of the @regex struct.  The
- * indexes of the full match, and all matches get stored off in the offsets
- * array.  The match options are ored with the match options set when
+ * @regex: a #EggRegex structure.
+ * @string: the string to scan for matches.
+ * @match_options: the match options.
+ *
+ * Scans for the next match in @string of the pattern in @regex.
+ * array.  The match options are combined with the match options set when
  * the @regex was created.
  *
- * You have to call egg_regex_clear() to reuse the same pattern on a new string.
- * This is especially true for use with egg_regex_match_next().
+ * You have to call egg_regex_clear() to reuse the same pattern on a new
+ * string. This is especially true for use with egg_regex_match_next().
  *
- * Returns:  Number of matched substrings + 1, or 1 if the pattern has no
- *           substrings in it.  Returns #GREGEX_NOMATCH if the pattern
- *           did not match.
+ * Returns: %TRUE is the string matched, %FALSE otherwise.
  */
-gint
+gboolean
 egg_regex_match_next (EggRegex          *regex,
     const gchar     *string,
-    gssize           string_len,
     EggRegexMatchFlags match_options)
 {
+  return egg_regex_match_next_extended (regex, string, -1, 0,
+ match_options, NULL);
+}
+
+/**
+ * egg_regex_match_next_extended:
+ * @regex: a #EggRegex structure.
+ * @string: the string to scan for matches.
+ * @string_len: the length of @string, or -1 to use strlen().
+ * @start_position: starting index of the string to match.
+ * @match_options: the match options.
+ * @error: location to store the error occuring, or NULL to ignore errors.
+ *
+ * Scans for the next match in @string of the pattern in @regex.
+ * array.  The match options are combined with the match options set when
+ * the @regex was created.
+ *
+ * You have to call egg_regex_clear() to reuse the same pattern on a new
+ * string. This is especially true for use with
+ * egg_regex_match_next_extended().
+ *
+ * Setting @start_position differs from just passing over a shortened string
+ * and  setting EGG_REGEX_MATCH_NOTBOL in the case of a pattern that begins
+ * with any kind of lookbehind assertion, such as "\b".
+ *
+ * Returns: %TRUE is the string matched, %FALSE otherwise.
+ */
+gboolean
+egg_regex_match_next_extended (EggRegex          *regex,
+     const gchar       *string,
+     gssize             string_len,
+     gint               start_position,
+     EggRegexMatchFlags match_options,
+     GError           **error)
+{
+  g_return_val_if_fail (regex != NULL, FALSE);
+  g_return_val_if_fail (string != NULL, FALSE);
+  g_return_val_if_fail (start_position >= 0, FALSE);
+
   /* if this regex hasn't been used on this string before, then we
    * need to calculate the length of the string, and set pos to the
    * start of it.  
-   * Knowing if this regex has been used on this string is a bit of
+   * Knowing if this regex has been used on this string is a bit of
    * a challenge.  For now, we require the user to call egg_regex_clear()
    * in between usages on a new string.  Not perfect, but not such a
    * bad solution either.
@@ -290,140 +460,227 @@
     {
       if (string_len < 0)
  string_len = strlen (string);
-      
+      else if (regex->use_offsets)
+        string_len = g_utf8_offset_to_pointer (string, string_len) - string;
+
       regex->string_len = string_len;
+
+      if (regex->use_offsets)
+        start_position = g_utf8_offset_to_pointer (string, start_position) - string;
+      regex->start_position = start_position;
     }
 
+
   /* perform the match */
   regex->matches = _pcre_exec (regex->regex, regex->extra,
-       string + regex->pos,
-       regex->string_len - regex->pos,
-       0, regex->match_opts | match_options,
+       string, regex->string_len,
+       regex->start_position + regex->pos,
+       regex->match_opts | match_options,
        regex->offsets, regex->n_offsets);
 
-  /* if the regex matched, adjust the offsets array to take into account
-   * the fact that the string they're out of is shorter than the string
-   * that the caller passed us, by regex->pos to be exact.
-   * Then, update regex->pos to take into account the new starting point.
-   */
-  if (regex->matches > 0)
+  /* avoid infinite loops if regex is an empty string or something
+   * equivalent */
+  if (regex->pos == regex->offsets[1])
+    {
+      regex->pos++;
+      if (regex->pos > regex->string_len)
+        /* we have reached the end of the string */
+        return FALSE;
+    }
+  else
     {
-      gint i, pieces;
-      pieces = (regex->matches * 2) - 1;
-
-      for (i = 0; i <= pieces; i++)
- regex->offsets[i] += regex->pos;
-
       regex->pos = regex->offsets[1];
     }
 
-  return regex->matches;
+  return regex->matches >= 0;
 }
 
+/**
+ * egg_regex_get_match_count:
+ * @regex: a #EggRegex structure.
+ *
+ * Returns:  Number of matched substrings + 1 in the last call to
+ *           egg_regex_match*(), or 1 if the pattern has no
+ *           substrings in it. Returns -1 if the pattern did not
+ *           match.
+ */
+gint
+egg_regex_get_match_count (const EggRegex *regex)
+{
+  g_return_val_if_fail (regex != NULL, -1);
+
+  return regex->matches;
+}
 
 /**
  * egg_regex_fetch:
- * @regex: #EggRegex structure used in last match
- * @string: the string on which the last match was made
- * @match_num: number of the sub expression
+ * @regex: #EggRegex structure used in last match.
+ * @string: the string on which the last match was made.
+ * @match_num: number of the sub expression.
  *
  * Retrieves the text matching the @match_num<!-- -->'th capturing parentheses.
  * 0 is the full text of the match, 1 is the first paren set, 2 the second,
  * and so on.
  *
- * Returns: The matched substring.  You have to free it yourself.
+ * Returns: The matched substring, or %NULL if an error occurred.
+ *          You have to free the string yourself.
  */
 gchar *
-egg_regex_fetch (EggRegex      *regex,
+egg_regex_fetch (const EggRegex *regex,
        const gchar *string,
        gint         match_num)
 {
-  gchar *match;
+  gchar *match = NULL;
+
+  g_return_val_if_fail (regex != NULL, NULL);
+  g_return_val_if_fail (match_num >= 0, NULL);
+  g_return_val_if_fail (regex->start_position >= 0, NULL);
 
   /* make sure the sub expression number they're requesting is less than
    * the total number of sub expressions that were matched. */
   if (match_num >= regex->matches)
     return NULL;
 
-  _pcre_get_substring (string, regex->offsets, regex->matches,
-       match_num, (const char **)&match);
+  _pcre_get_substring (string, regex->offsets, regex->matches, match_num,
+                       (const char **)&match);
 
   return match;
 }
 
 /**
  * egg_regex_fetch_pos:
- * @regex: #EggRegex structure used in last match
- * @string: the string on which the last match was made
- * @match_num: number of the sub expression
- * @start_pos: pointer to location where to store the start position
- * @end_pos: pointer to location where to store the end position
+ * @regex: #EggRegex structure used in last match.
+ * @string: the string on which the last match was made.
+ * @match_num: number of the sub expression.
+ * @start_pos: pointer to location where to store the start position.
+ * @end_pos: pointer to location where to store the end position.
  *
  * Retrieves the position of the @match_num<!-- -->'th capturing parentheses.
  * 0 is the full text of the match, 1 is the first paren set, 2 the second,
  * and so on.
+ *
+ * @string is needed only to convert from indexes to offsets, so if
+ * you are using indexes (i.e. you passed %FALSE as 4th argument to
+ * egg_regex_new), you can set @string to %NULL.
+ *
+ * Returns: %TRUE if the position was fetched, %FALSE otherwise. If the
+ *          position cannot be fetched, @start_pos and @end_pos are left
+ *          unchanged.
  */
-void
-egg_regex_fetch_pos (EggRegex      *regex,
-     const gchar *string,
-     gint         match_num,
-     gint        *start_pos,
-     gint        *end_pos)
-{
+gboolean
+egg_regex_fetch_pos (const EggRegex    *regex,
+   const gchar *string,
+   gint         match_num,
+   gint        *start_pos,
+   gint        *end_pos)
+{
+  g_return_val_if_fail (regex != NULL, FALSE);
+  g_return_val_if_fail (match_num >= 0, FALSE);
+  g_return_val_if_fail (!regex->use_offsets || string != NULL, FALSE);
+
   /* make sure the sub expression number they're requesting is less than
    * the total number of sub expressions that were matched. */
   if (match_num >= regex->matches)
-    return;
+    return FALSE;
 
   if (start_pos)
-    *start_pos = regex->offsets[2 * match_num];
+    {
+      *start_pos = regex->offsets[2 * match_num];
+      if (regex->use_offsets)
+        *start_pos = g_utf8_pointer_to_offset (string, &string[*start_pos]);
+    }
 
   if (end_pos)
-    *end_pos = regex->offsets[2 * match_num + 1];
+    {
+      *end_pos = regex->offsets[2 * match_num + 1];
+      if (regex->use_offsets)
+        *end_pos = g_utf8_pointer_to_offset (string, &string[*end_pos]);
+    }
+
+  return TRUE;
 }
 
 /**
  * egg_regex_fetch_named:
- * @regex: #EggRegex structure used in last match
- * @string: the string on which the last match was made
- * @name: name of the subexpression
+ * @regex: #EggRegex structure used in last match.
+ * @string: the string on which the last match was made.
+ * @name: name of the subexpression.
  *
  * Retrieves the text matching the capturing parentheses named @name.
  *
- * Returns: The matched substring.  You have to free it yourself.
+ * Returns: The matched substring, or %NULL if an error occurred.
+ *          You have to free the string yourself.
  */
 gchar *
-egg_regex_fetch_named (EggRegex      *regex,
-     const gchar *string,
-     const gchar *name)
-{
-  gchar *match;
+egg_regex_fetch_named (const EggRegex *regex,
+     const gchar  *string,
+     const gchar  *name)
+{
+  gchar *match = NULL;
+
+  g_return_val_if_fail (regex != NULL, NULL);
+  g_return_val_if_fail (string != NULL, NULL);
+  g_return_val_if_fail (name != NULL, NULL);
 
-  _pcre_get_named_substring (regex->regex,
-     string, regex->offsets, regex->matches,
+  _pcre_get_named_substring (regex->regex,
+     string, regex->offsets, regex->matches,
      name, (const char **)&match);
 
   return match;
 }
 
 /**
+ * egg_regex_fetch_named_pos:
+ * @regex: #EggRegex structure used in last match.
+ * @string: the string on which the last match was made.
+ * @name: name of the subexpression.
+ * @start_pos: pointer to location where to store the start position.
+ * @end_pos: pointer to location where to store the end position.
+ *
+ * Retrieves the position of the capturing parentheses named @name.
+ *
+ * Returns: %TRUE if the position was fetched, %FALSE otherwise. If the
+ *          position cannot be fetched, @start_pos and @end_pos are left
+ *          unchanged.
+ */
+gboolean
+egg_regex_fetch_named_pos (const EggRegex *regex,
+ const gchar  *string,
+ const gchar  *name,
+ gint         *start_pos,
+ gint         *end_pos)
+{
+  gint num;
+
+  num = egg_regex_expression_number_from_name (regex, name);
+  if (num == -1)
+    return FALSE;
+
+  return egg_regex_fetch_pos (regex, string, num, start_pos, end_pos);
+}
+
+/**
  * egg_regex_fetch_all:
- * @regex: a #EggRegex structure
- * @string: the string on which the last match was made
+ * @regex: a #EggRegex structure.
+ * @string: the string on which the last match was made.
  *
- * Bundles up pointers to each of the matching substrings from a match
+ * Bundles up pointers to each of the matching substrings from a match
  * and stores then in an array of gchar pointers.
  *
- * Returns: a %NULL-terminated array of gchar * pointers. It must be freed using
- * g_strfreev(). If the memory can't be allocated, returns %NULL.
+ * Returns: a %NULL-terminated array of gchar * pointers. It must be freed
+ *          using g_strfreev(). If the memory can't be allocated, returns
+ *          %NULL.
  */
 gchar **
-egg_regex_fetch_all (EggRegex      *regex,
-   const gchar *string)
+egg_regex_fetch_all (const EggRegex *regex,
+   const gchar  *string)
 {
   gchar **listptr = NULL; /* the list pcre_get_substring_list() will fill */
   gchar **result;
 
+  g_return_val_if_fail (regex != NULL, FALSE);
+  g_return_val_if_fail (string != NULL, FALSE);
+
   if (regex->matches < 0)
     return NULL;
   
@@ -444,17 +701,46 @@
   return result;
 }
 
+/**
+ * egg_regex_expression_number_from_name:
+ * @regex: #EggRegex structure.
+ * @name: name of the subexpression.
+ *
+ * Retrieves the number of the subexpression named @name.
+ *
+ * Returns: The number of the subexpression or -1 if @name does not exists.
+ */
+gint
+egg_regex_expression_number_from_name (const EggRegex *regex,
+     const gchar  *name)
+{
+  gint num;
+
+  g_return_val_if_fail (regex != NULL, -1);
+  g_return_val_if_fail (name != NULL, -1);
+
+  num = _pcre_get_stringnumber (regex->regex, name);
+  if (num == PCRE_ERROR_NOSUBSTRING)
+  num = -1;
+
+  return num;
+}
 
 /**
  * egg_regex_split:
- * @regex:  a #EggRegex structure
- * @string:  the string to split with the pattern
- * @string_len: the length of @string, or -1 to use strlen()
- * @match_options:  match time option flags
- * @max_pieces:  maximum number of pieces to split the string into,
- *    or 0 for no limit
- *
- * Breaks the string on the pattern, and returns an array of the pieces.  
+ * @regex:  a #EggRegex structure.
+ * @string:  the string to split with the pattern.
+ * @string_len: the length of @string, or -1 to use strlen().
+ * @start_position: starting index of the string to match.
+ * @match_options:  match time option flags.
+ * @max_pieces:  maximum number of pieces to split the string into,
+ *    or 0 for no limit.
+ *
+ * Breaks the string on the pattern, and returns an array of the pieces.
+ *
+ * Setting @start_position differs from just passing over a shortened string
+ * and  setting EGG_REGEX_MATCH_NOTBOL in the case of a pattern that begins
+ * with any kind of lookbehind assertion, such as "\b".
  *
  * Returns: a %NULL-terminated gchar ** array. Free it using g_strfreev().
  **/
@@ -465,40 +751,48 @@
        EggRegexMatchFlags  match_options,
        gint              max_pieces)
 {
+  /* FIXME: add a start_position argument */
   gchar **string_list; /* The array of char **s worked on */
   gint pos;
-  gint match_ret;
+  gboolean match_ok;
+  gint match_count;
   gint pieces;
-  gint start_pos;
+  gint new_pos;
   gchar *piece;
   GList *list, *last;
 
-  start_pos = 0;
+  g_return_val_if_fail (regex != NULL, NULL);
+  g_return_val_if_fail (string != NULL, NULL);
+  g_return_val_if_fail (max_pieces >= 0, NULL);
+
+  new_pos = 0;
   pieces = 0;
   list = NULL;
   while (TRUE)
     {
-      match_ret = egg_regex_match_next (regex, string, string_len, match_options);
-      if ((match_ret > 0) && ((max_pieces == 0) || (pieces < max_pieces)))
+      match_ok = egg_regex_match_next_extended (regex, string, string_len, 0,
+                                              match_options, NULL);
+      if (match_ok && ((max_pieces == 0) || (pieces < max_pieces)))
  {
-  piece = g_strndup (string + start_pos, regex->offsets[0] - start_pos);
+  piece = g_strndup (string + new_pos, regex->offsets[0] - new_pos);
   list = g_list_prepend (list, piece);
 
   /* if there were substrings, these need to get added to the
    * list as well */
-  if (match_ret > 1)
+  match_count = egg_regex_get_match_count (regex);
+  if (match_count > 1)
     {
       int i;
-      for (i = 1; i < match_ret; i++)
+      for (i = 1; i < match_count; i++)
  list = g_list_prepend (list, egg_regex_fetch (regex, string, i));
     }
 
-  start_pos = regex->pos; /* move start_pos to end of match */
+  new_pos = regex->pos; /* move new_pos to end of match */
   pieces++;
  }
       else /* if there was no match, copy to end of string, and break */
  {
-  piece = g_strndup (string + start_pos, regex->string_len - start_pos);
+  piece = g_strndup (string + new_pos, regex->string_len - new_pos);
   list = g_list_prepend (list, piece);
   break;
  }
@@ -514,31 +808,40 @@
   return string_list;
 }
 
-
 /**
  * egg_regex_split_next:
- * @pattern:  gchar pointer to the pattern
- * @string:  the string to split on pattern
- * @string_len: the length of @string, or -1 to use strlen()
- * @match_options:  match time options for the regex
+ * @pattern:  gchar pointer to the pattern.
+ * @string:  the string to split on pattern.
+ * @string_len: the length of @string, or -1 to use strlen().
+ * @start_position: starting index of the string to match.
+ * @match_options:  match time options for the regex.
  *
- * egg_regex_split_next() breaks the string on pattern, and returns the  
- * pieces, one per call.  If the pattern contains capturing parentheses,
+ * egg_regex_split_next() breaks the string on pattern, and returns the
+ * pieces, one per call.  If the pattern contains capturing parentheses,
  * then the text for each of the substrings will also be returned.
- * If the pattern does not match anywhere in the string, then the whole
+ * If the pattern does not match anywhere in the string, then the whole
  * string is returned as the first piece.
  *
- * Returns:  a gchar * to the next piece of the string
+ * Setting @start_position differs from just passing over a shortened string
+ * and  setting EGG_REGEX_MATCH_NOTBOL in the case of a pattern that begins
+ * with any kind of lookbehind assertion, such as "\b".
+ *
+ * Returns:  a gchar * to the next piece of the string.
  */
 gchar *
-egg_regex_split_next (EggRegex      *regex,
-    const gchar *string,
-    gssize       string_len,
+egg_regex_split_next (EggRegex          *regex,
+    const gchar     *string,
+    gssize           string_len,
     EggRegexMatchFlags match_options)
 {
-  gint start_pos = regex->pos;
+  /* FIXME: add a start_position argument */
+  gint new_pos = regex->pos;
   gchar *piece = NULL;
-  gint match_ret;
+  gboolean match_ok;
+  gint match_count;
+
+  g_return_val_if_fail (regex != NULL, NULL);
+  g_return_val_if_fail (string != NULL, NULL);
 
   /* if there are delimiter substrings stored, return those one at a
    * time.  
@@ -552,43 +855,35 @@
 
   /* otherwise...
    * use egg_regex_match_next() to find the next occurance of the pattern
-   * in the string.  We use start_pos to keep track of where the stuff
+   * in the string.  We use new_pos to keep track of where the stuff
    * up to the current match starts.  Copy that piece of the string off
    * and append it to the buffer using strncpy.  We have to NUL term the
    * piece we copied off before returning it.
    */
-  match_ret = egg_regex_match_next (regex, string, string_len, match_options);
-  if (match_ret > 0)
+  match_ok = egg_regex_match_next_extended (regex, string, string_len,
+                                          0, match_options,
+                                          NULL);
+  if (match_ok)
     {
-      piece = g_strndup (string + start_pos, regex->offsets[0] - start_pos);
+      piece = g_strndup (string + new_pos, regex->offsets[0] - new_pos);
 
       /* if there were substrings, these need to get added to the
        * list of delims */
-      if (match_ret > 1)
+      match_count = egg_regex_get_match_count (regex);
+      if (match_count > 1)
  {
   gint i;
-  for (i = 1; i < match_ret; i++)
+  for (i = 1; i < match_count; i++)
     regex->delims = g_slist_append (regex->delims,
-     egg_regex_fetch (regex, string, i));
+    egg_regex_fetch (regex, string, i));
  }
     }
   else /* if there was no match, copy to end of string */
-    piece = g_strndup (string + start_pos, regex->string_len - start_pos);
+    piece = g_strndup (string + new_pos, regex->string_len - new_pos);
 
   return piece;
 }
 
-static gboolean
-copy_replacement (EggRegex      *regex,
-  const gchar *string,
-  GString     *result,
-          gpointer     data)
-{
-  g_string_append (result, (gchar *)data);
-
-  return FALSE;
-}
-
 enum
 {
   REPL_TYPE_STRING,
@@ -776,7 +1071,13 @@
       p++;
       break;
     case '0':
-      base = 8;
+      /* if \0 is followed by a number is an octal number representing a
+       * character, else it is a numeric reference. */
+      if (g_ascii_digit_value (*g_utf8_next_char (p)) >= 0)
+        {
+          base = 8;
+          p = g_utf8_next_char (p);
+        }
     case '1':
     case '2':
     case '3':
@@ -888,10 +1189,10 @@
 }
 
 static gboolean
-interpolate_replacement (EggRegex      *regex,
- const gchar *string,
- GString     *result,
- gpointer     data)
+interpolate_replacement (const EggRegex *regex,
+ const gchar  *string,
+ GString      *result,
+ gpointer      data)
 {
   GList *list;
   InterpolationData *idata;
@@ -927,23 +1228,31 @@
  }
     }
 
-  return FALSE;  
+  return FALSE;
 }
 
 /**
  * egg_regex_replace:
- * @regex:  a #EggRegex structure
- * @string:  the string to perform matches against
- * @string_len: the length of @string, or -1 to use strlen()
- * @replacement:  text to replace each match with
- * @match_options:  options for the match
- *
- * Replaces all occurances of the pattern in @regex with the
- * replacement text. Backreferences of the form '\number' or '\g<number>'
- * in the replacement text are interpolated by the number-th captured
+ * @regex:  a #EggRegex structure.
+ * @string:  the string to perform matches against.
+ * @string_len: the length of @string, or -1 to use strlen().
+ * @start_position: starting index of the string to match.
+ * @replacement:  text to replace each match with.
+ * @match_options:  options for the match.
+ * @error: location to store the error occuring, or NULL to ignore errors.
+ *
+ * Replaces all occurances of the pattern in @regex with the
+ * replacement text. Backreferences of the form '\number' or '\g<number>'
+ * in the replacement text are interpolated by the number-th captured
  * subexpression of the match, '\g<name>' refers to the captured subexpression
- * with the given name. '\0' refers to the complete match. To include a
- * literal '\' in the replacement, write '\\'.
+ * with the given name. '\0' refers to the complete match, but '\0' followed
+ * by a number is the octal representation of a character. To include a
+ * literal '\' in the replacement, write '\\'. If you do not need to use
+ * backreferences use egg_regex_replace_literal().
+ *
+ * Setting @start_position differs from just passing over a shortened string
+ * and  setting EGG_REGEX_MATCH_NOTBOL in the case of a pattern that begins
+ * with any kind of lookbehind assertion, such as "\b".
  *
  * Returns: a newly allocated string containing the replacements.
  */
@@ -951,6 +1260,7 @@
 egg_regex_replace (EggRegex            *regex,
  const gchar       *string,
  gssize             string_len,
+ gint               start_position,
  const gchar       *replacement,
  EggRegexMatchFlags   match_options,
  GError           **error)
@@ -958,64 +1268,198 @@
   gchar *result;
   GList *list;
 
+  g_return_val_if_fail (replacement != NULL, NULL);
+
   list = split_replacement (replacement, error);
   result = egg_regex_replace_eval (regex,
- string, string_len,
+ string, string_len, start_position,
  interpolate_replacement,
  (gpointer)list,
  match_options);
   g_list_foreach (list, (GFunc)free_interpolation_data, NULL);
   g_list_free (list);
-  
+
   return result;
 }
 
+static gboolean
+literal_replacement (const EggRegex *regex,
+     const gchar  *string,
+     GString      *result,
+     gpointer      data)
+{
+  g_string_append (result, data);
+  return FALSE;
+}
+
+/**
+ * egg_regex_replace_literal:
+ * @regex:  a #EggRegex structure.
+ * @string:  the string to perform matches against.
+ * @string_len: the length of @string, or -1 to use strlen().
+ * @start_position: starting index of the string to match.
+ * @replacement:  text to replace each match with.
+ * @match_options:  options for the match.
+ *
+ * Replaces all occurances of the pattern in @regex with the
+ * replacement text. @replacement is replaced literally, to
+ * include backreferences use egg_regex_replace().
+ *
+ * Setting @start_position differs from just passing over a shortened string
+ * and  setting EGG_REGEX_MATCH_NOTBOL in the case of a pattern that begins
+ * with any kind of lookbehind assertion, such as "\b".
+ *
+ * Returns: a newly allocated string containing the replacements.
+ */
+gchar *
+egg_regex_replace_literal (EggRegex          *regex,
+ const gchar     *string,
+ gssize           string_len,
+ gint             start_position,
+ const gchar     *replacement,
+ EggRegexMatchFlags match_options)
+{
+  g_return_val_if_fail (replacement != NULL, NULL);
+
+  return egg_regex_replace_eval (regex,
+       string, string_len, start_position,
+       literal_replacement,
+       (gpointer)replacement,
+       match_options);
+}
 
 /**
  * egg_regex_replace_eval:
- * @gregex:  a #EggRegex structure
- * @string:  string to perform matches against
- * @string_len: the length of @string, or -1 to use strlen()
- * @eval: a function to call for each match
- * @match_options:  Options for the match
- *
- * Replaces occurances of the pattern in regex with
- * the output of @eval for that occurance.
+ * @gregex:  a #EggRegex structure.
+ * @string:  string to perform matches against.
+ * @string_len: the length of @string, or -1 to use strlen().
+ * @start_position: starting index of the string to match.
+ * @eval: a function to call for each match.
+ * @match_options:  Options for the match.
+ *
+ * Replaces occurances of the pattern in regex with the output of @eval
+ * for that occurance.
+ *
+ * Setting @start_position differs from just passing over a shortened string
+ * and  setting EGG_REGEX_MATCH_NOTBOL in the case of a pattern that begins
+ * with any kind of lookbehind assertion, such as "\b".
  *
  * Returns: a newly allocated string containing the replacements.
  */
 gchar *
-egg_regex_replace_eval (EggRegex             *regex,
-      const gchar        *string,
-      gssize              string_len,
-      EggRegexEvalCallback  eval,
-      gpointer            user_data,
-      EggRegexMatchFlags match_options)
+egg_regex_replace_eval (EggRegex            *regex,
+      const gchar       *string,
+      gssize             string_len,
+      gint               start_position,
+      EggRegexEvalCallback eval,
+      gpointer           user_data,
+      EggRegexMatchFlags   match_options)
 {
   GString *result;
   gint str_pos = 0;
   gboolean done = FALSE;
+  gboolean tmp_use_offsets;
+
+  g_return_val_if_fail (regex != NULL, NULL);
+  g_return_val_if_fail (string != NULL, NULL);
 
   if (string_len < 0)
     string_len = strlen (string);
+  else if (regex->use_offsets)
+    string_len = g_utf8_offset_to_pointer (string, string_len) - string;
+
+  if (regex->use_offsets)
+    start_position = g_utf8_offset_to_pointer (string, start_position) - string;
 
   /* clear out the regex for reuse, just in case */
   egg_regex_clear (regex);
 
   result = g_string_sized_new (string_len);
 
+  /* run in index mode */
+  tmp_use_offsets = regex->use_offsets;
+  regex->use_offsets = FALSE;
+
   /* run down the string making matches. */
-  while (egg_regex_match_next (regex, string, string_len, match_options) > 0 && !done)
+  while (!done &&
+ egg_regex_match_next_extended (regex, string, string_len,
+      start_position, match_options, NULL))
     {
-      g_string_append_len (result,
-   string + str_pos,
+      g_string_append_len (result,
+   string + str_pos,
    regex->offsets[0] - str_pos);
+      /* restore use_offsets for the user supplied function */
+      regex->use_offsets = tmp_use_offsets;
       done = (*eval) (regex, string, result, user_data);
+      regex->use_offsets = FALSE;
       str_pos = regex->offsets[1];
     }
-  
+
   g_string_append_len (result, string + str_pos, string_len - str_pos);
 
+  regex->use_offsets = tmp_use_offsets;
+
   return g_string_free (result, FALSE);
 }
 
+/**
+ * egg_regex_escape_string:
+ * @string: the string to escape.
+ * @length: the length of @string in characters or -1 to use g_utf8_strlen().
+ *
+ * Escapes the special characters used for regular expressions in @string,
+ * for instance "a.b*c" becomes "a\.b\*c". This function is useful to
+ * dinamically generate regular expressions.
+ *
+ * @string can contain NULL characters that are replaced with "\0", in this
+ * case remember to specify the correct length of @string in @length.
+ *
+ * Returns: a newly allocated escaped string.
+ */
+gchar *
+egg_regex_escape_string (const gchar *string,
+       gint         length)
+{
+  GString *escaped;
+  gchar *tmp;
+  gint i;
+
+  g_return_val_if_fail (string != NULL, NULL);
+
+  if (length < 0)
+    length = g_utf8_strlen (string, -1);
+
+  escaped = g_string_new ("");
+  tmp = (gchar*) string;
+  for (i = 0; i < length; i++)
+  {
+    gunichar wc = g_utf8_get_char (tmp);
+    switch (wc)
+    {
+      case '\0':
+        g_string_append (escaped, "\\0");
+        break;
+      case '\\':
+      case '|':
+      case '(':
+      case ')':
+      case '[':
+      case ']':
+      case '{':
+      case '}':
+      case '^':
+      case '$':
+      case '*':
+      case '+':
+      case '?':
+      case '.':
+        g_string_append_unichar (escaped, '\\');
+      default:
+        g_string_append_unichar (escaped, wc);
+    }
+    tmp = g_utf8_next_char (tmp);
+  }
+
+  return g_string_free (escaped, FALSE);
+}
+

--- libegg/libegg/regex/eggregex.h 2004-07-05 07:57:25.000000000 +0200
+++ eggregex.h 2005-06-08 16:39:49.000000000 +0200
@@ -1,6 +1,7 @@
 /* EggRegex -- regular expression API wrapper around PCRE.
  * Copyright (C) 1999 Scott Wimer
  * Copyright (C) 2004 Matthias Clasen
+ * Copyright (C) 2005 Marco Barisione <[hidden email]>
  *
  * This is basically an ease of user wrapper around the functionality of
  * PCRE.
@@ -48,7 +49,8 @@
 {
   EGG_REGEX_ERROR_COMPILE,
   EGG_REGEX_ERROR_OPTIMIZE,
-  EGG_REGEX_ERROR_REPLACE
+  EGG_REGEX_ERROR_REPLACE,
+  EGG_REGEX_ERROR_MATCH
 } EggRegexError;
 
 #define EGG_REGEX_ERROR egg_regex_error_quark ()
@@ -77,51 +79,65 @@
 
 typedef struct _EggRegex  EggRegex;
 
-typedef gboolean (*EggRegexEvalCallback) (EggRegex*, const gchar*, GString*, gpointer);
+typedef gboolean (*EggRegexEvalCallback) (const EggRegex*, const gchar*, GString*, gpointer);
 
-/* Really quick outline of features... functions are preceded by 'egg_regex_'
- *   new         - compile a pattern and put it in a egg_regex structure
- *   free        - free up the memory used by the egg_regex structure
- *   clear       - clear out the structure to match against a new string
- *   optimize    - study the pattern to make matching more efficient
- *   match       - try matching a pattern in the string
- *   match_next  - try matching pattern again in the string
- *   fetch       - fetch a particular matching sub pattern
- *   fetch_all   - get all of the matching sub patterns
- *   split       - split the string on a regex
- *   split_next  - for using split as an iterator of sorts
- *   replace     - replace occurances of a pattern with some text
- */
 
 EggRegex  *egg_regex_new          (const gchar           *pattern,
    EggRegexCompileFlags   compile_options,
    EggRegexMatchFlags     match_options,
+   gboolean               use_offsets,
    GError               **error);
 void       egg_regex_optimize     (EggRegex              *regex,
    GError               **error);
 void       egg_regex_free         (EggRegex              *regex);
+EggRegex  *egg_regex_copy  (const EggRegex        *regex);
+const gchar * egg_regex_get_pattern
+  (const EggRegex        *regex);
 void       egg_regex_clear        (EggRegex              *regex);
 gint       egg_regex_match        (EggRegex              *regex,
    const gchar           *string,
-   gssize                 string_len,
    EggRegexMatchFlags     match_options);
-gint       egg_regex_match_next   (EggRegex              *regex,
+gint       egg_regex_match_extended
+  (EggRegex              *regex,
    const gchar           *string,
    gssize                 string_len,
+   gint                   start_position,
+   EggRegexMatchFlags     match_options,
+   GError               **error);
+gint       egg_regex_match_next   (EggRegex              *regex,
+   const gchar           *string,
    EggRegexMatchFlags     match_options);
-gchar     *egg_regex_fetch        (EggRegex              *regex,
+gint       egg_regex_match_next_extended
+  (EggRegex              *regex,
+   const gchar           *string,
+   gssize                 string_len,
+   gint                   start_position,
+   EggRegexMatchFlags     match_options,
+   GError               **error);
+gint       egg_regex_get_match_count
+  (const EggRegex        *regex);
+gchar     *egg_regex_fetch        (const EggRegex        *regex,
    const gchar           *string,
    gint                   match_num);
-void       egg_regex_fetch_pos    (EggRegex              *regex,
+gboolean   egg_regex_fetch_pos    (const EggRegex        *regex,
    const gchar           *string,
    gint                   match_num,
    gint                  *start_pos,
    gint                  *end_pos);
-gchar     *egg_regex_fetch_named  (EggRegex              *regex,
+gchar     *egg_regex_fetch_named  (const EggRegex        *regex,
    const gchar           *string,
    const gchar           *name);
-gchar    **egg_regex_fetch_all    (EggRegex              *regex,
+gboolean   egg_regex_fetch_named_pos
+  (const EggRegex        *regex,
+   const gchar           *string,
+   const gchar           *name,
+   gint                  *start_pos,
+   gint                  *end_pos);
+gchar    **egg_regex_fetch_all    (const EggRegex        *regex,
    const gchar           *string);
+gint       egg_regex_expression_number_from_name
+  (const EggRegex        *regex,
+   const gchar           *name);
 gchar    **egg_regex_split        (EggRegex              *regex,
    const gchar           *string,
    gssize                 string_len,
@@ -134,16 +150,27 @@
 gchar     *egg_regex_replace      (EggRegex              *regex,
    const gchar           *string,
    gssize                 string_len,
+   gint                   start_position,
    const gchar           *replacement,
    EggRegexMatchFlags     match_options,
    GError               **error);
+gchar     *egg_regex_replace_literal
+  (EggRegex              *regex,
+   const gchar           *string,
+   gssize                 string_len,
+   gint                   start_position,
+   const gchar           *replacement,
+   EggRegexMatchFlags     match_options);
 gchar     *egg_regex_replace_eval (EggRegex              *regex,
    const gchar           *string,
    gssize                 string_len,
+   gint                   start_position,
    EggRegexEvalCallback   eval,
    gpointer               user_data,
    EggRegexMatchFlags     match_options);
-
+gchar     *egg_regex_escape_string
+  (const gchar           *string,
+   gint                   length);
 
 
 G_END_DECLS

/*
 * Copyright (C) 2005  Marco Barisione <[hidden email]>
 *
 * This library is free software; you can redistribute it and/or
 * modify it under the terms of the GNU Library General Public
 * License as published by the Free Software Foundation; either
 * version 2 of the License, or (at your option) any later version.
 *
 * This library is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 * Library General Public License for more details.
 *
 * You should have received a copy of the GNU Library General Public
 * License along with this library; if not, write to the
 * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
 * Boston, MA 02111-1307, USA.
 */

#include <glib.h>
#include <glib/gprintf.h>
#include <string.h>
#include "eggregex.h"

static gint error_count = 0;

static gboolean
test_value (gboolean cond, const gchar *format, ...)
{
  va_list args;
  va_start (args, format);
  if (!cond)
    {
      g_vprintf (format, args);
      g_printf ("\n");
      error_count++;
    }
  va_end (args);
  return cond;
}


/* structures used to test egg_regex_new(), egg_regex_copy(),
 * egg_regex_get_pattern(), egg_regex_match(). */
struct _TestMatchElement
  {
    gboolean result;
    const gchar *text;
  };
typedef struct _TestMatchElement TestMatchElement;

struct _TestMatch
  {
    const gchar *pattern;
    EggRegexCompileFlags compile_opts;
    TestMatchElement tests [10];
  };
typedef struct _TestMatch TestMatch;

TestMatch match_tests [] =
  {
    {"a", 0,
     {{TRUE,  "a"},
      {FALSE, "A"},
      {TRUE,  "bab"},
      {FALSE, "b"},
      {FALSE, NULL}}
    },
    {"a", EGG_REGEX_CASELESS,
     {{TRUE,  "a"},
      {TRUE,  "A"},
      {TRUE,  "bab"},
      {FALSE, "b"},
      {FALSE, NULL}}
    },
    {"a", EGG_REGEX_ANCHORED,
     {{TRUE,  "a"},
      {FALSE, "A"},
      {FALSE, "bab"},
      {FALSE, "b"},
      {FALSE, NULL}}
    },
    {NULL, 0,
     {{FALSE, NULL}}
    }
  };


/* structures used to test egg_regex_match_extended(),
 * egg_regex_fetch(), egg_regex_fetch_pos(). */
struct _TestMatchSubElement
  {
    gint start_position;
    gboolean match_result;
    gint sub_number;
    gint start, end;
    const gchar *sub_text;
    const gchar *text;
  };
typedef struct _TestMatchSubElement TestMatchSubElement;

struct _TestMatchSub
  {
    const gchar *pattern;
    EggRegexCompileFlags compile_opts;
    gboolean use_offsets;
    TestMatchSubElement tests [10];
  };
typedef struct _TestMatchSub TestMatchSub;

TestMatchSub match_sub_tests [] =
  {
    {"a", 0, TRUE,
     {{0, TRUE,  0, 0,  1,  "a", "a"},
      {1, FALSE, 0, -1, -1, NULL, "abc"},
      {0, FALSE, 0, 0,  0,  NULL, NULL}}
    },
    {"a(.)", EGG_REGEX_CASELESS, TRUE,
     {{0, TRUE,  1, 2,  3,  "b", "€Ab"},
      {0, TRUE,  1, 2,  3,  "è", "€Aè"},
      {1, TRUE,  1, 3,  4,  "è", "a€Aè"},
      {1, FALSE, 1, -1, -1, NULL, "ab"},
      {0, FALSE, 0, 0,  0,  NULL, NULL}}
    },
    {"a(.)", EGG_REGEX_CASELESS, FALSE,
     {{0, TRUE,  1, 4,  5,  "b", "€Ab"},
      {0, TRUE,  1, 4,  6,  "è", "€Aè"},
      {1, TRUE,  1, 5,  7,  "è", "a€Aè"},
      {1, FALSE, 1, -1, -1, NULL, "ab"},
      {0, FALSE, 0, 0,  0,  NULL, NULL}}
    },
    {NULL, 0, FALSE,
     {{0, FALSE, 0, 0,  0,  NULL, NULL}}
    }
  };


/* structures used to test egg_regex_fetch_named() and
 * egg_regex_fetch_named_pos(). */
struct _TestMatchNamedElement
  {
    gint start_position;
    gboolean match_result;
    gint start, end;
    const gchar *name;
    const gchar *sub_text;
    const gchar *text;
  };
typedef struct _TestMatchNamedElement TestMatchNamedElement;

struct _TestMatchNamed
  {
    const gchar *pattern;
    EggRegexCompileFlags compile_opts;
    gboolean use_offsets;
    TestMatchNamedElement tests [10];
  };
typedef struct _TestMatchNamed TestMatchNamed;

TestMatchNamed match_named_tests [] =
  {
    {"a(?P<first>.)(?P<second>.)?", EGG_REGEX_CASELESS, TRUE,
     {{0, TRUE,   2,  3, "first", "b", "€Ab"},
      {0, TRUE,  -1, -1, "second", NULL, "€Aè"},
      {1, TRUE,   3,  4, "first", "è", "a€Aèx"},
      {1, TRUE,   4,  5, "second", "x", "a€Aèx"},
      {1, TRUE,  -1, -1, "third", NULL, "a€Aèx"},
      {0, FALSE, -1, -1, "first", NULL, "xb"},
      {1, FALSE, -1, -1, "first", NULL, "ab"},
      {0, FALSE,  0,  0, NULL, NULL, NULL}}
    },
    {"a(?P<first>.)(?P<second>.)?", EGG_REGEX_CASELESS, FALSE,
     {{0, TRUE,   4,  5, "first", "b", "€Ab"},
      {0, TRUE,  -1, -1, "second", NULL, "€Aè"},
      {1, TRUE,  -1, -1, "third", NULL, "a€Aèx"},
      {1, FALSE, -1, -1, "first", NULL, "ab"},
      {0, FALSE,  0,  0, NULL, NULL, NULL}}
    },
    {NULL, 0, FALSE,
     {{0, FALSE, 0,  0,  NULL, NULL, NULL}}
    }
  };


/* tests used for egg_regex_escape_string(). */
struct _EscapeTest
  {
    gint len;
    const gchar *src;
    const gchar *res;
  };
typedef struct _EscapeTest EscapeTest;

EscapeTest escape_test[] =
  {{-1, "hello world", "hello world"},
   {5,  "hello world", "hello"},
   {-1, "hello.world", "hello\\.world"},
   {-1, "a(b\\b.$", "a\\(b\\\\b\\.\\$"},
   {-1, "hello\0world", "hello"},
   {11, "hello\0world", "hello\\0world"}};


/* tests */

static void
test_new ()
{
  TestMatch *test = match_tests;

  while (test->pattern != NULL)
    {
      EggRegex *regex = egg_regex_new (test->pattern, test->compile_opts,
                                       0, TRUE, NULL);
      test_value (regex != NULL,
                  "egg_regex_new() failed with pattern %s",
                  test->pattern);
      egg_regex_free (regex);
      test++;
    }
}

static void
test_copy ()
{
  TestMatch *test = match_tests;

  while (test->pattern != NULL)
    {
      EggRegex *regex = egg_regex_new (test->pattern, test->compile_opts,
                                       0, TRUE, NULL);
      EggRegex *copy = egg_regex_copy (regex);
      test_value (copy != NULL &&
                  strcmp (egg_regex_get_pattern (regex),
                          egg_regex_get_pattern (copy)) == 0,
                  "egg_regex_copy() failed with pattern %s",
                  egg_regex_get_pattern (regex));
      egg_regex_free (regex);
      egg_regex_free (copy);
      test++;
    }
}

static void
test_match ()
{
  TestMatch *test = match_tests;

  while (test->pattern != NULL)
    {
      EggRegex *regex = egg_regex_new (test->pattern, test->compile_opts,
                                       0, TRUE, NULL);
      TestMatchElement *element = test->tests;
      while (element->text != NULL)
        {
          gboolean ret = egg_regex_match (regex, element->text, 0);
          test_value (ret == element->result,
                      "egg_regex_match() failed with regex %s and text %s (flags: %d)",
                      egg_regex_get_pattern (regex), element->text,
                      test->compile_opts);
          element++;
        }
      egg_regex_free (regex);
      test++;
    }
}

static void
test_sub ()
{
  TestMatchSub *test = match_sub_tests;

  while (test->pattern != NULL)
    {
      EggRegex *regex = egg_regex_new (test->pattern, test->compile_opts,
                                       0, test->use_offsets, NULL);
      TestMatchSubElement *element = test->tests;
      while (element->text != NULL)
        {
          gboolean ret = egg_regex_match_extended (regex, element->text, -1,
                                                   element->start_position,
                                                   0, NULL);
          if (test_value (ret == element->match_result,
                          "egg_regex_match_extended() failed with regex %s "
                          "and text %s (flags: %d, start: %d, use_offsets: %s)",
                          egg_regex_get_pattern (regex), element->text,
                          test->compile_opts, element->start_position,
                          test->use_offsets ? "TRUE" : "FALSE"))
            {
              gint start = -1, end = -1;
              gchar *sub_expr = egg_regex_fetch (regex, element->text,
                                                 element->sub_number);
              test_value ((sub_expr == NULL && sub_expr == element->sub_text) ||
                          strcmp (sub_expr, element->sub_text) == 0,
                          "egg_regex_fetch() failed to fetch subexpression %d "
                          "from the regex %s", element->sub_number,
                          egg_regex_get_pattern (regex));
              g_free (sub_expr);
              egg_regex_fetch_pos (regex, element->text, element->sub_number,
                                   &start, &end);
              test_value (start == element->start && end == element->end,
                          "egg_regex_fetch_pos() failed to fetch the position "
                          "of subexpression %d from the regex %s",
                          element->sub_number, egg_regex_get_pattern (regex));
            }
          element++;
        }
      egg_regex_free (regex);
      test++;
    }
}

static void
test_named ()
{
  TestMatchNamed *test = match_named_tests;

  while (test->pattern != NULL)
    {
      EggRegex *regex = egg_regex_new