auto-upgrading strings to utf8

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

auto-upgrading strings to utf8

squentin
With gtk2-perl, most strings passed as arguments to a glib/gtk function
are auto upgraded to utf8,
that's not the case in C, where arguments must be in the proper encoding
before passing it as an argument to a gtk/glib function.
Which leads to a problem with the Gstreamer bindings where the filenames
is a string property of a Glib object, and thus are auto upgraded to
utf8, but they shouldn't.

So, following a chat with muppet on IRC, we were wondering if
automatically upgrading text to utf8 is the right thing to do?

I tried disabling auto-upgrading in Glib, and my program (a very complex
jukebox) runs fine because all the data I use are utf8, so there is no
need to upgrade strings to utf8 in this case.

The problem is how to keep existing code working...

Any thoughts on how to fix the problem ?

_______________________________________________
gtk-perl-list mailing list
[hidden email]
http://mail.gnome.org/mailman/listinfo/gtk-perl-list
Reply | Threaded
Open this post in threaded view
|

Re: auto-upgrading strings to utf8

Christian Borup
On søn, 2005-06-05 at 04:28 +0200, Quentin wrote:
> With gtk2-perl, most strings passed as arguments to a glib/gtk function
> are auto upgraded to utf8,
> that's not the case in C, where arguments must be in the proper encoding
> before passing it as an argument to a gtk/glib function.

> Which leads to a problem with the Gstreamer bindings where the filenames
> is a string property of a Glib object, and thus are auto upgraded to
> utf8, but they shouldn't.
> So, following a chat with muppet on IRC, we were wondering if
> automatically upgrading text to utf8 is the right thing to do?

It is :-)

> I tried disabling auto-upgrading in Glib, and my program (a very complex
> jukebox) runs fine because all the data I use are utf8, so there is no
> need to upgrade strings to utf8 in this case.

The problem here is the data not the upgrade.

Your strings are utf8 but you don't let perl know. That will break
things all over the place not just in Glib/Gtk2 (regular expressions and
pretty much every other string operator won't work on your data.)

I you know for sure that your data is utf8, call Encode::_utf8_on(...)
on your string. _utf8_on is a very cheap, it just flips a bit.

So from a Perl point of view your strings is broken. Please note that
I'm not saying that Perls POV is right, but its the way it is, and isn't
likely to change before Perl 6. In perl today a string can be either in
the encoding of the locale (usually iso-8859-1 or is-8859-15 in France
and Denmark.) or utf8 in which case the utf8 flag is set on the string.

There are two separate issues here. One is that gtk+ requires strings to
be utf8 (this one i a non issue for us because Perl knows about uft8 so
with our typemap all is good - unless the strings are not valid.)

The second issue is filenames. This one is harder. Some applications use
utf8 filenames regardless of locale (personally I think that is a bug -
but there can be valid reasons to do so I suppose).

> The problem is how to keep existing code working...

That's easy. Don't break the typemap. IMHO the typemap is right as it is
(not a big surprise seeing how I made the first version of it.)

I do not think that its a good idea to break the common case, to get
filenames right. I don't think its acceptable to not be able to print
the same string as you would put in a Label.

> Any thoughts on how to fix the problem ?

Ideally the glib filename functions should be fixed. Quite a few glib
based programs have had problems with filenames, which become utf8 even
though the locale is say iso-8859-15.

I think we should provide a filename helper of some sort. Either as a
function that take a perl string and returns a filename suited for the
locale or perhaps handling the convertion in the the wrappers for the
functions that access the filesystem.

./borup

_______________________________________________
gtk-perl-list mailing list
[hidden email]
http://mail.gnome.org/mailman/listinfo/gtk-perl-list
Reply | Threaded
Open this post in threaded view
|

Re: auto-upgrading strings to utf8

squentin
> > I tried disabling auto-upgrading in Glib, and my program (a very complex
> > jukebox) runs fine because all the data I use are utf8, so there is no
> > need to upgrade strings to utf8 in this case.
>
> The problem here is the data not the upgrade.
Maybe I haven't been clear, my program runs fine with or without
auto-upgrading.

> Ideally the glib filename functions should be fixed. Quite a few glib
> based programs have had problems with filenames, which become utf8 even
> though the locale is say iso-8859-15.
>
> I think we should provide a filename helper of some sort. Either as a
> function that take a perl string and returns a filename suited for the
> locale or perhaps handling the convertion in the the wrappers for the
> functions that access the filesystem.
There is already Glib->filename_from_unicode,
The problem here is that gstreamer put a filename in a glib object
property of type string, and all string properties are auto-upgraded to
utf8.
The function called is $source->set(location => $file);
 which is a generic glib function to set a property.

(now that I think about it a dirty trick to make it work for now would
be to cheat and turn on the utf8 flag on the non-utf8 filename, so it
doesn't get upgraded.)


_______________________________________________
gtk-perl-list mailing list
[hidden email]
http://mail.gnome.org/mailman/listinfo/gtk-perl-list
Reply | Threaded
Open this post in threaded view
|

Re: auto-upgrading strings to utf8

muppet-6

On Jun 5, 2005, at 9:48 AM, Quentin wrote:

>> Ideally the glib filename functions should be fixed. Quite a few glib
>> based programs have had problems with filenames, which become utf8  
>> even
>> though the locale is say iso-8859-15.
>>
>> I think we should provide a filename helper of some sort. Either as a
>> function that take a perl string and returns a filename suited for  
>> the
>> locale or perhaps handling the convertion in the the wrappers for the
>> functions that access the filesystem.
>>
> There is already Glib->filename_from_unicode,
> The problem here is that gstreamer put a filename in a glib object
> property of type string, and all string properties are auto-
> upgraded to
> utf8.
> The function called is $source->set(location => $file);
>  which is a generic glib function to set a property.

To clarify:

    # filename from command line is in proper filename encoding.
    $filename = $ARGV[$n];

    $source->set (location => $filename);

invokes Glib::Object::set(), which contains something like this:

      foreach key/val pair:
          SV * key = ST (i);   # ith item from the stack, "location"
          SV * val = ST (i+1);  # i+1th item from the stack, $filename

          # look up the "location" property on this class to find the
          # value type
          pspec = g_object_class_find_property (class, key);

          # initialize the GValue container to hold that type:
          g_value_init (gvalue, G_PARAM_SPEC_VALUE_TYPE (pspec));

          # the location property is G_TYPE_STRING, so the GValue
          # is now prepared to hold gchar* strings.

          # marshal the SV into the GValue:
          gperl_value_from_sv (gvalue, val);
          # this function contains a great big switch on GType, and
          # for G_TYPE_STRING, it does this:
                g_value_set_string (value, SvGChar (sv));
          # SvGChar() upgrades the sv to utf8.
          # now the GValue contains a utf8-encoded version of
          # $filename, which isn't what we actually need.


The GstFileSrcElement's set_property handler does nothing special to  
the string -- it just copies it, and then later passes it, unaltered  
to open().  So, it expects the string to have been a valid filename.

But since it was passed through a G_TYPE_STRING property, and we  
consider G_TYPE_STRING to mean "utf8 text", we mangled it.


I can think of three fixes for this:

a) if it's the case that a G_TYPE_STRING really is supposed to be  
utf8, then GstFileSrcElement is broken, and should do something like  
g_filename_from_utf8() on the string it gets from the location  
property.  the bindings would have to do nothing.  this risks  
breaking C programs, but GStreamer is still at a nonstable version...

b) turn off auto-upgrading, and push the burden of ensuring utf8-ness  
of text onto perl developers.  this risks breaking lots of existing  
code.

c) add infrastructure to the bindings to allow per-property overrides  
for marshaling.  this would slow down the general case and take up  
even more memory (another hash table and lookup per property), but  
would allow problems like this (and G_TYPE_POINTER properties) to be  
fixed.


--
Holy crap, dude, we have kids!
     -- Elysse, six days after giving birth to twins

_______________________________________________
gtk-perl-list mailing list
[hidden email]
http://mail.gnome.org/mailman/listinfo/gtk-perl-list
Reply | Threaded
Open this post in threaded view
|

Re: auto-upgrading strings to utf8

Aristotle Pagaltzis
* muppet <[hidden email]> [2005-06-05 22:05]:
> I can think of three fixes for this:
>
> b) turn off auto-upgrading, and push the burden of ensuring
> utf8-ness of text onto perl developers. this risks breaking
> lots of existing code.

That is so wrong. In the default case, auto-upgrading is the
right thing to do. We know there are so edge cases which require
special treatment. Making the default case require extra work so
that the edge cases do not is just ass backwards and would
introduce subtle bugs where people forget these things.

Think of SQL injection vulnerabilities; people constantly forget
where they have to quote things and where not. Of course, for SQL
queries this is difficult to solve; but why introduce the same
kind of subtlety into an API where it’s simple to avoid?
 
> a) if it's the case that a G_TYPE_STRING really is supposed to
> be utf8, then GstFileSrcElement is broken, and should do
> something like g_filename_from_utf8() on the string it gets
> from the location property. the bindings would have to do
> nothing. this risks breaking C programs, but GStreamer is still
> at a nonstable version...

Unless there’s an actual, non-obvious, but compelling reason for
GStreamer to do otherwise, that’s clearly the correct solution.

> c) add infrastructure to the bindings to allow per-property
> overrides for marshaling. this would slow down the general case
> and take up even more memory (another hash table and lookup per
> property), but would allow problems like this (and
> G_TYPE_POINTER properties) to be fixed.

That would be the right thing to do if GStreamer does have reason
for doing things the way it does (rather than it being a simple
oversight).

Would there be a way to do this specifically within the GStreamer
bindings (a special overridden version of Glib::Object::set(),
say) rather than penalizing all of Gtk2-Perl for it?

If there is a way to do this only within GStreamer, maybe the
same mechanism could be available from base Gtk2-Perl to all
bindings which need to make use of such, but as an opt-in option
that does not penalize those which don’t?

Regards,
--
#Aristotle
*AUTOLOAD=*_=sub{s/(.*)::(.*)/print$2,(",$\/"," ")[defined wantarray]/e;$1};
&Just->another->Perl->hacker;
_______________________________________________
gtk-perl-list mailing list
[hidden email]
http://mail.gnome.org/mailman/listinfo/gtk-perl-list
Reply | Threaded
Open this post in threaded view
|

Re: auto-upgrading strings to utf8

Christian Borup
On man, 2005-06-06 at 03:09 +0200, A. Pagaltzis wrote:

> * muppet <[hidden email]> [2005-06-05 22:05]:
> > I can think of three fixes for this:
> >
> > b) turn off auto-upgrading, and push the burden of ensuring
> > utf8-ness of text onto perl developers. this risks breaking
> > lots of existing code.
>
> That is so wrong. In the default case, auto-upgrading is the
> right thing to do. We know there are so edge cases which require
> special treatment. Making the default case require extra work so
> that the edge cases do not is just ass backwards and would
> introduce subtle bugs where people forget these things.

Obviously I agree here. Gtk2 is doing the right thing here.
Unfortunately, lots of other perl modules do not.

> Think of SQL injection vulnerabilities; people constantly forget
> where they have to quote things and where not. Of course, for SQL
> queries this is difficult to solve; but why introduce the same
> kind of subtlety into an API where it’s simple to avoid?
>  
> > a) if it's the case that a G_TYPE_STRING really is supposed to
> > be utf8, then GstFileSrcElement is broken, and should do
> > something like g_filename_from_utf8() on the string it gets
> > from the location property. the bindings would have to do
> > nothing. this risks breaking C programs, but GStreamer is still
> > at a nonstable version...
>
> Unless there’s an actual, non-obvious, but compelling reason for
> GStreamer to do otherwise, that’s clearly the correct solution.

Unfortunately there is a reason. Filenames really should be treated as
opaque data. The reason for this is that there is no way to be sure of
the encoding of filenames (it might not even be consistent). After all
locales can be set per user. To make matters worse not all scripts can
be represented in unicode.

But we all know that filenames are not treated as opaque. If nothing
else they are concatenated with other bit of the path. Which in perl may
lead to an upgrade.

> > c) add infrastructure to the bindings to allow per-property
> > overrides for marshaling. this would slow down the general case
> > and take up even more memory (another hash table and lookup per
> > property), but would allow problems like this (and
> > G_TYPE_POINTER properties) to be fixed.
>
> That would be the right thing to do if GStreamer does have reason
> for doing things the way it does (rather than it being a simple
> oversight).
>
> Would there be a way to do this specifically within the GStreamer
> bindings (a special overridden version of Glib::Object::set(),
> say) rather than penalizing all of Gtk2-Perl for it?

After a brief discussion on IRC last night I think Torsten will see if
this solution flies. Ie a ->set(...) which splits out location and sets
it without going through the standard set.

./borup

_______________________________________________
gtk-perl-list mailing list
[hidden email]
http://mail.gnome.org/mailman/listinfo/gtk-perl-list
Reply | Threaded
Open this post in threaded view
|

Re: auto-upgrading strings to utf8

Jan Hudec
In reply to this post by muppet-6
On Sun, Jun 05, 2005 at 15:59:15 -0400, muppet wrote:

> [...]
> The GstFileSrcElement's set_property handler does nothing special to  
> the string -- it just copies it, and then later passes it, unaltered  
> to open().  So, it expects the string to have been a valid filename.
>
> But since it was passed through a G_TYPE_STRING property, and we  
> consider G_TYPE_STRING to mean "utf8 text", we mangled it.
>
>
> I can think of three fixes for this:
>
> a) if it's the case that a G_TYPE_STRING really is supposed to be  
> utf8, then GstFileSrcElement is broken, and should do something like  
> g_filename_from_utf8() on the string it gets from the location  
> property.  the bindings would have to do nothing.  this risks  
> breaking C programs, but GStreamer is still at a nonstable version...
May I bring the G_BROKEN_FILENAMES environment variable to your
attention...? **The code quite well can g_filename_from_utf8**. It's
just that that function is a NOP unless G_BROKEN_FILENAMES is set. It is
arguable whether having filenames in locale encoding is broken, but Gtk
people think so.

Now yes, this is the right solution -- the G_TYPE_STRING must be
supposed to be utf8 for sake of general sanity.

> b) turn off auto-upgrading, and push the burden of ensuring utf8-ness  
> of text onto perl developers.  this risks breaking lots of existing  
> code.
>
> c) add infrastructure to the bindings to allow per-property overrides  
> for marshaling.  this would slow down the general case and take up  
> even more memory (another hash table and lookup per property), but  
> would allow problems like this (and G_TYPE_POINTER properties) to be  
> fixed.
>
>
> --
> Holy crap, dude, we have kids!
>     -- Elysse, six days after giving birth to twins
>
> _______________________________________________
> gtk-perl-list mailing list
> [hidden email]
> http://mail.gnome.org/mailman/listinfo/gtk-perl-list
>
-------------------------------------------------------------------------------
                                                 Jan 'Bulb' Hudec <[hidden email]>

_______________________________________________
gtk-perl-list mailing list
[hidden email]
http://mail.gnome.org/mailman/listinfo/gtk-perl-list

signature.asc (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: auto-upgrading strings to utf8

Torsten Schoenfeld
On Mon, 2005-06-06 at 15:40 +0200, Jan Hudec wrote:

> Now yes, this is the right solution -- the G_TYPE_STRING must be
> supposed to be utf8 for sake of general sanity.

I'm unable to find a single word in the docs that would support this.
GstFileSrc's "location" property gets setup via g_param_spec_string
which creates a GParamSpecString instance.  The corresponding GValue's
for GParamSpecString are of type G_TYPE_STRING.  g_value_get_string
returns a const gchar*.  gchar is just a typedef for char, and the docs
say about it:

  Corresponds to the standard C char type.

Then there is GtkFileChooser.  gtk_file_chooser_get_filename returns a
gchar*, too.  The docs say:

  This means that while you can pass the result of
  gtk_file_chooser_get_filename() to open(2) or fopen(3), you may not
  be able to directly set it as the text of a GtkLabel widget unless
  you convert it first to UTF-8, which all GTK+ widgets expect. You
  should use g_filename_to_utf8() to convert filenames into strings
  that can be passed to GTK+ widgets.

So, no, I don't think the GStreamer library is wrong in assuming that
filenames set via the "location" property are in the correct (i.e.
filesystem) encoding already.  But yes, I think we should fix this in
the GStreamer bindings as opposed to Glib or Gtk2.  I'll see if the
previously mentioned approach works out.

--
Bye,
-Torsten

_______________________________________________
gtk-perl-list mailing list
[hidden email]
http://mail.gnome.org/mailman/listinfo/gtk-perl-list
Reply | Threaded
Open this post in threaded view
|

Re: auto-upgrading strings to utf8

Torsten Schoenfeld
In reply to this post by Christian Borup
On Mon, 2005-06-06 at 08:48 +0200, Christian Borup wrote:

> > Would there be a way to do this specifically within the GStreamer
> > bindings (a special overridden version of Glib::Object::set(),
> > say) rather than penalizing all of Gtk2-Perl for it?
>
> After a brief discussion on IRC last night I think Torsten will see if
> this solution flies. Ie a ->set(...) which splits out location and sets
> it without going through the standard set.

Alright, here's a patch that implements this:

  http://home.arcor.de/kaffeetisch/tmp/location.patch

In my (not very exhaustive) testing, it seems to work.  I tested
filenames with umlauts and accented characters a) passed via the command
line, b) retrieved from a file chooser and c) retrieved from a file
chooser and mangled with Glib::filename_(to/from)_unicode.  Everything
in the C and the de_DE.UTF-8 locale.

Could you guys poke at it and see if it *really* fixes the issue?  Each
and every corner case is important, I think.

--
Bye,
-Torsten

_______________________________________________
gtk-perl-list mailing list
[hidden email]
http://mail.gnome.org/mailman/listinfo/gtk-perl-list