Unicode normalization and text display differences

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Unicode normalization and text display differences

Benjamin Kiessling
Hi everybody,

I am trying to build a simple line generation tool for training neural
networks for OCR and everything is working fine except an oddity in
display depending on Unicode normalization, in particular diacritic
placement.

In [0] (text is normalized to NFC) diacritics are placed correctly while
in [1] (text normalized to NFD) diacritics are placed next to the
preceding code point.

If I understand Unicode correctly there should be no difference in
display and there is a presentation about Pango from 2004 claiming that
there shouldn't be one.

Is this a known issue or expected behavior? Is there some preprocessing
necessary before using pango_layout_set_text()?

All Best,
Ben

[0] http://l.unchti.me/dump/nfc.png
[1] http://l.unchti.me/dump/nfd.png
_______________________________________________
gtk-i18n-list mailing list
[hidden email]
https://mail.gnome.org/mailman/listinfo/gtk-i18n-list
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unicode normalization and text display differences

Behdad Esfahbod-5
Hi Ben,

You at least need to tell us on what platform you are testing this and with
what fonts.

On 16-01-06 02:00 PM, Benjamin Kiessling wrote:

> Hi everybody,
>
> I am trying to build a simple line generation tool for training neural
> networks for OCR and everything is working fine except an oddity in
> display depending on Unicode normalization, in particular diacritic
> placement.
>
> In [0] (text is normalized to NFC) diacritics are placed correctly while
> in [1] (text normalized to NFD) diacritics are placed next to the
> preceding code point.
>
> If I understand Unicode correctly there should be no difference in
> display and there is a presentation about Pango from 2004 claiming that
> there shouldn't be one.
>
> Is this a known issue or expected behavior? Is there some preprocessing
> necessary before using pango_layout_set_text()?
>
> All Best,
> Ben
>
> [0] http://l.unchti.me/dump/nfc.png
> [1] http://l.unchti.me/dump/nfd.png
> _______________________________________________
> gtk-i18n-list mailing list
> [hidden email]
> https://mail.gnome.org/mailman/listinfo/gtk-i18n-list
>
_______________________________________________
gtk-i18n-list mailing list
[hidden email]
https://mail.gnome.org/mailman/listinfo/gtk-i18n-list
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unicode normalization and text display differences

Benjamin Kiessling
On 01/06, Behdad Esfahbod wrote:
> Hi Ben,
>
> You at least need to tell us on what platform you are testing this and with
> what fonts.

Sorry, I'm running on Debian testing (libpango-1.0 package version
1.36.8-3) and using it through GObject introspection from python
(although that's probably not the issue here). Apart from setting up the
whole CairoSurface and Layout shebang the code boils down to
set_font_description/set_text/show_layout.

The font family used is GFS Philostratos, although the issue persists
only with GFS fonts as I'm realizing just now. Using DejaVu Sans output
is independent of normalization form. Is this to be expected? The
documentation is mute about such points (and in general about how it's
supposed to be used).

All Best,
Ben
_______________________________________________
gtk-i18n-list mailing list
[hidden email]
https://mail.gnome.org/mailman/listinfo/gtk-i18n-list
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unicode normalization and text display differences

Behdad Esfahbod-5
On 16-01-06 02:55 PM, Benjamin Kiessling wrote:

> On 01/06, Behdad Esfahbod wrote:
>> Hi Ben,
>>
>> You at least need to tell us on what platform you are testing this and with
>> what fonts.
>
> Sorry, I'm running on Debian testing (libpango-1.0 package version
> 1.36.8-3) and using it through GObject introspection from python
> (although that's probably not the issue here). Apart from setting up the
> whole CairoSurface and Layout shebang the code boils down to
> set_font_description/set_text/show_layout.
>
> The font family used is GFS Philostratos, although the issue persists
> only with GFS fonts as I'm realizing just now. Using DejaVu Sans output
> is independent of normalization form. Is this to be expected? The
> documentation is mute about such points (and in general about how it's
> supposed to be used).

It's probably the case that the GFS fonts don't support the combining marks,
and as such Pango picks them from a different font, which then breaks shaping.
 Ideally that would not happen, but that's the way it is currently.

behdad
_______________________________________________
gtk-i18n-list mailing list
[hidden email]
https://mail.gnome.org/mailman/listinfo/gtk-i18n-list
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unicode normalization and text display differences

Benjamin Kiessling
On 01/06, Behdad Esfahbod wrote:

> On 16-01-06 02:55 PM, Benjamin Kiessling wrote:
> > On 01/06, Behdad Esfahbod wrote:
> >> Hi Ben,
> >>
> >> You at least need to tell us on what platform you are testing this and with
> >> what fonts.
> >
> > Sorry, I'm running on Debian testing (libpango-1.0 package version
> > 1.36.8-3) and using it through GObject introspection from python
> > (although that's probably not the issue here). Apart from setting up the
> > whole CairoSurface and Layout shebang the code boils down to
> > set_font_description/set_text/show_layout.
> >
> > The font family used is GFS Philostratos, although the issue persists
> > only with GFS fonts as I'm realizing just now. Using DejaVu Sans output
> > is independent of normalization form. Is this to be expected? The
> > documentation is mute about such points (and in general about how it's
> > supposed to be used).
>
> It's probably the case that the GFS fonts don't support the combining marks,
> and as such Pango picks them from a different font, which then breaks shaping.

Makes sense. Is there a way to detect fallback, except disabling it
completely using pango_attr_fallback_new() or instantiating a new
fontconfig environment for a particular font?

All Best,
Ben
_______________________________________________
gtk-i18n-list mailing list
[hidden email]
https://mail.gnome.org/mailman/listinfo/gtk-i18n-list
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Unicode normalization and text display differences

Behdad Esfahbod-5
On 16-01-06 03:07 PM, Benjamin Kiessling wrote:

> On 01/06, Behdad Esfahbod wrote:
>> On 16-01-06 02:55 PM, Benjamin Kiessling wrote:
>>> On 01/06, Behdad Esfahbod wrote:
>>>> Hi Ben,
>>>>
>>>> You at least need to tell us on what platform you are testing this and with
>>>> what fonts.
>>>
>>> Sorry, I'm running on Debian testing (libpango-1.0 package version
>>> 1.36.8-3) and using it through GObject introspection from python
>>> (although that's probably not the issue here). Apart from setting up the
>>> whole CairoSurface and Layout shebang the code boils down to
>>> set_font_description/set_text/show_layout.
>>>
>>> The font family used is GFS Philostratos, although the issue persists
>>> only with GFS fonts as I'm realizing just now. Using DejaVu Sans output
>>> is independent of normalization form. Is this to be expected? The
>>> documentation is mute about such points (and in general about how it's
>>> supposed to be used).
>>
>> It's probably the case that the GFS fonts don't support the combining marks,
>> and as such Pango picks them from a different font, which then breaks shaping.
>
> Makes sense. Is there a way to detect fallback, except disabling it
> completely using pango_attr_fallback_new() or instantiating a new
> fontconfig environment for a particular font?

If you care to walk the layout info using a layout iterator, you can check the
font used for each run of text, and under fallback, more than one font is used.
_______________________________________________
gtk-i18n-list mailing list
[hidden email]
https://mail.gnome.org/mailman/listinfo/gtk-i18n-list
Loading...