Sunday, February 05, 2012

Firebird and Unicode databases

The first encounter with unicode databases and collations on Firebird may be confusing, as it was for me and it may take some time to grasp the concepts and settings. This article describes the principles and catches and gives the examples how to set up a UTF-8 database in Firebird SQL server 2.5, how to setup ICU libraries and how to use national collations (eg. czech).




Unicode database concepts

Firebird is a great database server that supports both single-byte and unicode databases and clients. But if you have been using single-byte databases for some time and you want to migrate to unicode, you better be aware of several conceptual changes.

Firebird uses ICU libraries to handle Unicode data. This is a good idea since Unicode is a complex topic and ICU libraries are free software developed and maintained by IBM. The thing is that ICU libraries are advancing very fast and different ICU versions generate different keys used for Firebird indexes.

This means that if you generate your collation according to any locale, your Unicode database is bound to the ICU version you have used at the time of creation and cannot be used with another version (older or newer).

You can move single-byte databases freely from server to server (plain copy) and they always work. For Unicode databases this may not be the case. Using default collations (UCS_BASIC, UNICODE, UNICODE_CI) may be safe but I have not tested these cases extensively. The situation changes if you are using national collations. You can still move your database freely but only when the same ICU version is used on both servers.

If you move the database to a server with different ICU, you can connect to the database but every select returns en error

collation is not installed

The good news is that GBAK dump and restore always works and this way you can copy data from any server to any server. Another good news is that Firebird can use several ICU versions at the same time.

 Bundled and system ICU

If you download Firebird from the the project website, it uses so-called bundled ICU. It means that it carries an ICU library and uses always this one. All the Firebird 2.x versions come with ICU version 3.0, it is located in the firebird/lib directory and consists of three files (libicudata, libiuui18n, libicuuc).

But this is not a full-featured ICU. In order to keep it small, almost all locales are stripped (original libicudata is about 9MB while the bundled one is about 1.5MB). In order to be able to create your collations according to locales you will need to replace these libraries with full versions (see below).


Linux distributions (Debian, Ubuntu for sure) compile Firebird wirh so-called system ICU. This means that it uses the ICU library that comes with your system. This is always the full version of the library but the version numbers may differ wildly.

I prefer to use Firebird release from the project website and to configure ICU libraries myself as I describe below.

Please note that Firebird 3.0 will bring in some changes. It is reported that the bundled ICU version may be abandoned. I have no information on configuration options' changes.

Before we start

The next steps presume that you have Firebird 2.5.1 downloaded and installed in the /opt directory. These steps are tested with Firebird release from the Firebird project website, on Ubuntu 11.10-64.

Note that Firebird 2.5.0 may not work with ICU versions newer than 4.2 since there was a naming change in the ICU APIs with the next release. Firebird 2.5.1 works fine with both old and new ICU versions.

Step 1: Downloading and Setting Up the ICU

ICU can be downloaded from the ICU-project web site at http://site.icu-project.org/download

There are binaries for many systems as well as sources. In this example we will download the binaries for ICU 4.8 (icu4c-4_8_1_1-RHEL6-x64.tgz) and sources for ICU 3.0 since there are no binaries for 64bit Linux (icu-3.0.tgz).

It is simple to install the binaries. Decompress the file, go to the usr/local/lib directory and copy the needed files into /opt/firebird/lib directory.

cd icu4c-4_8_1_1-RHEL6-x64/usr/local/lib

cp libicudata.so.48*   /opt/firebird/lib/
cp libicui18n.so.48*   /opt/firebird/lib/
cp libicuuc.so.48*     /opt/firebird/lib/

Then open /opt/firebird/intl/fbintl.conf and set up the ICU versions you want to enable. We will tell Firebird to use both the default and the new ICU 4.8 version.

Before:
 
<intl_module builtin>
    icu_versions    default
</intl_module>

After:

<intl_module builtin>
    icu_versions    default 4.8
</intl_module>

It is important to make changes in the builtin section not in the fbintl section. The changes in the fbintl section have no influence.

More examples how the icu_versions line may look like (these all work):
icu_versions    default 
icu_versions    default 4.8
icu_versions    3.0
icu_versions    3.0 4.8

When you have saved the changes, restart the Firebird service.

/etc/init.d/firebird restart

Step 1a: compiling ICU 3.0 from sources

If you need to compile ICU 3.0, it is quite easy, there is just one little catch.

Decompress the downloaded file and do the usual thing:

cd icu/sources
./configure
make

The catch is that the compilation may stop with an error. It is usually just a small mistake in the file layoutex/ParagraphLayout.cpp. If this is the case, open the file layoutex/ParagraphLayout.cpp, go the line 748 and change the #elif directive to #else. Save and make again.

When the make is finished you will find the libraries in the icu/sources/lib directory. Copy them the same way as described above, replacing the Firebird original ICU libraries version 3.0.

Step 2: Create the database 

create database 
   'localhost:/path/to/database.fdb' 
   default character set UTF8;

Step 3: Create the national collation

First connect to the database. Then:

create collation UNICODE_CSCZ_CI 
   for UTF8 
   from UNICODE 
   case insensitive 
   'LOCALE=cs_CZ;ICU-VERSION=4.8';

This command creates a czech (cs_CZ) case insensitive collation using ICU version 4.8. 
  • Note: quoted attributes on the last line are case sensitive
  • Note: if the given locale is not present in the ICU library, you get an error invalid collation attributes
  • Note: if the given ICU version is not configured in Firebird or libraries are missing, you get an error invalid collation attributes 
  • Note: ICU libraries must be located in the firebird/lib directory.

Step 4: Use the national collation

This command sets up the collation as the default for the database.

alter character set UTF8 
   set default collation UNICODE_CSCZ_CI;

Conclusion

And this is all there is to it. It really boils down to just a 3 simple commands (create database, create collation, alter character set) provided that the correct versions of ICU libraries are present in the right directories and correctly set up ;-)

εὕρηκα


Useful links

1 comment:

Popa Adrian Marius said...

There was also an issue with
icu 52.1 in Debian and Ubuntu
but now is solved at least in Debian/Ubuntu packages

http://www.firebirdnews.org/?p=9201