samedi 28 février 2015

MD5 value mismatch between SQL server and PostgreSQL

In order to write some code to do consistency check of data stored in both sql-server and PostgreSQL, I plan to calculate the MD5 on table data for both the databases, and verify if they are equal. This works fine as long as data is plain text ( ANSI ) as below:



sql-server> SELECT master.dbo.fn_varbintohexstr(HashBytes('MD5', 'a'));
0x0cc175b9c0f1b6a831c399e269772661


postgres=# select MD5('a');
0cc175b9c0f1b6a831c399e269772661


Now, If I try to use some hangul(korean) characters, MD5 match fails:



sql-server> SELECT master.dbo.fn_varbintohexstr(HashBytes('MD5', '무'));
0x0cc175b9c0f1b6a831c399e269772661


postgres=# select MD5('무');
cb3e9be1a3a28b355eabae1fa1e291b3


As per my understanding, reason of mismatch is that unicode characters are stored as UCS-2 encoding (fixed 16 bits encoding) in sql-server and UTF-8 encoding in PostgreSQL. And as MD5 works on character bits, the character bits sequence would be different in both SQL server and PostgreSQL.


AS I have been dealing mostly with hangul character-set, the workaround I used in PostgreSQL is to convert the encoding from UTF-8 to UHC ( Universal Hangul Character-set) before calculating hash as below:



postgres=# select MD5(CONVERT('무'::bytea,'UTF8','UHC'));
7827b52f65d9f7777d37071cbbbf7f2d


As you can see, the above hash value is same as that for SQL server.


All is fine as long as I am dealing with Hangul characters. But some tables contains mix of Hangul and Chinese characters, and the conversion fails in that case:



postgres=# select MD5(CONVERT('무么'::bytea,'UTF8','UHC'));
ERROR: character 0xe4b988 of encoding "UTF8" has no equivalent in "UHC"
postgres=#


The error makes sense as there are no equivalent of Chinese characters in UHC character-set.


How can I make it work? Basically, I need to find way to convert UCS-2 to UTF-8 in SQL server, or to convert UTF-8 to UCS-2 in PostgreSQL before calculating MD5. I want to perform all these operations within database engine, and not load data in external application to calculate MD5, as some tables has huge data set.


Aucun commentaire:

Enregistrer un commentaire