i learned something today

2009-03-25 · by jsnby · Read in about 2 min · (324 Words)
Computers

Today, I learned the hard way about character encodings and file conversion utilities.

At work, I’m building a system that takes a data dump from system A(a windows server) and imports it into system B(a linux server). System A outputs a file I’m going to call input.txt in utf16 format. System B wants the input in utf-8.

At first, I didn’t realize that input.txt was in utf16 format. I looked at the file in vi and saw that the first byte was garbage, then every-other byte was garbage. No problem I thought and proceeded to write a perl script to throw away the garbage bytes. This worked well on small files, but input.txt was 2.8GB(2885966416 bytes to be exact) and it took the following amount of time to run(I used the time command):

real     39m38.253s
user    39m27.890s
sys     0m9.298s

Holy crap! That took 40 minutes. After discussing with the person that generated the file, they said it was encoded in utf16. I then tried modifying my perl script to use the Encode module, but that didn’t work out well and I kept getting errors complaining about the encoding.

Enough was enough. I needed to know the details about this file. I used the file command to find out exactly what was going on:

$ file input.txt
input.txt: Little-endian UTF-16 Unicode character data, with CRLF line terminators

Great. Now I know for certain that it’s utf-16. I figured that there had to be a linux command line file conversion utility, so I consulted google. Sure enough, I found the ‘iconv’ utility.

$ time iconv -f utf16 -t utf8 -o output.txt input.txt
real    0m24.807s
user    0m13.495s
sys     0m7.362s

Wow…that was fast. I diffed my original output file (the one from my perl script) vs. the new one and they matched.

After all was said and done, I felt like a total idiot for not using the file command first to determine exactly what I was working with.