Today, I learned the hard way about character encodings and file conversion utilities.
At work, I’m building a system that takes a data dump from system A(a windows server) and imports it into system B(a linux server). System A outputs a file I’m going to call input.txt in utf16 format. System B wants the input in utf-8.
At first, I didn’t realize that input.txt was in utf16 format. I looked at the file in vi and saw that the first byte was garbage, then every-other byte was garbage. No problem I thought and proceeded to write a perl script to throw away the garbage bytes. This worked well on small files, but input.txt was 2.8GB(2885966416 bytes to be exact) and it took the following amount of time to run(I used the time command):
real 39m38.253s user 39m27.890s sys 0m9.298s
Holy crap! That took 40 minutes. After discussing with the person that generated the file, they said it was encoded in utf16. I then tried modifying my perl script to use the
Encode module, but that didn’t work out well and I kept getting errors complaining about the encoding.
Enough was enough. I needed to know the details about this file. I used the
file command to find out exactly what was going on:
$ file input.txt input.txt: Little-endian UTF-16 Unicode character data, with CRLF line terminators
Great. Now I know for certain that it’s utf-16. I figured that there had to be a linux command line file conversion utility, so I consulted google. Sure enough, I found the ‘iconv’ utility.
$ time iconv -f utf16 -t utf8 -o output.txt input.txt real 0m24.807s user 0m13.495s sys 0m7.362s
Wow…that was fast. I diffed my original output file (the one from my perl script) vs. the new one and they matched.
After all was said and done, I felt like a total idiot for not using the
file command first to determine exactly what I was working with.