Love Coding: Script to Convert Windows-1252 files to UTF-8

Script to Convert Windows-1252 files to UTF-8

HuyPV

Sunday, July 19, 2009

I had several hundred (over 1000) HTML files in a directory. They were unfortunately encoded in Windows-1252 and I wanted them all converted to UTF-8, but I was not willing to open the files one by one or feed their names to a script (there’s too many) so I needed a script that would operate on the whole directory and spit out the converted files in one fell swoop.

If you’re not familiar with encodings the visual problem one sees is that Firefox displays little black diamonds with question marks inside them for characters it doesn’t understand (I think they’re mostly tabs, spaces, and em-dashes in this case.)

With help from friends and the internet I learned about the GNU/Linux command-line tool iconv which handled this perfectly. Here’s the bash script I used that made it work on the entire directory at once:

#/bin/bash
LIST=`ls *.html`
for i in $LIST;
do iconv -f WINDOWS-1252 -t UTF8 $i -o $i.”utf8″;
mv $i.”utf8″ $i;
done

It seems that iconv requires a new name for the output file, so the above script temporarily names them *.utf and then moves them back over the original .html files. Hopefully this helps someone else.

Title: Script to Convert Windows-1252 files to UTF-8
Description: I had several hundred (over 1000) HTML files in a directory. They were unfortunately encoded in Windows-1252 and I wanted them all converte... ...
Rating: 4

Love Coding

Script to Convert Windows-1252 files to UTF-8

No comments :

Post a Comment