Unicode Conversion Gateway for Indian Language Newspapers

Posted by themischord on December 22, 2006

Lack of usable content in Indian languages has been a major problem for a long time now. There are quite a few Indian language newspapers and magazines which have online portals, but unfortunately, they use proprietary fonts/encodings. This makes them good only for viewing the page. These pages cannot be indexed by search engines, cannot be used to create a corpus etc, making valuable content practically useless. For instance, if one searches for a Telugu word in google, there will be no results from which is a Telugu newspaper.

The Swecha team at TCS has set up a Unicode Conversion Gateway for some popular Indian language newspapers in Hindi, Telugu, Tamil, Gurajati, Kannada and Malayalam. For the tech savy, the proxy server fetches pages from the original server, converts it into a Unicode based encoding, which is a standard and serves the page to the user. It uses modified Padma Firefox Extension to do this.

Since search engines understand Unicode based encodings (like UTF8), they will be able to index the pages through this gateway. There are other advantages of a Unicode encoded page too. One can search for a word within the page and copy and paste the Indian language text from the page like regular English text.

