Zpravodaj Československého sdružení uživatelů TEXu, ISSN 1211-6661, eISSN 1213-8185, http://bulletin.cstug.cz/
Issue homepage: http://bulletin.cstug.cz/bul20032.shtml, URL to PDF: http://bulletin.cstug.cz/pdf/bul_032.pdf
Volume 13, Number 2, pages 98106, 2003. BibTEX source; DOI:10.5300/2003-2/98
Published by CSTUG, printed and distributed by Konvoj, s. r. o.
New encTEX -- the UTF-8 encoding in TEX (in Czech)
Abstract: The UTF-8 encoding keeps the standard ASCII characters unchanged and encodes the accented letters of our alphabets in two bytes. The standard 8bit TEX is not ready for the UTF-8 input because it has to manage the single character as two tokens. It means you cannot set the catcode, uccode, etc. to these single characters and you cannot do futurelet of the next character in normal sense. The second version of my encTEX solves these problems.
The encTEX is full backward compatible with the original TEX. It adds ten new primitives by which you can set or read the conversion tables used by input processor of TEX or used during output to the terminal, log and write files.
The second version gives possibility to convert the multi-byte sequences to one byte or to a control sequence. You can implement up to 256 UTF-8 codes as one byte and unlimited number of other UTF-8 codes as control sequences. All internals in 8bit TEX are working in the same way as if normal one byte encoding of input files is used.
I think that the UTF-8 encoding will be in more common use. In such situation, there is no other way than to modify the input processor of TEX otherwise the 8bit TEX will die in a short time.
Nový encTEX -- kódování UTF-8 v TEXu
Author
Petr Olšák
Cited-by CrossRef


Webpage prepared by editors of the journal.