TL;DR

  1. Is there a way to support display of U+X?
  2. How many character blocks will it take to display U+X?
  3. Will U+X be displayed in independent multicolor?

Introduction

I’ve been on a bit of a research kick in Unicode lately, and I’ve long been a C language user, since before UTF-8 became the “winning” encoding for Unicode. I’ve also long been a bit of an enthusiast for Virtual Terminals, especially as I started programming before fully integrated development environments for every language were the default. That is, I learned to program C on a DOS console, and have since mostly had jobs creating code on a remote UNIX/Linux system using a virtual terminal of some sort.

Unicode and Terminals

When creating an interface meant to run on a terminal, there are some limitations that are not often thought about when dealing with Unicode glyphs from non-European languages. In the early days of Unicode, this was dealt with, mostly, by Japanese computer manufacturers, and the work put into that, back in the early 1990s is captured in the ubiquitous (but not POSIX) wcwidth() and wcswidth() functions, which are, in turn, dependent upon the data from Unicode. There are exceptions, but those functions are primarily created from data in the Unicode file, EastAsianWidth.txt.

Both these functions, and the concept behind EastAsianWidth were created when “Internationalization” meant UCS-2 (or 16-bit wide character encoding), and, at the time, there were no Unicode points past the 16-bit UCS-2 size. Most importantly, that standard basically says that something can take up 0, 1, or 2 terminal character widths, with -1 being undefined or undetermined.

UCS-2 has been impractical for near 25 years with Unicode being represented with 32-bit codepoints, and many language suppliments (and the even more common Emoji) sitting outside of the old UCS-2/16-bit range. That is, it was never updated to include archaic languages, or languages out of the global south.

This is even a problem with Emoji as EastAsianWidth.txt would happily declare that U+231A “WATCH” is 2 characters wide, but many font sets will happily render it in only 1, as it can be fully rendered that way. That is, EastAsianWidth.txt is not actually reliable.

More importantly, there are several languages in Unicode that have single glyphs that take MORE than 2 character blocks.

         123
U+111E5  𑇥
         12345
U+1242E  𒐮

Above, the first example takes 3 blocks (on my screen, with my fonts), and the second example uses 5.

Missing Unicode Terminal Functions

I’ve talked about this before, but - by far - the most common virtual terminal emulators that are availble for current hardware base descendent of the ANSI X3.64 (ECMA-48) terminal control standards, importantly with non-standard extensions from Digital Equipment Corporation for their VTxxx line of terminals.

What I think is needed is for even one relatively popular terminal emulator to add a few functions to specifically support a few new queries.

  1. Is there a way to support display of X?
  2. How many character blocks will it take to display X?
  3. Will this be displayed in independent multicolor?

Why In Terminal

These are each a questions of locally installed and configured Fonts, which are not known by a remote algorythm. Not all fonts even follow the EastAsianWidth.txt standards. I’ve seen Chinese 3, ㆔ (U+3194) rendered in 1 or 2 in different terminals on the same system because of Font configuration differences. These are also things that a terminal literally has to figure out when it is asked to render that character, so allowing a program to ask before display would be a useful addition.

What Good Would That Do?

This is a thing that’s been happening with Terminals, post-hardware. New, unique features introduced in one decent terminal have a tendency to migrating into others. 24-bit color support never existed in hardware terminals, but has spread to become fully supported across a wide number of terminals. “Set the Window title” was made available on xterm on a few UNIX systems, and has become available on almost every terminal in use now. Command marks (shell integration) were introduced in iTerm2, and have been adopted into WezTerm and KiTTY.

Outro

Anyway, that’s my thoughts.