Signed-off-by: Dan Fandrich <dan@coneharvesters.com> Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
		
			
				
	
	
		
			72 lines
		
	
	
		
			2.3 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			72 lines
		
	
	
		
			2.3 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
	Unicode support in busybox
 | 
						|
 | 
						|
There are several scenarios where we need to handle unicode
 | 
						|
correctly.
 | 
						|
 | 
						|
	Shell input
 | 
						|
 | 
						|
We want to correctly handle input of unicode characters.
 | 
						|
There are several problems with it. Just handling input
 | 
						|
as sequence of bytes would break any editing. This was fixed
 | 
						|
and now lineedit operates on the array of wchar_t's.
 | 
						|
But we also need to handle the following problematic moments:
 | 
						|
 | 
						|
* It is unreasonable to expect that output device supports
 | 
						|
  _any_ unicode chars. Perhaps we need to avoid printing
 | 
						|
  those chars which are not supported by output device.
 | 
						|
  Examples: chars which are not present in the font,
 | 
						|
  chars which are not assigned in unicode,
 | 
						|
  combining chars (especially trying to combine bad pairs:
 | 
						|
  a_chinese_symbol + "combining grave accent" = ??!)
 | 
						|
 | 
						|
* We need to account for the fact that unicode chars have
 | 
						|
  different widths: 0 for combining chars, 1 for usual,
 | 
						|
  2 for ideograms (are there 3+ wide chars?).
 | 
						|
 | 
						|
* Bidirectional handling. If user wants to echo a phrase
 | 
						|
  in Hebrew, he types: echo "srettel werbeH"
 | 
						|
 | 
						|
	Editors (vi, ed)
 | 
						|
 | 
						|
This case is a bit similar to "shell input", but unlike shell,
 | 
						|
editors may encounter many more unexpected unicode sequences
 | 
						|
(try to load a random binary file...), and they need to preserve
 | 
						|
them, unlike shell which can afford to drop bogus input.
 | 
						|
 | 
						|
	more, less
 | 
						|
 | 
						|
Need to correctly display any input file. Ideally, with
 | 
						|
ASCII/unicode/filtered_unicode option or keyboard switch.
 | 
						|
Note: need to handle tabs and backspaces specially
 | 
						|
(bksp is for manpage compat).
 | 
						|
 | 
						|
	cut, fold, watch
 | 
						|
 | 
						|
May need ability to cut unicode string to specified number of wchars
 | 
						|
and/or to specified screen width. Need to handle tabs specially.
 | 
						|
 | 
						|
	sed, awk, grep
 | 
						|
 | 
						|
Handle unicode-aware regexp match
 | 
						|
 | 
						|
	ls (multi-column display)
 | 
						|
 | 
						|
ls will fail to line up columnar output if it will not account
 | 
						|
for character widths (and maybe filter out some of them, see
 | 
						|
above). OTOH, non-columnar views (ls -1, ls -l, ls | car)
 | 
						|
should NOT filter out bad unicode (but need to filter out
 | 
						|
control chars (coreutils does that). Note that unlike more/less,
 | 
						|
tabs and backspaces need not special handling.
 | 
						|
 | 
						|
	top, ps
 | 
						|
 | 
						|
Need to perform filtering similar to ls.
 | 
						|
 | 
						|
	Filename display (in error messages and elsewhere)
 | 
						|
 | 
						|
Need to perform filtering similar to ls.
 | 
						|
 | 
						|
 | 
						|
TODO: write an email to Asmus Freytag (asmus@unicode.org),
 | 
						|
author of http://unicode.org/reports/tr11/
 |