There
are a number of
utilities to search
and display the
contents of a file - similar to unix grep - that
use the command console (or DOS prompt). These
are super fast and return results from a 25,000 line file
almost immediately
after hitting "enter". Unfortunately the "DOS
prompt" uses the DOS character mapping for ASCII characters
above decimal 127; these differ from Windows' character
maps.
Files
edited with Windows' programs and using the Windows characters
with ASCII decimal values >127
will display incorrectly in the DOS window. For example,
ASCII 235 is the small "e" with a
diaeresis (or
umlaut) in Windows, but this is the lower case Greek "delta" in
the IBM PC character set and this is the way it displays
in the DOS window. Here's an example
using the following text from a German web page discussing Draeseke's
Mass in A minor: "Draesekes
Weg als Künstler
war ein fortwährender Kampf
um Anerkennung. Der von allen seinen Schülern am Dresdener
Konservatorium hochgeachtete Professor für Komposition
und Musikgeschichte fand mit seinen Werken nicht das Echo
in der musikalischen Öffentlichkeit,
wie es dem
in dieser Hinsicht glücklicheren Zeitgenossen
Brahms beschieden war." Notice that none of the
characters with a diacritic mark displays correctly (see
the capital
Greek sigma, instead of the lower case "a" with
an umlaut in the first line). This can be
a real nuisance for users of console applications.
The
solution is setting the console code page. The default
console in
Windows (at least in versions w/American English set as
default) uses code page 437, the old DOS/IBM-PC
character set. The command "mode con" will display
the code page used. This can be changed using the command:
"mode
con cp select=nnnn"
where nnnn is the code page to be selected. nnnn=850
will give the IBM "international
set" (also pretty lame). nnnn=1252 will set it
to the Windows
(Western European/USset), nnnn=28591
will set it to ISO 8859-1 Latin I, and nnn=28592 will set
to ISO 8859-2 Eastern Europe (the
latter with similar ASCII 128-255 mapping as Windows).
Here are some code pages of interest:
Code page |
Description |
437 |
MS-DOS United States |
708 |
Arabic (ASMO 708) |
709 |
Arabic (ASMO 449+, BCON V4) |
710 |
Arabic (Transparent Arabic) |
720 |
Arabic (Transparent ASMO) |
737 |
Greek (formerly 437G) |
775 |
Baltic |
850 |
MS-DOS Multilingual (Latin I) |
852 |
MS-DOS Slavic (Latin II) |
855 |
IBM Cyrillic (primarily Russian) |
857 |
IBM Turkish |
860 |
MS-DOS Portuguese |
861 |
MS-DOS Icelandic |
862 |
Hebrew |
863 |
MS-DOS Canadian-French |
864 |
Arabic |
865 |
MS-DOS Nordic |
866 |
MS-DOS Russian (former USSR) |
869 |
IBM Modern Greek |
874 |
Thai |
932 |
Japan |
936 |
Chinese (PRC, Singapore) |
949 |
Korean |
950 |
Chinese (Taiwan; Hong Kong SAR, PRC) |
1200 |
Unicode (BMP of ISO 10646) |
1250 |
Windows 3.1 Eastern European |
1251 |
Windows 3.1 Cyrillic |
1252 |
Windows 3.1 Latin 1 (US, Western Europe) |
1253 |
Windows 3.1 Greek |
1254 |
Windows 3.1 Turkish |
1255 |
Hebrew |
1256 |
Arabic |
1257 |
Baltic |
1258 |
Latin 1 (ANSI) |
20000 |
CNS - Taiwan |
20001 |
TCA - Taiwan |
20002 |
Eten - Taiwan |
20003 |
IBM5550 - Taiwan |
20004 |
TeleText - Taiwan |
20005 |
Wang - Taiwan |
20127 |
US ASCII |
20261 |
T.61 |
20269 |
ISO-6937 |
20866 |
Ukrainian - KOI8-U |
21027 |
Ext Alpha Lowercase |
21866 |
Russian - KOI8 |
28591 |
ISO 8859-1 Latin I |
28592 |
ISO 8859-2 Eastern Europe |
28593 |
ISO 8859-3 Turkish |
28594 |
ISO 8859-4 Baltic |
28595 |
ISO 8859-5 Cyrillic |
28596 |
ISO 8859-6 Arabic |
28597 |
ISO 8859-7 Greek |
28598 |
ISO 8859-8 Hebrew |
28599 |
ISO 8859-9 Latin Alphabet No.5 |
29001 |
Europa 3 |
1361 |
Korean (Johab) |
There
are registry entries under the category NLS (National
Language Support) which list the code pages and defaults,
but simply
adding the line "mode
con cp select=1252" or "mode
con cp select=28592" to the batch file that launches the
grep-like applications will change the console display properties. Here's
the same German text displayed after manually setting the code
page to 28592. Umlauts are now displayed in their full glory.
In
addition, the console should be set to display unicode
fonts. This can be set as the default by right clicking
the task bar of the command console; click on properties;
click on fonts. "Lucida console" is used here.
Then apply using the "Save properties for future windows
with same title" option. Other console parameters
can be changed here or with the "mode con" command;
type "mode
con/?" at
the DOS prompt for a list.
Useful
sites:
Character
Sets
Microsoft on "SetConsoleOutputCP"
ASCII Diacritics:
( ISO-8859-1 Latin-1)
Command line MODE command
[Go
back to Willow Pond
PC Tips]
[Was this useful? Have something to add? Let
us know.] |