Welcome!

By registering with us, you'll be able to discuss, share and private message with other members of our community.

SignUp Now!

Done Error: the BDEBUGGER displays the cyrillic CP 866 characters incorrectly

I tried to open in the TCC's IDE a file .TXT with some russian text coded in code page 866 (a result of a .BAT I'm developing). The IDE displays the text incorrectly, and does not have the setting option to correct this. It is interesting, that the .CMD files with the CP 866 characters are displayd correctly in the IDE.
 

Attachments

  • BDEBUGGER.with CP866 chars.error.window.gif
    BDEBUGGER.with CP866 chars.error.window.gif
    77.5 KB · Views: 419
  • Notepad.with CP866 chars.window.gif
    Notepad.with CP866 chars.window.gif
    13.9 KB · Views: 410
Without one of the above, your file will be treated as ASCII, and the characters will be displayed based on the active codepage.

Interpreting high-order OEM characters according the the current code page would be the Right Thing. But it seems the BDEBUGGER/IDE/TCEDIT is not interpreting them per the code page, but displaying them as hex numbers in reverse video.
 
Adding: .BAT / .BTM / .CMD files should be interpreted per the current OEM code page, since that's how TCC sees them in batch files.

Whether other extensions, e.g. .TXT, should use the the Windows (“ANSI”) code page instead... I could argue that one either way!
 
Most editors permit users to set the code page to be used displaying text files. Maybe you will include this option into future versions? It seems very strange when a program displays the same text differently depending on the filename exension.

My .CMD file forms some e-mail text in the Russian by redirected ECHO commands, and I tried to see how that text is added to the output file. I had opened the output file in the TCC's IDE and saw some garbage in place of russian words.
 
Last edited:
After some testing, I find that BDEBUGGER / IDE / TCEDIT consistently display inverted hex for all high-order characters in 8-bit-encoded files. The file extension and code page don't matter; it always happens. Dmitry, I have to wonder if your .CMD file isn't actually UTF-16.

TCEdit - High OEM characters.png


My OEM code page is 437, and my Windows code page is 1252. I'm guessing that the editor control isn't trying to remap these, just making them Unicode code points with the same values — C1 control characters.

On the bright side, TCEDIT et al. seem quite happy dealing with high Unicode characters — those outside the BMP. Many Windows programs don't even try:

TCEdit - Smilies.png
 
My OEM code page is 437, and my Windows code page is 1252. I'm guessing that the editor control isn't trying to remap these, just making them Unicode code points with the same values — C1 control characters.
But in that case, 0xA0 through 0xFF should be printable Latin-1 characters. They aren't. So... I don't know what the editor is doing with high-order characters in OEM files, but it isn't right.
 
Yes, you are right! I had forgotten I had written that .CMD in unicode! Pardon!

But the option to set the codepage for the foreign files displaying seems to be convenient.
 
Last edited:
The standard Windows font selection dialog permits to set the "character set". Both your IDE.EXE and TCEDIT..EXE call this dialog, but continue, for weakly understandable reason, to show hex codes instead of cyrillic chars despite of "cyrillic charset" selected in the dialog (see the screenshot attached) BDEBUGGER.with CP1251 chars & Font Selection dialog.gif.
 
Both your IDE.EXE and TCEDIT..EXE call this dialog, but continue, for weakly understandable reason, to show hex codes instead of cyrillic chars despite of "cyrillic charset" selected in the dialog (see the screenshot attached)

Just to clarify a little bit, this isn't a problem only with Cyrillic or Russian or code page 866. The editor control does this to all high-order characters in non-Unicode files. Anybody using non-ASCII characters in an 8-bit text file will have the same issue. Here's a text file using some CP1252 characters:

TCEdit - CP1252.png
 
Okay, I've been looking over the Scintilla documentation and I think I have a better understanding of what's going on. Their docs on "Character representations" say that
Invalid bytes are shown in a similar way with an 'x' followed by their value in hexadecimal, like "xFE".
That sounds familiar. But... what's an "invalid byte"? In an 8-bit text file, any byte should be a valid character. So... maybe Scintilla is thinking this file uses some other encoding. If it were UTF-8, then yeah; these scattered high-order bytes probably would not form any valid character.

I downloaded the demo text editor, SciTE, and tried opening my silly text file in that:

SciTE-1.png


Looks great. But SciTE has a File / Encoding submenu, which TCEdit et al. lack. I check, and it's set to "Code Page Property" — which sounds good — but I can change it. And if I set it to UTF-8:

SciTE-2.png


That looks very familiar. So my theory is that TCEdit (IDE, BDEBUGGER) is not correctly recognizing OEM text files; they get misinterpreted as UTF-8.

Which explains another mystery, one that I'd been ignoring. Dmitry, in your first screen shot, somehow there is a Chinese character amongst the hexadecimal. The three Cyrillic letters ч и к are encoded in CP866 as 0xE7 0xA8 0xAA. Which just happens to form a valid UTF-8 sequence. It works out to U+7A2A, which is 稪. Or, in Japanese, mojibake.

@Rex: Shall I send you my LooksLikeUTF8() function?
 
Last edited:
Well. I have understood the underlaying stuff. Can I take an interest, when will this be corrected? Or you don't consider this an error?
I don't speak for Rex. But in my opinion, this is (A) a bug, and (B) easily fixable.

Most batch files use OEM encoding. No batch files, so far as I know, use UTF-8. So if there were absolutely no way to distinguish between the two, BDEBUGGER should assume OEM.

(But in fact, it's not difficult to recognize UTF-8, even without a BOM.)
 
How can I make the TC's IDE to correctly display my .CMD file with CP 1251 cyrillic characters in the ECHO and PAUSE commands? Now it displays .
The font settings in the IDE.EXE have no effect to the display. internationalisation problem of the TCC's IDE.screenshot.gif
 
If you start CMD‍.EXE from the Win-R "Run" dialog — not from in Take Command or TCC — what does the CHCP command report?
 
Then I don't know where IDE is getting code page 437 from!

TCEDIT seems to do the same thing, by the way.
 
It seems the IDE tries to display the file with code page 866, not 437. The code page 866 is set by my TCSTART.BAT file. But it is not correct to think that all the batch files use the code page set by the TCSTART.BAT. The Russian code pages are 866 and 1251, and both may be used on one computer. The IDE must allow the user to choose this independently of what is written in the start files..
 
It seems the IDE tries to display the file with code page 866, not 437.

Quite right, so it is. My mistake.

If IDE and TCEDIT assume the system OEM code page by default -- that seems like a sensible design decision to me. But it would be nice if there was a menu option to select the code page.
 
Can you please add this option in some future version? There is an option Настройки → Шрифт [Settings → Font] in the IDE's menu now, but it does not change the code page.
 
The editor in IDE and TCEDIT does everything in UTF8.

If you edit a UTF16 file, it is converted to UTF8 and everything is displayed as expected.

If you edit a UTF8 file, it doesn't need to convert anything and everything is displayed as expected.

If you edit an ASCII file, it has to be converted to UTF8. The editor does this using CP_OEMCP.

The only good solution is to use Unicode. The awkward solution is to add an option to the IDE to specify (either on a per-file basis or for all subsequent files) a code page. Code pages are definitely the old tech way to handle it -- particularly since Windows cannot always convert reliably from ASCII -> Unicode -> ASCII and get the same results.
 
Default Russian OEM (terminal) codepage is CP866.
CP1251 is Russian ANSI (Windows GUI non-UNICODE) default codepage. It is normally not used in terminal.
If, as @rconn said, the IDE is using OEMCP-to-UTF-8 conversion, and you see wrong results, it may means that your OEM CP is wrong.
If you are using Windows 10, please check that your "locale for non-unicode programs" is not set to "unicode".
 

Similar threads

Back
Top