xyzzy

HPFS and FAT filename characters

0. Contents of filechar.zip:

   FILECHAR.ABS    this text
   FILECHAR.CMD    OS/2 REXX script to create FILECHAR.nnn
   FILECHAR.437    file name characters for codepage 437
   FILECOLD.850    ditto codepage  850 (old 850 without euro)
   FILECHAR.850    ditto codepage  858 (new 850 with    euro)
   FILECHAR.004    ditto codepage 1004
   FILECHAR.W2K    ditto codepage 1252 among others on W2K
   FILECHAR.REX    NT ooREXX script to create FILECHAR.W2K

1. Introduction

   You don't need this file (FILECHAR.ABS) to use FILECHAR.CMD.

   FILECHAR.CMD is a trivial OS/2 REXX script used to determine
   all legal filename characters and their AKAs on HPFS and FAT.

   If you are only interested in my results see the five files
        FILECHAR.437            (new result after CHCP 437)
        FILECHAR.850            (new result after CHCP 850)
        FILECOLD.850            (old result after CHCP 850)
        FILECHAR.004            (new result after CHCP 1004)
        FILECHAR.W2K            (see below, result for NTFS)

   These files have been created on my system by commands like
        CHCP 437 & FILECHAR > FILECHAR.437
        CHCP 850 & FILECHAR > FILECHAR.850

   The old FILECOLD.850 reflects results before installing the
   new "Euro-codepage" (codepage 850 with Euro-symbol hex. D5).
   On my WARP 3 system "old" is fixpack 17, and "new" is e.g.
   fixpack 40.  I never intended to publish FILECHAR.CMD, but
   the different results for with vs. without Euro-symbol are
   IMHO quite alarming.

2. Configuration

   If all legal filename characters depending on file system
   (FAT vs. HPFS etc.), codepage (437 vs. 850 etc.), and even
   installed fixpack are documented somewhere, then please
   tell me where...  Until then FILECHAR.CMD works by trial
   and error.  You have to "configure" FILECHAR.CMD for your
   system by editing two lines, replace...

        HPFS.. = 'D:\TMP\'   /* HPFS directory */
        OFAT.. = 'F:\TMP\'   /*  FAT directory */

   ... by existing HPFS- and FAT-directories on your system.
   You may use root-directories, e.g. OFAT.. = 'C:', or even
   other file systems, as long as you have write access and
   know how to interpret the results.

   Hint:  FILECHAR.CMD deletes all created temporary files
   ---?---$, and this works faster on drives without "DELDIR".

3. Operation

   FILECHAR.CMD simply tries to create 255 files ---?---$ in
   both directories, where ? is hex. 01 .. hex. FF (255), by
   appending the letter ? to ---?---$.  For some characters
   like # the file ---#---$ finally contains only # on both
   FAT and HPFS in codepage 437 or 850.

   The file ---Z---$ probably contains Z and z for HPFS and
   FAT:  OS/2 would treat files zzz, ZZZ, zZz, etc. as the same
   file, although HPFS supports mixed case filenames.  If you
   have write access on a *NIX-filesystem then zzz, ZZZ, and
   zZz would be three different files.  The most (in)famous
   examples are makefile, Makefile, MAKEFILE, etc. ;-)

   The file ---E---$ may contain 6 characters in codepage 437:
   E, e, é, ê, ë, and è are treated as identical in file names.
   Of course FILECHAR.CMD does not only create ---?---$ files,
   it also evaluates and eventually deletes these files.

4. Usage

        FILECHAR -h        usage info (dito -u, -?, etc.)
        FILECHAR --         long result lines (upto 255 columns)
        FILECHAR           short result lines (upto  79 columns)

   The short format skips 0 .. 9 (known unique legal characters)
   to get less result lines.  The long format contains all valid
   filename characters, about 164 columns in "new" codepage 850.

   The output format should be obvious.  Characters in a line
   marked by HPFS (or FAT) are valid in HPFS (FAT) filenames.

   Characters in a line marked by "aka" are treated as identical
   with the character(s) in the same column, notably the next
   HPFS- and FAT-line character above it.  Short format only:
   If there is no aka-line below a HPFS- or FAT-line, then
   these characters are legal and unique.

   In long format you get exactly one long HPFS-line and one
   long FAT-line with as many aka-lines as needed in the worst
   case.  In the "new" codepage 850 there is only one aka-line,
   i.e. at most lower and upper case are treated as identical.

   Characters in a line marked by "not" are not supported on
   HPFS (or FAT).  HPFS does not support "/:<>\| in addition
   to anything below hex. 20 (32, space).  FAT does not support
   "+,./:;<=>[\]|.  Often programs have difficulties with the
   characters +,.;=[] working on HPFS but not in a FAT.

5. Caveats

   FILECHAR.CMD only tests ---?---$.  So if characters depend
   on the position within a filename, then FILECHAR.CMD cannot
   detect it.  Examples:  leading or trailing spaces generally
   don't work, but spaces within a name are okay (even in a FAT,
   compare "WP ROOT. SF" etc.).  Trailing dots don't work on
   HPFS, a leading dot may have a special meaning (*NIX), many
   programs treat the last dot as THE DOT, and in a FAT dots are
   not supported (except from the implicit 8+3 dot).

   For "FAT" read "good old DOS FAT", all I know about FAT32 is
   that it exists.

6. W2K and ooREXX

   The text above was written 2002.  Six years later I repeated
   this test on W2K using ooREXX FILECHAR.REX.  In essence the
   same old script, only renaming FILECHAR.CMD to FILECHAR.REX,
   replacing all "HPFS" by "NTFS", and using directories C:\TMP
   for the FAT16-tests and D:\TMP for the NTFS-tests.

   For the result see FILECHAR.W2K.  Good news, apparently the
   supported characters do NOT depend on the actual codepage.
   In other words the results after CHCP 850 and CHCP 1252 were
   identical.  But still interesting, read FILECHAR.W2K after a
   CHCP 1252 (or 1004 on OS/2), this shows the simple logic:

   All windows-1252 letters are treated as case-insensitive, but
   on a FAT four are mapped to similar US-ASCII characters.  The
   four special pairs are umlauted Y, Scaron, Zcaron, and OElig,
   i.e. all pairs with "ANSI" letters in the range 0x80 to 0x9F.

   Ten non-letters in the range 0x80 to 0x9F are also mapped to
   similar US-ASCII characters on a FAT, permille to percent is
   an example.

   On a FAT 0x85 (hellip, three dots) is an oddity, I'm not sure
   how to interpret the result.  Creating ---?---$ for ? := 0x85
   fails in the SysFileTree() existence test, therefore 0x85 is
   noted as "not permitted" on a FAT.  But a file with long name
   ---.---$ was created for the ordinary 0x2E dot, 0x85 ended up
   in this ---.---$ file, and counted as alias for 0x2E.  Unless
   you know what you are doing better stay away from using 0x85
   in NT file names on a FAT… :-)

W3 validator Last update: 26 Sep 2008 12:00 by F.Ellermann