This document should be read in good humor. It is well known that writing technical docs is boring. But I don't agree ;-)

PW32: Philosophy & Implementation

Contents

General principles

PW32 has its roots in frustration with CygWin and amusement with DJGPP. From DJGPP it takes its runtime library as a base, its structure, its packaging conventions, its debugger. From familiarity with CygWin in takes desire to be more efficient and sensible, as well as attentive to Win9x chores. But even those listed above is not enough for such child library to mature. So, when DJGPP doesn't show good example, and CygWin - bad one, Linux used as the reference Unix/POSIX implementation. That goes even so far as to shying its Win32 nature and to pretending to be Linux itself. (Why so? Because, in our times, GNU software is gleamed with such pearls as

#ifdef _WIN32
#define getpid() GetCurrentProcessId()
#endif
). Of course, that's little insolent (I'm about masquerading as Linux), but let it stay that way for now.

Files and Filesystems

Filesystem Extensions

First that worth a note on PW32 filesystem philosophy and implementation is that it supports DJGPP's Filesystem Extensions (FSEXT). This means that semantics of POSIX-level file-handling functions may be completely redefined on per-object basis by user-level code. This is very powerful feature of DJGPP, but with PW32, running in multitasking dynamic loading- and network- enabled environemt, capabilities are realy amazing. Imagine filesystem drivers built as DLLs and loaded by highly configurable means (such as system-supported per-application config file or environment). Then any existing program may use /dev/hd??, /dev/random, /dev/fb, FTP, HTTP, AFS, etc., while those which don't need this array will work efficiently by underlying system-provided means. As more dumb example of what can be done, those folks who don't agree with just solution of so-called 'binary vs text files problem' (see below), can develop a data-tampering extension which will distort your files (what a horror!).

File access levels and so-called 'text vs binary problem'

DJGPP offers 3 levels of file access, from most low-level:

Undescore-prefixed POSIX file access functions (e.g. _open, _read)
Filesystem extensions are implemented on this level. If one of them is not active, request performed by undelying OS. There's no concept of text/binary files on this level.
POSIX file access functions (e.g. open, read)
The addition comparing with lower level is addition of text/binary distortion stuff. Note that that stuff is arbitrary extension to POSIX (reference: SUSV2).
ANSI stdio file access functions (e.g. fopen, fgetc)
Highest level, exists on any ANSI C-compliant system. Most inefficient level - default file accessors read/write information by single bytes. Ideal place to do data peek-poking. Explicitly, by ANSI C, support notion of text and binary files. Moreover, ANSI C depricates opening files without explicitly specifying file type.
Picture above represents common approach in MS DOS/Windows world: POSIX-level functions take arbitrary decision to alter data for files opened in 'text' mode in following way: on reading, sequence of bytes "\r\n" replaced with "\n", on writing, vice-versa. Why such transformation? The reason is that, in DOS/Windows, text files' lines terminated with sequence "\r\n", while in the rest of the world (with exceptions of some enclaves having other local conventions) with "\r". Whether that statement is true and if yes, what's the reason for that, is still under argument by historians (for example, some argue that claim above is not true: it's unnatural to associate line-ending conventions with operating systems, they rather should be linked with applications which produce those files, and hence, they continue, there's no big infringment if there will be other applications, producing other line ends, especially if the latter line ends will be those used in the rest of the world. While other think the claim is true; some of this group think it's first precedent of 'decommodizing standards' technique later actively used by vendor of systems in question; while others argue that it's done right, since physical devices have separate notions for carrage return and line feed; still others point finger in the sky).

So or other, but it's done that way. But that's only half a story: the most interesting part is that 'text mode' is applied by default to 3 default POSIX file descriptors and to all files accessed via ANSI stdio level. Result is overall corruption of data streams processed by applications on those systems. Even such simple command as 'gzip -c a.tar.gz|tar xv' is unable to perform without errors. To remedy situation, PW32 return to parental sources: no such notion as 'text/binary' on POSIX level, all data read and written intact. Also, binary is default mode for stdio access. Text stdio mode means converting "\r\n" to "\n" on reading, no output conversion. Ok, but does Win32 support such way? Sure. When writing to console in cooked(default) mode, "\n" automagically prepended with CR. Not so fine with reading console - it returns \r\n as EOL. As of version 0.5.0, there's builtin filesystem extension is being applied to fd 0 if it is terminal device on process startup.

And at last, three level of file access with PW32. As you see, they differ from DJGPP's.

Undescore-prefixed POSIX file access functions (e.g. _open, _read)
This files take undelying OS' handles as argument. This is useful for interfacing with other systems and OS itself.
POSIX file access functions (e.g. open, read)
Standard level for effecient accessing files. It's highly recommended to use stdio functions to access files containing textual information. Filesystem extensions are applied on this level.
ANSI stdio file access functions (e.g. fopen, fgetc)
This level should be used for accessing text files. In such case, fopen should be explicitly used with "t" flag. Default is binary mode.

Summing up, PW32 implements file accessing techique of reference implementation (which, as was explained, Linux), plus some non-intruding support for native conventions. This leads not only to data integrity, but also to not degraded performance.

File- and path- naming conventions

Note that possibilities above are due PW32 itself, particular application may limit you only to standard-defined one (as any Unix shell will).

File permissions

PW32 implements simplified version of POSIX file and directories access control. Specifically, the only supported permissions are for owner; owner's permissions are propogated to group's and others' ones. However, it works even on win9x (of course, on NT full POSIX permission system can be implemented - not (yet?) done). Following mapping of POSIX permissions to standard win32 (or, to be precise, FAT) file attributes used:

perm attribute
r not hidden
w not readonly
x for files: archive
for directories: not archive

This mapping is not arbitrary: using READONLY attribute as negation of w is obvious. When file doesn't have r permission, it's impossible to look into it, hence HIDDEN. SYSTEM attribute is used to mark symlinks, so the only remaining one is ARCHIVE. But it's usage for x dictated not only by inevitability, it suites that need rather good. While ARCHIVE bit was intended for the purpose, it's not really used and all typical native files have it set. So, PW32 won't have problems recognizing native applications, since all files are 'executable'. However for directories picture is opposite: they typically have ARCHIVE bit reset, from here stems distinctions in mapping x for directories, so PW32 apps won't have problems searching in natively created dirs.

It should be stated that whole permission system is implemented above win32 API and may lead to noticable overhead. As for files it's not a big deal, since PW32 gets file attributes to check for symlink. However, to implement search permissions for directories, it is required to traverse path and check attributes for all directories on it. I never thought I would implement permission system. But I have noticed that many testsuites check how some apps behaves when presented with non-accessible files. So, to make these testcases pass, I decided to implement perms for files - as I told, it doesn't pose much overhead. I tried that on tar testsuite, just to find out that testcase in question still fails because it expects directory perms also work. Then, I just did it all. Well, regression tests passing is nice thing, but introducing overhead for real work is bad. I benchmarked performance of GetFileAttributes() and CreateFile() Win32 API functions and got susteianed 100,000 and 140,000 Pentium ticks for each function for 50-files directory under original win95/FAT. So, while it's milliseconds even for P5-100, overhead is really times-fold. Fortunately, solution came automagically: it is not needed to check permissions for root. So, what is needed just way to specify uid to run PW32 under. It is not yet done. Still, to implement directory symlinks, it will need to traverse path anyway...

Note that to preserve file permissions across archiving, you must use native PW32 tar. However, using win32 InfoZip (i.e. zip.exe & unzip.exe) with -S switch (for packing) will preserve r and w permissions; x will be set unconditionally.

Executables suffixes

You think PW32 is really orthodox and does all to break "compatibility" with native conventions? No, it does not. One thing POSIX implementation for Win32 have to deal with is native convention on suffixing executable names with '.exe', while POSIX doesn't have such stipulation. The most straight solution would be to subdue that Win32 idiosyncrasy, but I haven't dared that by following reasons:

So, if suffixes are still there, how they are dealt with? Let's first survey how it is done in other implementations:

I was amazed by Mikey's solution, and decided to adopt it concerning stat and access functions. However, studying install behaviour showed that that won't help much: yes, it stats suffixless-named file successfully, but then tries to open it and fails. So, consistent lookup for every filename-accessing function is required. Surprisingly, that can be done with zero additional overhead with PW32: it already lookups each filename to be symlink, so what it does as follows: looks up filename in filesystem, if it exists then does usual symlink lookup, if not, it appends '.exe' to it and returns (without looking up it at this time), so specific operation will be tried on exe-ized name.

Of course, this may potentially lead to problems (scenario: you thought you had your file, 'file', somewhere, and you wanted to delete it. However, in it was gone before. As the other coincidence, 'file.exe' lay around. Trying to delete 'file' will kill 'file.exe'). '.exe' lookup is recent addition to PW32 and may be refined in the future.

Hard and symbolic links

PW32 implements symlinks compatible with CygWin. Hard links are aliased to symlinks. Currently, symlinks apply only to leaves of filesystem (misfeature). Hard link implementation for NTFS is welcome.

Inodes and other filesystem features having no direct Win32 counterparts

There's static counter, each *stat() call returns succesive number. It at least allows fileutils not to complain about curcular dependencies in filesystem. Better solution is welcome.

Memory handling

Arena

Yes, there's an arena reserved in the address space of process, with brk attached. 128Mb by default. You may define unsigned int __arena_sz in your sources to value you need .

Memory-mapped files

Code is written but not even tested once. (Fortunately, configure finds that mmap() does not really work ;-) )

Environment

One of the issue with Win32 is its idiosyncratic ';' pathnames separator in PATH environment string. PW32 deals with that bravely: if environment contains PATH in Win/DOS format (heuristic used to determine whether), it's converted to POSIX format, so application will see decent POSIX environment, as it expects. On exec, conversion goes in reverse direction, so if by chance launched program is native one, it won't be confused. If ';'-separated path doesn't contain entry for '.', it is prepended. This efficiently means that path cannot not to contain entry for current dir. This is misfeature.

Processes

pids

PW32's processes are first-order citizens in Win32 and vice-versa (unlike CygWin, which sees only its processes). However, PW32 doesn't always use pids as provided by underlying system. That's because pids on Win9x systems are known to be negative integers. While in Unix world pids are known to be little (not much bigger than 16 bits) positive numbers. (NT also conforms to this de facto convention). Some process-related functions rely on that positivity, and, in fact, treat negative pids in special way (ref: waitpid()). What PW32 do on Win9x is just negates number returned by system and uses that as pid. So, simple (even for human) association with real system pids is maintained. Note however, that resulting numbers are bigger than ones used on Unix. For example, don't remember having seen on Linux pid with more than 5 decimal digits, while with PW32 and Win9x value around 900000 is starting. There's no wonder to get 8-digit value (diving in dirty implementation details, pid, as returned by Win9x is xored with fixed value('obsfucated', in MS terms (ref: industrial anecdote, taking place with 'Inside' or 'Internals' of Petzold or other author)), pointer to internal process information block, which may reside within range of 0x80000000-0xbfffffff. That fixed value is something 0x7fxxxxxx, so in worst case pid as returned by PW32 may have 10 decimal digits.)

So, least unpleasant thing is that ps should be patched to use more width for pid column. Far more unpleasant one is old software which use shorts to store pids (ref: ash from Slackware 3.2).

ppids

As of version 0.5.0, getppid() still dummy returning 1. However, I know ways how to get ppid for both 9x and NT, so in the next versions it will be implemented correctly. The other problem is that many processes want to get ppid to communicate with there parent with signals. There's however problem - since under win32 it's impossible to overlay current process with new image, separate process is being started for each exec(). This means there's extra process in fork-exec-child chain, and it should forward signals between real fork parent and exec'ed child. This is also on TODO list.

Exit codes

Other issue with processes are their exit codes. Following is true for both Unix and Win/DOS: if exit code is 0, program has terminated successfully. But for other codes, there's destinction: native processes terminates with exit code passed to exit() as-is. But Unix by de facto standard uses 16 bit process exit code, low bytes of which contains number of signal by which app was terminated, or zero otherwise, and high - value passed to exit().

To smooth this difference, special utilities are provided:
run-w32 cmdline This will run cmdline (with executable searching on the PATH) and convert exit code to PW32's standards (by shifting left by 8)
run-pw32 cmdline This will run cmdline (with executable searching on the PATH) and convert exit code from PW32 to native convention (if no terminating signal, return high byte, else, return 256+signo)
exec-w32 This, being renamed to <something>.exe, will try to execute file <something>.w32 in the same directory where <something>.exe resides with the rest of args and convert exit code to PW32's.
exec-pw32 Being renamed to <something>.exe, will try to execute file <something>.pw32 in the same directory where <something>.exe resides with the rest of args and convert exit code to native.

Intra- and inter-process communication

Signals

There exists such notion as signals. They can be recieved interprocessically. They are implemented in sensible to native processes way (besides ability to SIGKILL any process, GUI processes can be closed gracefully with SIGTERM). However, currently Win32 exceptions and events are not mapped to signals.

System V IPC

Currently not supported

Unix domain sockets

Currently not supported

Networking

Currently not supported

Dynamic libraries/loading

Overall status

Dynamic libraries are implemented via Win32 DLLs. It is well known that Win32 DLL model has number of idiosyncrasies which render it, from the first view, largely incompatible and underfunctional with respect to standard *nix shared libraries model. However, investigations and carefully worked out techniques allow to use DLLs in ways very similiar to usage of shared libraries.

Dynamic libraries search path

Due to strange design of DLLs, they are searched in the same directories where executables are - i.e., on $PATH, while standard Unices have separate environment variable for that.

Dynamic loading - dl*() family

Currently not supported

Locale

I contributed basic CTYPE locale support to djgpp, it should be just thrown in to PW32.

Misc

There's an itimer implementation. Dunno whether it works.

Security

Security? Sorry, if you need security, you should really get Real OS. However following may be said:

Summary of design decisions/devices employed to be compatible with / support native features

Summary of design features which lead to incompatibility with native way of doing and/or thinking

Summary of not implemented/omited POSIX/*nix features and differences in implementation

Not implemented

Differences


Paul Sokolovsky | use this form to submit bug