Thursday, July 30, 2015

The Collapse of the Pass By Value versus Pass by Reference Distinction

In the beginning, there were two fundamental ways to pass information
between part of a program, variously called functions, procedures, routines,
and subroutines.

1) By Value: The subroutine was told the value of a variable, but the caller
kept the location where it was stored to itself.

2) By Reference: The calling routine told the subroutine where to find the
value.

Even then, arrays have always been passed by reference. That is, the memory
location where its first element was stored was handed off to the
subroutine. Combined with information about the type of value that was
stored in the array and the number of elements it contained (the latter
usually passed separately, by value, except in BASIC), the subroutine could
extract individual elements from it. This worked well enough for BASIC, and
C even pretended to play along, by passing pointers to structures and
objects by value. At the very least, the subroutine could mess with the
data, but it couldn't change the caller's pointer to it.

This approach worked pretty well when a good percentage of everyday programs
were written in some dialect of BASIC, leaving the serious work, including
most of the plumbing, to be written in C or C++ by "real programmers," who
were assumed to know what they were doing. Mind you that it isn't entirely
their fault that a lot of buggy C and C++ code made it into production.
Pressure from upper management to get things done faster, with fewer people,
played a significant role, too, but I digress.

In 2002 came the Microsoft .NET Framework, which was supposed to put an end
to all of that with its managed heap and all that. With the .NET Framework
came something else that has received little coverage; the distinction
between pass by value and pass by reference has pretty much collapsed into a
heap of rubble. But nobody noticed.

Before I go further, allow me to illustrate with an example from some of my
own working C# code.

private static void RecordInitialStateInLog (
Dictionary<StreamID , StreamStateInfo> pdctStreamStates ,
StateManager psmTheApp ,
string pstrBOJMessage )

The code snippet shown above is the signature of a subroutine,
RecordInitialStateInLog, that takes three arguments.

1) pdctStreamStates is a Dictionary (associative array) of StreamStateInfo
objects, indexed by StreamID, an enumerated type.

2) psmTheApp is a reference to a StateManager, a Singleton object that
exposes properties and methods to support operations commonly required of
character mode applications.

3) pstrBOJMessage is a garden variety string, if you can say there is such a
thing in a .NET application.

The first and simplest of the two complex objects, pdctStreamStates ,
exposes the data that the subroutine needs through its Keys and Values
properties, both of which can be enumerated. A Count property tells us right
off how many items we can expect to find in each of those collections.

Likewise, the psmTheApp argument, through its properties, exposes
AppErrorMessages, an array of strings, which we can read, but cannot change,
AppExceptionLogger, an instance of the
WizardWrx.DLLServices2.ExceptionLogger that we can use to report and log
exceptions, AppReturnCode, a read/write integer that holds the exit code
returned by the program when one of several methods on the StateManager is
called, and others.

Nowhere in the method signature does the word "reference" appear, or any
similar word. Neither does the phrase "By Value" appear anywhere.

Watch what happens when the main program calls the routine. Below are the
machine instructions that make it happen.

155: RecordInitialStateInLog (
156: dctStreamStates ,
157: s_smTheApp ,
158: strBOJMessage );
0047400E push dword ptr [ebp-44h]
00474011 mov edx,dword ptr ds:[3567254h]
00474017 mov ecx,dword ptr [ebp-48h]
0047401A call 0040C6B0

The first instruction is as follows.

push dword ptr [ebp-44h]

This instruction pushes the machine address where string strBOJMessage, a
counted Unicode string, is stored.

The next instruction puts a pointer to the state manager into CPU register
EDX, where the CLR looks for the second argument when the subroutine needs
it.

mov edx,dword ptr ds:[3567254h]

The third step in the setup of the call stores the location of the
Dictionary object, dctStreamStates, in another machine register, ECX.

mov ecx,dword ptr [ebp-48h]

Finally, the subroutine is called.

call 0040C6B0

Without delving further into the machine code and getting off topic, what
just happened here? The main routine just told a subroutine,
RecordInitialStateInLog, where, within its memory, it can find three bits of
information that the subroutine needs to do its work. Armed with this
information, the subroutine can do anything with those objects that each
permits. That last phrase is significant; I shall return to it shortly.

1) Starting with the string, about all it can do with strBOJMessage is take
its length copy some or all of it into a new string, and convert it to all
upper or lower case characters, also yielding a new string.

2) Dictionary dctStreamStates allows its values to be enumerated and copied.
Since the dictionary isn't marked as read only (which the called routine can
ascertain by evaluating its IsReadOnly property), the subroutine can even
append items to it, replace existing items, and delete items.

3) Finally, StateManager psmTheApp is a mixed bag; some of its properties
(e. g., AppReturnCode, can be changed to inform the main routine that the
program should report an error when it end, while AppErrorMessages is a read
only array of strings, whiles AppRootAssemblyFileDirName is a read only
string.

You have probably realized by now that StateManager is a custom class, and
its design determines what users are allowed to do with its properties.
Which properties are read/write and which are read only are the result of
deliberate decisions about which properties a consuming assembly should be
allowed to change, and which should be protected against changes. The
AppReturnCode is fair game for the application to change at will, but you
wouldn't want the application to be able to change the text of the error
messages or the name of the program directory.

Elsewhere in the code, an exception handler sets the error code to
MagicNumbers.ERROR_RUNTIME (+1), a nonzero value, to signal that a run-time
exception has been caught and reported, causing the task to fail.

The final statement executed by the main program makes a decision based on
the value of the s_smTheApp.AppReturnCode property.

Environment.Exit (
s_smTheApp.AppReturnCode > MagicNumbers.ERROR_SUCCESS
? s_smTheApp.AppReturnCode
: MagicNumbers.ERROR_SUCCESS );

Astute observers will notice that this could be simplified by eliminating
the decision, and passing the value of the AppReturnCode property straight
into the Environment.Exit routine. The reason that I didn't do so is that
this example came from a work in progress, and the final version will
substitute a different routine that makes better use of that decision to set
one of its arguments. I wrote it this way to remind me to make the
replacement in the final version.

There are two noteworthy things about this example.

First, the subroutine got a reference to each of its three arguments. Within
the limits imposed by the objects, themselves, it can plunder their
properties more or less at will.

The second consequence follows from the first: when the caller regains
control, some or all of the properties may have been changed.

The exception is the string, strBOJMessage, which is immutable, meaning that
assigning a new value to it within the subroutine, or even in the main
routine, creates a brand new string, leaving the original intact when a
subroutine makes the change. Conversely, when the new value is assigned in
the main routine, the old one is lost. Strings are always the odd duck on
the pond.

The .NET Framework specification really muddies things, especially if you
are accustomed to thinking of structures the way C and C++ programmers use
the term. According to "Value Types (C# Reference)," at
https://msdn.microsoft.com/en-us/library/s1ax56ch.aspx, a Value Type is
either an Enumeration or a Struct. Wait a minute, you say, I can understand
how an enumerated type can be a value type, because they boil down to an
integer, but how can a Struct be a value type? I thought the other value
types were the simple numeric types (integer, long, float, double). A closer
look reveals that the intrinsic value types are, indeed those four, plus
enumerations. Call it Microsoft Magic; all four are classified as Structs!

I have a theory about why this is so, but my proof is limited
semi-scientific observation of the machine code that implements my .NET code
running in the Visual Studio debugger. If my guess is correct, though,
Microsoft has successfully future-proofed all four basic numeric types by
making their internal representation opaque, for which there is ample
precedent. For example, a C implementation of almost any major encryption
algorithm is made more portable by specifying variables that represent
integers of a specific bit width (usually 32 or 64), as a typedef, and a
Win32 HANDLE is an opaque struct.

There is nothing in the definition of struct that requires it to contain two
or more members. Obviously, to be useful, it needs one member, but that's
all it really needs. Hence, the following structure is legal.

struct _int32 {
Value ;
} int32 ;

Since the machine address of the first member of a structure and the address
of the structure, itself, are the same, defining a value type as a structure
hides the implementation details without affecting user code. If I pass my
int32 structure to a routine that knows how to handle such a thing, it can
find the other members, if any, without anything from me beyond the address
of the first member. Concrete examples of this abound , even in the Win32
API. For example, both long (64 bit) integers and floating point numbers are
structures. Integers store the lower and upper 32 bits in machine word sized
chunks, while floating point numbers store a mantissa and an exponent, which
are passed around and mostly treated as a unit, until it comes time to
format it for printing or use it in a mathematical operation. Those chores
fall to routines in system libraries that you can safely treat as black
boxes, whether your code is written in C#, VB.NET, C++, or something else.

But why 32 bit integers, too? What happens when the processor architecture
is 64 bits? Your 32 bit integer occupies only half a machine register, a
detail that matters only at the very lowest levels of the code, in the
native code generated behind the scenes by NGEN, the Native Code Generator
service. Since it's a structure, the 64 bit runtime just handles it,
transparently, and you carry on. That's why you see System.Int32 in the
Locals window of your debugger, and in the argument lists displayed in a
stack trace. This simple device abstracts away the hardware dependency.
Intermediate Language sees only System.Int32; and the native code generator
knows exactly what to do with it, whether your CPU architecture is 32 bit,
64 bits, 128 bits, or more. Your code just works, without any changes.

Even this is not entirely new; for years, we have had 16 bit integers, known
by various names (WORD, Short, and so on) that were treated in much the same
way by 32 bit hardware, in which a 16 bit integer occupies only half of the
32 bit register. Indeed, the present day assemblers still recognize 16 bit
registers AX, BX, CX, DX, DI, and SI, which correspond to the lower half of
32 bit registers EAX, EBX, ECX, EDX, EDI, and ESI

As an aside, code that manipulates ANSI characters uses the original 16 bit
subdivisions, AH, AL, BH, BL, CH, CL, DH, and DL, all of which behave like 8
bit registers.

WHAT PRACTICAL USE IS THIS?

The topic "Main Features of Value Types" says " Assigning one value type
variable to another copies the contained value." Ignore the first sentence;
it's the one that stirs up the mud. Treat value types AS IF they directly
contain values, because, in truth, a structure cannot directly contain a
value. Only a structure member can do that. The issue is that value types
are sufficiently small that making a copy is computationally cheap, whereas
copying a reference type is neither computationally cheap, nor good
engineering, since it defeats the purpose of defining the object in the
first place.

Given the preceding statement yields the following practical distinctions.

1) Value types are effectively passed by value, period. With respect to
value types, what changes in the subroutine stays in the subroutine.

2) Reference types are effectively passed by reference, period. With respect
to reference types, there are no secrets. The calling routine sees all
changes made to the properties of the reference types that it passed into
the subroutine.

Value types represent a very small subset of the objects that inhabit a
typical .NET assembly. Enumerations, integers, floating point numbers,
decimal numbers, a relatively small number of system types (e. g.,
System.DateTime, System.TimeSpan, System.GUID, and a few others) and user
defined structures are value types. Everything else is a reference type or a
string.

This second statement has significant consequences for both robustness and
security of applications.

You have heard it said that one of the tenets of good object oriented design
is data hiding. The example above should make abundantly clear why this is
important; the read/write properties of any object that is visible to a
routine can be changed by it. From this precept, I draw two rules.

1) Unless consumers of an object _must_ be able to change the value of a
property, make it read only. If the value must be updateable, consider a
method, instead of a write property. Methods offer two advantages over
properties; they can take arguments that can be used to supply additional
information that can be used by the method to decide whether to allow the
update, and it is considered acceptable to allow a method to fail by
returning a distinct exit code or raising an exception.

2) Unless a routine needs access to most or all of the properties of an
object, consider passing in only the properties that it needs as individual
arguments. This also decouples the routine from the object.

In security terms, the preceding two rules implement the Need To Know
principle. In addition to making the routine more secure by reducing its
attack surface, they reduces the risk of unintended changes to object
properties that may not surface until the application is in production.

HOW DOES THIS AFFECT THE OVERALL DESIGN

Rigorous application of the Need To Know principle is mostly old school
design. The main routine makes a few key decisions, and calls one or more
subroutines that do the real work. Each of those subroutines makes a few
more decisions, and calls more specialized subroutines to perform the
required tasks. This process continues until no more decisions remain to be
made, and the routines that comprise the leaves of the program's process
flow are pretty much drop-through routines that perform a series of actions,
with few, if any, decisions. When objects are brought into these routines as
and only when needed, every routine conforms to the Need To Know principle,
and the overall attack surface is minimized by design.

No comments: