Wednesday 4 December 2013

Missing HOMEDRIVE and HOMEPATH Variables Under Jenkins

TL;DR - If you’re looking for the reason why the variables are AWOL then I can’t help, but if you want to know what I tried and how I worked around my specific problem, then read on…

Back when we first set up Jenkins I soon noticed that all the files in the workspace had LF style line endings instead of CR/LF as you would expect on Windows; this was despite following the advice in the often-cited GitHub page to set things up correctly. I’d managed to live with it to date but now we were suddenly involved in a tactical project and were deploying to staging & production where trying to edit application .config files on servers that only had Notepad was like pulling teeth. I really needed to sort it out once and for all.

1st Attempt

The last time I tried to fix this I logged into the Jenkins build server, fired up a command prompt (as the Jenkins user account using RUNAS) and then executed “git config --global core.autocrlf true”. For good measure I also wiped the Jenkins workspace to ensure that I’d get a freshly cloned repo when the build job next ran. It didn’t work and I’d already put a few hours into the problem so had to put my spade down and go do some work instead.

2nd Attempt

Coming to the problem once again all fresh and reinvigorated some months later I surmised that the problem must be down to the Jenkins build account not picking up the --global scoped setting. After a quick recap on the differences between --system, --global and --local and where the files were stored, my next guess was that the account might be being used without a profile. So I checked again, this time using the /noprofile switch with RUNAS:-

C:\> runas /noprofile /user:DOMAIN\BuildAccount cmd.exe
<new command prompt>
E:\Jenkins> git config --config core.autocrlf
true

The setting was exactly as I had left it last time. The next stop was to inject an extra build step into the Jenkins job and see what it thought was going on. For good measure I dumped all the different scoped values to see what exactly was set and where:-

E:\Jenkins> C:\Program Files (x86)\git\bin\git.exe config --local core.autocrlf
(blank)
E:\Jenkins> C:\Program Files (x86)\git\bin\git.exe config --global core.autocrlf
(blank)
E:\Jenkins> C:\Program Files (x86)\git\bin\git.exe config --system core.autocrlf
false

Suspicion confirmed - the Jenkins job doesn’t see the setting. But why?

I started reading a little more closely about where the .gitconfig file would be picked up from and used “git config --global --edit” to see what path the editor [1] had loaded the file from. Sure enough, from my command prompt it loaded the correct file, although the HOMEDRIVE was set to C: and HOMEPATH was set to \Windows\system32 which seemed a little odd. The USERPROFILE variable on the other hand pointed correctly to the profile folder, not that it’s used by Git, but it was a useful check.

So I decided to just go ahead and set the autocrlf setting via the Jenkins job, hoping that at least it would stick even if I didn’t know at that moment where the setting would end up. To my surprise I got the following weird error:-

E:\Jenkins> "c:\Program Files (x86)\Git\bin\git.exe" config --global core.autocrlf true
error: could not lock config file (null)/(null)/.gitconfig: No such file or directory

I didn’t twig at first what the error in the path was telling me so naturally I Googled it. I got a hit that was related to Jenkins (JENKINS-19249) and was quite excited. When I read the issue I found that it was a similar problem, superficially, but there were no comments; not even so much as a “me too!” to at least show me I wasn’t the only one. I did a bit more Googling about how the path to the --global .gitconfig file is derived and it dawned on me what might be happening, so I dump out all the environment variables the Jenkins job sees with a call to SET.

Guess what - the HOMEDRIVE and HOMEPATH variables are not set. The way Git forms the path is with %HOMEDRIVE%/%HOMEPATH%. In C/C++ if you printf(“%s”, NULL); by accident you’ll often see the value “(null)” instead of it crashing [2] - hence there is one “(null)” for the HOMEDRIVE and another “(null)” for the HOMEPATH. For what it’s worth the USERPROFILE variable was still set correctly.

Solving My Problem

Ultimately I just wanted the line endings to be correct and so I took the rather heavy handed approach and used “git config --system core.autocrlf true” from an elevated command prompt as it meant altering the gitconfig file in the C:\Program Files (x86)\git\etc folder. Actually I forgot to use an elevated command prompt first time around and the file-system redirection added back in Vista kicked in and it wrote to a per-user virtualised version of the gitconfig file instead, doh!

I don’t particularly like this solution but I’m modifying the build server which is a carefully controlled environment anyway and it’s highly unlikely to be used for mixing-and-matching Git repos with different line endings.

But Why?

As I mentioned right back at the very start - I don’t know why the variables aren’t set. At first I assumed it might have been down to the way the user profile might have been loaded as Jenkins runs as a service, but I couldn’t find anything in the MSDN Knowledge Base that suggested why this might happen.

Failing that it might be a Jenkins issue. It’s a CI server and so perhaps this is done on purpose to ensure builds are user agnostic or something. If that was the case though I’d expect there to be a very popular StackOverflow post asking why their variables have been silently nobbled, but I didn’t stumble across one.

As always, once you know the answer, Googling for it is so much easier. As a Jenkins newbie too I’ll probably find out what’s really going on as I get into it more.

 

[1] Watch out for this as it’ll use VI as the editor by default. Luckily the one key sequence I can still remember after 25 years is to how exit it!

[2] I’m guessing this behaviour isn’t defined by the C/C++ standard but a common implementation choice?

Tuesday 26 November 2013

if errorlevel 1 vs if %errorlevel% neq 0

The problem with having used batch files since the days of DOS is that certain constructs are just way too embedded in your brain. For example the way I check whether a command in a batch file has failed is by using the special “errorlevel” version of the “if” statement:-

ExternalProgram.exe
if errorlevel 1 exit /b 1

This style of “if” says that if the exit code from the last run program run was greater than or equal to 1 the script should terminate, with an exit code of 1. The general convention for exit codes is that 0 means success and anything else just means “something else happened” [1] and it’s entirely dependent on the program. For example the ROBOCOPY utility has a variety of different codes that may or may not be an error depending on what you asked it to do. The MSI installer (MSIEXEC.EXE) is another program that has a few exit codes that you soon get to know if you’re trying to script a deployment process, e.g.

msiexec /x <GUID>
if %errorlevel% equ 1605 goto :not_installed

This form of “if” statement (with a variable called errorlevel) is the newer form that was introduced in Windows NT (I think) and it allows you to do an equality comparison with a single exit code, which was less than intuitive before. This form is also required when you have anti-social processes that return negative exit codes [2]. In fact the earlier form should probably be considered defunct (if only my muscle memory would let go) and the newer form used by default instead:-

ExternalProgram.exe
if %errorlevel% neq 0 exit /b %errorlevel%

If you can’t remember what the operators are use “C:\> HELP IF” to list them [3].

[1] C & C++ programmers will of course already be used to using the macros EXIT_SUCCESS and EXIT_FAILURE from <stdlib.h>. I don’t think .Net has an equivalent and so I often end up creating a class called ExitCode with the same two constants.

[2] Yes SCHTASKS (the command line tool for managing scheduled tasks) I’m looking at you. The .Net runtime can also chuck out negative exit codes if something really serious goes wrong, e.g. the process fails to start with an out of memory or assembly loading problem.

[3] They’re different from PowerShell too which is a shame.

Logging Stack Traces Should be Unnecessary

I like a nice clean log file. The left hand margin should be a fixed width and easy on the eye so that my built-in pattern recognition gets to work as effectively as possible. It should also be easy to parse with the age old UNIX command line tools (and LogParser) without me having to pre-process the file first to get shot of the noise.

What really messes with this is when I see stack traces in the log file. They are always huge and contain far too much duplication because modern software design principles suggest we write short, simple methods, with overloads and reuse by forwarding calls instead of cutting-and-pasting functionality:-

. . .
void DoSomething(string[], int, int)
void DoSomething(string[])
void DoSomething(IEnumerable<string>)
. . .

So, seriously, has that level of detail ever enabled you to solve a problem? Without knowing what the parameter values are how much do stack traces even tell you? Agreed, if all you’ve got is a crash dump then a stack trace is invaluable, but I’m talking about logging stack traces which by definition means that you’re already writing other diagnostic information too.

Design Smell?

I’ve always favoured keeping stack traces out of log files on the basis that they are of little value in comparison to other techniques, and so far I’ve found that I don’t miss them. In my experience, if the design of the code is right and the error message (e.g. exception message) is well written it should be fairly easy to reason about where in the code the problem is, which is effectively what a stack traces tells you. In short that means a simple GREP on the source code to find where the message is generated.

You might argue that a stack trace tells you that up front so why make more effort than necessary, which is of course true, but you’re also going to need the context, which a stack trace will not tell you unless it logs all its parameter values too. And for that we need to visit the log file, and if we’re going to do that how much time and effort are we really saving at the cost of extra background noise? More importantly this is the moment when the quality of our log message entries will begin to shine or we find ourselves lost in the trees looking for the wood. Hopefully during development you’ve already been dog-fooding your own logs to get a feel for how well you can solve real support problems when using them.

Test Infrastructure

The second part of the puzzle of avoiding needing all this gratuitous text is an ability to reproduce the problem easily within a debugger. Hopefully from the context you should be able to explore the problem in isolation - varying different inputs to see how the code is reacting. If the design is simple you should easily be able to step through an existing test case and see where the points of trouble might be, e.g. some missing or dodgy error handling.

At this stage, while the overarching goal is to fix the problem at hand, the fact that a bug has crept in means that the development process has failed to some degree and therefore I’d be taking this as an opportunity to compensate by doing a review. It’s likely I won’t action anything there-and-then, instead favouring to make some notes so that any eventual action can be triaged and prioritised separately.

Depending on the complexity of the system, this is the point at which I might rely on any custom tooling I’ve built to help isolate certain aspects of the system so that they can be exercised in a tightly controlled and deterministic environment, e.g. console app test harness that hosts the various services in-process.

Minimal Traces

What I despise most about many stack traces I see in forum posts is the sheer volume of noise. There is clearly some value in them, more so for “generic” unhandled exceptions like a NullReferenceException that have no message, but do they have to extend to reams and reams of text? When I log an exception I write the entire exception chain with both the type and the message; all on a single line. This is done using an extension method for the Exception type in C#. The same could easily be done for the stack trace, the only reason I haven’t done it is because I haven’t needed to, yet. But if I did write one what would it do?

Firstly I’d strip away all the arguments as they are fairly meaningless without their values. I’d also collapse all overloads into a single method as forwarding calls are uninteresting too. The bottom part of any stack trace is a load of boilerplate system code, so I’d chuck that away and make the entry point to my code the first interesting point of reference, which I should be able to find because assembly names and namespaces tend to follow a corporate convention. The same is probably true for the top of the stack, but the very first frame may be meaningful so perhaps I’d keep that, although if I had to keep just one method name it would be the last method of my code I’d keep as that is the first point that has any real meaning. Finally, I’d rotate what’s left and stitch it together with pipes, probably something like this (ignoring the unfortunate word-wrap):-

2001-01-01 ERR Unhandled exception - OrderController.GetOrder|>ProductService.FindProduct {NullReferenceException}

I definitely don’t need the full namespace names, just the class and method, although I’d argue that with decent method names even the classes might easily be inferred from just the method name and context. Doesn’t that look a whole lot less noisy?

Whilst I might not convince you to drop stack traces entirely from your logs, at least entertain the idea that you can represent them in a far more support-friendly fashion than what the runtime throws out by default.

Friday 22 November 2013

The 3 Faces of PowerShell Collections - 0, 1 & Many

There is a classic rule of thumb in programming that says there are only  three useful numbers - zero, one and many. I’ve found this concept very useful when writing tests as code that deals with collections or item counts sometimes need to handle these 3 cases in different ways. As a simple example imagine generating a log message about how many items you’re going to process. The lazy approach would be to just print the number and append a “(s)” to the noun to make it appear as though you’ve made an effort:-

Found 2 file(s) to process…

If you wanted to spell it out properly you’d write 3 separate messages:-

  1. No files need processing
  2. Found 1 file to process…
  3. Found 2 files to process…

A PowerShell Gotcha

This idea of 0, 1 & many is also the way I remember how PowerShell collections work when they are returned from a cmdlet. I was reminded of this idiom once again after debugging a colleague’s script that was failing because they had written this:-

$items = Get-SomeItems . . .

if ($items.Count -gt 0) {
. . .

For those not well versed in PowerShell this kind of construct will generate an error when no item or just 1 item is returned. The error will tell you “Count” is not a property on the variable - something like this in fact:-

Property 'Count' cannot be found on this object. Make sure that it exists.
At line:1 char:55
+ . . .
    + CategoryInfo : InvalidOperation: . . . 
    + FullyQualifiedErrorId : PropertyNotFoundStrict

You won’t see this error unless you have Strict Mode turned on (hence the PropertyNotFoundStrict in the error message). For one-liners this might be acceptable, but when I’m writing a production grade PowerShell script I always start it with these two lines (plus a few others that I covered in “PowerShell, Throwing Exceptions & Exit Codes”):-

Set-StrictMode -Version Latest
$ErrorActionPreference="stop"

For those used to the family of Visual Basic languages the former is akin to the “Option Explicit” statement you probably learned to add after misspelling variables names a few times and then scratched your head as you tried to work out what on earth was going on.

PowerShell Collections

To help illustrate these three manifestations of a collection you might come across we can create 3 folders - an empty one, one with a single file and one with many files [1]:-

PS C:\temp> mkdir Empty | Out-Null
PS C:\temp> mkdir Single | Out-Null
PS C:\temp> echo single > .\Single\one-file.txt
PS C:\temp> mkdir Many | Out-Null
PS C:\temp> echo many > .\Many\1st-file.txt
PS C:\temp> echo many > .\Many\2nd-file.txt

Now, using Get-ChildItem we can explore what happens by invoking the GetType() method in the resulting value from the cmdlet to see exactly what we’re getting [2]:-

PS> $items = Get-ChildItem Empty; $items.GetType()
You cannot call a method on a null-valued expression.

PS> $items = Get-ChildItem Single; $items.GetType()
IsPublic IsSerial Name     BaseType
-------- -------- ----     --------
True     True     FileInfo System.IO.FileSystemInfo

PS> $items = Get-ChildItem Many; $items.GetType()
IsPublic IsSerial Name     BaseType
-------- -------- ----     --------
True     True     Object[] System.Array

As you can see in the first case we get a null reference, or in PowerShell terms, a $null. In the second case we get a single item of the expected type, and in the third an array of objects. Only the final type, the array, will have a property called “Count” on it. Curiously enough, as you might have deduced from earlier, you don’t get a warning about a “null-valued expression” if you try and access the missing Count property on a $null value, you get the “invalid property” error instead:-

PS C:\temp> $null.Count
Property 'Count' cannot be found on this object. Make sure that it exists.

Forcing a ‘Many’ Result

The idiomatic way to deal with this in PowerShell is not to try and do it in the first place. It is expected that you will just create a pipeline and pass the objects along from one stage to the next letting the PowerShell machinery hide this idiosyncrasy for you:-

PS C:\temp> Get-ChildItem Empty | Measure-Object |
            Select Count
Count
-----
    0

However, if you do need to store the result in a variable and then act on it directly [3] you’ll want to ensure that the variable definitely contains a collection. And to do that you wrap the expression in “@(...)”, like so:-

PS> $items = @(Get-ChildItem Empty);
    Write-Output $items.Count
0
PS> $items = @(Get-ChildItem Single);
    Write-Output $items.Count
1
PS> $items = @(Get-ChildItem Many);
    Write-Output $items.Count
2

 

[1] Apologies for the old-skool syntax; I still work with a lot with batch files and the PowerShell syntax for creating directories just hasn’t bedded in yet. The blatant use of ECHO instead of Write-Output was me just being perversely consistent.

[2] Whilst Get-Member is the usual tool for inspecting the details of objects coming through a pipeline it will hide the different between a single value and a collection of values.

[3] For example diagnostic logging, which I tackled in “Logging & Redirection With PowerShell”.

Wednesday 13 November 2013

Self-Terminating From Inside a TCL Script DDE Handler

Like all good bugs, the one you discover is not always the one that is actually causing the real problem the user is reporting. In my last post “DDE XTYP_EXECUTE Command Corruption” I described a problem I ran into when sending a DDE XTYP_EXECUTE message from a Unicode server to an ANSI client. Whilst this had became a problem, it turned out this wasn’t the actual problem that the user had originally reported.

Hanging on Exit

The real problem the user was experiencing was that when they sent a DDE command to their TCL script to terminate itself, the calling script which was written in VBScript was timing out with a DMLERR_EXECACKTIMEOUT error. What made things more curious was that the user had managed to find a workaround using another DDE tool (a command line application) that did seem to terminate the TCL script without generating an error.

Although I knew nothing about TCL at all at that point, my spider-senses were already tingling when I saw this bit of the TCL script in their email:-

proc TclDdeServerHandler {args} {
  . . .
  switch -exact -- $cmd {
    . . .
    exit {
      exit
    }
  }
}

The code path for the “exit” command was causing the TCL script to terminate whilst still instead the DDE handler. Although it may not actually be a bad thing to do in a script I’ve always tended to try and let any stacks unwind before terminating a process or script to make sure the “runtime” remains in “a good place”. Maybe I’ve just spent too many hours porting native libraries that use “exit()” instead of “return” as an error handling strategy [1].

I raised this as a concern, but given the other developer knew TCL and I didn’t I was happy to accept their answer that this wasn’t an issue.

My First TCL Script

After taking a crash course in TCL, which really just involved hacking around the script I’d already been given, I managed to create a simple one that acted as a trivial DDE server to print a popular message:-

proc RunServer {} { 
  package require dde
  dde servername TestTopic
}

proc SayHello {} { 
  puts "Hello world"
}

RunServer
vwait forever

I ran this using TCLSH.EXE MyScript.tcl and poked it remotely using a similar nugget of VBScript:-

Dim client
Set client = CreateObject("DDECOMClient.DDEClient")

client.ExecuteTextCommand "TclEval","TestTopic",
                          "SayHello"

The hardest bit about getting this working was the making the script sit in a loop processing Windows messages instead of terminating, and that’s what the “
vwait forever” does. The only way to exit this though it to use Ctrl+C in the console.

To test the configurable timeout behaviour I’d added to my COM component I added a sleep in the SayHello function like so.

global alarm
proc sleep {time} { 
  after $time set alarm 1 
  vwait alarm
}
. . .
proc SayHello {} { 
  puts "Waiting..." 
  sleep 2000 
  puts "Hello world"
}

Reproducing the Real Issue

My “improved” DDE COM Component went back to the original developer so they could then bump the timeout themselves to something sensible in their script. They came straight back to to say that increasing the timeout didn’t work. They had bumped it up to 60 secs, which, after further discussion revealed was 59.99 secs longer than the operation should really take.

With a bit more TCL knowledge under my belt I started to hack around with their script, which took a while as I couldn’t even get it to run under TCLSH.EXE. A bit of random commenting out and I was finally able to reproduce the timeout problem. At this point I assumed the real issue might lie with some interaction between myself and the VBScript engine or be a subtle bug in my COM component as it was being hosted in-process and had already made the other DDE calls.

However what I couldn’t reproduce was their ability to terminate the script using another process. At first I used my own DDECmd tool as I like to dog-food my own stuff when possible. No go. Luckily they shipped me the execdde tool they were using and lo-and-behold it also failed with a timeout exception too. Huh?

Trust Nothing

At this point I was totally bemused and was sure I was a victim of some kind of environmental difference, after all, how else could the bug reporter experience one thing whilst I saw another unless there was a difference in the tools themselves or the way they were being used? Time to start again and re-evaluate the problem…

Luckily I had been sent a copy of the scripts being run by the other developer and so I started to look more closely at what they were doing. Whilst reviewing this I noticed that the call to this 3rd party execdde tool was being done in such a way as to ignore the return code from it. In fact the lines above had the error reporting in, but it was now commented out. In the heat of battle it’s all too easy to just keep trying lots of different things in the hope that something eventually works and then we quickly lose sight of what we have and haven’t tried up to that point.

This caused me to also re-evaluate my theory about calling “exit” from inside the TCL DDE handler and so I did some more Googling and came up with this variation of the TCL DDE handler that sets an exit “flag” instead, unwinds, and then finally exits by dropping out the bottom:-

set forever 1

proc TclDdeServerHandler {args} {
  . . .
  switch -exact -- $cmd {
  . . .
    exit    {
      global forever
      set forever 0 
    } 
  }
}

package require dde 1.3
dde servername -handler ::TclDdeServerHandler TestTopic
vwait forever

This worked a treat. Although there was much gnashing of teeth wondering why the “forever” variable wasn’t changing. Apparently I need to tell TCL that the “forever” variable I was changing was the global one, instead of changing a local one.

TCLSH vs WISH

That was the end of it, or so I thought. All along there had actually been an “environmental” difference between what I was working with and what the other developer was using - the TCL interpreter. I didn’t realise that TCLSH.EXE and WISH.EXE are two slightly different variations of the TCL scripting host. I had wondered why I needed to comment out the line “console show” when trying to run their script, but just as they had done I moved on and forgotten to evaluate how I had got there.

The main difference, at least as far as what I proposed, was that when my script is run under WISH.EXE it doesn’t actually terminate, oops! Although I didn’t manage to confirm it, my suspicion was that WISH behaves more GUI like and so it probably has an impliedvwait __forever__” at the end (but waiting on some internal variable instead). The solution of course is as simple as appending the manual call to “exit” that we took out of the DDE handler to the bottom of the script:-

. . .
package require dde 1.3
dde servername -handler ::TclDdeServerHandler TestTopic

vwait forever
exit

Is there a Bug?

I managed to find a workaround that allows the person that reported the problem to do what it was they needed to do. Sadly I don’t know whether their original idea was sound (being able to call exit directly) or if it’s a bug in the TCL interpreter or DDE package. Presumably being open source I have the opportunity to download the sources, build and debug it myself. Maybe one day I’ll have the time.

 

[1] Back in the mid 1990’s when the JPEG library was in its infancy I had the pleasure of taking this Unix-style C library that just called exit() all over the place when anything failed and tried to turn it into a DLL to run under 16-bit Windows. Because it was going to be part of a bigger application it had to return an error code instead of just bailing out; along with dealing with NEAR and FAR pointers, etc.

Tuesday 12 November 2013

DDE XTYP_EXECUTE Command Corruption

A few months back I had a request to add support for the XTYP_EXECUTE transaction type to my COM based DDE client. The question came from someone who was using the component via VBScript to automate a TCL script which in turn was automating some other processes! My underlying C++ based DDE library already had support for this, along with my DDECmd tool so I thought it was going to be a simple job. What I didn’t expect was to get into the murky depths of what happens when you mix-and-match ANSI [1] and Unicode build DDE clients and servers...

Adding XTYP_EXECUTE Support

Due to the nature of COM the API to my COMDDEClient component is essentially Unicode. When I went into my code to check how the command strings were handled in my C++ library I discovered (after spelunking the last 10 years of my VSS repo) that when I ported the library to allow it to be dual build (ANSI & Unicode) I just took the command string as is (either char or wchar_t) and pushed it out directly as the value for the XTYP_EXECUTE transaction data.

The XTYP_EXECUTE DDE transaction type is an oddity in that the “format” argument to the DdeClientTransaction() function is ignored [2]. The format of this command is documented as being a string, but the exact semantics around whether it’s ANSI or Unicode appear left unsaid. I therefore decided to correct what I thought was my naive implementation and allow the format to be specified by the caller along with the value - this felt more like a belt-and-braces approach.

Naturally when I came to implement my new ExecuteTextCommand() method on the COM DDE server IDDEConversation interface, I followed the same pattern I had used in the rest of the COM server and defaulted to CF_TEXT as the wire format [3]. I tested it using the new support I’d added to my DDECmd tool via the --format switch and thought everything was great.

Impedance Mismatch

Quite unexpectedly I quickly got a reply from the original poster to say that it wasn’t working. After doing the obvious checks that the build I provided worked as expected (still using my own DDECmd tool as the test harness), I took a crash course in TCL. I put together a simple TCL script that acted as a DDE server and printed the command string send to it. When I tried it out I noticed I didn’t get what I expected, the string appeared to be empty, huh?

Naturally I Googled “TCL XTYP_EXECUTE” looking for someone who’s had the same issue in the past. Nothing. But I did find something better - the source code to TCL. It took seconds to track down “win/tclWinDde.c” which contains the implementation of the TCL DDE support and it shows that TCL does some shenanigans to try and guess whether the command string is ANSI or Unicode text (quite a recent change). The implementation makes sense given the fact that the uFmt parameter is documented as being ignored. What then occurred to me as I was reading the TCL source code (which is written in C) was that it was ANSI specific. I was actually looking at slightly older code it later transpired, but that was enough for the light bulb in head to go on.

A consequence of me using my own tools to test my XTYP_EXECUTE support was that I had actually only tested matched build pairs of the DDE client and server, i.e. ANSI <-> ANSI and Unicode <-> Unicode. What I had not tested was mixing-and-matching the two. I quickly discovered that one permutation always worked (ANSI client to Unicode server) but not the other way (a Unicode client sending CF_TEXT to an ANSI server failed).

Guess what, my DDECOMClient component was a Unicode build DDE client sending CF_TEXT command strings to a TCL DDE server that was an ANSI build. Here is a little table showing the results of my mix-and-match tests:-

Client Sending Server Receives Works?
ANSI CF_TEXT Unicode CF_UNICODETEXT Yes
ANSI CF_UNICODETEXT Unicode CF_UNICODETEXT Yes
Unicode CF_TEXT ANSI ??? No
Unicode CF_UNICODETEXT ANSI CF_TEXT Yes

Forward Compatible Only

I looked at the command string buffer received by the DDE server in the debugger and as far as I can tell DDE will attempt to convert the command string based on the target build of the DDE server, which it knows based on whether you called the ANSI or Unicode version of DdeInitialize() [4]. That means it will automatically convert between CF_TEXT and CF_UNICODETEXT format command strings to ensure interoperability of direct ports of DDE components from ANSI to Unicode.

In fact it does even better than that because it will also correctly convert a CF_TEXT format command sent from a Unicode build DDE client to a Unicode build DDE server. The scenario where it does fail though, and the one that I created by accident through trying to be flexible, is sending a CF_TEXT format command string from a Unicode build DDE client to an ANSI build DDE server.

Once I discovered that DDE was already doing The Right Thing I just backed out all my DDE library changes and went back to the implementation I had before! And hey presto, it now works.

 

[1] For “ANSI” you should probably read “Multi-byte Character Set” (MBCS), but that’s awkward to type and so I’m going to stick with the much simpler (and much abused) term ANSI. This means I’m now as guilty as many others about using this term incorrectly. Sorry Raymond.

[2] Not only is this documented but I also observed it myself whilst investigating my later changes on Windows XP - the uFormat argument always seemed to reach the DDE Server callback proc as 0 no matter what I set it to in the client.

[3] Given the uses for DDE that I know of (mostly financial and simple automation) CF_TEXT is more than adequate and is the lowest common denominator. Nobody has written to me yet complaining that it needs to support requests in CF_UNICODETEXT format.

[4] It is not obvious from looking at the DdeInitialize() signature that the function even has an ANSI and Unicode variant because there are no direct string parameters.

Friday 8 November 2013

Extract Method is About More Than Code Reuse

I read an interesting article recently from Reginald Braithwaite titled “Defactoring”. Sadly the article was published in a format where no comments can be added directly, so I’m going to publish mine here instead.

The premise of the article was about performing the reverse action of “factoring” on a codebase - essentially manually inlining methods again. The reason the author cited for doing this was largely because the promise of reuse, for which the original factoring had apparently taken place, had never materialised. My problem with this assertion is that the code was factored out solely for the purpose of reuse [1], which is rarely the reason I do it. For me, after having factored code out, if I should happen to need to reuse it later then it’s merely a fortuitous side-effect that it has already been simplified.

Introduce Explaining Variable

In Martin Fowler’s book on Refactoring, one of the items (Ch. 6, p124) is about breaking up long expressions into smaller ones and giving the result a name, i.e. using a variable (or constant [2]), e.g. instead of writing:-

var pt = new Point(((right-left)/2)+left 
                   ((bottom-top)/2)+top);

...you could break the expression down and name different parts:-

var width = right - left;
var height = bottom - top;
var centreX = left + (width / 2);
var centreY = top + (height / 2);
var centrePoint = new Point(centreX, centreY);

OK, so I’ve gone to the extreme here to make a point (pun intended) and in the real world the expressions are usually longer and have a few nested uses of the ternary operator thrown in for good measure.

The main reason I would do this decomposition is to explain my thinking. If there is a bug in the original example it might not be obvious what that bug is because you can’t see how I derived the expression. In the latter case, by trying to name the elements of the expression I’m cementing in my own mind exactly how it is I’ve arrived at the solution.

Show Your Workings

With my eldest child just about to go through his exams, I’m reminded about how important it is to “show your workings”. At school it’s all very well knowing you know something, but in an exam it’s important to show the examiner that you know it too. Sometimes we have to learn it the hard way by staring at a maths problem we know we’ve done wrong but can’t see where we went astray because all we wrote down was the final answer.

Magic Numbers

If you take an expression and reduce it down to its smallest elements you eventually arrive at creating names for literal values, or as they are often referred to: magic numbers. The most primitive of these are boolean values and anyone who has come across code that uses the CreateEvent function in the Windows API will have met this kind of code:-

CreateEvent(..., true, false, ...);

You have to look up the documentation to see what the arguments refer to; some simple local constants can make this much more readable at the call site:-

const bool manualEvent = true;
const bool notSignalled = false;

CreateEvent(..., manualEvent, notSignalled, ...);

Introduce Explaining Method

The alternative to introducing explaining variables would be to introduce explaining methods where we wrap up the functionality into lots of small, well named methods instead. Going back to our earlier example we could refactor our expression this way instead:-

public int CentreX(left, right)
{
  return ((right - left) / 2) + left;
}

Then we can use the methods directly:-

var centrePoint = new Point(CentreX(left, right),
                            CentreY(top, bottom));

On such as simple and obvious example as this it probably seems overkill, but I’ve often come across many other small expressions that just seem to take way too much brain power to understand - both what they represent and, more importantly, if they’re correct.

Turning once again to the Refactoring book and the Extract Method item in particular (Ch. 6, p 110) we find the following advice in the Motivation section:-

“If extracting improves clarity, do it, even if the name is longer than the code you have extracted”.

Too Many Layers of Abstraction?

Once again I find a criticism of this degree of breakdown being that it adds too many levels to the code. I suspect the people who say this seem hell bent on trying to squeeze as much code as possible into the smallest area on the screen. When there are too many layers the calls are across multiple classes, whereas what I’m talking about means lots of little private methods in the same source file.

Additionally, the point of naming is to convey something to the reader, and that’s often hard to do in programming, so much so I can understand why it seems easier to just not do it.

Decades ago there might also have been an argument against doing this on the grounds that it affects performance, and it may well do in dynamic languages, but modern statically compiled languages can easily inline this stuff.

There is a famous quote by Sir Tony Hoare that goes:-

“One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies”

To me, breaking a problem down as much as possible helps to create a program that approaches the former state rather than the latter.

 

[1] Of course it’s entirely plausible that the author has chosen to focus solely on this particular use case for factoring out methods and is perfectly aware of others. Either way, I’m going to add my own £0.02 to the debate.

[2] See the Single Assignment Pattern.

Friday 18 October 2013

Start C# Enumerations at 1

Coming from a background in C/C++ I was brought up with the problems of uninitialised variables, which, through the use of better compiler warnings and static code analysis tools I managed to keep under control. Sadly uninitialised member variables are still the domain of the more specialist tools as far as I know. With C# the problem still exists, to some extent. Whilst C#, nay .Net, guarantees that variables and class fields will be initialised to a known value (0 or the moral equivalent, e.g. false, null, etc.) that does not always help in the fight against bugs due to uninitialised variables [1].

Reflection

In C#, unlike C++, you have Reflection which pretty much allows you to do whatever you like and ignore whatever access controls and constraints the class designer put in place by poking data directly into class fields. When you create an object via reflection it only initialises values to their defaults, there must be some other protocol in place to allow a post-initialisation method to be called, e.g. the [OnDeserialized] attribute.

Enums

With the enum type essentially being implemented as a primitive, an integer, it gets the value 0 after construction by reflection. Whereas a real integer always has the value 0 as part of its range, an enum may not - it’s usually constrained to a small subset of the entire range of integral values. What this means in practice is that the bit pattern for 0 may or may not fall onto one of the values in the enumeration depending on how you defined it.

enum Colour
{
  Red,   // first is 0 by default
  White,
  Blue,
}

The C# documentation suggests that you make “the default value" of your enumeration the first. What they are really saying though is that you should try and ensure that an uninitialised variable of your enumeration type happens to coincide with a valid value, and therefore a value that will cause the code to behave sensibly.

Maybe, I’ve being doing C++ for too long because frankly that advice disturbs me.

Special Values

I’m sure there are cases where a default value might makes sense in some enumerations but I’ve rarely come across a case where it does [2]. Also I thought we’d got passed the whole “special” value problem and put the likes of -1, NaN and NULL behind us? If the value is optional, then use an optional-style type, like Nullable<T>. Only, it’s not actually optional, it’s just optional when you want to create the object via reflection [3].

Don’t Mask The Error

So, what’s the alternative? Well you could start your enumeration from 1 instead of 0, by being explicit on the first value:-

enum Colour
{
  Red = 1,
  White,
  Blue,
}

At least now you can detect an uninitialised value when you come to process a value from the enumeration because both the static variation through a switch statement:-

switch (colour)
{
  case red: . . .
  case white: . . .
  case blue: . . .
  default: throw new ArgumentOutOfRangeException();
}

…and the dynamic variation via a dictionary:-

var map = new Dictionary<Colour, Method>();
. . .
if (!map.TryGetValue(colour, out method))
  throw new ArgumentOutOfRangeException();

…allow a check to be made and an invalid value to be detected. Also don’t forget that you can use the Enum.IsDefined() static method to validate an integer value before casting it to its enum equivalent. In fact this kind of thing is just perfect for an extension method:-

public static T AsEnum<T>(this int value)
  where T : struct
{
  if (!typeof(T).IsEnum)
    throw new ArgumentException();

  if (!Enum.IsDefined(typeof(T), value))
    throw new ArgumentOutOfRangeException();

  return (T)Enum.ToObject(typeof(T), value);
}
. . .
int value = 1;
Colour colour = value.AsEnum<Colour>();
Assert.That(colour, Is.EqualTo(Colour.Red));

Sadly C# generics aren’t quite good enough to constrain the type specifically to an enum, a value type is the best we can do at compile time, hence the runtime check at the start of the method. Also I’d be wary of the performance of the Enum.ToObject() to perform the cast as it seems a very ham-fisted way of just doing a (T)value style cast. As always there’s plenty more depth to this topic on StackOverflow.

Tests - Simples

I know the stock answer to my paranoia is just to write some unit tests and make sure the correct behaviour is covered, but there still feels like there is a subtlety here that is just waiting to trick up the unwary maintenance programmer down the line if you go with the party line.

 

[1] My definition of “uninitialised” is based on a variable not having the value it was intended to have, that means it has to be a conscious decision to allow a variable to take on the default value, not an unconscious one. 

[2] Using an enumeration to define a set of bitfield flags is possibly one case. But that in itself is a special type of enumeration that is hardly different from an integer and a class of constants. An enumerated type is commonly used as a type identifier which is a different sort of beast to a set of boolean flags.

[3] I’m currently working with a web API and MongoDB back-end and all the marshalling is being handled automatically via reflection, which is all new ground for me.

Thursday 17 October 2013

Codebase Stability With Feature Toggles

Whilst at Agile Cambridge a few weeks ago I got involved in a brief discussion about using Feature Toggles instead of Feature Branches. The scepticism around using them seemed to focus around the instability that it potentially creates within the codebase.

Development Branch

To me a “development” branch can either be a private (nay, feature) branch or an integration branch where complex change is allowed to take place. If the change is potentially disruptive and cannot be broken down, just maybe a private branch is needed to avoid screwing up “the build” and blocking the team. However, even with Continuous Integration and Continuous Deployment any non-trivial change has the potential to destabilise the product by introducing bugs, that is not a side-effect of feature branches or toggles in particular - just change.

Release Branch

If you want stability then that is what a release branch is for. The policy of change on a release branch is radically different than for a development branch - notably more care is taken with any change to ensure that the potential for new bugs is minimised. In short, I would not expect any formal development on a release branch at all, only surgical changes to fix any late problems. Consequently I wouldn’t expect any work on unreleased features to continue once the branch had been cut, therefore all feature toggles would be “off” and the unused code sits in stasis unless it has any unintended side-effects on the release itself.

Bedding In Changes

Before the days of branching, a software product would often go through the process of a “code freeze” where all new development would halt and the focus would switch to finding and squashing any outstanding bugs before release. Release branches help avoid that to a large degree by allowing the stability work to be done in parallel and in isolation whilst the product continues to evolve in the background on the main integration (development) branch.

Whilst in theory there is no reason not to put a potentially dangerous change in just before you cut your release branch, I would question whether you will give yourself enough time to root out any subtle problems, e.g. performance. Whilst continuously deploying is great for pushing new features out ready for formal QA it destroys any stability of the deployment itself. Unless you have automated stress testing on every deployment you risk the danger of not exposing any changes in the non-functional side, such as resource leaks, badly performing caches, error recovery, etc.

As I mentioned in “Feature Branch or Feature Toggle?” I try to break down my changes into the smallest chunks possible [1]. This has the added benefit that I can then try to get the most dangerous part of my changes into the development branch as soon as possible to give them as long as possible to “bed in”. The same goes for any other nasty changes that might have unanticipated side-effects, like upgrading 3rd party components. As time marches on the development branch moves ever closer to another release and so I would have a preference for ever lower risk changes. As long as the release cycle is frequent enough, it will not be long before the riskier changes can be picked up again.

Cherry Picking

One time when this “anything goes” attitude might bite you is if you need to cherry pick changes from your development branch and apply them to an impending release. As a rule of thumb you should only merge branches in the direction of stable to volatile, i.e. release to development. But sometimes a feature has to get dragged into a release late and if you’re pulling code from a volatile branch you run the risk of needing to drag in more than you bargained for. If the feature you need depends on some earlier refactoring that didn’t make it to the release branch originally you’re in trouble - you either have to drop the feature, pull the refactoring in too (and any changes that it depends on, and so on and so forth) or you have to re-write the change to only depend on the release version of the code. This is not really what you want to be dealing with late in the day.

So, whilst feature toggles are one way to evolve software with the benefits of early integration, there is a time and a place for them, and that is where the other riskier changes are - a development branch.

 

[1] But not so small that I would appear hypocritical and fall foul of my own advice - “What’s the Check-In Frequency, Kenneth?”.

Tuesday 15 October 2013

Concrete Interfaces

Back at the end of last year I wrote a blog post titled “Interfaces Are Not the Only Abstract Fruit”. At the time I was bemoaning the use of interfaces that mirror the public interface of classes that implement them, and in particular where those types were value types. More recently I’ve seen an alternate side to this where a concrete class has had an interface extracted from it, albeit again one which just mirrors the existing public interface of the class. The reason it seems is that the concrete class must now become a seam for unit testing…

I’ve written before in “Don’t Let Your Tools Pwn You” about the dangers of letting your tools dictate the code you produce. In this case, whilst the human is seemingly in control, the ease with which it’s possible to perform this refactoring without seeing it through to its logical conclusion is all too easy. So what is that logical conclusion I’m alluding to?

Implementation != Abstraction

Let’s start with the simpler case. Say you have a proxy for a remote service, that might be implemented in a concrete class like this:-

public class RemoteProductServiceProxy
{
  public Product[] FetchProducts(string startDate,
                                 string endDate)
  { . . . }
}

The refactoring tool, which only knows about the concrete class and not what it represents will probably lead you into creating an interface like this, i.e. stick an I on the front of the class name and expose all the public methods:-

public interface IRemoteProductServiceProxy
{
  public Product[] FetchProducts(string startDate,
                                 string endDate);
}

But an IRemoteProductServiceProxy is not an abstraction it is the implementation. From the name I’m going to guess that the thing your client really wants to interact with is a service for querying about products. The fact that it’s hosted in or out of process is almost certainly of little consequence for the the client [1].

Secondly the types of the parameters for the FetchProducts() method are also leaky, in the abstraction sense, if you’re pushing the burden onto the client for marshalling any input parameters, e.g. if it’s a web API call where the parameters will ultimately be passed as string values. The interface should traffic in the domain types of the abstraction, not the implementation, after all if you do host the service in-process you’d be marshalling the parameters unnecessarily.

With that all said, I’d be looking at something more like this [2]:-

public interface IProductService
{
  public Product[] FetchProducts(Date start,
                                 Date end);
}

Hopefully it should be obvious now why I called the original refactoring a “Concrete Interface”. I did canvas Twitter a few weeks back to see if there was another more formal (or at least polite!) term for this anti-pattern, but it didn’t come to anything. Perhaps the comments to this post will shed some light?

Mock the Abstraction, Not the Technology

There is another variation of this story where, instead of creating an interface for the service, which then allows the client to be entirely isolated from the implementation, the underlying technology is abstracted instead. The premise seems to be that to get a piece of code under test you mock out “the external dependencies”. In this instance the dependency would be the various classes that allow an HTTP request to be made.

However, before creating interfaces and factories that allow you to mock a technology at a low-level, ask yourself what this will actually buy you. Personally I see very little value in sensing all the miniscule interactions between your code and a 3rd party library because all the complexity is in their library, so it often ends up being like tests for getters and setters. For example, mocking all the database classes just to sense that you’ve put the right value in the right parameter seems less valuable to me than actually making a database call that tests the integration between the two (See “C#/SQL Integration Testing With NUnit”).

The reason why the underlying 3rd party technology is often complicated is exactly because it does many different things. For instance a web based API has to consider proxies, authentication, content of different types, different protocols, etc. But I bet you are probably only using a very small part of it. This is especially true in the enterprise arena where you have considerable control over the infrastructure.

Assuming I still want to write unit tests for some aspects of this code, e.g. verifying the URL format for a web API call, I’ll try and create some kind of simple facade that wraps up all that complexity so I’m dealing with an interface tailored to my needs:-

public interface IHttpRequester
{
  string RequestData(string url);
}

Oftentimes much of the grunge with dealing with technology like this is not in the actual execution of the task-in-hand, but in the dance required to create and configure all the various objects that are needed to make it happen. For me that complexity is best handled once elsewhere and tested through separate integration tests.

The Smell

So, if you find yourself creating an interface which matches a single concrete class in both name and public methods ask yourself if there is not really a more formal abstraction trying to get out.

 

[1] Yes, the network is evil, but the interface will likely take that into account by not being too chatty. If some sort of error recovery is possible even in the absence of the service then that might also be the case too, but these are generally the exception in my experience.

[2] I still don’t like the “I” prefix or the “weak” Service suffix, but it’s a common convention. I really like “Requester”, “Locator” or “Finder” for query based services as that sits nicely with the suggestions I made in “Standard Method Name Verb Semantics”.

Tuesday 24 September 2013

Extension Methods Should Behave Like Real Methods

The other day I was involved in two discussions about Extension Methods in C#. The second was an extension (pun intended) of the first as a by-product of me tweeting about it. My argument, in both cases, is that the observable outcome of invoking an extension method with a null “this” argument should be the same as if that method was directly implemented on the class.

This != null

What kick-started the whole discussion was a change in the implementation of an extension method I wrote for the System.String class that checks if a string value [1] is empty:-

public static bool IsEmpty(this string value)
{
  return (value.Length == 0);
}

It had been replaced with this implementation:-

public static bool IsEmpty(this string value)
{
  return String.IsNullOrEmpty(value);
}

Ironically, the reason it was changed was because it caused a NullReferenceException to be thrown when provided with a null string reference. This, in my opinion, was the correct behaviour.

Whenever I come across a bug like this I ask myself what is it that is wrong - the caller or the callee? There are essentially two possible bugs here:-

  1. The callee is incorrectly implemented and does not support the scenario invoked by the caller
  2. The caller has violated the interface contract of the callee by invoking it with unsupported arguments

The code change was based on assumption 1, whereas (as implementer the extension method) I knew the answer to be 2. In this specific case the bug was still mine and down to me misinterpreting how “empty” string values are passed to MVC controllers [2].

My rationale for saying that the “this” argument cannot be null and that the method must throw, irrespective of whether the functionality can be implemented without referencing the “this” argument or not, is that if the method were to be implemented directly in the class in the future, it would fail when attempting to dispatch the method call. This could be classified as a breaking change if you somehow relied on the exception type.

Throw a NullReferenceException

This leads me to my second point and the one that came up via Twitter. If you do decide to check the “this” argument make sure you throw the correct exception type. Whilst it might be common to throw some type of ArgumentException, such as a ArgumentNullException when validating your inputs, in this case you are attempting to emulate a real method and so you should throw a NullReferenceException as that is what would be thrown if the method was real:-

public static bool MyMethod(this object value, . . .)
{
  if (value == null)
    throw new NullReferenceException();
  . . .
}

The reason I didn’t explicitly check for a null reference in my extension method was because I was going to call an instance method anyway and so the observable effect was the same. In most cases this is exactly what happens anyway and so doing nothing is probably the right thing to do - putting an additional check in and throwing an ArgumentNullException would be wrong.

Test The Behaviour

As ever if you’re unsure and want to document your choice then write a unit test for it; this is something I missed out originally. Of course you cannot legislate for someone also deleting the test because they believe it’s bogus, but it should at least provide a speed bump that may cause them to question their motives.

In NUnit you could write a simple test like so:-

[Test]
public void if_empty_throws_when_value_is_null()
{
  string nullValue = null;

  Assert.That(() => nullValue.IsEmpty,
         Throws.InstanceOf<NullReferenceException>());
}

Is It Just Academic?

As I’ve already suggested, the correct behaviour is the likely outcome most of the time due to the very nature of extension methods and how they’re used. So, is there a plausible scenario where someone could rely on the exception type to make a different error recovery decision? I think so, if the module boundary contains a Big Outer Try Block. Depending on what role the throwing code plays a NullReferenceException could be interpreted by the error handler as an indication that something logical has screwed up inside the service and that it should terminate itself. In a stateful service this could be an indication of data corruption and so shutting itself down ASAP might be the best course of action. Conversely an ArgumentException may be treated less suspiciously because it’s the result of pro-active validation.

 

[1] If you’re curious about why I’d even bother to go to such lengths (another pun intended) then the following older blog posts of mine may explain my thinking - “Null String Reference vs Empty String Value” and “Null Checks vs Null Objects”.

[2] It appears that you get an empty string on older MVC versions and a null reference on newer ones. At least I think so (and it’s what I observed) but it’s quite hard to work out as a lot of Stack Overflow posts and blogs don’t always state what version of MVC they are talking about.

Saturday 21 September 2013

The Battle Between Continuity & Progress

One of the hardest decisions I think we have to make when maintaining software is to decide whether the change we’re making can be done in a more modern style, or using a new idiom, and if so what is the tension between doing that and following the patterns in the existing code.

Vive la Différence!

In some cases the idiom can be a language feature. When I was writing mostly C++ a colleague started using the newer “bind” helper functions. They were part of the standard and I wasn’t used to seeing them so the code looked weird. In this particular case I still think they look weird and lead to unreadable code (when heavily used), but you have to live with something like that for a while to see if it eventually pans out.

Around the same time another colleague was working on our custom grid control [1] and he came up with this weird way of chaining method calls together when creating a cell style programmatically:-

Style style = StyleManager.Create().SetBold().SetItalic().SetFace(“Arial”).SetColour(0, 0, 0).Set...;

At the time I was appalled at the idea because multiple statements like this were a nightmare to put breakpoints on if you needed to debug them. Of course these days Fluent Interfaces are all the rage and we use techniques like TDD that makes debugging far less of an occurrence. I myself tried to add SQL-like support to an in-memory database library we had written that relied on overloading operators and other stuff. It looked weird too and was slightly dangerous due to classic C++ lifetime issues, but C# now has LINQ and I’m writing code like that every day.

They say that programmers should try and learn a number of different programming languages because it’s seeing how we solve problems in other languages that we realise how we can cross-pollinate and solve things in our “primary” languages in a better way. The rise in interest in functional programming has probably had one of the most significant effects on how we write code as we learn the hard way how to deal with the need to embrace parallelism. It’s only really through using lambdas in C# that I’ve begun to truly appreciate function objects in C++. Similarly pipelines in PowerShell brought a new perspective on LINQ, despite having written command shell one-liners for years!

Another side-effect is how it might affect your coding style - the way you layout your code. Whilst Java and C# have fairly established styles C++ code has a mixture. The standard obviously has its underscores but Visual C++ based code (MFC/ATL) favours the Pascal style. I’ve worked on a few C++ codebases where the Java camelCasing style has been used and from what I see posted on the Internet from other languages I’d say that the Java style seems to have become the de-facto style. My personal C++ codebase has shifted over the last 20 years from it’s Microsoft/Hungarian Notation beginnings to a Java-esque style as I finally learnt the error of my ways.

When In Rome…

The counter argument to all this is consistency. When we’re touching a piece of code it’s generally a good idea not to go around reformatting it all as it makes reading diffs really hard. Adopting new idioms is probably less invasive and in the case of something like unique_ptr / make_shared it could well make your code “more correct”. There is often also the time element - we have more “important” things to do; there are always more reasons not to change something than to change it and so it’s easier to stick with what we know.

The use of refactoring within a codebase makes some of this more palatable and with good test coverage and decent refactoring tools it can be pretty easy much of the time. Unless it’s part of the team culture though this may well just be seen as another a waste of time. For me, when it’s out of place and continues to be replicated making the problem worse, then I find more impetus to change it.

For example, when I started on my first C# project it was a greenfield project and we had limited tools (i.e. just Visual Studio). A number of the team members had a background in C++ and so we adopted a few classic C/C++ styles, such as using all UPPERCASE names for constants and a “m_” prefix on member variables. Once we eventually started using ReSharper we found that these early choices were beginning to create noise which was annoying. We wanted to adopt a policy of ensuring there were no serious warnings from ReSharper [2] in the source code as we were maintaining it, and also that as we went forward it was with the accepted C# style. We didn’t want to just #pragma stuff to silence R# as we also knew that without the sidebar bugging us with warnings things would never change.

One particularly large set of constants (key names for various settings) continued to remain in all upper case because no one wanted to take the hit and sort them out. My suggestion that we adopt the correct convention for new constants going forward was met with the argument that it would make the code inconsistent. Stalemate! In the end I started factoring them out into separate classes anyway as they had no business being lumped together in the first place. This paved the way for a smaller refactoring in the future.

Whilst this particular example seems quite trivial, others, such as adopting immutability for data model classes or using an Optional<T> style class for cases where a null reference might be used are both more contentious and probably require a longer period of inconsistency whilst the new world order is adopted. They also tend not to bring any direct functional benefit and so its incredibly hard to put a value on them. Consistent code doesn’t stick out like a sore thumb and so it’s less distracting to work with, but there is also no sense in doing things inefficiently or even less correctly just to maintain the status quo.

 

[1] Writing a grid control is like writing a unit test framework - it’s a Rite of Passage for programmers. Actually ours was a lot less flickery than the off-the-shelf controls so it was worth it.

[2] The serious ones that is. We wanted to leverage the static analysis side of R#' to point out potential bugs, like the “modified closure” warning that would have saved me the time I lost in “Lured Into the foreach + lambda Trap”.

Wednesday 18 September 2013

Letting Go - Player/Manager Syndrome

I’ve been following Joe Duffy’s blog for some time and this year he started a series on Software Leadership. In his first post he covered a number of different types of mangers but he didn’t cover the one that I feel can be the most dangerous: the ex-programmer turned manager.

While I think it’s great that Joe wants to maintain his coding skills and act as a mentor and leader towards his team, this is an ideal that would struggle to exist for long in the Enterprise Arena, IMHO. In the small companies where I’ve worked the team sizes have not demanded a full time project manager and so they have to ability to remain first-and-foremost a quality developer. In the corporate world the volume of paperwork and meetings takes away time they might have to contribute and so their skills eventually atrophy. In our fast-paced industry technologies and techniques move on rapidly and so what might once have been considered cutting-edge skills may now just be a relic of the past, especially when a heavy reliance on tooling is needed to remain productive.

There are some parallels here with the amateur football leagues, e.g. Sunday football [1], which I played in for many years. Someone (usually a long standing player) takes the role of manager because someone needs to do it, but ultimately what they really want to do is play football. As the team gets more successful and rises up the divisions the management burden takes hold and eventually they are spending more time managing and less time playing. At some point the quality of the players in the team changes such that the part-time player/manager cannot justify picking himself any longer because as a player he has become a liability - his skills now lie in managing them instead.

In software teams every full-time developer is “match fit” [2], but the danger creeps in when the team is under the cosh and the manager decides they can “muck in” and help take on some of the load. Depending on how long it’s been since they last developed Production Code this could be beneficial or a hindrance. If they end up producing sub-standard code then you’ve just bought yourself a whole load of Technical Debt. Unless time-to-market is of the utmost priority it’s going to be a false economy as the team will eventually find itself trying to maintain poor quality code.

No matter how honourable this sentiment might be, the best thing they can do is to be the professional manager they need to be and let the professional developers get on and do their job.

[1] Often affectionately known as pub football because it’s commonly made up of teams of blokes whose only connection is that they drink in the same boozer.
[2] Deliberate Practice is a common technique used to maintain a “level of fitness”.