R – Tricks to refactor a piece of code with many branches (if/then/else)

language-agnosticrefactoring

I'm having a hard time trying to refactor some pieces of code with many branches. There are many if/then/else blocks, some of them are nested.

Are there any tricks that can be used to refactor the code without wasting a lot of time trying to understand every minor aspect of the functionality first?

For now I'm basically using boolean algebra (De Morgan's laws). I'm trying to modify the conditions in if statements so I could pop up small code snippets outside if/then/else blocks. This helps somehow, but the code still remains branched. I know at some point I will eventually have to break the biggies into smaller class methods, but this is complicated because the code contains calls to many other class methods and has a lot of local-scope variables, so I will have to deal with passing many arguments to new methods. I wonder If I could use some other tricks to improve the code quality before I start breaking it into distinct smaller pieces (class methods).

Best Answer

In my experience, it's almost always more productive to functionally decompose (break things into smaller functions that do one thing and do it well) before getting into branches and equivalent logical expressions.

without wasting a lot of time trying to understand every minor aspect of the functionality first?

Among other things, the act of functionally decomposing will lead to your understanding what the code does, which leads to further chances to decompose. You want to get to functions that take only a few arguments, and return no, one or few values.

And by decomposing, you'll gain a better understanding of what parts are algorithmically essential (it must be done this way to be correct) and what parts are merely implementation details (which must be done, but can be done in various different ways without changing the output).

Often, you'll find that many of those deeply nested ifs are either unnecessary, or are repeated redundantly, or are similar enough they can be reduced to one or a few functions if you factor out minor differences in otherwise repeated code.

In other words, you want to get a handle not on the minutia of the code, but the abstract thing the code is dealing with. Many many problems boil down to some data structure (a list, a queue, a tree, a set) and some operation(s) on that structure. To the extent that you can tease apart data structures and algorithms, you can gain a level of abstraction that simplifies things. Or you can discover that whatever you have to do is better suited to other structures or algorithms, and then you can blow away a lot of cruft.

I remember years ago a colleague who always wrote database cursors, to do anything, because that's all he knew how to do. Setting up and tearing down a cursor requires some boilerplate, and a loop with some error checking, so his code always looked superficially complicated. He'd do it in a stored proc, which of course added more boilerplate.

Now, as it happened, this was in Sybase T-SQL, which has a global for errors and a global for cursor status, both of which he'd check twice, once when he got the first cursor row, then once in a loop that iterated over all the other rows. And he consistently confused the error variable (@@error) with the cursor state variable (@@sqlstatus, which is zero on success (got a cursor row), 1 on failure, and 2 if there are no more rows in the cursor). His code worked on the nominal path, because both were zero if he'd gotten a row, and when he tried to get the row after the last one, he'd get an error. So if you looked closely at his code, you'd groan (once again!) and fix that for him.

But then you'd look closer, and see that he was doing something like cursoring through the set of all rows where x = 1 and for each row setting y = y * 2, and you'd end up telling him, "just write an update statement!".

Your correcting of his checking of globals, while correct, didn't address the real problem, which was that he was using cursor and a stored proc for no good reason, when he just could have sent an update statement from the client code.

Figuring that out was more difficult, because instead of just looking at the local use of the globals, you had to look two places: where the cursor was declared (declare cursor_foo cursor for select * from table where x = 1 for update;) and twenty lines down where the update happened (update table set y = y * 2 where current of cursor_foo). (And all of these would be multi-line in a very boilerplate way.)

Figuring it out was also more difficult because you just assumed no one would go through all this just to do an update; surely all this boilerplate had to be because something special was happening in the update, right? Or because the where predicate was dynamic or something? So you'd look at this, and being a modest coder who respects his colleagues, your first instinct would be, "what am I missing here, there must be a reason a cursor was used?"

(Despite both me and his boss explaining the difference between @@error and @@sqlstatus, he just never got it, much less the idea that he could almost always just do an update; he thought in terms of procedural code, and never "got" databases, despite ostensible experience as a "database programmer".)

The lesson is to not get caught up in minutia until you have confirmed there's a need for the minutia (the implementation details) in the first place. By decomposing your code first, you have a better chance of understanding the abstractions you're dealing with, and really improving the code.

Related Topic