Here’s a post I wrote for NDepend a while back. The original post is here, on their blog. NDepend’s tools for .Net are very cool and made me jealous. We don’t have anything close to that powerful for Java.
This post was a lot of fun to write, since counting lines of code is one of those things you don’t think about until you start to think about it. Then you can’t stop.————-
There are a few ways to count lines of code, and they each have their advantages and disadvantages.
Much of the differences come down to defining what a “line” is. Is a line a literal line in the source file, a logical statement in the language we’re using, or an executable instruction?
Let’s take a look at three metrics:
- Source lines of code—the number of lines of code in a method, skipping comments and blank lines
- Logical lines of code—the number of statements, ignoring formatting and often counting a line as more than one statement
- IL instructions—the number of instructions that the code compiles to
Is one better than the other? It depends on what you’re trying to measure.
Source Lines of Code
The most direct way to count lines of code (LOC) is to, well, count lines of code.
Our IDE tells us how many lines of text a file has and displays a count in one of the margins. It’s a useful metric to have: a quick way to see how long a given method is or object has. It gives us an ongoing indicator as to when things might need to be broken down into smaller parts or refactored a bit. Chances are we have a feel for when things are getting too long, but seeing an ongoing count on the side of the screen often helps.
It’s also relativity easy to use an external tool such as wc from GNU Coreutils or any one of many other utilities to get a quick count of the number of lines in a group of files.
But regardless of how we count raw lines of code, we’re still including comments, package statements, using statements, and even blank lines. If we’re working in our editor, we have to do some quick arithmetic in our heads (or on a post-it) if we want an accurate count. If we’re inheriting a large codebase, lines of code might be indicative of the sheer scale of what we’ve just been handed, but it’s not necessarily a good indicator of complexity. Maybe the previous owner liked vertical white space and completely qualified namespaces.
So the next step is to try to filter out comments and empty lines and count what is often called source lines of code (SLOC.) This is more accurate than a raw count, but there’s still some built-in uncertainty. We’re still counting package and using directives, which is interesting, but this can all be considered noise.
And of course, there’s the matter of coding style. Is this two lines of code or four?
for(int j=0; j < 100; j++)
Should it be counted differently than a file written by someone that prefers K&R-style indenting?
Logical Lines of Code
Rather than trying to count lines that contain code, a different approach is to calculate statements.
A simple algorithm for C/C++, Java, and C# is to count semicolons rather than carriage returns. This method filters out comments and blank lines and renders different formatting conventions moot. But it introduces a few quirks of its own. A for loop is counted as two statements, while a while loop doesn’t count. Neither would all of the parts of switch and if/then/else statements.
This leads to developing a program that understands the language it’s counting. It has to recognize keywords and intelligently tally them.
Answering the question of how many lines a for loop is worth means assigning weights to each keyword. Logically, a for loop is worth whatever a while loop is, which is the same as a do/while. Very quickly, counting lines becomes parsing source code.
There are many tools for counting logical lines of code. A quick Google search brings up at least a dozen in various states of disrepair.
That’s because there’s a better way to count statements, especially if you’re working in the .NET environment.
Windows .NET compiles code into Microsoft’s Common Intermediate Language or CIL. Counting instructions in CIL avoids many of the problems presented by trying to count by parsing source files.
The most apparent advantage is there are no longer any formatting questions. While IL is still human-readable, it’s machine-generated with consistent formatting.
Counting IL statements gives us a reliable metric without having to worry about the specifics of the high-level language. Rather than counting lines of source code, we measure the number of executable statements our code generates in the runtime environment. So, the question of how to count different styles of loops and comparisons is answered. We do it by counting the number of instructions each of these constructs compiles to.
There are things to consider when we’re using this metric, though. While IL is human-readable, it’s not a high-level language like C#. One line of code in C# is often many more than one in IL.
At the same time, IL does not include essential parts of our code. Interface definitions and abstract methods don’t compile directly to code, but they do contribute to complexity and human overhead.
NDepend combines IL instructions with your source code to give you an accurate count of logical lines of code in a method.
Program database (PDB) files contain sequence points that correspond to locations in the source file for operations such as setting breakpoints in the debugger. They also provide NDepend with a way to correlate lines in the source file with executable code.
NBLinesOfCode is a correlation, not just the count of sequence points relabeled as something else. NDepend takes each sequence point in the PDB, examines the associated line in the source code, and then counts the line if it’s relevant. For example, curly braces entering and exiting a loop can have a breakpoint assigned to them, but they’re not a line of code.
Since NDepend derives the count from IL, it doesn’t include interfaces, abstract methods, or enumerations. It’s a count of lines associated with executable code only.
This metric is a best-of-both-worlds approach; it completely bypasses issues with differences in formatting while providing a very accurate count of how many executable statements a unit of code contains. Since NDepend makes it simple to see this number for individual methods, it’s easy to detect when a method is doing too much.
NDepend also uses the PDB to identify comments. It calculates a count of comments for methods, types, namespaces, and assemblies and makes the count available as NbLinesOfComment. The number calculated for each item includes the comments in their implementation or definitions—in other words, in between the braces—so comments outside of these spaces don’t count.
For convenience, PercentageComment is also available. The formula is what you would expect:
PercentageComment = 100*NbLinesOfComment / ( NbLinesOfComment + NbLinesOfCode)
The number of IL instructions is available with NBILInstructions. This count varies based on compiler options and can be orders of magnitude higher than the number of lines of code. According to NDepend, a seven-to-one ratio of instructions to lines of code is a decent one to aim for.
In this post, we went over a few different ways to count code and the advantages and disadvantages of each approach. Then we saw how NDepend provides three metrics that help us measure lines of code and comments.
So what should you do with this knowledge? Well, rather than simply watching a number increment in your editor as you write, decide what you need to measure. Then you can use that information to improve your code.