An emerging class of tools makes it easy to automatically detect copying of copyrighted software source code, even if it came from one of the hundreds of thousands of open source packages.
I am presently providing litigation support in a case of alleged software copyright infringement. In a nutshell, the plaintiff brought suit against the defendant for allegedly continuing to use plaintiff’s copyrighted software source code in defendant’s products after termination of a license agreement between the parties. Fortunately, automated tools are making it easier than ever to quickly and inexpensively detect copying of software source code.
Some of the most powerful tools for doing direct comparisons between a pair of source code sets are from S.A.F.E. Their CodeMatch tool works by comparing each file of source code in the first set with every file of code in the second set. Results are presented in a table that is sorted by the relative amount of matching code in the files. And CodeMatch is clever enough to detect copying in which variable and function names and other details were subsequently changed; CodeMatch can even detect code that was copied from one programming language into another. The only weakness of CodeMatch is that you have to have the source code for each product, which is not always possible early in litigation.
Other tools from S.A.F.E. provide additional help. For example, BitMatch can compare a pair of executable binary programs or one party’s source code against another’s executable code. It works by matching strings that appear in both programs. Meanwhile, SourceDetective helps rule out that the two programs are only similar because they both borrowed from some third program—by automatically searching the Internet for hundreds or thousands of matching phrases. CodeMatch, BitMatch, and SourceDetective are part of a suite of related tools called CodeSuite. CodeSuite is a free download that runs on Microsoft Windows, with license keys sold based on the amount of code to be compared.
Of course, sometimes code may be copied from open source software. Open source software is subject to so-called copyleft licenses, which are a special type of copyright that makes the source code open to the public. Copyleft language is drafted to ensure that the source code for certain categories of derived work are also open to the public. This creates problems for companies that wish to keep their source code private but also rely upon open source software.
Fortunately, there are also tools to detect the presence of part of all of an open source software package within a proprietary program. I have used such tools from Black Duck Software and Protecode. Both work similarly: each company maintains a database of hundreds of thousands of known open source packages against which the source code you provide is tested. Results are presented as a list of open source packages from which code may have been copied. This testing can be done entirely on a personal computer running Microsoft Windows, so that proprietary source code need not be sent outside a trusted network. Both tools are generally licensed for an expected level of use on an annual basis.
Unfortunately, the precision of CodeMatch is lost in trying to cast such a broad net for potential copying. The tools from BlackDuck and Protecode don’t actually compare your code against each and every of the millions of source code files in their database. Instead, they reduce each file of your source code to a simpler representation of its structure and then compute a unique mathematical signature for that new file. This signature is subsequently compared to a similar representation of the files in their database. In plain English, this means that you get lots of false positives. Some open source packages that weren’t actually copied usually turn up in the results list.
When searching for potential copying of open source code, I recommend searching the database from BlackDuck or Protecode first. Then, to eliminate the false positives, a more thorough analysis should be performed by obtaining the listed open source packages and using CodeMatch to compare the proprietary code against them file-by-file.
With the help of tools like those mentioned here, it is possible to quickly ascertain whether source code copying has taken place. Prior to the appearance of these tools, it was necessary for an expert in software development to manual perform dozens of searching and comparison steps. This strategy can be used early in litigation with the benefit of dramatically reducing the cost of such analysis. The same tools can also be employed proactively by companies seeking to reduce their risks of copyright infringement litigation.