New computer tools uncover illegally copied software code
Two international companies. An accusation of plagiarized software. A court trial. Two expert witnesses who offer directly conflicting opinions. A judge who has never owned a computer must decide who’s right.
Just a few years ago, the experts’ testimony would have been the only technical evidence the judge would have considered. But now a third point of view is available: that of a sophisticated software forensics program. By interpreting the program’s results, an expert computer scientist can give a definitive, quantitative answer.
In recent years, litigation over software in the United States and elsewhere has skyrocketed. Partly that’s due to the 1998 caseState Street Bank & Trust v. Signature Financial Group, which established that most kinds of software are patentable. Partly it’s due to the fact that you can easily and surreptitiously make a copy of copyrighted source code and put it on a flash drive or send it by e-mail. And partly it’s due to our increasing reliance on computers and the increasing value of the software that runs our businesses and our equipment.
Clearly, it’s in society’s best interest to resolve these lawsuits as efficiently and equitably as possible. But settling such disputes can get exceedingly technical, and few people have the expertise to parse source code—the human-readable form of a program—to determine what, if anything, has been illegally reproduced. A program that runs something as simple as a clock radio can have thousands of lines of code; a more complicated device, such as an airplane, can have millions. That’s why automatic software forensics tools are so useful. Just as software for analyzing DNA has become crucial in resolving criminal cases and paternity suits, tools that can quickly and accurately uncover illicit software copying are becoming key to copyright infringement litigation.
Detecting illegal copying without the aid of forensics software can be like finding the proverbial needle in a haystack. In one high-profile case that some colleagues of mine worked on, Cadence Design Systems, one of the largest developers of software for electronic design automation, sued Avanti Corp., a much smaller competitor. As is common in the tech business, Avanti had been founded by high-level engineers and executives who’d previously worked for Cadence.
In the mid-1980s Cadence introduced a product called Symbad for laying out the physical structure of integrated circuits. In a surprisingly short time Avanti came out with its own circuit-layout program, called ArcCell. Not only did the product development time seem too short, but a Cadence engineer noticed that ArcCell exhibited a very strange bug that was identical to a bug in Symbad. Given these suspicious circumstances, Cadence filed a motion with the court and convinced a judge that there was reason to believe its software copyright had been infringed.
Avanti had a lot of financial backing, though, and so it was able to delay the trial for some years. By the time the case reached the discovery phase, during which attorneys turned over relevant documents, including source code, to the opposing side, Avanti’s software had gone through many revisions.
At that time, no sophisticated forensics software yet existed that could spot illegal copying. Teams of experts spent months manually poring over the code, but they found few signs of copying. Eventually, though, they turned up one curious comment in both programs. The comment was a description in which a single word was spelled incorrectly. It was known that some of the same programmers had worked on both programs, which is completely legal as long as the programmers don’t literally copy the code. Perhaps one of them wasn’t a very good speller.
But this comment stood out. What were the chances that the same misspelling would show up in the same comment in nearly identical positions in both programs? Practically zero. Based largely on that seemingly tiny concurrence, Avanti lost both the civil and criminal lawsuits, and several of its executives went to jail. After paying fines that effectively put it out of business, what was left of Avanti was bought by a Cadence rival, Synopsys.
In that case, justice prevailed, but much of the time and expense involved in trying the case could have been avoided with forensics software.
Computer scientists have studied software copying since at least the late 1970s. In 1987, J.A.W. Faidhi and S.K. Robinson of Brunel University, in England, published a paper in the journal Computers & Education on detecting plagiarism in students’ programming projects. The paper characterized six types of source-code modifications that students tended to make, but it didn’t really define what constituted plagiarism or provide useful measurements for determining whether or not it had occurred.
Later research sought to fill that gap. Many of these efforts were based on the earlier work of computer scientist Maurice H. Halstead. Halstead wasn’t interested in plagiarism but in ways of measuring the complexity of code and the “mental effort” required to create it. He devised quantitative measurements, later called Halstead metrics, that counted the number of unique operators and operands as well as the number of operator and operand occurrences in source code.
Starting in the late 1970s, various researchers used the Halstead metrics to create more sophisticated metrics that were intended to detect plagiarism. If two computer programs produced similar values for these metrics, the conclusion was that plagiarism was likely to have taken place. In 1989, Alan Parker and James Hamblen of Georgia Tech documented at least seven plagiarism-detection algorithms that relied on Halstead metrics.
Few people have the expertise to parse source code--the human-readable form of a program--to determine what, if anything, has been illegally reproduced
Although these algorithms vary somewhat, they are similar in that they all yield a single score. If the score exceeds a certain threshold value, it indicates that plagiarism has probably occurred. But having one score for an entire program means that small sections of plagiarized code could be missed entirely. Algorithms of this kind reflect their creator’s aim: They are written by university professors trying to spot plagiarism in student projects, and so they are mainly concerned with flagrant incidents of cheating.
Investigating illegal copying in commercial software is quite different. Copied code is not necessarily there illegally; it may have been purchased from a third-party vendor, or it may be open-source code. So an algorithm that assumes that any instance of copying is illegal—that is, plagiarized—ignores the fact that sometimes the copying was authorized. There can be many other reasons that one source-code file is similar to another, most of which are not due to copying. More about that later.
The other key difference between plagiarism-detection software and forensics software is that the former is designed to execute quickly, and that by design, it favors false negatives over false positives. In other words, a professor would rather miss a few cheaters than falsely accuse a student of cheating. One professor explained to me that just the threat of a plagiarism-detection program, whether it worked or not, was enough to discourage cheating.
The goals of a software forensics tool used in intellectual-property litigation have to be very different. In these cases, there may be hundreds of millions of dollars at stake. The tool must favor false positives over false negatives so that it does not miss any cases of copying; an expert can then examine the results and eliminate those false positives. And it must be fast, but not at the expense of accuracy; in these high-stakes cases, it’s fine to dedicate a computer or set of computers to analyze the code for a day, a week, or even a month if necessary.
So how do you go about creating such a software forensics tool? I spent about a year developing CodeMatch, for use in copyright infringement cases. My company released the first version in 2003, and the program has been evolving ever since. Here are the basic principles I followed in developing CodeMatch. First, the tool can analyze source code in a way that’s independent of the programming language. That’s extremely important, given the wide variety of software cases in which the program is used. To be able to do that, CodeMatch focuses on characteristics in the code that are generic to all types of source code. If the program were comparing automobile designs instead, it might look at characteristics such as gas mileage, rather than unique features, such as the number of tail fins. And rather than rendering a definitive verdict that illegal copying has or hasn’t occurred, CodeMatch relies on measurable quantities that can be used to judge the likelihood of copying.
CodeMatch works by gauging the statistical correlation of two variables. A correlation of 0 indicates that the two variables are unrelated. A correlation of 1 indicates that a change in one variable always causes a similar change in the other. And –1 indicates a completely opposite correspondence between their variations. For source-code files that have no similarities, the correlation is 0; for identical files, it’s 1.
CodeMatch looks for such correlations by examining and comparing two collections of source-code files. Source code consists of statements, which include the instructions and identifiers that guide the program, as well as comments and strings that serve to document the code but cause no action to occur. A single line of source code may include one or more statements and one or more comments.
Just as software for analyzing DNA has become crucial in resolving criminal cases and paternity suits, tools that can quickly and accurately uncover illicit software copying are becoming key to copyright infringement litigation
CodeMatch spots correlations in four places: statements, comments/strings, identifiers, and instruction sequences. Each type of correlation offers a clue to plagiarism that the others may miss. Statement correlation finds the percentage of matching statements. Comment/string correlation finds the percentage of matching comments and strings. Identifier correlation finds the percentage of matching or partially matching identifiers. Identifier names in copied code are often changed to disguise the copying, but because they contain useful information for debugging and maintaining the code, the new names are usually similar to the original names. For example, the identifier “count” may be renamed “count5” or “cnt.” CodeMatch flags such similarities.
Instruction-sequence correlation compares long sequences of instructions. Even if the identifiers, comments, strings, labels, and operators are completely different, the sequence of instructions will likely be preserved to maintain the functionality of the original code. For instance, an unscrupulous programmer may decide to use another party’s copyrighted code as a reference, perhaps even rewriting the code in a different programming language to evade detection. If the programmer duplicates only the program concepts, that’s not copyright infringement. But if he takes anything literal, like a sequence of instructions, and converts it directly into the new language, that could be infringement, because the programmer is making an unauthorized derivative of the work, which is protected by the copyright. CodeMatch can spot such similarities in code, even when they’re written in different languages. Copyright infringement can also involve copying nonliteral elements like software architecture or organization, but CodeMatch is not as useful in detecting these forms of copyright infringement.
Once these four correlations have been determined, CodeMatch combines them into a single overall correlation value. Even then, the work is not done. A software expert must still go through the results and rule out any reasons for the correlation other than copyright infringement.
Just as books by the same author may have a similar style, even when the subject matter is completely different, software written by the same programmer may also have a telltale style
As mentioned earlier, a correlation may occur when two programs use the same widely available open-source code or code purchased from a third-party vendor. Correlations can also spring from automatic code-generation tools like Microsoft’s Visual Studio and Adobe Dreamweaver, which use standard identifiers for many variables, classes, methods, and properties. The structure of the code generated by these tools also tends to fit into a certain template with an identifiable pattern. So it’s common for two programs developed using the same code-generation tool to have a correlation.
A correlation may happen simply because the programmers who created the software studied at the same school or work in the same industry and therefore rely on the same identifier names. For example, many programmers like to use the identifier “result” to designate the result of an operation, so that identifier appears in lots of unrelated programs. A search on the Internet can determine whether an identifier is widely used or relatively rare.
Similarly, the same algorithm may show up in unrelated programs. An algorithm is a set of instructions for accomplishing a given task—say, calculating the square root of a number. In one programming language there may be an easy or well-understood way of writing that algorithm. If it’s taught in programming classes at universities or appears in a popular programming textbook, then it’s likely to show up in many programs, too.
What about two blocks of code written by the same person? Just as books by the same author may have a similar style, even when the subject matter is completely different, software written by the same programmer may also have a telltale style. He might repeatedly use a unique identifier name, for instance.
For these reasons, a software forensics expert must examine each instance where CodeMatch finds a strong correlation. If all of the benign reasons can be ruled out, the correlation must be due to unauthorized copying. In that case, the owner of the copyrighted software has a strong legal case.
Here’s how CodeMatch worked in one real-world case. A small start-up had developed software for viewing e-mail attachments on handheld devices; the software was good, but the company went out of business. The founders then started a new software company, which was quickly bought by a large competitor for a hefty sum of money. The large company, unsurprisingly, incorporated the acquired company’s code into its own program. I was hired in late 2003 by the bankrupt start-up’s investors, because they believed their start-up’s code had been illegally copied and that they were entitled to some money. The large company, again unsurprisingly, insisted that the software it had acquired did not infringe on the bankrupt start-up’s work.
As the importance of software in our daily lives grows, intellectual property disputes over that software are also likely to escalate.
To begin, I used CodeMatch to examine and compare the source code of the bankrupt company’s program and the large company’s program. I found a strong correlation between the two. I then checked whether each instance of correlation might have a legitimate explanation. Ultimately, I found matching statements, comments, and identifier names that seemed to be unique to the two programs; searching online, I could find no use of them anywhere else. I concluded that the code had been copied.
The next step was the deposition, the pretrial process during which the opposing side’s lawyers question witnesses and experts in an attempt to uncover new information. I presented a report comparing sample snippets of copied code, and I included CD-ROMs containing the results of the complete CodeMatch analysis of the code.
The defendant’s lawyer showed me a slide with three snippets of code. Two of the snippets were ones I’d included in my report as evidence of copying. But it was the third snippet that nearly tripped me up. It was identical to the other two snippets. “Did you know that this third snippet of code is open-source code that is freely available on the Web, accessible to anyone?” the lawyer asked.
I didn’t in fact know that, and I began to feel really nervous. Had I overlooked something? Had I not done a complete search? If both programs contained code from a third party that allows its code to be copied, then there was no copyright infringement. I told the lawyer that I would need to know more about the third-party code. He assured me that the expert hired by his client would provide the necessary information in his own report.
A few days later I received the opposing expert’s report. Sure enough, it contained snippets of open-source code that were identical to snippets from both parties’ code. But I noticed something else. Whereas in my report I had included dozens of lines of code in each snippet, the other expert gave only a few lines.
I then searched the Internet and found the open-source code that both programs had obviously relied on. Comparing the snippets again, I discovered that while maybe 10 to 20 percent of the lines matched exactly in all three sets of code, the vast majority of the lines were different. I concluded that the original company’s programmers had taken open-source code and made significant functional changes to it. The defendant had tried to make it appear that the open-source code had simply been copied without alteration, but in actuality proprietary changes had been made by my clients’ company and then copied by the defendant’s company. After my discovery, the two sides reached a settlement, and there was no courtroom trial. My clients didn’t divulge the terms, but they told me they were pleased with the result.
Since CodeMatch came out seven years ago, it’s become an accepted tool for sorting out cases of software copyright infringement. CodeMatch has evolved into CodeSuite, a set of tools for comparing, measuring, and filtering the results of a source-code comparison, not only for copyright infringement but also trade-secret theft and even tax cases. CodeSuite can run on a standalone computer, a multiprocessor machine, or a network of computers; my company trains lawyers and other consultants to run and interpret the results so that they can effectively use the software.
As the importance of software in our daily lives grows, intellectual property disputes over that software are also likely to escalate. You may consider all that litigation a good thing, righting a wrong, or a bad thing, draining valuable resources. But software forensics tools that automate, quantify, and standardize such disputes can only be beneficial, in that they leave less room for misunderstanding and get to the important results and help resolve disputes much faster than ever before.
This article originally appeared in print as “Software V. Software.”
About the Author
Bob Zeidman is the president of Software Analysis and Forensic Engineering Corp., the leading provider of intellectual-property analysis software. He holds seven patents, two bachelor’s degrees—in physics and electrical engineering—from Cornell, and a master’s in EE from Stanford. He is also the inventor of the Silicon Valley Napkin, a cocktail napkin printed with a simple form for creating a business plan, which when completed can be presented to a venture capitalist as a funding pitch.