Friday, October 10, 2008

What is it that you do here?

One of the problems of being a theoretical computer scientist is that it is difficult to justify to the lay person just what it is that we do. In a flippant moment, I just say, "I stare off into space and try to think of theorems to prove." It is easier than the following: "I examine the relations between classes defined by reductions to sparse sets, because the ramifications of those sets having measure zero have implications for the relation between P and NP." The reason it is easier is because in this second explanation, I have to explain classes and reductions and what it means for a a set (i.e. a language) to be sparse, and then I need to explain P and NP why the question as to whether they are equal is important. And then I need to explain why sparse sets are important in the relation between P and NP, which involves not only those classes themselves, but also Boolean circuit classes like P/poly, and the ramifications into derandomization... It is a mess to explain, and most people don't have the hours it takes just to understand the basic principles at play.

Still, there's an answer I wish I could give, but I can't since it isn't my research. I could simply say: "Theoretical computer science tells us that it is impossible to have a 100% reliable virus detector." The explanation of that is probably a little too complex for the lay person, but here it goes.

There's a problem in computer science, a language (a set of strings) defined as follows:

HALT = { < m,x > | Program M halts on input x}

This is the notorious halting problem, which no computer program can solve. Why? Well, suppose such a program exists, and call it N. We can encode N as a string, as we can do with all programs (how do you think they're stored on your hard drive, after all?). Then we can make a new program P:

P: Input x. If N halts on x, run forever. Else, halt.

Now, in similar fashion, we can encode P as a string (which, of course, depends on being able to encode N as a string). Furthermore, we can feed P itself as an input! But then, what do we have? Does P on input P halt? Well, if it does, then P should run forever (i.e. doesn't halt), and if it doesn't halt, it does. Obviously this is a contradiction. Since this was based on assuming the halting problem has a program that decides it. So there must be no program for the halting problem.

Now let's turn to the problem of our foolproof virus checker. Current day virus checking programs look for signatures, i.e. particular patterns in the binary code of a suspicious program. But we know that isn't successful, because that requires knowing the signature ahead of time, and besides, suppose we had the signature somewhere in the code, but the program is built to never execute the signature. Then it might not be a virus at all. So we want our foolproof virus checker to be capable of examining a program and telling us decisively if it is a virus (i.e. the program is capable of replicating itself, and perhaps does damage in the process).

But here's the thing. Having a foolproof virus checker would give us a program to solve the halting problem, which we have already shown cannot have a program. Here's the proof:

Suppose we know that Program A is a virus, i.e. if it runs, it will replicate itself and perhaps do some damage to the computer. We can pad A with useless instructions that do not affect how A functions (save maybe that A will run a little slower). This is one method virus-writers use to defeat signature checking, by the way. So let's take an instance of the halting problem, some pair < m,x > . We want to know if M halts on input x. Well, we create Program B by "padding" A with the code of running M on x. In other words,

B: Run M on x. Then run Program A.

In other words, if M halts on x, B has the same properties of A. Since A is a virus, B must also be a virus. However, if M doesn't halt on x, B never executes A, and thus can't be a virus, since it doesn't replicate itself or damage the computer (besides running forever, of course).

Here's the catch: our foolproof virus-checker can then decide the halting problem. It examines B, and if B is a virus, then M must halt on x. If B isn't a virus, then M must run forever on x. But, as we said before, no program can decide the halting problem, so our foolproof virus-checker is a sham. It can't exist.

I think this statement in justification of theoretical computer science is one that would help explain our field's importance in once sense. But then, it might also turn people off to the field, if all we can do is assure the public that no computer is 100% safe from viruses!

No comments: