The following document is copyrighted, 1989, by Tim Sankary - all rights reserved. It may be copied and distributed freely as long as no changes are made and as long as this copyright notice remains with the document I want to preface this document with a personal statement. I am aware that Jim Goodwin has published a partial list of his virus disassemblies and I can imagine the controversy that will result. I do not have an inside track to the "truth" of this Distribute/Don't Distribute issue, and I can frankly see both sides of the argument. I find it hard, however, to censure a colleague who has performed such excellent and dedicated work as Jim has, and I have to admire his courage in taking such a controversial step. For those of you who anticipate writing or designing Identification and Removal programs (CVIA Class III programs) for viruses, I hope you will find something of value in the following study that will be useful. If you have access to disassemblies, this document may provide some insights into designing your own disinfectant. I would like to thank "Doc" John McAfee for his guidance and help in developing this paper, and the Computer Virus Industry Association for the outstanding visual aids that they contributed. These figures have been referenced in the paper but I have been unable to create ASCII representations of them for BBS distribution. If you obtained this document from an electronic source and would like a copy of the figures, they can be obtained by sending a stamped, self addressed envelope to the CVIA, 4423 Cheeney Street, Santa Clara, CA. 95054. - Tim Sankary From the Homebase BBS 408 988 4004 DEVELOPING VIRUS IDENTIFICATION PRODUCTS In January of 1986, the world's first computer virus was unleashed upon an unsuspecting and largely defenseless population of global IBM personal computers users. The virus originated in Lahore, Pakistan, and spread rapidly from country to country through Europe and across to the North American Continent. In less than twelve months it had infected nearly a half-million computers and was causing minor havoc in hundreds of universities, corporations and government agencies. This virus, later dubbed the "Pakistani Brain", caught the user community unawares and the problems resulting from its many infections demonstrated how unprepared we were for this phenomenon. The computer systems targeted by the virus contained no specific hardware or software elements that could prevent or even slow its spread, and few utilities could even detect its presence after an infection occurrence. Fortunately, the virus was not destructive, and it limited its infections to floppy diskettes; avoiding hard disks entirely. The first defensive procedure developed to counteract this virus involved a simple visual inspection of a suspected diskette's volume serial label. The virus erased every infected diskette's volume label and replaced it with the character string - "@BRAIN". Thus, any inspection of the volume label, such as performing a simple DIRECTORY command, would indicate the presence or absence of the virus. An infected diskette could then be reformatted, or the virus could be removed by replacing the boot sector. This manual procedure is a typical, if somewhat rudimentary, example of the type of functions performed by a class of antiviral utilities commonly called Infection Identification products. Infection identification products generally employ "passive" techniques for virus detection. That is; they work by examining the virus in its inert state. This contrasts with active detection products which look for specific actions employed by a virus. For example, looking for a Format instruction within a segment of code on a disk would be a passive method of detecting a potentially destructive program. If we detected the Format attempt during program execution, however, we would be performing an active detection. Passive methods concern themselves with the static attributes of viruses, active methods concern themselves with the results of virus execution. Example active indicators are: the attempted erasure of critical files, destruction of the FAT table, re-direction of system interrupt vectors, general slowdown of the system, or an attempt to modify an executable program. These indicators are generic; that is, they are common to a large class of viruses. Because so many viruses perform these common activities, however, they are of little use in identifying individual virus strains. It is the passive virus indicators that prove most useful to a positive identification: The characteristic text imbedded within the virus, specific flags, singular filenames or a distinctive sequence of instructions that are unique to the virus. These and other similar indicators can best be ascertained by scanning system storage and examining the program files and other inert data. History Virus identification products have their genesis in the utility programs first developed in 1982 and 1983 to check public domain software for bombs or trojans before they were executed. These utility programs initially checked for questionable instructions in the suspect program's object code. Direct input/output instructions, interrupt calls, format sequences and like instructions, if found, were flagged and the user was notified. Later versions included tests for imbedded data strings that were typically used by trojan designers. Suspect programs were scanned for profanity, for keywords like "gotcha" or "sucker", and for data strings that had been found in specific trojan programs. Some programs looked also for specific names of files that were frequently used by trojans and bombs. These products, however, were seldom able to identify a specific bomb or trojan. Rather, they indicated that the suspect program contained instructions or messages of a questionable nature - implying that the program might be a generic trojan. This, however, is not sufficient for dealing with viruses. Viruses create entirely different problems than bombs or trojans. Viruses replicate, and can infect hundreds or even thousands of programs within an installation. They remain invisible for long periods of time before they activate and cause damage. And, they are difficult to remove because they imbed themselves within critical segments of the system. It is not sufficient to know that a virus is present, it is necessary to know which virus is present. We must know how it infects, what actions it takes, and, most importantly, what must be done to de-activate and remove the virus. Thus, when the first virus identification products emerged in 1986 they didn't just look for generic code or messages, they looked for specific indications that could identify the individual virus strain. This allowed the user to verify a specific infection occurrence and take appropriate action. Later versions of these products went a step further. They actually removed the virus when an infection was identified. Techniques Before we discuss the techniques used by identification products, we need to look briefly at how viruses insert themselves into programs. As shown in Figure 1, viruses actually modify the structure of the programs that they infect. Generally, the virus replaces the program's start-up segment with a routine that passes control to the main body of the virus. This main body code may be inserted within the program in a buffer area, or it may be added to the beginning or the end of the program. After execution of the virus, the program's original start-up sequence is replaced and control is passed to the program. When removing a virus from an infected program, it is crucial to determine exactly how the virus modified the program. Each virus differs from other viruses in size, segmentation and technique. Each virus chooses a different area for infection, stores the start-up sequence in a different location. and return control in a different manner. We must know exactly what the virus did during the infection process in order to reverse the steps for removal. Thus, it should be clear that in order to develop an antidote for a specific virus, we must first obtain a copy of the virus for analysis. A thorough analysis of the structure and design of the virus will provide the answers to all of the above questions. When a virus has been disassembled and analyzed, we in theory know all there is to know about the virus. We are then able to create an "attribute file" for the virus. This file contains all of characteristics of the virus that can be uniquely assigned to the virus. For example, we may find imbedded data within the virus that we would not reasonably expect to find in any other program or data file. Or we may find an instruction sequence that is sufficiently unusual that we would not expect any other program to use the exact same sequence. Figure 2 shows two virus examples that contain unique imbedded data. In the Pakistani Brain example, it is clear that we would not expect to find the exact same name, address and telephone number in any other program. In addition to "identification" attributes, the attribute file contains all information necessary to reverse the virus infection process. Common elements of an attribute file might be: - Executable code signatures - Volume label flags - Hidden file names - Absolute sector address contents - Key data at specific file offsets - Specific interrupt vector modifications - ASCII data content - Specific increases in bad sector counts When the attribute file has been created, it is inputted into a program that scans all of the appropriate areas of system storage looking for combinations of the attributes. As more attributes are discovered, the degree of assurance that the virus is present increases. For example, the character string "sUMsDOS" is common to all versions of the Israeli virus. It is conceivable, however, that the same string could appear randomly in any text file. Therefore, the identification program will look for verification attributes, such as the file offset where the character string was located, or a sequence of instructions following the data. When the virus has been identified, the removal phase begins. Since the infection attributes of the virus are known, the removal process is fairly straightforward. Usually it involves locating the main body of the virus and all segments of the original program that had been re-located by the virus. The virus is erased, and the program is then re-constructed. Clearly, multiple attribute files can be used by a single program. Thus, single identification products are able to identify multiple strains of viruses (see Figure 3). Product Advantages Infection identification products have a major advantage over other types of virus protection products: They are able to determine whether or not a system is already infected. This is a serious concern in many organizations. Other classes of virus protection products must assume that a given system is uninfected at the time the products are installed. They log the state of the system at the time they are installed and periodically compare the current state to the original state. If a virus has infected the system in the interim, the change will be detected. If a virus has already infected the system before such products are installed, however, the virus will be logged as part of the original system, and no change will be detected. Infection identification products, on the other hand, are specifically designed to look for and identify pre-existing infections. This ability to identify an existing infection is in many cases crucial to the success of implementing antiviral measures. Since a virus may remain dormant for months or even years before it activates and damages the system, pre-existing infections could cause widespread destruction in spite of our best efforts at implementing protection programs. Automatic removal is the second advantage of identification products. Virus infections can sometimes involve hundreds or thousands of programs within an organization. When the virus is discovered, the task of tracking down and disinfecting all of the infected programs can become monumental. In many cases, multiple versions of a single program may be infected, or the original source diskettes may have been lost or misplaced. In some cases, infected programs may be overlooked or incorrectly replaced, so that re- infection becomes a problem. These and other issues invariably cause problems. The identification products, however, automatically find, identify and remove the infection, normally at a rate of a few seconds per infected program. The time savings alone can be enormous. A third advantage to identification programs is that they cannot be circumvented by a known virus. Other types of products that use active methods for infection prevention or detection can be specifically targeted by viruses. The virus can seek out and destroy or disable the active element of such products. For example, if the product is a filter type product that monitors all system I/O, the virus can steal the interrupts from the monitor and thus bypass the program's checking function. Likewise, if a protection program uses a checksum or other method to look for change within a program, the virus can modify the program's checksum routine so that the change caused by an infection will not be detected. These and other techniques have been used by many viruses to avoid interference by antiviral programs that use active detection methods. Identification products, on the other hand, cannot be so easily circumvented. Since these products use passive techniques, the virus has no control over the products' functions. Keep in mind that the virus and its resultant system modifications are merely a sequence of inert bits as far as the identification product is concerned. Also the virus is not active at the time the product is being used (all such products come with their own boot diskettes, and they run stand-alone). Thus, the virus can in no way affect the product's operation, or even be aware of its presence. Problem areas There are some drawbacks to identification products however. The first problem is that these products only work for known viruses. That is, a virus that has been around long enough to be noticed, isolated, sampled, disassembled and analyzed. This may take a considerable time if the virus is unobtrusive and slow to activate. When the virus has been discovered and analyzed, the identification product must be designed, implemented, packaged, marketed and distributed - a process that could take considerably more time. Thus identification utilities will lag new virus developments by months, or in some cases, even years. This time lag implies that there will always be new viruses, and thus new dangers, against which no identification utility will be effective. The second problem with these products is more thorny, and requires a high level of product sophistication in order to resolve. At issue is a phenomenon that might be called the Uncertainty Factor, and it is caused by the increasing tendency of hackers to collect existing viruses, modify them and return them to the public domain. These modifications sometimes cause viruses to react differently from the ways in which they were originally designed, yet they may leave key identification attributes unchanged. For example, the Jerusalem virus was originally designed to slow down the infected machine's processor one-half hour after an infected program was executed. This slowdown was a nuisance to the user of the infected machine, but it severely limited the spread of the virus, because the virus made itself known early in the infection process and had limited time to replicate before being removed. In the summer of 1988, an unknown hacker modified the virus by changing just one instruction (see Figure 4). This modification disabled the routine that caused the system to slowdown, and as a result, the virus became many times more infectious. Modifications like this, and other more substantial modifications, are made almost daily to existing viruses. The danger that these modifications pose to identification products is substantial. If an identification product is attempting to remove a virus that has infected a program differently than the way in which the product expects, then the results of the disinfection will be unpredictable. Damage to the system may result, the program may be destroyed or, in the worst case, the virus will still be active even though the product thinks it has removed it. In order to minimize the risks posed by this problem, identification products must be designed to cross reference as many virus attributes as possible prior to attempting removal. If any one of the expected attributes has been changed, or is missing, the product should notify the user of the potential problem and manual intervention will be required. Future Prospects Identification products clearly must play a major role in the battle against computer viruses. As viruses become more widespread and as infections become more common, the need for utilities able to identify and help remove viruses will become apparent. It is probable that these products will become the dominant form of virus protection in the future. A few technical advances, however would greatly aid their general acceptance. One of the problems facing identification products is the time required to fully scan attached storage devices when searching for a virus. For example, as many as ten or more minutes can be required to fully scan a 40 megabyte drive while looking for just one virus. Multiple virus checks require more time. Because of this, it is impractical to perform frequent scans of the system. This is unfortunate because it would be advantageous to perform a complete identification check of a system each time the system was booted. This would provide a high degree of system security, assuming that the identification product was kept up to date. More sophisticated algorithms for searching attached storage and creative techniques for multiple virus scans could alleviate the time scan problem. A second desirable advance in the technology of these products would be the development of techniques that could identify variations of known viruses and still provide the capability to remove the modified virus. This advance would remove a major limitation of the current products and would greatly increase their reliability. Techniques for removing variations have already been developed for a few root viruses, but there currently exists no generic technique that is effective for a large class of viruses. I anticipate that this hurdle will be overcome within a year or two. A final enhancement would be the ability to fully or partially re- structure data that has been corrupted by a virus after it has activated. Currently, infection identification products are only useful if they are used before a virus begins its destructive phase. When the destructive phase begins, the virus may destroy critical control tables, data files, programs or even itself. At this point all current virus products have limited usefulness. It is possible in some cases, however, to reverse much of the destruction caused by a virus provided: 1) We know the details of the destruction process, and 2) The destructive phase has not gone on too long. For example, one of the common PC viruses scrambles the File Allocation Table by reversing a number of the entries. Since we know the exact way in which the virus scrambles the information, we can easily unscramble it. However, after a few days of data scrambling, the virus initiates a low level format of the hard disk. At this point, no recovery is possible. I anticipate that future products will incorporate recovery capabilities for a large number of virus destructive acts. This capability, and others described above, should provide the best virus protection that we can hope to achieve.