Creating a custom known file type for R-Studio. Methods for detecting "glued" files Bad coding practice

The function code (FC) in the telegram header identifies the telegram type, such as Request telegram (Request or Send/Request) and Acknowledgment or Response telegram (Acknowledgment frame, Response frame). In addition, the function code contains the actual transmission function and control information that prevent loss and duplication of messages, or the station type with FDL status .

7 6 5 4 3 2 1 0 FC: Function Code Request
1 Request Telegramm
X FCV = Alternating bit switched on
X href="http://profibus.felser.ch/en/funktionscode.htm#aufruffolgebit">FCB = Alternating bit (from frame count)
1 0 (0x0) CV = ClockValue()
1 other Reserved
0 0 (0x0) TE = Time Event (Clock synchronization)
0 3 (0x3) SDA_LOW = Send Data Acknowledged - low priority
0 4 (0x4) SDN_LOW = Send Data Not acknowledged - low priority
0 5 (0x5) SDA_HIGH = Send Data Acknowledged - high priority
0 6 (0x6) SDN_HIGH = Send Data Not acknowledged
0 7 (0x7) MSRD = Send Request Data with Multicast Reply
0 9 (0x9) Request FDL Status
0 12(0xC) SRD low = Send and Request Data
0 13(0xD) SRD high = Send and Request Data
0 14(0xE) Request Ident with reply
0 15 (0xF) Request LSAP Status with reply 1)
0 other Reserved

1) this value is in the last version of the standard not defined anymore but only reserved

7 6 5 4 3 2 1 0 FC: Function Code Response
0 response telegram
0 Reserved
0 0 slave
0 1 Master not ready
1 0 Master ready, without token
1 1 Master ready, in token ring
0 (0x0) OK
1 (0x1) UE = User Error
2 (0x2) RR = No resources
3 (0x3) RS = SAP not enabled
8 (0x8) DL = Data Low (normal case with DP)
9 (0x9) NR = No response data ready
10(0xA) DH = Data High (DP diagnosis pending)
12(0xC) RDL = Data not received and Data Low
13(0xD) RDH = Data not received and Data High
other Reserved

Frame Count Bit The frame count bit FCB (b5) prevents message duplication by the acknowledging or responding station(responder) and any loss by the calling station (initiator). Excluded from this are requests without acknowledgment (SDN) and FDL Status, Ident and LSAP Status requests.

For the security sequence, the initiator must carry an FCB for each responder. When a Request telegram (Request or Send/Request) is sent to a responder for the first time, or if it is re-sent to a responder currently marked as non-operational, the FCB must be set as defined in the responder. The initiator achieves this in a Request telegram with FCV=0 and FCB=1. The responder must assess a telegram of this kind as the first message cycle and store the FCB=1 together with the initiator’s address (SA) (see following table). This message cycle will not be repeated by the initiator. In subsequent Request telegrams to the same responder, the initiator must set FCV=1 and change the FCB with each new Request telegram. Any responder that receives a Request telegram addressed to it with FCV=1 must evaluate the FCB. If the FCB has changed when compared with the last Request telegram from the same initiator (same SA), this is valid confirmation that the preceding message cycle was concluded properly. If the Request telegram originates from a different initiator (different SA), evaluation of the FCB is no longer necessary. In both cases, the responder must save the FCB with the source SA until receipt of a new telegram addressed to it. In the case of a lost or impaired acknowledgment or response telegram, the FCB must not be changed by the initiator in the request retry: this will indicate that the previous message cycle was faulty. If the responder receives a Request telegram with FCV=1 and the same FCB as the last Request telegram from the same initiator (same SA), this will indicate a request retry. The responder must in turn retransmit the acknowledgment or response telegram held in readiness. Until the above-mentioned confirmation or receipt of a telegram with a different address (SA or DA) that is not acknowledged (Send Data with No Acknowledge, SDN) the responder must hold the last acknowledgment or response telegram in readiness for any possible request retry . In the case of Request telegrams that are not acknowledged and with Request FDL Status, Ident, and LSAP Status, FCV=0 and FCB=0; evaluation by the responder is no longer necessary.

b5 b4 bit position
FCB FCV Condition Meaning Action
0 0 DA=TS/127 Request without acknowledgment
Request FDL Status/ Ident/ LSAP Status
Delete last acknowledgment
0/1 0/1 DA#TS Request to another responder
1 0 DA=TS First request FCBM:= 1
SAM:=SA
Delete last acknowledgment / response
0/1 1 DA=TS
SA=SAM
FCB#FCBM
New Request Delete last acknowledgment / response
FCBM:= FCB
Hold acknowledgment / response in readiness for retry
0/1 1 DA=TS
SA=SAM
FCB=FCBM
Retry Request FCBM:= FCB
Repeat acknowledgment / response and continue to hold in readiness
0/1 1 DA=TS
SA#SAM
New initiator FCBM:= FCB
SAM:= SA Hold acknowledgment / response in readiness for retry

FCBM stored FCB in memory SAM stored SA in memory

Searching when scanning files of known types (or, as it is often said, searching for files by signatures) is one of the most effective ones used in the R-Studio data recovery utility. Using a given signature allows you to restore files of a certain type in the case when information on the directory structure and file names is partially or completely missing (corrupted).

Usually, the disk partition table is used to determine the location of files. If we compare a disk with a book, then the partition table will be similar to its table of contents. When scanning, R-Studio searches for files of known types in the disk partition table according to certain predefined signatures. This is possible because virtually every file type has a unique signature or data pattern. File signatures are located at a specific location at the beginning of the file, and in many cases also at the end of the file. When scanning, R-Studio matches the found data with the signatures of files of known types, which allows them to be identified and their data restored.

Using the technology of scanning files of known types, R-Studio allows you to recover data from disks that have been reformatted, as well as the partition tables of which have been overwritten. Moreover, if a disk partition is overwritten, corrupted, or deleted, then scanning known file types is the only possible option.

But almost everything and everything has its drawbacks, and files of known types used in R-Studio are no exception. So, when scanning files of known types, R-Studio allows you to recover only unfragmented files, but, as already mentioned, in most cases this is the last possible method.

R-Studio already includes signatures for the most common file types (see the R-Studio Online Help section for a complete list of known file types.)

If necessary, the user can add new types of files to R-Studio. For example, if you need to find files of a unique type, or those developed after the date of the last release of R-Studio, you can add your own signatures to files of known types. This process will be discussed next.

User Files of Known Types
User signatures for files of known types are stored in XML file e specified in the Options dialog box. Adding a signature consists of two parts:

  1. Determination of the file signature located at the beginning of the file and, if present, at the end of the file.
  2. Creation of an XML file containing the file signature and other information about the file type.

All this can be done with R-Studio. At the same time, you do not need to be an expert in the field of compiling (editing) XML documents or in the field of hex editing - in this guide (article), aimed at the user of the very initial level, all stages of this process will be considered in detail.

Example: Adding a signature for an MP4 file (XDCam-EX Codec)
Consider adding a file signature using the example of an .MP4 file created using Sony XDCAM-EX. It can be used, for example, in case of damage to the SD card for which you have not yet managed to save on the computer's hard drive.

First Step: Determining the File Signature
To determine the file signature, consider examples of files of the same format.

Let these be four video files from Sony XDCAM-EX:
ZRV-3364_01.MP4
ZRV-3365_01.MP4
ZRV-3366_01.MP4
ZRV-3367_01.MP4

For ease of consideration, let these be small files. Larger files are harder to view in hexadecimal.

1. Open files in R-Studio. To do this, click on each file right click mouse and select View/Edit from the context menu.

2. Compare files. We will look for the same pattern found in all four files. He will be file signature. As a rule, file signatures are located at the beginning of the file, but sometimes at the end.

3. Define the file signature at the beginning of the file. In our example, it is located at the very beginning of the file. Note that this is not always the case - often the file signature is at the beginning of the file, but not at the first line (offset).

It follows from the images below that the contents of all four files are different, but they all start with the same file signature.


Click on the image to enlarge it


Click on the image to enlarge it


Click on the image to enlarge it


Click on the image to enlarge it

The highlighted area on the images is the file signature of the given file type. It is presented both in text and in hexadecimal form.

InB text form The file signature looks like this:
....ftypmp42....mp42........free

Dots (“.”) indicate characters that cannot be represented as text. Therefore, it is also necessary to give the hexadecimal form of the file signature:
00 00 00 18 66 74 79 6D 70 34 32 00 00 00 00 6D 70 34 32 00 00 00 00 00 00 00 08 66 72 65 65

4. In the same way, we define the file signature, but at the very end of the file. It may be a different file signature, a different length.

The images below highlight the file signature at the end of the file:


Click on the image to enlarge it


Click on the image to enlarge it


Click on the image to enlarge it


Click on the image to enlarge it

Note that the data before the selection (file signature) is the same in all four files. This is technical information, which is not a file signature, but indicates that all four pictures (files) were taken using the same camera with the same parameters. It is usually possible to distinguish matching patterns with technical information from the file signature. In our example, in the last line before the beginning of the file signature, we see the text ‘RecordingMode type=”normal”’, which clearly indicates that this is some kind of file parameter, and not a signature. Always pay close attention to this line so you don't mistakenly include technical information in the file signature.

In our case, the file signature is the following text:
...
Recall that dots denote characters that cannot be represented in text form.

In hexadecimal, the file signature is:
3N 2F 4E 6F 6E 52 65 61 6N 54 69 6A 65 4A 65 74 61 3E 0D 0A 00
Please note: the signature will not always be at the end of the file.

Second Step: Create an XML file describing the Known File Type
Now, having defined the file signature, you can create an XML file and include the corresponding file type in R-Studio. This can be done in two ways:

2.1 Using built-in graphics editor file signatures:
Select Settings from the Tools menu, in the Settings dialog box that opens, click the Known Files Types tab, and then click the Edit User's File Types button.

Click on the image to enlarge it

Click the Create File Type button on the Edit User's File Types dialog box.
Set the following options:

  • Id - a unique digital identifier. This number will be chosen arbitrarily; the only thing is that it must not match the numeric ID of any other file type.
  • Group Description - the group in which the found files will be located in R-Studio. You can specify either a new group, or select one of those that already exist. For us it will be a group “Multimedia Video (Multimedia: Video)”.
  • Description - a short description of the file type. In our example, you can use, for example, "Sony cam video, XDCam-EX".
  • Extension - file extension of this type. In our case - mp4.

The Features parameter is optional, in our case we don't need to use it.

Click on the image to enlarge it

Next, you need to enter the initial and final file signature. To do this, select Begin and then the Add Signature command in the context menu.

Click on the image to enlarge it

Then double click on the field<пустая сигнатура> () and enter the appropriate text.

Click on the image to enlarge it

Then create the final file signature. Don't forget to enter 21 in the From column field.

Click on the image to enlarge it

You have successfully created your own file signature of a known type.

Now we need to save it. There are two ways: you can either save it to the default file specified on the Main tab of the Settings dialog box by clicking the Save button. Or click the Save As... button and save the signature to some other file.

2.2 Creating an XML file describing a Known File Type manually:
For creating given file Let's use XML version 1.0 and UTF-8 encoding. Don't despair if you don't know what it is - just open any text editor (for example, Notepad.exe) and enter the following text in the first line:

Next, we will create an XML tag that defines the file type (FileType). Given the previously described XML attributes, the tag will look like this:

Insert it right after

Next, we define the file signature (tag ). The initial signature (at the beginning of the file) will be inside the tag without any attributes. We use the text form of the signature, but at the same time replacing with hexadecimal characters that cannot be represented in text form. Before each hexadecimal character, insert "\x" Thus, the tag with a file signature would look like this:

If present, you must also define the final signature (at the end of the file). For this, the same tag is used, but with the "from" element and the "end" attribute. It will look like this:

Recall that there were no non-text characters in the final file signature, but there were slashes and triangle brackets. To avoid confusion and errors in the XML syntax, we will replace the characters "/", "<" и ">" with their hexadecimal values.

At the end, after the file signatures, there must be the closing FileType and FileTypeList tags:

So the whole file should look like this:


\x00\x00\x00\x18ftypmp42\x00\x00\x00\x00mp42\x00\x00\x00\x00\x00\x00\x00\x08free
\x3C\x2FNonRealTimeMeta\x3E\x0D\x0A\x00

Remember: the XML syntax is case sensitive, so the correct tag is , but not .

Let's save the file in text format with the .xml extension. For example: SonyCam.xml.

We have successfully created our own file signature of a known type. This example is sufficient to understand the basic principles of creating a custom file type. More advanced users can use XML version 2.0. You can read more about this in the R-Studio online Help section.

Stage 3: Checking and Adding a File Describing a Known File Type
The next step is to add (upload) your XML file to R-Studio. It will then be automatically checked.

Let's load the XML file created at the previous stage into R-Studio. To do this, select the Settings item from the Tools menu. In the User’s file types area of ​​the Main tab of the Settings dialog box, add the XML file we created (SonyCam.xml). Click the Apply button.

Click on the image to enlarge it

2. Answer Yes (Yes) to the request to upload a new file type.

Click on the image to enlarge it

3. To verify that the file type was loaded successfully, click on the Known File Types tab of the Settings dialog box. Recall that we added the file type to the Multimedia Video group (Multimedia: Video). Expanding this group (folder), we should see an element with the description we specified when creating the XML file: Sony cam video, XDCam-EX (.mp4).

Click on the image to enlarge it


Click on the image to enlarge it

If there are any errors in the file syntax, you will see the corresponding message:

Click on the image to enlarge it

In this case, check your XML file again for errors. Remember, the XML syntax is case sensitive and every tag must have a closing tag at the end.

Stage 4: Testing the File Describing the Known File Type
To check if the custom file type we created is correct, let's try to find our .mp4 files on a removable USB flash drive.

1. Under OS Windows Vista or Windows 7, we will perform a full (not quick) format of the disk, or we will use a disk space cleaning utility (for example, R-Wipe & Clean) to completely delete all data on the disk. Let USB disk formatted in FAT32 (the size of the required files does not exceed 2 GB).

2. Copy the test files to the disk and restart the computer so that the contents of the cache memory are saved on the disk. You can also disconnect the external drive and then connect it again.

3. In the OS, the drive will be defined as, for example, the logical drive F:\.

4. Run R-Studio. Select our drive (F:\) and click the Scan button

Click on the image to enlarge it

5. In the Scan dialog box, in the File System area, click the Change... button and clear all checkboxes. This way we will disable the search for file systems and files using the partition table.
Click on the image to enlarge it

6. Check the box Extra Search for Known File Types. This will allow R-Studio to scan for known file types.

7. To start scanning, click the Scan button.

8. Wait while R-Studio scans the disk. The Scan Information tab will display the progress of the scan.


Click on the image to enlarge it

9. After scanning is complete, select the Extra Found Files item and double-click on it.


Click on the image to enlarge it

10. Our test files will be located in the Sony cam video, XDCam-EX folder (or in a folder with another name that matches the file type description specified in the Second Step).


Click on the image to enlarge it

You can see that the filenames, dates and locations (folders) were not restored because this information is stored in the file system. Therefore, R-Studio will automatically display each file with a new name.

However, it can be seen that the contents of the files are not damaged. To verify this, let's open them in the appropriate program, for example, VLC media player.


Click on the image to enlarge it

Conclusion
The ability of R-Studio to scan for files of known types allows you to recover data even from a disk whose file systems are either or overwritten. You can quite efficiently search for files using their signatures, which is especially useful if you know exactly the type of files being recovered, as, for example, in our example. The ability to create custom file types allows you to add any file that has a specific file signature to the list of known file types.

The concept of " magic number" in programming has three meanings:

  • Data Signature
  • Distinguished unique values ​​that should not be the same as other values ​​(such as UUID)
  • Bad programming practice.

Data Signature

magic number, or signature, is an integer or text constant used to uniquely identify a resource or data. Such a number in itself does not carry any meaning and can cause bewilderment when encountered in the program code without the appropriate context or comment, while an attempt to change it to another, even close in value, can lead to absolutely unpredictable consequences. For this reason, such numbers have been ironically called magic numbers. At present, this name is firmly entrenched as a term. For example, any compiled Java language class starts with the hexadecimal "magic number" 0xCAFEBABE . The second widely known example is any executable file The Microsoft Windows OS with the .exe extension begins with the byte sequence 0x4D5A (which corresponds to the ASCII characters MZ - the initials of Mark Zbikowski, one of the creators of MS-DOS). A lesser known example is the uninitialized pointer in Microsoft Visual C++ (since 2005 Microsoft versions Visual Studio) which is 0xDEADBEEF in debug mode.

In UNIX-like operating systems the type of a file is usually determined by the signature of the file, regardless of the extension of its name. They provide a standard file utility to interpret the signature of a file.

Bad programming practice

Also, “magic numbers” is a bad programming practice when a numeric value occurs in the source text and it is not obvious what it means. For example, a snippet like this, written in Java, would be bad:

drawSprite(53, 320, 240);

final int SCREEN_WIDTH = 640 ; final int SCREEN_HEIGHT = 480 ; final int SCREEN_X_CENTER = SCREEN_WIDTH / 2 ; final int SCREEN_Y_CENTER = SCREEN_HEIGHT / 2 ; final int SPRITE_CROSSHAIR = 53 ; ... drawSprite(SPRITE_CROSSHAIR , SCREEN_X_CENTER , SCREEN_Y_CENTER );

Now it's clear: this line displays a sprite in the center of the screen - the crosshair of the sight. In most programming languages, all values ​​used for such constants will be calculated at compile time and substituted into the places where the values ​​are used. Therefore, such a change in the source text does not impair the performance of the program.

In addition, magic numbers are a potential source of errors in the program:

  • If the same magic number is used more than once in a program (or could potentially be used), then changing it will require editing each occurrence (instead of just editing the value of the named constant). If not all occurrences are corrected, at least one error will occur.
  • In at least one of the occurrences, the magic number may be misspelled initially, and this is quite difficult to detect.
  • The magic number can depend on an implicit parameter or another magic number. If these dependencies, not explicitly identified, are not satisfied, at least one error will occur.
  • When modifying occurrences of one magic number, it is possible to erroneously change another magic number that is independent but has the same numeric value.

Magic numbers and cross-platform

Sometimes magic numbers hurt cross-platform code. The point is that in C in 32-bit and 64-bit OS the size of char , short and long long types is guaranteed, while the size of int , long , size_t and ptrdiff_t can change (for the first two - depending on the preferences of the compiler developers). , for the last two - depending on the bit depth of the target system). In old or poorly written code, there may be "magic numbers" indicating the size of a type - when moving to machines with a different bitness, they can lead to subtle errors.

For instance:

const size_t NUMBER_OF_ELEMENTS = 10 ; long a [ NUMBER_OF_ELEMENTS ]; memset(a , 0 , 10 * 4 ); // wrong - long is assumed to be 4 bytes, magic number of elements is used memset(a , 0 , NUMBER_OF_ELEMENTS * 4 ); // wrong - long is assumed to be 4 bytes memset(a , 0 , NUMBER_OF_ELEMENTS * sizeof (long )); // not entirely correct - duplication of the type name (if the type changes, you will have to change it here as well) memset (a , 0 , NUMBER_OF_ELEMENTS * sizeof (a [ 0 ])); // correct, optimal for dynamic arrays of non-zero size memset(a , 0 , sizeof(a )); // correct, optimal for static arrays

Numbers that are not magical

Not all numbers need to be converted to constants. For example, the code for

The boss has set me a rather interesting task. In a short time, write an executable file analyzer that would be able to find virus bodies by signatures and determine the packer / cryptor used. The finished prototype appeared in a couple of hours.

Author's word

signature analysis

Searching for a malicious object by signatures is what any antivirus can do. In the general case, a signature is a formalized description of some signs by which it can be determined that the file being scanned is a virus and the virus is quite specific.

There are various methods here. Alternatively, use a signature composed of N bytes of a malicious object. In this case, it is possible to make not a stupid comparison, but a comparison by some mask (such as looking for bytes EB ?? ?? CD 13). Or set additional conditions like "such bytes must be at the entry point to the program" and so on. The signature of Malvari is a particularity.

In the same way, some signs are described by which it is possible to determine that the executable file is packed with one or another cryptor or packer (for example, the banal ASPack). If you carefully read our magazine, you have definitely heard about such a tool as PEiD, which is able to determine the most commonly used packers, cryptors and compilers (there are a large number of signatures in the database) for the PE file transferred to it. Alas, new versions of the program have not been released for a long time, and recently a message appeared on the official website that the project will not develop further. It's a pity, because PEiD's features (especially given the plugin system) could very well be useful to me. After a short analysis, it became clear that this was not an option. But after digging around in English-language blogs, I quickly found what suited me. YARA project (code.google.com/p/yara-project).

What is YARA?

From the very beginning I was convinced that somewhere on the Web there are already open developments that would take on the task of determining the correspondence between some signature and the file under investigation. If I could find such a project, then it could easily be put on the rails of a web application, add different signatures there and get what was required of me. The plan began to seem even more real when I read the description of the YARA project.

The developers themselves position it as a tool to help malware researchers identify and classify malicious samples. The researcher can create descriptions for different type malware using text or binary patterns that describe the formalized features of malware. This is how signatures are obtained. In fact, each description consists of a set of lines and some logical expression, on the basis of which the logic of the analyzer operation is determined.

If the conditions of one of the rules are met for the file under investigation, it is determined accordingly (for example, a worm such and such). A simple example of a rule to understand what is at stake:

rule silent_banker: banker
{
meta:
description = "This is just an example"
thread_level = 3
in_the_wild = true
strings:
$a = (6A 40 68 00 30 00 00 6A 14 8D 91)
$b = (8D 4D B0 2B C1 83 C0 27 99 6A 4E 59 F7 F9)
$c = "UVODFRYSIHLNWPEJXQZAKCBGMT"
condition:
$a or $b or $c
}

In this rule, we tell YARA that any file that contains at least one of the sample strings described in the $a, $b, $c variables should be classified as a silent_banker trojan. And this is a very simple rule. In fact, wheels can be much more complicated (we will talk about this below).
Even the list of projects that use it already speaks about the authority of the YARA project, and this:

  • VirusTotal Malware Intelligence Services (vt-mis.com);
  • jsunpack-n (jsunpack.jeek.org);
  • We Watch Your Website (wewatchyourwebsite.com).

All code is written in Python, and the user is offered both the module itself for use in their developments, and just an executable file to use YARA as a standalone application. As part of my work, I chose the first option, but for simplicity, in the article we will use the analyzer simply as a console application.

After some digging, I quickly figured out how to write rules for YARA, as well as how to attach virus signatures from a free aver and packers from PEiD to it. But let's start with the installation.

Installation

As I said, the project is written in Python, so it can be easily installed on Linux, Windows, and Mac. At first, you can just take the binary. If we call the application in the console, we will get the rules to run.

$yara
usage: yara ... ... FILE | PID

That is, the format of the program call is as follows: first comes the name of the program, then the list of options, after which the file with the rules is indicated, and at the very end - the name of the file under investigation (or the directory containing the files), or the process ID. Now it would be nice to explain how these very rules are drawn up, but I don’t want to immediately load you with a dry theory. Therefore, we will act differently and borrow other people's signatures so that YARA can perform one of the tasks we have set - a full-fledged detection of viruses by signatures.

Your antivirus

The most important question: where to get the signature database known viruses? Anti-virus companies actively share such databases among themselves (some more generously, some less). To be honest, at first I even doubted that somewhere on the Web someone was openly posting such things. But, as it turned out, there are good people. A suitable base from the popular ClamAV antivirus is available to everyone (clamav.net/lang/en). In the "Latest Stable Release" section, you can find a link to the latest version antivirus product, as well as links for downloading ClamAV virus databases. We will be primarily interested in the files main.cvd (db.local.clamav.net/main.cvd) and daily.cvd (db.local.clamav.net/daily.cvd).

The first contains the main database of signatures, the second - the most complete on this moment base with various additions. For this purpose, daily.cvd is enough, which contains more than 100,000 malware casts. However, the ClamAV base is not a YARA base, so we need to convert it to the correct format. But how? After all, we still don’t know anything about either the ClamAV format or the Yara format. This problem has already been taken care of before us by preparing a small script that converts the ClamAV virus signature database into a YARA rule set. The script is called clamav_to_yara.py and is written by Matthew Richard (bit.ly/ij5HVs). Download the script and convert the databases:

$ python clamav_to_yara.py -f daily.cvd -o clamav.yara

As a result, we will get a signature base in the clamav.yara file, which will be immediately ready for use. Now let's try the combination of YARA and ClamAV base in action. Scanning a folder using a signature is performed with a single command:

$ yara -r clamav.yara /pentest/msf3/data

The -r option specifies that the scan should be performed recursively through all subfolders of the current folder. If there were any virus bodies in the /pentest/msf3/data folder (at least those in the ClamAV database), then YARA will immediately report this. In principle, this is a ready-made signature scanner. For more convenience, I wrote a simple script that checked for ClamAV database updates, uploaded new signatures and converted them to YARA format. But these are details. One part of the task is done, now we can start writing rules for defining packers/cryptors. But for this it was necessary to deal with them a little.

Game by the rules

So, the rule is the main mechanism of the program that allows you to assign a given file to any category. The rules are described in separate file(or files) and in their appearance are very similar to the struct () construct from the C / C ++ language.

rule Bad Boy
{
strings:
$a = "win.exe"
$b = "http://foo.com/badfile1.exe"
$c = "http://bar.com/badfile2.exe"
condition:
$a and ($b or $c)
}

In principle, there is nothing complicated in writing rules. Within the framework of this article, I touched only on the main points, and you will find the details in the manual. For now, the ten most important points are:

1. Each rule starts with the rule keyword followed by the rule identifier. Identifiers can have the same names as variables in C/C++, that is, they can consist of letters and digits, and the first character cannot be a digit. Maximum length identifier name - 128 characters.

2. Typically, rules consist of two sections: a definitions section (strings) and a condition section. The strings section contains data, on the basis of which a decision will be made in the condition section whether the specified file satisfies certain conditions.

3. Each line in the strings section has its own identifier, which begins with a $ sign - in general, like a variable declaration in php. YARA supports regular strings enclosed in double quotes (" ") and hexadecimal strings enclosed in curly braces (()), as well as regular expressions:

$my_text_string = "text here"
$my_hex_string = ( E2 34 A1 C8 23 FB )

4. The condition section contains all the logic of the rule. This section must contain a boolean expression that determines when the file or process matches the rule. Typically, this section refers to previously declared strings. And the row ID is treated as a boolean variable that returns true if the row was found in the file or process memory, and false otherwise. The above rule specifies that files and processes containing the string win.exe and one of the two URLs must be categorized as BadBoy (by the name of the rule).

5. Hexadecimal strings allow three constructs that make them more flexible: wildcards, jumps, and alternatives. Substitutions are places in a string that are unknown and can be any value. They are denoted by the symbol "?":

$hex_string = ( E2 34 ?? C8 A? FB )

This approach is very useful when specifying strings whose length is known but whose content can change. If part of the string can be of different lengths, it is convenient to use ranges:

$hex_string = ( F4 23 62 B4 )

This entry means that in the middle of the line there can be from 4 to 6 different bytes. You can also implement an alternative choice:

$hex_string = ( F4 23 (62 B4 | 56) 45 )

This means that in place of the third byte there can be 62 B4 or 56, such an entry corresponds to the lines F42362B445 or F4235645.

6. To check that a given string is at a certain offset in a file or process address space, use the at statement:

$a at 100 and $b at 200

If the string can be within a certain range of addresses, the in operator is used:

$a in (0..100) and $b in (100..fi lesize)

Sometimes there are situations when it is necessary to indicate that the file should contain a certain number from a given set. This is done using the of operator:

rule Of Example1
{
strings:
$foo1 = "dummy1"
$foo2 = "dummy2"
$foo3 = "dummy3"
condition:
2 of ($foo1,$foo2,$foo3)
}

The above rule requires that the file contain any two lines from the set ($foo1,$foo2,$foo3). Instead of specifying a specific number of lines in a file, you can use the variables any (at least one line from a given set) and all (all lines from a given set).

7. Well, the last interesting possibility to consider is applying one condition to many rows. This feature is very similar to the of operator, only more powerful is the for..of operator:

for expression of string_set: (boolean_expression)

This entry should be read like this: of the strings specified in string_set, at least expression pieces must satisfy the boolean_expression condition. Or, in other words: boolean_expression is evaluated for each string from string_set, and expression from them must return True. Next, we will consider this construction on a specific example.

Making PEiD

So, when everything became more or less clear with the rules, we can start implementing the detector of packers and cryptors in our project. As a starting material, at first I borrowed the signatures of well-known packers from the same PEiD. The plugins folder contains the userdb.txt file, which contains what we need. There were 1850 signatures in my database.

A lot, so in order to fully import them, I advise you to write some kind of script. The format of this database is simple - a plain text file is used, in which records of the form are stored:


signature = 50 E8 ?? ?? ?? ?? 58 25 ?? F0 FF FF 8B C8 83 C1 60 51 83 C0 40 83 EA 06 52 FF 20 9D C3
ep_only=true

The first line specifies the name of the packer that will be displayed in PEiD, but for us it will be the rule identifier. The second is the signature itself. The third is the ep_only flag, which specifies whether to look for the given line only at the address of the entry point, or throughout the entire file.

Well, let's try to create a rule, say, for ASPack? As it turned out, there is nothing complicated in this. First, let's create a file to store the rules and name it, for example, packers.yara. Then we search the PEiD database for all signatures with ASPack in their names and transfer them to the rule:

rule ASPack
{
strings:
$ = ( 60 E8 ?? ?? ?? ?? 5D 81 ED ?? ?? (43 | 44) ?? B8 ?? ?? (43 | 44) ?? 03 C5 )
$ = ( 60 EB ?? 5D EB ?? FF ?? ?? ?? ?? ?? E9 )
[.. cut..]
$ = ( 60 E8 03 00 00 00 E9 EB 04 5D 45 55 C3 E8 01 )
condition:
for any of them: ($at entrypoint)
}

All found entries have the ep_only flag set to true, that is, these lines must be located at the entry point address. So we write the following condition: "for any of them: ($ at entrypoint)".

Thus, the presence of at least one of the given strings at the entry point address will mean that the file is packed with ASPack. Also note that in this rule, all strings are specified simply with the $ sign, without an identifier. This is possible, since in the condition section we do not refer to any specific of them, but use the entire set.

To check the operability of the received system, it is enough to execute the command in the console:

$ yara -r packers.yara somefi le.exe

After feeding a couple of applications packed with ASPack there, I made sure that everything works!

Finished prototype

YARA turned out to be an extremely clear and transparent tool. It was not difficult for me to write a webadmin for it and set it up as a web service. With a little creativity, the dry results of the analyzer are already colored in different colors, indicating the degree of danger of the detected malware. A small update of the database, and for many of the cryptors a brief description is available, and sometimes even instructions for unpacking. The prototype has been created and works perfectly, and the authorities are dancing with delight!

Many may have heard of files such as rarjpeg "and. This is a special kind of file, which is a jpeg image and a rar archive glued together tightly. It is an excellent container for hiding the fact of information transfer. You can create rarjpeg using the following commands:

UNIX: cat image1.jpg archive.rar > image2.jpg
WINDOWS: copy /b image1.jpg+archive.rar image2.jpg

Or if you have a hex editor.

Of course, to hide the fact of information transfer, you can use not only the JPEG format, but many others. Each format has its own characteristics, due to which it may or may not be suitable for the container role. I will describe how you can find glued files in the most popular formats or point out the fact of gluing.

Methods for detecting glued files can be divided into three groups:

  1. Method for checking the area after the EOF marker. Many popular file formats have a so-called end-of-file marker, which is responsible for displaying the desired data. For example, photo viewers read all bytes up to this marker, however, the area after it remains ignored. This method is ideal for formats: JPEG, PNG, GIF, ZIP, RAR, PDF.
  2. Method for checking file size. The structure of some formats (audio and video containers) allows you to calculate the actual file size and compare it with the original size. Formats: AVI, WAV, MP4, MOV.
  3. Method for checking CFB files. CFB or Compound File Binary Format is a document format developed by Microsoft, which is a container with its own file system. This method is based on detecting anomalies in a file.

Is there life after the end of the file?

JPEG

To find the answer to this question, it is necessary to delve into the specifications of the format, which is the "ancestor" of the glued files and understand its structure. Any JPEG starts with signature 0xFF 0xD8.

After this signature, there is service information, optionally an image icon, and, finally, the compressed image. In this format, the end of the image is marked with a two-byte signature 0xFF 0xD9.

PNG

The first eight bytes of a PNG file are occupied by the following signature: 0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A. The end signature that ends the data stream: 0x49, 0x45, 0x4E, 0x44, 0xAE, 0x42, 0x60, 0x82.

RAR

Common signature for all rar archives: 0x52 0x61 0x72 0x21 (Rar!). After it comes information about the version of the archive and other related data. Empirically, it was found that the archive ends with the signature 0x0A, 0x25, 0x25, 0x45, 0x4F, 0x46.

Table of formats and their signatures:
The algorithm for checking for gluing in these formats is extremely simple:

  1. Find the initial signature;
  2. Find the final signature;
  3. If there is no data after the final signature - your file is clean and contains no attachments! Otherwise, it is necessary to look for other formats after the final signature.

GIF and PDF

A PDF document may have more than one EOF marker, for example due to incorrect document generation. The number of end signatures in a GIF file is equal to the number of frames in it. Based on the features of these formats, it is possible to improve the algorithm for checking for attached files.
  1. Item 1 is repeated from the previous algorithm.
  2. Item 2 is repeated from the previous algorithm.
  3. When finding the final signature, remember its location and search further;
  4. If the last EOF marker is reached in this way, the file is clean.
  5. If the file does not end with the end signature, goto is the location of the last end signature found.
A large difference between the file size and the position after the last end signature indicates the presence of a sticky attachment. The difference can be more than ten bytes, although other values ​​can be set.

ZIP

A feature of ZIP archives is the presence of three different signatures: The structure of the archive is as follows:
Local File Header 1
File Data 1
Data Descriptor 1
Local File Header 2
File Data 2
Data Descriptor 2
...
Local File Header n
File Data n
Data Descriptor n
Archive decryption header
Archive extra data record
Central directory
Most interesting is the central directory, which contains metadata about the files in the archive. The central directory always starts with the signature 0x50 0x4b 0x01 0x02 and ends with the signature 0x50 0x4b 0x05 0x06, followed by 18 bytes of metadata. Interestingly, empty archives consist only of the final signature and 18 null bytes. After 18 bytes comes the archive comment area, which is the perfect container for hiding a file.

To check a ZIP archive, you need to find the final signature of the central directory, skip 18 bytes and look for signatures of known formats in the comment area. The large size of the comment also indicates the fact of gluing.

Size matters

AVI

The structure of an AVI file is as follows: each file starts with a RIFF signature (0x52 0x49 0x46 0x46). On byte 8 there is an AVI signature specifying the format (0x41 0x56 0x49 0x20). The block at offset 4, consisting of 4 bytes, contains the initial size of the data block (byte order is little endian). To find out the number of the block containing the next size, you need to add the size of the header (8 bytes) and the size obtained in the block of 4-8 bytes. Thus it is calculated full size file. It is possible that the calculated size may be smaller than the actual file size. After the calculated size, the file will contain only zero bytes (needed to align the 1Kb boundary).

Size calculation example:


WAV

Like AVI, a WAV file starts with the signature RIFF, however, this file has a 8 byte signature - WAVE (0x57 0x41 0x56 0x45). The file size is calculated in the same way as AVI. The actual size must exactly match the calculated size.

MP4

MP4 or MPEG-4 is a media container format used for storing video and audio streams, and also provides for the storage of subtitles and images.
At offset 4 bytes there are signatures: file type ftyp (66 74 79 70) (QuickTime Container File Type) and file subtype mmp4 (6D 6D 70 34). To recognize hidden files, we are interested in the ability to calculate the file size.

Consider an example. The size of the first block is at offset zero, and it is 28 (00 00 00 1C, Big Endian byte order); it also indicates the offset where the size of the second data block is located. At the 28th offset, we find the next block size equal to 8 (00 00 00 08). To find the next block size, you need to add the sizes of the found previous blocks. Thus, the file size is calculated:

MOV

This widely used format is also an MPEG-4 container. MOV uses a proprietary data compression algorithm, has a structure similar to MP4 and is used for the same purpose - to store audio and video data, as well as related materials.
Like MP4, any mov file has a 4-byte ftyp signature at offset 4, however, the next signature has the value qt__ (71 74 20 20). The rule for calculating the file size has not changed: starting from the beginning of the file, we calculate the size of the next block and add it up.

The method of checking this group of formats for the presence of "attached" files is to calculate the size according to the above rules and compare it with the size of the file being checked. If the current file size is much smaller than the calculated one, then this indicates the fact of merging. When checking AVI files, it is allowed that the calculated size may be less than the file size due to the presence of added zeros for border alignment. In such a case, it is necessary to check for zeros after the calculated file size.

Check Compound File Binary Format

This file format developed by Microsoft is also known as OLE (Object Linking and Embedding) or COM (Component Object Model). DOC, XLS, PPT files belong to the group of CFB formats.

The CFB file consists of a 512-byte header and sectors of the same length that store data streams or service information. Each sector has its own non-negative number, with the exception of special numbers: "-1" - numbers the free sector, "-2" - numbers the sector that closes the chain. All sector chains are defined in the FAT table.

Let's assume that an attacker modified a certain doc file and pasted another file into its end. There are several various ways detect it or point out an anomaly in the document.

Abnormal file size

As mentioned above, any CFB file consists of a header and sectors of equal length. To find out the sector size, you need to read the two-byte number at the 30th offset from the beginning of the file and raise 2 to the power of this number. This number must be equal to either 9 (0x0009) or 12 (0x000C), respectively, the file sector size is 512 or 4096 bytes. After finding the sector, you need to check the following equality:

(FileSize - 512) mod SectorSize = 0

If this equality is not satisfied, then you can indicate the fact of merging files. However, this method has a significant drawback. If the attacker knows the sector size, then it is enough for him to stick his file and n more bytes so that the size of the glued data is a multiple of the sector size.

Unknown sector type

If an attacker knows about a method to bypass the previous check, then this method can detect the presence of sectors with undefined types.

Let's define equality:

FileSize = 512 + CountReal * SectorSize, where FileSize is the file size, SectorSize is the sector size, CountReal is the number of sectors.

We also define the following variables:

  1. CountFat - the number of FAT sectors. Located at offset 44 from the beginning of the file (4 bytes);
  2. CountMiniFAT - the number of MiniFAT sectors. Located at offset 64 from the beginning of the file (4 bytes);
  3. CountDIFAT - number of DIFAT sectors. Located at offset 72 from the beginning of the file (4 bytes);
  4. CountDE is the number of Directory Entry sectors. To find this variable, you need to find the first DE sector, which is located at the 48th offset. Then you need to get the full representation of DE from FAT and count the number of DE sectors;
  5. CountStreams - number of sectors with data streams;
  6. CountFree - the number of free sectors;
  7. CountClassified - the number of sectors with a certain type;
CountClassified = CountFAT + CountMiniFAT + CountDIFAT + CountDE + CountStreams + CountFree

It is obvious that if CountClassified and CountReal are not equal, we can conclude that the files may be merged.