What are YARA rule best practices?

Learn about ways to create efficient YARA rules for your organization

Table of Contents

  1. Prerequisites
  2. YARA rule best practices


YARA rule best practices

Understand YARA best practices to create efficient rules.

  • Build and use templates to help stay organized and consistent
    • Use a consistent template when developing files for particular file types
    • Use consistent metadata that helps you remember and share the details of analytical intentions and notes behind how a rule works
    • Use // to add comments after strings and conditions directly within a rule to explain what they are if they are not easily understood or human readable
      • For example: uint32be(0) != 0x52617221 // not Rar! header
  • Use a naming convention that helps analysts understand the intent of the rule based only on the name. This will also help keep similar rules next to each other when sorted by name.
    • Many intelligence and security companies use a convention that details the malware family, associated threat actor, and things such as file type
    • Example naming conventions:
        • APT41_DEADEYE_Backdoor_PE_Strings
        • NOBELLIUM_Dropper_ISO_1
        • DARKSIDE_Ransomware_PE_Imphash
        • Methodology_XOREncoding_DOSStrings_PE
        • Methodology_Toolmark_Pastebin_ELF
        • Methodology_RemoteTemplates_URIRegex_RTF
  • Use file magic liberally to focus your matching on the right file types
    • Apply file magic conditions often for PE, ELF, Mach-O, and other file types
    • Rather than doing one rule for all major file types, break your idea into multiple rules, with one for each file type as appropriate
    • Common file magic:
      • PE: uint16(0) == 0x5A4D //PE “MZ” header
      • ELF: uint32(0) == 0x464c457f //ELF header
      • MACHO: uint32(0) == 0xfeedface or uint32(0) == 0xcefaedfe or uint32(0) == 0xfeedfacf or uint32(0) == 0xcffaedfe or uint32(0) == 0xcafebabe or uint32(0) == 0xbebafeca
    • Apply exclusions to file magic, too. For example, if you know you’re not looking for a PE, you could specify the condition uint16(0) != 0x5A4D
  • Conditions in YARA rules are evaluated from left to right, so use static feature evaluations first before string matching conditions
    • Example: filesize <5MB and uint16(0) == 0x5A4D and $a and $b
  • In the event that a rule matches too many legitimate software files or too many files to be useful:
    • Consider excluding signed software, using the condition pe.number_of_signatures == 0
    • Consider adding filesize limitations or breaking a rule into several for filesize bands, one for <1MB, another for >1MB and <2MB, another for >2MB and <3MB, and so forth. This helps one manage the “size of the haystack” for match count.
    • Consider adding file type exclusions, such as uint16(0) != 0x5A4D or not pe.is_pe
    • Consider breaking a rule into several file type bands, where you do one for PE (using pe.is_pe) and one for not PE (not pe.is_pe)
  • Use nocase selectively as this not only eats up memory, but also may cause false positives on shorter strings where casing variations cause matches on random or otherwise benign things
    • For example, if you are searching for the camel-cased $a = ”KeRnEl32.dLl”, it will only match on that exact casing, which is very unique. However, if you add nocase to the end of that, YARA will create every possible casing variant and will match on the typical kernel32.dll which is in almost all Windows executables
  • Use regular expressions (regex) carefully:
    • Strings in a YARA rule (such as $a = “thisstring”) are extracted as an atom for matching on files in memory
    • Strings shorter than 6 bytes are likely to have false positive, incidental, or random matches on arbitrary data. Use longer, more unique strings to ensure high-quality matching
      • An example of a bad string is “FNoC”. While this seems unique, it may match accidentally on large chunks of base64 encoded data. 
      • An example of a good string is “FNoC3haB”. This is long enough to minimize arbitrary matches on other data.
    • Regexes use a lot of memory and can slow down or make rules ineffective if they are not anchored in a string
      • A regex like $a = /[a-zA-Z0-9\s]{5}/ has no string to form an atom and will thus run over every byte in every file, consuming a lot of memory and resulting in a lot of matches
      • Each regex should have a string to help form an “atom”, otherwise the regex will be unnecessarily matched against all bytes in a file
      • With a regex such as $a = /somethinghere[0-9]{5,43}/, the somethinghere portion will be extracted as an atom as if it were any other string condition. Then, the regex will only be evaluated over the offset where that atom matched. This means the regex will not run unless the atom matches in the file.
  • Remember that because YARA is so flexible, there are multiple ways to accomplish the same task, and it will be difficult to know which one is more “optimal” in any given situation.
    • For example, to denote the ELF header, one could do all of the following:
      • $elf = "\x7fELF"
      • $hex = { 7f 45 4c 46 }
      • condition:
        • and $elf at 0 //vintage header style in string
        • and $hex at 0 //vintage header style in hex
        • and uint16(0) == 17791 //this is the integer of hex 45 7F
        • and uint32(0) == 1179403647 //this is the integer of hex 464C457F
        • and uint32(0) == 0x464c457f //magic in little endian hex
        • and uint32be(0) == 0x7f454c46 //magic in big endian hex
        • and elf.type == elf.ET_EXEC //specifies EXEC using ELF module
      • No one of these conditions specifications are meaningfully more performant. It is very difficult to identify performance differences, and those that are identified are not statistically significant because they are within the computational noise range.
      • Choose a convention for things that makes the most sense to you, makes the most sense to analysts, and is the easiest to understand and remember. We prefer to use the hex notation in big-endian because it is most similar to how analysts typically view file bytes.