Parsing binary files with regular expressions

License: Freeware
Date Added: 02 February, 2013
Category: Scripts / File Manipulation
Author: activestate.com

This script allows you to use the regular expression engine to parse binary files, especially those for which the struct module alone is inadequate.The typical way to parse binary data in Python is to use the unpack method of the struct module.


Advertisements



 

This script allows you to use the regular expression engine to parse binary files, especially those for which the struct module alone is inadequate.The typical way to parse binary data in Python is to use the unpack method of the struct module. This works well for fixed-width fields, but becomes more complicated when you need to parse variable-width fields. Perl's implementation of unpack accepts "*" as the field length, and even allows grouping with parentheses, which mitigates this problem. Python does not currently offer these features. Although you can dynamically generate a format string for unpack with a lot of slicing and calls to calcsize, the resulting code will likely be hard to read and error-prone.Fortunately, in some cases there is a simpler way to do it: use the regular expression engine to grab each field, and use struct.unpack on the results.First, you construct a regular expression (RE) describing the entire record structure, grouping each field you'd like to extract with parentheses, and compile it. To create the regular expression, you just have to remember that one character in the RE equals one byte in the record. So, the expression ".." would match any short (2 bytes). To match a variable-width field, the REengine will have to be able to recognize where the field ends. In a null-terminated string, for example, the field ends with a zero byte. You'd therefore look for any number of characters followed by a null byte: "(.*?)". Note the use of the non-greedy qualifier "?" -- this way, we only match up to the first null, rather than the last null in the buffer.When compiling, make sure to pass the re.DOTALL flag to the compiler, or it will consider bytes that happen to match ASCII '' to be newlines. Then, you use the findall method of the compiled expression object on your buffer. findall finds all non-overlapping matches, one match for each record. It returns a list of tuples, one for each match; each tuple will contain one element for each field you grouped in the RE.You still need to unpack the fields in the tuples before using them, since they're still strings rather than usable values. Generally, you'll call unpack once for each field, with only one format character. (You can also group multiple consecutive fixed fields in one set of parentheses in the RE, and then unpack them in one call. But that may get confusing.)The code above demonstrates how to unpack a binary file that has an indeterminate number of variable-width records, each consisting of a little-endian short, a null-terminated string, and two more shorts. It drops the resulting values into a list and also into a dictionary.This technique is useful when your variable-width fields are terminated with a sentinel, such as the zero-terminated strings described above. If your field length is embedded in the data, and you can't use the "p" (Pascal string) modifier, you'll probably have to resort to slicing the buffer up manually.This technique is also applicable even if your fields are all fixed-width. The findall method will operate on the entire buffer at once with a single regular expression, which saves you from having to dynamically create a long format string encapsulating all your data, or alternatively iterating over slices of the buffer.


Operating Systems:  Python, Windows, Linux, BSD, Solaris, Mac OS


Parsing binary files with regular expressions Related Downloads
 
Download Limagito Lite 8.4.107.0

Limagito Lite 8.4.107.0  Limagito is a utility to automatically move / copy / delete files that are added to a specific folder. You can set filters based on the file name, rename files and directories (using regular expressions) when moving them to the new destination.

Download Japplis Toolbox 1.1

Japplis Toolbox 1.1  Japplis Toolbox is a compilation of text utilities in one application. It can encode and decode URL, Base64, Hex, SoundEx, Metaphone. It can convert numbers from/to binary, octal, decimal and hexadecimal.

Download Limagito FileMover 9.202.9.0

Limagito FileMover 9.202.9.0  Use the Limagito FileMover to move your files from anywhere to anywhere. This file mover automation tool can copy, move or delete files that are added to a specific folder. You can set file and directory filters based on the file name, date and size.

Download StringValidator 1.9

StringValidator 1.9  This script is a portable class to carry out all sorts of validation on strings. It uses regular expressions to carry out common validation procedures.

Download K-Pregs 1.0

K-Pregs 1.0  K-Pregs is a PHP library. It comes with many functions related to Regular Expressions. All these are Perl compatible, fast and very efficient functions. Some example validation functions: Visa, MasterCard, Email, Url, etc...

Download Magican File 1.2.0

Magican File 1.2.0  Magican File is updated to help Mac users to find files more quickly. It can copy, cut, paste, delete, compress, creat files and folders, still features of searching files, looking file property and opening terminal are also available.

Download FSList 1.0

FSList 1.0  This script allows you to manage a list of files, directories and other  file system objects. Optionally recurses directories and ensure that each item in the list is a regular file.

Download Flexible directory walking 1.1

Flexible directory walking 1.1  This function walks a directory tree starting at a specified root folder, and returns a list of all of the files (and optionally folders) that match our pattern(s).The standard match our tree function os.path.

Download Extracting Windows file versions 1.6

Extracting Windows file versions 1.6  This is my attempt at extracting the file version information from .dll, .exe, .ocx files etc. on Windows 2000 (should work with others, but I haven't tested it), without resorting to using extensions (i.e. dll functions).

File Manipulation Popular Downloads
 
Download Free File Splitter Joiner 5.5.5

Free File Splitter Joiner 5.5.5  Free file splitter joiner software to split any file into smaller pieces for easier transfer, storage and file management, and then rejoin them later for use.

Download Parsing a full file specification 1.1

Parsing a full file specification 1.1  This function parses a full file specification into tuple of: a) list of drive and folders b) file name c) (last) file extension (including dot)A full file specification consists of drive letter, folders, file name, and extension.

Download php Download Manager 1.1

php Download Manager 1.1  php Download Manager is a script package written in php with a MySQL back-end. The script allows site owners to offer downloads by category. The program generates code to link to categories or individual downloads.

Download basE91 0.6.0

basE91 0.6.0  basE91 is an advanced method for encoding binary data as ASCII characters. It is similar to UUencode or base64, but is more efficient.The overhead produced by basE91 depends on the input data.

Download aria2 0.11.4

aria2 0.11.4  aria2 is a download utility with resuming and segmented downloading. Supported protocols are HTTP/HTTPS/FTP/BitTorrent. It also supports Metalink version 3.0. As of 0.10.

Download Progress bar class 1.0

Progress bar class 1.0  Here is a little class that lets you present percent complete information in the form of a progress bar using the '#' character to represent completed portions, space to represent incomplete portions, and the actual percent done (rounded to integer)

Download PKTorrents 0.1b 1.0

PKTorrents 0.1b 1.0  PKTorrents has the functionality to crawl the top torrent sites, Meganova, Mininova, Piratebay, Snarf, Torrentportal, Torrentspy and adding 1000's of torrents to the db as it gos using the well know ibitzy torrent spider/crawler!!

Download MyDMS 1.7.0

MyDMS 1.7.0  MyDMS is an open-source, web-based document management system written in PHP and supported by an SQL database. Originally coded by Markus Westphal, MyDMS provides document meta-data, version control, security and easy access to your documents.

Download PKZip library for PHP 0.4.1

PKZip library for PHP 0.4.1  PKZip library for PHP contains a set of functions to create PKZip (Winzip) files in PHP.

Submit Reviews for Parsing binary files with regular expressions
- required fields