File Analysis (Windows Forensic Analysis) Part 6

Word Documents

Metadata contained within Word documents has long been an issue. Word documents are compound documents, based on the object linking and embedding (OLE) technology that defines a "file structure within a file." Besides formatting information, Word documents can contain quite a bit of additional information that is not visible to the user, depending on the user’s view of the document. For example, Word documents can maintain not only past revisions but also a list of up to the last 10 authors to edit a file. This has posed an information disclosure risk to individuals and organizations. Perhaps one of the most visible was made public in mid-2003 by Richard M. Smith, in relation to a document released by British Prime Minister Tony Blair (www.computerbytesman.com/privacy/blair.htm). The Blair government had released a dossier of Iraq’s security and intelligence organizations as a Word document on the Web in February 2003. A lecturer in politics at Cambridge University recognized portions of the content of this document as having originally been written by a U.S. researcher in Iraq. This caused quite a number of people to look much more closely at the document. In his discussion of the information disclosure issue, the lecturer illustrated information he was able to extract from the Word document, which consists of a list of the last 10 authors to modify the document. This information proved quite embarrassing to Prime Minister Blair’s staff.


On his Web site, the lecturer mentions a utility that he wrote to extract this information from Word documents, yet this utility is not provided for others to use. I wrote a Perl script called wmd.pl, included on the accompanying DVD, which parses through the binary header of the Word document to extract some information. The script uses Perl modules (the script does not use the Microsoft Word API, so you can run the Perl script on any system that supports Perl and has the necessary modules, as listed in the use pragmas for the script, installed) to retrieve additional information. The output of the script run against the Blair document appears as follows:

tmp1E1-172

 

 

 

 

 

 

tmp1E1-173

As you can see, some of the information "hidden" in Word documents can be quite revealing and potentially quite embarrassing. In addition to the last 10 authors, the script will reveal the platform (Windows or Mac) that the document was created on, as well as which version of Word was used to create and later revise the document. The script also extracts summary information from the document (discussed further in the "NTFS Alternate Data Streams" section of this topic).

I have also included another small utility on the accompanying DVD, called oledmp.pl. This utility uses the same Perl modules as wmd.pl but performs a slightly different function. Oledmp.pl will list the OLE streams and trash bins embedded in a Word document as well as the same summary information that wmd.pl extracts, as illustrated in the following sample output:

tmp1E1-174

 

 

 

 

 

tmp1E1-175

The ListStreams information displays the names of the various OLE streams that make up the Word document. Microsoft refers to OLE as "a file system within a file," and these stream names refer to the "files" in the document.

Warning::

Sometimes it can be pretty shocking how much information is revealed in Word document metadata. Try a little experiment: Look around a file server at work (with permission, of course) and find some Word documents, such as something that might have been sent to clients, and see what the hidden metadata says about the documents. I tried something similar, only I used Google instead of a corporate file server. Due to the number of responses I received, I restricted my searches to .mil and .gov domains, but I still found more documents than I really knew what to do with.

Taking things a step further, this reviewer would complete the review forms in Word documents but save the content as a straight ASCII text document, removing all metadata. I guess he really didn’t want me to know who he was!

Not only can this metadata pose an information disclosure risk to an individual or organization, but it can also be useful to an investigator who is looking for specific information regarding documents. This can be particularly important during e-discovery cases, especially if searches for keywords or phrases are confined to the visible text of the documents.

For the sake of completeness on this topic, I need to add a couple of things before moving on to the next topic. First, Microsoft provides information to users regarding metadata in Word documents and ways to minimize the available metadata. Second, Word documents are not the only Office files that have an issue with metadata. To address both of these items, Microsoft provides the following Knowledge Base articles:

■ 223790: WD97: "How to Minimize Metadata in Microsoft Word Documents"

■ 223396: OFF: "How to Minimize Metadata in Microsoft Office Documents"

■ 223789: XL: "How to Minimize Metadata in Microsoft Excel Workbooks"

■ 223793: PPT97: "How to Minimize Metadata in Microsoft PowerPoint Presentations"

■ 290945: "How to Minimize Metadata in Word 2002"

■ 825576: "How to Minimize Metadata in Word 2003"

In addition to these Knowledge Base articles, Microsoft also provides the Remove Hidden Data tool (http://support.microsoft.com/kb/834427) as a plug-in to Office 2003 and XP. Authors can use this tool to remove a great deal of metadata from documents.

This is an excellent tool to ensure that the amount of available metadata is minimized, even if your authoring process includes saving the file in a different format, such as PDF.

Notes from the Underground…

The Merge Streams Utility

A utility called Merge Streams (www.ntkernel.com/w&p.php?id=23), available from NT Kernel Resources, implements an interesting aspect of Office OLE documents. In a nutshell, it allows you to "merge" an Excel spreadsheet into a Word document. The utility has a simple GUI that allows you to select a Word document and an Excel spreadsheet and merge the two together. Say you have one of each document in a directory. If you run the utility and merge the two documents, you will be left with a Word document that is larger than the original Word document as well as being larger than the original Excel spreadsheet. However, if you were to delete the Excel spreadsheet, change the file extension of the Word document to .xls, and then double-click the file, you would see the Excel spreadsheet opened on the desktop, with no evidence of the original Word document or its contents. Changing the file extension back to .doc allows you to open the Word document with no apparent evidence of the Excel spreadsheet.

When presenting on this subject at conferences, I generally include a demonstration of the tool. Most often I demonstrate it from the aspect of a corporate user trying to smuggle a spreadsheet of financial forecasts or contract information pertinent to an important bid out of an organization. All the "user" has to do is merge the Excel spreadsheet into the Word document (something harmless, such as a letter) and then copy the Word document to a thumb drive. If anyone stops the user on the way out the front door and inspects the contents of the thumb drive, all he will see is the Word document.

When talking to law enforcement officers, however, I take a slightly different approach. Suppose a corporate employee has some illicit images that he’d like to share with his buddies. He copies the images into a Word document, then locates an Excel spreadsheet on the file server that all of them have access to (as well as a legitimate need to access) and merges them. He then renames the Word document to the original name and extension of the spreadsheet and lets his buddies know what he’s done. This way, he can distribute the images without leaving any traces.

Detecting the use of a utility such as Merge Streams isn’t necessarily an overly difficult task. Using scripts that include functionality similar to oledmp.pl, as mentioned previously in this topic, you can list the OLE streams that make up the Word document. If you see any stream names (Workbook, Worksheet, or the like) that would indicate the presence of an Excel spreadsheet, the Word document is definitely worth examining.

Tip::

The oledmp.pl Perl script has been extremely useful in examinations involving Excel spreadsheets and PowerPoint presentations, as well. In one instance, I was performing an examination of a system from which the customer suspected that someone had performed fraud, using account numbers that the employee had access to as part of his day-to-day responsibilities. Using a keyword list created with the help of the customer, I located an Excel spreadsheet on the system, extracted it from the image, and provided it to the customer for review. As part of my report, I was able to include information about where the file had come from (according to the location of the file, it had been an Outlook attachment), when the user had accessed the file (based on data found in the Registry), as well as the fact that the user had edited and then printed the spreadsheet. These last two bits of information were retrieved from the spreadsheet metadata using oledmp.pl.

Cory Althiede recently pointed out to me yet another means of extracting potentially useful information from Microsoft Word (and other OLE) documents. When writing the manuscript for this topic, I would highlight/select text from a file, copy it to the Clipboard, paste that text into the document I was working on, and then ensure that it was properly formatted for its purpose. However, when someone drags and drops text into a Microsoft Word document, it becomes an attachment. If you ever have a need to extract those OLE document attachments, Cory pointed out an excellent tool to use, called b2xtranslator (http://b2xtranslator.sourceforge.net/). According to the About section of the Web site, the purpose of this tool is to allow users to transition from the binary document format to the new XML/zip format used in later versions of Microsoft Office (e.g., move from the .doc format to the .docx format). The Documentation page linked from the main Web page provides some very good illustrations of how the tool works conceptually, and shows how various OLE objects embedded within a Word document or Excel spreadsheet can be accessed. If you need to do more than just review the last author or the date that an OLE document was printed, you might consider taking a look at this tool.

PDF Documents

Portable document format (PDF) files can also contain metadata such as the name of the author, the date the file was created, and the application used to create the PDF file. Often the metadata can show that the PDF file was created on a Mac or that the PDF file was created by converting a Word document to PDF format. As with Word documents, this metadata can pose a risk of information disclosure. However, depending on the situation, this information can also be useful to an investigator, either to assist in e-discovery or to show that a particular application had been installed on the user’s system.

On the accompanying DVD, I’ve included two Perl scripts (pdfmeta.pl and pdfdmp.pl) that I have used to extract metadata from PDF files. The only difference between the two scripts is that they use different Perl modules to interact with PDF files. To be honest, I’ve had varying amounts of success with the scripts; in some instances, both scripts will successfully retrieve metadata from a PDF file, whereas in other cases, one or the other will fail for some reason. As a test, I used Google to search for some sample PDF files and found two, one from the FTC and another from the IRS. The PDF file from the FTC was called idtheft.pdf, and pdfmeta.pl returned the following information:

tmp1E1-176

Author

FTC

CreationDate

D:20050513135557Z

Creator

Adobe InDesign CS (3.0)

Keywords

identity theft, id theft, idtheft, credit

ModDate

D:20050513151619-04’00′

Producer

Adobe PDF Library 6.0

Subject

Identity Theft

Title

Take Charge: Fighting Back Against Identity Theft

The PDF file downloaded from the IRS site was a copy of the 2006 Form W-4, called fw4.pdf. Pdfmeta.pl returned the following information:

tmp1E1-177

Author

SE:W:CAR:MP

CreationDate

D:20051208083254-05’00′

Creator

OneForm Designer Plus

Keywords

Fillable

ModDate

D:20060721144 65 4-04’00′

Producer

APJavaScript 2.2.1 Windows SPDF 1112 Oct 3 2005

Subject

Employee’s Withholding Allowance Certificate

Title

2006 Form W-4

Both of these examples are fairly innocuous, but it should be easy to see how the metadata in PDF files can be used in e-discovery or should at least be considered in keyword searches. If you have trouble retrieving metadata with either of the two Perl scripts provided with this topic, the old standby is to open the file in Adobe Reader (freely available from Adobe.com) and click File | Document Properties. The Description tab of the Document Properties dialog box contains all the available metadata. Figure 5.9 illustrates the document properties for idtheft.pdf.

Figure 5.9 Idtheft.pdf Document Properties

Idtheft.pdf Document Properties

In fall 2008, Didier Stevens developed a Python-based tool called pdf-parser.py (available from http://blog.didierstevens.com/programs/pdf-tools/#pdf-parser; the site also includes a link to a screencast showing the tool in action). According to Didier, this Python script "will parse a PDF document to identify the fundamental elements used in the analyzed file. It will not render a PDF document."

Tip::

You can download a free Python interpreter from ActiveState.com, the same site that makes a Perl interpreter freely available.

Pdf-parser.py extracts various metadata and contents from a PDF document, to include objects and JavaScript code embedded in the document. For example, Didier posted a blog entry (http://blog.didierstevens.com/2008/11/10/shoulder-surfing-a-malicious-pdf-author/) in which he described parsing information from a malicious PDF document that contained code that exploited a vulnerability to the util.printf JavaScript function (http://cve.mitre.org/cgi-bin/cvename.cgi?name=2008-2992).

Didier also makes available his Python script, ExtractScripts (http://blog.didierstevens.com/programs/extractscripts/), which extracts potentially malicious scripts embedded within HTML files into separate files.

Next post:

Previous post: