Monday 1 December 2008

How can I view those Microsoft Office 2003 Scanned documents in Unix?

SOURCE: http://nomoa.com/index.php?module=articles&func=display&aid=1864&theme=print
-------------
Ubuntu - the straw that broke the camels back
Posted by: Samiuela LV Taufa onWed, 30 May 2007 14:57:24

Otherwise known as: How can I view those Microsoft Office 2003 Scanned documents in Unix?

[Update 2007.06.14 to include gnome2/nautilus-script and hopefully clarified some text]

The desktop replacement Ubuntu box I've been putting together for my father-in-law's office has ground to a halt because of a very simple problem:

I can't get a graphics viewer for Microsoft Office 2003's TIFF format created by the MS Office 2003 tool for managing scanners, Microsoft Office Document Imaging.

Technically, Ubuntu/Linux can view the multiple images embedded in the TIFF file, but it is a song and dance affair at the moment that is doable for a techno-dweeb, but not yet accessible to mere humans.

Scanning: YES we can scan documents under Ubuntu by using XSane Image Scanner, but I'm interested in viewing TIFF documents created by business partners.
Background

Microsoft Office Document Imaging (MDI) is part of Microsoft Office 2003 suite and provides a generic scanning tool for scanning images into your machine (most notably for attaching printed documents to email.) This tool is a way way easier to use than the driver based tool provided by the scanner manufacturer, as well as allowing a single application on which the user is trained for scanning documents/pictures.

We currently use it for scanning contracts, forms and sent faxes to forward between business partners. Likewise, our major business partner uses it extensively when sending us printed forms and faxes.

MDI's scanning tool saves multi-page scanned documents as a single TIFF file. Within this TIFF file are:

1. JPEG images of the scanned pages
2. JPEG thumbnails of the scanned pages
3. OCR'd versions of the above pages (OCR - Optical Character Recognition - an attempt to recognise the text in your document)

The TIFF/TIF file format has been extensively documented and in Microsoft's own promotional blurb about Microsoft Office - Document Imaging.

TIFF is a commonly used format for various imaging applications, including those that scan and fax. Microsoft Office Document Imaging uses the TIFF format, utilizing the format's capability to contain text recognized by optical character recognition (OCR) (OCR: Translates images of text, such as scanned documents, into actual text characters. Also known as text recognition.) When you scan new documents, they are saved in TIFF format (with a .tif extension), and any OCR text is stored in the TIFF file along with the image.

You can open and edit TIFF files created with Office Document Imaging by using many other graphics applications. When you do so, any OCR text that the file contains is lost. You will have to rerun OCR if you want to access the text in the TIFF file again in Office Document Imaging.

It seems that Microsoft are using legitimate extensions to TIFF 6.0 and it's extensions, but not near enough programmers out there have access to the documentation on these extensions or can cut the code for it.

Some further notes on the TIFF format in the below Unix section but there are problems even within Microsoft Windows to view these scanned multi-page documents.

Visit bTonga


Viewing:

On an MS Windows XP desktop, you can view these multiple pages using:

* Microsoft's MDI application's viewer, or
* the free application IrfanView, or
* the free application XnView (my preferred tool at the moment)

On Unix, Linux there's a convoluted way to get at the files and shown later in this post.
Viewing Limited:

Microsoft's Office Picture Manager (12.0.4518) can view the 1st Image in the TIFF file, but I can't see anyway of seeing the rest of the images in the file and there are no notable indications that there are multiple images in the file (leading you to a conclusion that the single image you see is the only relevant image.)

OT: Weird limitation considering the product is shipped by the same team, only further highlighting how big Office development/programming has become.
Viewing NOT:

You cannot, however, view the images using Microsoft's own current tools and other popular tools.

* Windows Picture and Fax Viewer, or
* Imagic Magic 6.2.3 Q16 IMDisplay 1.0, or
* Paint.NET v3.07

Unfortunately I don't have a copy of Adobe Photoshop on my machine to give people more information.

Similarly, I have tried to view multipage TIFFs on Linux with the following applications also failing with errors complaining about the TIFF format

* F-Spot Photo Manager 0.3.5 (crashes on import, and fails to display image/s)
* GIMP Image Editor 2.2.13 (multiple "unknown field tag" error message on loading file)
* GNU Paint gpaint-2 0.3.0-pre5 (error: cannot open file)
* gThumb Image Viewer 2.10.2 (no errors, but no image view)
* Gwenview 1.4.1 (multiple "unknown field tag" errors and "Invalid YCBCr subsampling")
* xloadimage 4.1 (same error message as tiffinfo shown below)

Viewing GNU Linux:

Thanks to a post by Michael R. Head, there is a way to view the multipage TIFF files, but there is some command-line magic you have to walk through.

Let's first take a look at an indicator that we have a TIFF file created by Microsoft's MDI by using LibTIFF's tiffinfo tool. We first transport 2 multipage TIFF files (multipage.tif and multipage2.tif) from our Windows box to Ubuntu Linux.

$ ls
multipage2.tif multipage.tif

$ file multipage.tif multipage2.tif
multipage.tif: TIFF image data, little-endian
multipage2.tif: TIFF image data, little-endian

The unix file utility is telling us that the two images we're using in this example is a file with the format "TIFF image data, little-endian"

$ tiffinfo multipage.tif
TIFFReadDirectory: Warning, multipage.tif: unknown field with tag 513 (0x201) encountered.
TIFFReadDirectory: Warning, multipage.tif: unknown field with tag 514 (0x202) encountered.
TIFFReadDirectory: Warning, multipage.tif: unknown field with tag 37680 (0x9330) encountered.
multipage.tif: Invalid YCbCr subsampling.
TIFFReadDirectory: multipage.tif: cannot handle zero strip size.

Using tiffinfo we now know that for both the multipage.tif and multipage2.tif file that we do not recognise portions of the file that seem to be equivalent areas in both files.

$ tiffinfo multipage2.tif
TIFFReadDirectory: Warning, multipage2.tif: unknown field with tag 513 (0x201) encountered.
TIFFReadDirectory: Warning, multipage2.tif: unknown field with tag 514 (0x202) encountered.
TIFFReadDirectory: Warning, multipage2.tif: unknown field with tag 37680 (0x9330) encountered.
multipage2.tif: Invalid YCbCr subsampling.
TIFFReadDirectory: multipage2.tif: cannot handle zero strip size.

Seeing the error messages displayed by tiffinfo helps us to understand some of the error messages displayed by the above image viewers. The errors are implying these viewers use of the libtiff library and it's limitations. It should be pointed out here that libtiff.org documents:

TIFF 6.0 Specification Coverage

The library is capable of dealing with images that are written to follow the 5.0 or 6.0 TIFF spec. There is also considerable support for some of the more esoteric portions of the 6.0 TIFF spec.
...
Note that there is no support for the JPEG-related tags defined in the 6.0 specification; the JPEG support is based on the post-6.0 proposal given in TIFF Technical Note #2.
...
The JPEG-related tag is specified in TIFF Technical Note #2 which defines a revised JPEG-in-TIFF scheme (revised over that appendix that was part of the TIFF 6.0 specification).

I am not so sure how relevant the above is to the Microsoft MDI problem, but suffice it to say I don't know enough to blame anyone for why so many open source software lack support for viewing MDI multi-page TIFF files.
Unix: Extracting the Images

We now know that the TIFF file could be a legitimate TIFF file, but we can't view the images without resorting to a Windows box. Thanks again to Michael R. Head's article the solution is through a forensics tool Foremost.

Foremost is a console program to recover files based on their headers, footers, and internal data structures. This process is commonly referred to as data carving. Foremost can work on image files, such as those generated by dd, Safeback, Encase, etc, or directly on a drive. The headers and footers can be specified by a configuration file or you can use command line switches to specify built-in file types. These built-in types look at the data structures of a given file format allowing for a more reliable and faster recovery.

Foremost seems to understand the TIFF data structure presented by Microsoft's MDI, so it can extract the separate streams/images and store them to the disk for 'later processing. Using foremost is rather simple as shown below on our two multipage files.

$ ls
multipage2.tif multipage.tif

$ foremost -i multipage.tif -o multipage
Processing: multipage.tif
|*|

$ foremost -i multipage2.tif -o multipage2
Processing: multipage2.tif
|*|

foremost creates subdirectories (-o) jpg and ole where jpg contains the images (both full image and thumbnail image), and ole contains ocr'd versions of the pages.

$ ls -R
.:
multipage multipage2 multipage2.tif multipage.tif

./multipage:
audit.txt jpg ole

./multipage/jpg:
00000000.jpg 00000545.jpg 00000937.jpg 00001543.jpg 00002127.jpg
00000538.jpg 00000931.jpg 00001535.jpg 00002120.jpg 00002682.jpg

./multipage/ole:
00002692.ole

./multipage2:
audit.txt jpg ole

./multipage2/jpg:
00000000.jpg 00002941.jpg 00006432.jpg 00009274.jpg 00011870.jpg 00014243.jpg 00016827.jpg
00001609.jpg 00004364.jpg 00006444.jpg 00009284.jpg 00011879.jpg 00014252.jpg 00016836.jpg
00001622.jpg 00004375.jpg 00007880.jpg 00010598.jpg 00012939.jpg 00015470.jpg 00018163.jpg
00002931.jpg 00004954.jpg 00007891.jpg 00010608.jpg 00012948.jpg 00015481.jpg

./multipage2/ole:
00018172.ole

The jpg files, being thumbnail and full image should have distinctive sizes such as the above listing shown below

$ ls -lR

./multipage:
total 12
-rw-r--r-- 1 samt samt 1178 2007-05-30 14:47 audit.txt
drwxr-xr-- 2 samt samt 4096 2007-05-30 14:47 jpg
drwxr-xr-- 2 samt samt 4096 2007-05-30 14:47 ole

./multipage/jpg:
total 1380
-rw-r--r-- 1 samt samt 275019 2007-05-30 14:47 00000000.jpg
-rw-r--r-- 1 samt samt 3709 2007-05-30 14:47 00000538.jpg
-rw-r--r-- 1 samt samt 197089 2007-05-30 14:47 00000545.jpg
-rw-r--r-- 1 samt samt 3011 2007-05-30 14:47 00000931.jpg
-rw-r--r-- 1 samt samt 305575 2007-05-30 14:47 00000937.jpg
-rw-r--r-- 1 samt samt 4002 2007-05-30 14:47 00001535.jpg
-rw-r--r-- 1 samt samt 294723 2007-05-30 14:47 00001543.jpg
-rw-r--r-- 1 samt samt 3442 2007-05-30 14:47 00002120.jpg
-rw-r--r-- 1 samt samt 284052 2007-05-30 14:47 00002127.jpg
-rw-r--r-- 1 samt samt 4793 2007-05-30 14:47 00002682.jpg

./multipage/ole:
total 8
-rw-r--r-- 1 samt samt 5632 2007-05-30 14:47 00002692.ole

./multipage2:
total 12
-rw-r--r-- 1 samt samt 1998 2007-05-30 14:47 audit.txt
drwxr-xr-- 2 samt samt 4096 2007-05-30 14:47 jpg
drwxr-xr-- 2 samt samt 4096 2007-05-30 14:47 ole

./multipage2/jpg:
total 9200
-rw-r--r-- 1 samt samt 823649 2007-05-30 14:47 00000000.jpg
-rw-r--r-- 1 samt samt 6345 2007-05-30 14:47 00001609.jpg
-rw-r--r-- 1 samt samt 669597 2007-05-30 14:47 00001622.jpg
-rw-r--r-- 1 samt samt 5344 2007-05-30 14:47 00002931.jpg
-rw-r--r-- 1 samt samt 728014 2007-05-30 14:47 00002941.jpg
-rw-r--r-- 1 samt samt 5365 2007-05-30 14:47 00004364.jpg
-rw-r--r-- 1 samt samt 296251 2007-05-30 14:47 00004375.jpg
-rw-r--r-- 1 samt samt 756384 2007-05-30 14:47 00004954.jpg
-rw-r--r-- 1 samt samt 6134 2007-05-30 14:47 00006432.jpg
-rw-r--r-- 1 samt samt 734716 2007-05-30 14:47 00006444.jpg
-rw-r--r-- 1 samt samt 5064 2007-05-30 14:47 00007880.jpg
-rw-r--r-- 1 samt samt 707892 2007-05-30 14:47 00007891.jpg
-rw-r--r-- 1 samt samt 4973 2007-05-30 14:47 00009274.jpg
-rw-r--r-- 1 samt samt 672318 2007-05-30 14:47 00009284.jpg
-rw-r--r-- 1 samt samt 4854 2007-05-30 14:47 00010598.jpg
-rw-r--r-- 1 samt samt 645537 2007-05-30 14:47 00010608.jpg
-rw-r--r-- 1 samt samt 4784 2007-05-30 14:47 00011870.jpg
-rw-r--r-- 1 samt samt 542300 2007-05-30 14:47 00011879.jpg
-rw-r--r-- 1 samt samt 4081 2007-05-30 14:47 00012939.jpg
-rw-r--r-- 1 samt samt 662687 2007-05-30 14:47 00012948.jpg
-rw-r--r-- 1 samt samt 4416 2007-05-30 14:47 00014243.jpg
-rw-r--r-- 1 samt samt 623235 2007-05-30 14:47 00014252.jpg
-rw-r--r-- 1 samt samt 5299 2007-05-30 14:47 00015470.jpg
-rw-r--r-- 1 samt samt 688888 2007-05-30 14:47 00015481.jpg
-rw-r--r-- 1 samt samt 4436 2007-05-30 14:47 00016827.jpg
-rw-r--r-- 1 samt samt 678824 2007-05-30 14:47 00016836.jpg
-rw-r--r-- 1 samt samt 4619 2007-05-30 14:47 00018163.jpg

./multipage2/ole:
total 8
-rw-r--r-- 1 samt samt 5632 2007-05-30 14:47 00018172.ole

I don't know what the sequencing issues are with the file names, but it seems obvious that the larger files will be the full image, with one of the smaller files being a thumbnail of the same (presumably the nearest higher order number.)
Unix: Automating extraction and viewability

In a comment to Michael R. Head's article, typhoncore writes a nice bash script that uses ImageMagick's 'convert' utility and pdftk to create a multipage PDF file from the larger images. It is listed here with a few minor modifications I have inserted (for better or worse.)

#!/bin/bash
DOC_COUNT=0
arg1=$1
arg_out=$arg1.out
echo "Extracting Images from $arg1 using foremost to $arg_out"
foremost -i $arg1 -o $arg_out
echo "Done"
cd $arg_out/jpg
echo "Converting Single Images to PDF"
for i in $(ls *.jpg); do
ODDEVEN=$(echo "scale=0; $DOC_COUNT % 2" | bc)
if [ "$ODDEVEN" = "0" ] ; then
echo -n " > $i to $i.pdf"
convert $i $i.pdf
echo " - done"
fi
DOC_COUNT=$(echo "scale=0; $DOC_COUNT + 1" | bc)
done
echo -n "Merging separate single page PDF's to a multipage PDF"
pdftk *.pdf cat output merged.pdf
mv merged.pdf ../../$arg1.pdf
echo " - done"
cd ../..
echo -n "Removing temporary directory $arg_out"
rm -Rf $arg_out
echo " - done"

The bastardisation of typhoncore's script is to add console progress indicators (and as additional documentation within the script) for us noobs.

Output of the script will look something like the below.

$ sh TIFFtoPDF.sh multipage.tif
Extracting Images from multipage.tif using foremost to multipage.tif.out
Processing: multipage.tif
|*|
Done
Converting Single Images to PDF
> 00000000.jpg to 00000000.jpg.pdf - done
> 00000545.jpg to 00000545.jpg.pdf - done
> 00000937.jpg to 00000937.jpg.pdf - done
> 00001543.jpg to 00001543.jpg.pdf - done
> 00002127.jpg to 00002127.jpg.pdf - done
Merging separate single page PDF's to a multipage PDF - done
Removing temporary directory multipage.tif.out - done
$

Unix: GNOME GUIfying extraction and viewability

I was thinking what could be a registry hack (Windows Hat on) or other means to let the File Explorer in X Windows (later discovering it is called GNOME Nautilus) send TIFF files to the above script when I came across a solution for separate but related problem Mount and UnMount ISO images without burning them.

That lead me to a rehacked whack of the above TIFFtoPDF.sh that can be placed in your ~/username/.gnome2/nautilus-scripts/ folder.

Read Nautilus File Manager Scripts : Questions and Answers for more details on how to get the below script working properly with Nautilus.


#!/bin/bash
# mount

BASENAME=`basename $NAUTILUS_SCRIPT_SELECTED_FILE_PATHS`

DOC_COUNT=0
INFILE=$BASENAME
OUTPUT=$INFILE.odir

if ! zenity --question --title "Convert MS TIFF file to Multipage PDF"

--text "Do you wish to Convert the MS TIFF $BASENAME to a Multipage PDF?"
then
exit 0
fi

foremost -i $INFILE -o $OUTPUT
cd $OUTPUT/jpg

for i in $(ls *.jpg); do
ODDEVEN=$(echo "scale=0; $DOC_COUNT % 2" | bc)
if [ "$ODDEVEN" = "0" ] ; then
convert $i $i.pdf
fi
DOC_COUNT=$(echo "scale=0; $DOC_COUNT + 1" | bc)
done
pdftk *.pdf cat output merged.pdf
mv merged.pdf ../../$INFILE.pdf
cd ../..
rm -Rf $OUTPUT

The bare essentials for getting the above script working in GNOME Nautilus is:

1. Put the script in ~/username/.gnome2/nautilus-scripts/
2. Make the script executable
3. Visit the directory using GNOME Nautilus

Conclusion

There is no going to Ubuntu/Linux or any other variant of Unix/BSD until this image viewing problem can find a simpler solution for these guys.

Funny how for the big ticket items we were eventually able to find good alternate solutions, but things fell over with this simple yet insurmountable problem.

Microsoft Outlook 2003 --> now using Thunderbird 2.0.x
Microsoft Word 2003 --> we have been testing Open Office 2.2 Write
Microsoft Excel 2003 --> we have been testing Open Office 2.2 Calc
Microsoft Access 2003 --> not currently using, no need for an alternative
Microsoft Publisher 2003 --> infrequent use, although testing scribus
Printing --> CUPS with Vendor Linux Drivers
Scanning --> XSane with Vendor Linux Drivers

Accounting Software --> Not currently using one, but looking around

For my own desktop needs, I'm still an XP man and will probably go to Vista with my next machine, as that will definitely be a TabletPC, but there's plenty of cheap Pentium IV's on www.ebay.com.au so I'm getting an X Windows (Gnome/KDE) up for some of the kid's fun and gaming (defining anything they enjoy as play.)

The sledge-hammer solution would be to run a mail server that would parse incoming emails for TIFF files and automatically detect/convert multipage files from TIFF to PDF. If this was a do or die situation I would probably work on it, as it is, it will have to wait for another day/solution.
References

Michael R. Head's Handling Microsoft Office Document Scanning TNEF and TIFFs in Linux
typhonecore Multipage TIFF to Multipage PDF script
DRAFT TIFF Technical Note #2
Adobe Photoshop TIFF Technical Notes (PDF)

No comments: