Skip to Main Content

SensusAccess Document Conversion

Guide to document conversion tool SensusAccess.

Top tips

  • The quality of a conversion is dependent on the quality of the original document. Smudges, imperfections, other objects on top of a document, all of these will affect the output of the document. Clean documents will work best!
  • It can also help to 'clean' a document of extra characters - for instance an image-file conversion will work best if exported into tagged pdf, copy/pasted into Word and certain characters removed.
  • When creating your own original documents, you can think about other people wanting to convert files later on. Using headings in Word and HTML, for example, will help accurate conversion to other formats later on
  • More detail on these tips in the format-specific best practices sections below:

 

Best practices for...

 

...PDF and Image-based files

PDF and image-based files will be processed using optical character recognition (OCR) to create a text-based version of the document.

  •    If scanning the document, ensure the scanned image is free from smudges, dark marks, highlighted text, or artifacts in the image. These will affect the accuracy of the OCR process.
  •     Minimize the effects from skewing. If the image is presented at an "off-angle", the accuracy of the OCR process will be lower resulting in a lower quality text version.
  •     If you are starting with an image-based format and wish to convert to a text format, you may achieve better results by initially converting to Tagged PDF and then copying/pasting the text into a MS Word document. While you can convert directly from an image file to a text file with SensusAccess, you may find better results for some image documents if converting to Tagged PDF and then to a text file (see "Converting to MS Word and Text Files" section).

 

...Converting to MS Word and Text Files

SensusAccess will convert image-based documents into MS Word, RTF, and text files. You may also find it useful with some image-based documents to convert initially to Tagged PDF and then copy and paste the text from the Tagged PDF into MS Word. This may result in a better reading experience and may remove non-essential content.

With the MS Word version of the document, you can more accurately "clean" the content for conversion into MP3 audio or for use with assistive technologies. Most conversions will take just a few seconds within MS Word and involve the use of the Find and Replace tools. For more information on using the Find and Replace tools, see Using the Find and Replace in MS Word  removing special characters in a document.

Please note - in the Find and Replace examples below, replace the <space> value with one space bar and do not include the quotes.

 

...Image-File to Tagged PDF to MS Word Document

  •     Submit the image-based document to SensusAccess and select Tagged PDF as the output option.
  •     Open the Tagged PDF and select all the text. Copy and paste this into a MS Word document (Open Office may also be used).
  •     Using Find and Replace:
  1.     Search for ".<space>^p" and replace with ".^p^p" .
  2.     Search for "<space>^p" and replace with "<space>" .
  3.     Search for "<space>•<space>" and replace with "^p•<space>" .
  4.     Search for "-<space>" and replace with no value.

    Save the document in your preferred text format.

 

...Authoring MS Word, RTF, Text Files

  • Use Word styles to specify document headings. For example, the style "Heading 1" could be used to identify the title of the document and the style "Heading 2" could be used to identify chapter information. It is best to use only one "Heading 1" to facilitiate accurate conversions into other document formats (e.g., DAISY, ePub, Braille, etc.).
  • Provide short descriptions for content-related images in your MS Word document.
  • Avoid using text-boxes in your document. If you want to customize the layout, use a Column Tool or a Section Break.
  • If converting to DAISY, page numbers will be identified based on the MS Word pagination. To obtain custom pagination, use the PageNumber style from the Save As DAISY plug-in for Microsoft Office(link is external) for your custom page numbers.

 

...Authoring HTML Files

  • Use HTML heading markup (e.g., <h1>, <h2>, etc.) to designate headings in the document. For example, the style "Heading 1" could be used to identify the title of the document and the style "Heading 2" could be used to identify chapter information.
  • Provide short descriptions for content-related images in the HTML document.