Tag Archives: ocr

Getting PDFs ready for Accessibilty Requirements

AI created attention grabber - decorative

The compliance of scanned and OCRed files from Adobe Acrobat Pro with accessibility standards depends on several factors, especially when dealing with complex layouts like columns and tables. Here’s how these elements fare:

1. Text Recognition (OCR) Accuracy

  • Adobe Acrobat Pro’s OCR is generally reliable for converting scanned images into editable and searchable text.
  • Challenges with Columns: OCR might misinterpret multi-column layouts, reading them linearly rather than by column.
  • Challenges with Tables: OCR may struggle to preserve the structure of tables, often interpreting them as unstructured text.

2. Tagging and Accessibility

Acrobat Pro can automatically tag OCRed documents, but the tags may not always be accurate, especially for complex layouts:

  • Columns: Acrobat might not detect column order correctly, causing screen readers to read content in the wrong sequence.
  • Tables: The software often fails to generate proper table tags, leading to a loss of row and column relationships crucial for screen reader users.

3. Alt Text for Images

  • Scanned documents often include graphical elements, which Acrobat cannot automatically assign alt text to. You must manually add descriptive alt text for meaningful images.

4. Reading Order

  • Acrobat’s “Reading Order” tool is essential to correct the logical reading sequence, especially in multi-column and table-heavy documents.
  • Default reading order for OCRed files may require significant manual adjustments to ensure compliance.

5. Compliance with Accessibility Standards

To meet accessibility standards like WCAG 2.1 or Section 508, additional steps are often necessary:

  • Manually Adjust Tags: Verify and edit tags to accurately reflect document structure, including headings, lists, tables, and columns.
  • Use Acrobat’s Accessibility Checker: This tool helps identify and fix accessibility issues but may not catch all problems in complex layouts.
  • Supplement with Manual Efforts: Complex documents may require manual remediation with tools like Adobe Acrobat or third-party software specialized in accessibility.

Best Practices for Improving Compliance

  1. Pre-OCR Processing: Clean up scanned files to enhance OCR accuracy (e.g., ensuring straight scans, good contrast, and minimal noise).
  2. Use Proper OCR Settings: Select the correct language and enable the “Recognize as Table” option where applicable.
  3. Manually Review Tags: After OCR, manually inspect and adjust tags for accurate representation of document structure.
  4. Simplify Layouts: If possible, avoid overly complex layouts in scanned documents to minimize accessibility challenges.

By taking these additional steps, you can significantly improve the compliance of scanned and OCRed documents, even with complex layouts.

 

Using Google Drive for OCR

OCR (Optical Character Recognition) is when a program looks at the image with text, recognises those shapes as in fact being text, and then leaves you with a document that is editable as text.  (or at least matches this text up against the orginal file making it keyword searchable.)

It used to be that if you needed to do this you needed to have an expensive specialized program.  These days, you can do this with your google drive and google docs.

 

For example,

here is a picture of a page in an old dictionary. You may have taken this picture with your phone.

The first thing we want to do is convert it to a pdf.  You can do this in a number of ways, but I do it by choosing to print the file, not to paper, but to a pdf. This looks like this on a Mac (see thet pdf pulldown in the lower left?)

Now I have a pdf.

Note that this page does not have columns.  This freebie method doesn’t handle columns well.  You could still do it, but you would want to slice the image up so that in each picture was just one column, and then put them back together in the final document.

Next you take your pdf, and load it to your google drive:

Select File upload.

browse to your file and select it.

When it is done uploading, select “recent” so the new files are at the top and easy to find.

Now right click  the pdf and choose Open with -> Google Docs.

When you open an image based pdf in google docs, it will automatically runs OCR, giving you a file that look like this:

and now you have an editable document.