On PDF preservation risks

On the risks facing long-term preservation of PDF files

This started out as a comment on @dsalo's excellent answer above but rather quickly expanded beyond 500 characters:

PDF is a container format: a single PDF file has metadata and one or more content streams, conceptually similar to a ZIP archive containing multiple files. The core PDF format is based on a subset of PostScript, which is a programming language designed to produce graphics, and common graphics formats, but over time the format was expanded to allow streams to contain any type of data.

  1. The PDF format is very complicated and pulls in several other complex specifications. In practice, the vast majority of PDF files were only validated by testing whether Adobe Acrobat can display them as intended and it is quite common to have PDF encoders generate output which breaks the standard in ways which Acrobat tolerates, leaving the problem to be detected only when the file is first used with other tools.

  2. While the subset of PostScript supported in PDF is not as capable as full PostScript (fortunately, as the latter which is Turing Complete), it is still the case that what you actually have is executable program code and thus the only way to display PDF content is to execute each PostScript command in order:

    /Times findfont 100 scalefont setfont
                            10 10 moveto
                            .5 .5 .5 setrgbcolor
                            (Hello World) true charpath fill
                            showpage
                            
    

    This fragment uses only a subset of the language exposes the key areas of concern for simple PDF display:

    1. Since this is program code, implementation details can affect the output. As a simple, hopefully purely hypothetical example, consider how processor or compiler-specific differences in floating-point rounding could affect a complex document after many operations cause display problems such as lines which are supposed to appear joined to have visible space.

      As the full language is far more complicated than the subset above, there are many variations on this theme. Fortunately the mainstream implementations have generally converged on reliable inter-operability but you are still likely to need a copy of Acrobat if you receive content from a wide range of sources.

      This was particularly a problem in the past with older “print to PDF” drivers which simply took the raw PostScript which they would have sent directly to a printer and wrapped in a PDF container.

    2. Font choices are specified by name. The corresponding font file may be embedded within the PDF file but as system fonts are also supported, it's quite easy for authors to use special fonts and forget to embed them until the first time the PDF is opened on a system which does not have those fonts installed.

      We've seen this somewhat frequently with academic journal articles which were created using LaTex and use its fonts to display mathematical symbols. A Google search confirms that this is not an uncommon mistake as it will only be a problem when documents circulate outside the significant portion of scientific users who have the LaTeX fonts installed: https://www.google.com/search?q=%2B%22Cannot+find+or+create+the+font%22

      Additionally, the TrueType and particularly OpenType font formats are by necessity quite complex to deal with the range of human writing systems. Again, this is an area of potentially significant difference between implementations and, particularly for complex scripts like Arabic or Devangari, the failures can potentially lead to the text being incorrect. Fonts are versioned, so it's possible to have text which would be displayed correctly if the operating system's version of a font is used instead of the embedded version or vice versa. The more obscure the languages you work with, the more you need to have some sort of system to check for correctness.

  3. For simple images, PDF writers are allowed to use a number of encodings and over the years various image formats have been added, all of which have require full software support:

    http://en.wikipedia.org/wiki/PDF#Adobe.27s_versions

  4. Over the years, Adobe has also added many other types of rich content: audio, video, 3D imagery, etc. All of these include the full set of challenges for preserving their respective formats.

  5. Primarily for business users, Adobe has added several types of interactive forms, which rely on several complicated specifications and have in my experience been far less supported by third-party implementations, particularly the open-source community.

  6. In PDF 1.2, support was added for JavaScript as part of the forms specification. Since JavaScript is a full programming language, this means that the only way to process those actions is requires executing code in a manner consistent with the original implementation. Fortunately, this is likely to be uncommon in most preservation scenarios.

  7. The specification includes varying levels of encryption. It is possible to brute-force weak passwords and the older encryption algorithms but that might be possible and the software to do so might be difficult or illegal to obtain.

In practice, many of the concerns are manageable with several precautions. If your content is not supposed to include the various rich media features the best place to start is by requiring the restricted subsets of PDF which have been developed to avoid many of these issues: PDF/A, intended for preservation, and PDF/X, intended for reliable graphics exchange, which do not allow the more complex features and dramatically simplify the problem. If your goal is to archive general PDFs, however, you'll need to develop a more nuanced approach to audit the various complex features to check that a document does not include content which you are unprepared for (e.g. if your content includes embedded video, your auditing script could verify that the video stream uses a long-term viable codec).

Here are some features which you might want to audit:

  1. All fonts in the file are the standard core PDF fonts or embedded within the file.
  2. All images are in the subset of formats which you are prepared to support and decode without errors.
  3. All content streams are checked against a whitelist of supported types
  4. The PDF is unencrypted or at least that the password is known and the file decrypts successfully
comments powered by Disqus