[wp_ad_camp_5]
The solutions to things are just out there awaiting discovery. You just need to dig a little bit deeper and do a research. When you combine PDFs using tools like Pdftk or Apache PDFBox, fonts get accumulated and they have, specially embedded ones, direct impact on merged PDF file’s size.
Below is an 85-page 4,153,344-byte PDF file with duplicate fonts. It could be way smaller without the duplicates. Now 85 pages is nothing. In production, PDF files can have pages as many as 20,000. Imagine the size of the merged PDF!
To remove these duplicates, we use iText’s PdfSmartCopy class – com.itextpdf.text.pdf.PdfSmartCopy.
[wp_ad_camp_4]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | package com.karlsangabriel.pdf; import java.io.File; import java.io.FileOutputStream; import java.util.UUID; import org.apache.commons.lang.StringUtils; import com.itextpdf.text.Document; import com.itextpdf.text.pdf.PdfReader; import com.itextpdf.text.pdf.PdfSmartCopy; public class PDFDuplicateFontsRemover { public static void main(String[] args) throws Exception { PDFDuplicateFontsRemover dupFontsRemover = new PDFDuplicateFontsRemover(); File f = new File("<SOMEWHERE>/2013-07-0912_29_13_034_EN.pdf"); dupFontsRemover.normalizeFile(f); } public File normalizeFile(File srcPdfFile) throws Exception { Document pdfDocument = new Document(); PdfReader pdfReader = null; FileOutputStream fileOutputStream = null; String parentDir = StringUtils.isNotEmpty( srcPdfFile.getParent()) ? srcPdfFile.getParent() + "/" : ""; String newPdfFile = srcPdfFile.getName() + "_" + UUID.randomUUID().toString(); try { File tmpPdfFile = new File(parentDir + newPdfFile); fileOutputStream = new FileOutputStream(tmpPdfFile); PdfSmartCopy pdfSmartCopy = new PdfSmartCopy(pdfDocument, fileOutputStream); pdfDocument.open(); pdfReader = new PdfReader(srcPdfFile.getCanonicalPath()); // Where the magic happens for (int i = 1; i <= pdfReader.getNumberOfPages(); i++) { pdfSmartCopy.addPage(pdfSmartCopy.getImportedPage(pdfReader, i)); } pdfDocument.close(); return tmpPdfFile; } finally { if (pdfReader != null) { pdfReader.close(); } if (fileOutputStream != null) { fileOutputStream.close(); } pdfReader = null; fileOutputStream = null; } } } |
Below is my Eclipse workspace showing the Java file, reference libraries, and JDK version.
When you run the application, it generates a smaller PDF file.
[wp_ad_camp_3]
You may still see the fonts listed in the PDF->Document Properties->Fonts tab, but they now point to the same set of references (or items) within the PDF document. Other duplicates, like shared images, are also removed.