企业级-生成PDF移除异常空白页

作者:fyupeng

技术专栏:☞ https://github.com/fyupeng

项目地址:☞ https://github.com/fyupeng/distributed-blog-system-api


留给读者

咱们又见面了,本期带给大家什么,请往下看,绝对是干货!

一、介绍

提供 PDF文件二进制参数,返回删除空白页的PDF文件二进制。

二、代码

引入依赖:

xml 复制代码
<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>2.0.21</version>
</dependency>

代码:

java 复制代码
public static void main(String[] args) throws IOException {
        File file = new File("d:/hztzs.pdf");
        byte[] bytes = new byte[(int) file.length()];
        FileInputStream fis = new FileInputStream(file);
        fis.read(bytes);

        bytes = new ArchivElecFileService().removeEmptyPages(bytes);

        File newfile = new File("d:/out.pdf");
        FileOutputStream fos = new FileOutputStream(newfile);
        fos.write(bytes);
    }

public byte[] removeEmptyPages(byte[] fileBytes) throws IOException {
        // Load the PDF document
        PDDocument document = PDDocument.load(fileBytes);

        // Iterate through each page
        PDPageTree pages = document.getPages();
        int pageCount = document.getNumberOfPages();
        for (int i = pageCount - 1; i >= 0; i--) {
            // Extract text from the page
            PDFTextStripper stripper = new PDFTextStripper();
            stripper.setStartPage(i + 1); // Page indexes are 1-based in PDFTextStripper
            stripper.setEndPage(i + 1);
            String text = stripper.getText(document);

            PDPage page = pages.get(i);

            // Check if the page is empty
            if (text.trim().isEmpty()) {
                // Remove the page
                if (isPageImageOnly(page, document)) {
                    document.removePage(i);
                }
            }
        }
        // 保存结果文件
        ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
        document.save(outputStream);

        return outputStream.toByteArray();
    }

    private static boolean isPageImageOnly(PDPage page, PDDocument document) throws IOException {
        PDFRenderer renderer = new PDFRenderer(document);
        BufferedImage image = renderer.renderImageWithDPI(document.getPages().indexOf(page), 300); // Adjust DPI as needed
        return isImageOnly(image);
    }

    private static boolean isImageOnly(BufferedImage image) {
        // Check if the image contains significant content (e.g., not just white)
        // You can implement custom logic based on your requirements
        // For simplicity, here's a basic check
        int width = image.getWidth();
        int height = image.getHeight();
        long whitePixelCount = ImageUtils.countWhitePixels(image);

        // If more than 90% of the image is white, consider it empty
        double whiteRatio = (double) whitePixelCount / (width * height);
        return whiteRatio > 0.95; // Adjust threshold as needed
    }

    // Utility class to count white pixels in an image
    static class ImageUtils {
        public static long countWhitePixels(BufferedImage image) {
            long count = 0;
            int width = image.getWidth();
            int height = image.getHeight();
            for (int y = 0; y < height; y++) {
                for (int x = 0; x < width; x++) {
                    int pixel = image.getRGB(x, y);
                    if (isWhite(pixel)) {
                        count++;
                    }
                }
            }
            return count;
        }
        private static boolean isWhite(int pixel) {
            // Define your white color threshold based on RGB values
            // Adjust as per your image characteristics
            int red = (pixel >> 16) & 0xff;
            int green = (pixel >> 8) & 0xff;
            int blue = (pixel) & 0xff;
            return red > 250 && green > 250 && blue > 250;
        }
    }

三、总结

易用、高效、轻便!

相关推荐
Highcharts.js4 小时前
Highcharts常见问题解析(5):如何将多个图表导出到同一张图片或 PDF?
pdf·highcharts
麦烤楽鸡翅8 小时前
pdf(攻防世界)
网络安全·pdf·ctf·misc·杂项·攻防世界·信息竞赛
Less is moree8 小时前
PDF无法打印怎么解决?
pdf
lijun_xiao200913 小时前
Python-将身份证正反面图片-生成PDF
pdf
A尘埃14 小时前
项目七:PDF智能公式与计算(金融机构信贷报告自动解析与风险评估)
pdf
百事牛科技1 天前
PDF如何设置密码?3种方法保护文件安全
windows·pdf
mysusheng1 天前
2025 批量下载微博内容/图片/视频,导出word和pdf,微博点赞/评论/转发等数据导出excel
pdf·word·excel
lijun_xiao20091 天前
Python-PDF文件生成水印
pdf
zongxingfengyun1 天前
图片转pdf接口
pdf
cpych1 天前
如何从 PDF 中删除页面
pdf