企业级-生成PDF移除异常空白页

作者:fyupeng

技术专栏:☞ https://github.com/fyupeng

项目地址:☞ https://github.com/fyupeng/distributed-blog-system-api


留给读者

咱们又见面了,本期带给大家什么,请往下看,绝对是干货!

一、介绍

提供 PDF文件二进制参数,返回删除空白页的PDF文件二进制。

二、代码

引入依赖:

xml 复制代码
<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>2.0.21</version>
</dependency>

代码:

java 复制代码
public static void main(String[] args) throws IOException {
        File file = new File("d:/hztzs.pdf");
        byte[] bytes = new byte[(int) file.length()];
        FileInputStream fis = new FileInputStream(file);
        fis.read(bytes);

        bytes = new ArchivElecFileService().removeEmptyPages(bytes);

        File newfile = new File("d:/out.pdf");
        FileOutputStream fos = new FileOutputStream(newfile);
        fos.write(bytes);
    }

public byte[] removeEmptyPages(byte[] fileBytes) throws IOException {
        // Load the PDF document
        PDDocument document = PDDocument.load(fileBytes);

        // Iterate through each page
        PDPageTree pages = document.getPages();
        int pageCount = document.getNumberOfPages();
        for (int i = pageCount - 1; i >= 0; i--) {
            // Extract text from the page
            PDFTextStripper stripper = new PDFTextStripper();
            stripper.setStartPage(i + 1); // Page indexes are 1-based in PDFTextStripper
            stripper.setEndPage(i + 1);
            String text = stripper.getText(document);

            PDPage page = pages.get(i);

            // Check if the page is empty
            if (text.trim().isEmpty()) {
                // Remove the page
                if (isPageImageOnly(page, document)) {
                    document.removePage(i);
                }
            }
        }
        // 保存结果文件
        ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
        document.save(outputStream);

        return outputStream.toByteArray();
    }

    private static boolean isPageImageOnly(PDPage page, PDDocument document) throws IOException {
        PDFRenderer renderer = new PDFRenderer(document);
        BufferedImage image = renderer.renderImageWithDPI(document.getPages().indexOf(page), 300); // Adjust DPI as needed
        return isImageOnly(image);
    }

    private static boolean isImageOnly(BufferedImage image) {
        // Check if the image contains significant content (e.g., not just white)
        // You can implement custom logic based on your requirements
        // For simplicity, here's a basic check
        int width = image.getWidth();
        int height = image.getHeight();
        long whitePixelCount = ImageUtils.countWhitePixels(image);

        // If more than 90% of the image is white, consider it empty
        double whiteRatio = (double) whitePixelCount / (width * height);
        return whiteRatio > 0.95; // Adjust threshold as needed
    }

    // Utility class to count white pixels in an image
    static class ImageUtils {
        public static long countWhitePixels(BufferedImage image) {
            long count = 0;
            int width = image.getWidth();
            int height = image.getHeight();
            for (int y = 0; y < height; y++) {
                for (int x = 0; x < width; x++) {
                    int pixel = image.getRGB(x, y);
                    if (isWhite(pixel)) {
                        count++;
                    }
                }
            }
            return count;
        }
        private static boolean isWhite(int pixel) {
            // Define your white color threshold based on RGB values
            // Adjust as per your image characteristics
            int red = (pixel >> 16) & 0xff;
            int green = (pixel >> 8) & 0xff;
            int blue = (pixel) & 0xff;
            return red > 250 && green > 250 && blue > 250;
        }
    }

三、总结

易用、高效、轻便!

相关推荐
weixin_3975740914 天前
PDF复杂表格的1:1还原引擎:跨页表格自动拼接技术实战
大数据·人工智能·pdf
Metaphor69214 天前
使用 Python 将 PDF 转换为 HTML
python·pdf·html
2601_9618451514 天前
粉笔行测5000题电子版|pdf|解析
pdf·新媒体运营·github·个人开发·内容运营·规格说明书·极限编程
Sour14 天前
PDF翻译卡住不动怎么办?扫描件、OCR 和大文件排查清单
前端·pdf·ocr
狂奔solar14 天前
OpenDataLoader-PDF 做 PDF 解析可视化调试器
pdf·rag 预处理
chatexcel14 天前
ChatExcel Max使用教程:图片、PDF、网页与复杂Excel的一站式数据分析
数据分析·pdf·excel
绘梨衣54714 天前
PDF表格解析知识总结
开发语言·python·pdf
qq_5469372714 天前
Excel批量转PDF_Word_图片,支持自动合并报表,效率翻倍。
pdf·word·excel
zyplayer-doc15 天前
企业知识库安全与权限管理完全指南:从加密到审计的六层防护
人工智能·安全·pdf·编辑器·创业创新
易鹤鹤.15 天前
pdf标注高亮
pdf