电子发票解析工具-c#桌面应用开发案例详解

1. 项目结构

电子发票解析工具是一个基于C#开发的桌面应用程序，用于批量处理和解析电子发票。项目采用多层架构设计，主要包含UI交互层、业务逻辑层、数据处理层和外部服务交互层。

主要文件结构

arduino 复制代码

InvoiceClient/
├── Program.cs                    // 应用程序入口
├── InvoiceAnalysis.cs            // 主窗体及核心业务逻辑
├── InvoiceAnalysis_ConvertJpg.cs // PDF/OFD转图片功能
├── InvoiceAnalysis_GoApiServer.cs // 与Go API服务交互
├── InvoiceAnalysis_ForamtData.cs // 发票数据格式化处理
├── InvoiceAnalysis_DatabaseHelper.cs // SQLite数据库操作
├── InvoiceAnalysis_ExportData.cs // 数据导出功能
├── Utils.cs                      // 通用工具类
├── App.config                    // 应用程序配置

核心类及功能

类名	主要功能	文件位置
InvoiceAnalysis	主窗体及发票处理核心逻辑
ConvertJpg	PDF/OFD文件转换为图片
InvoiceAnalysis_GoApiServer	与Go API服务交互
FormatData	发票数据格式化与解析
InvoiceAnalysis_DatabaseHelper	SQLite数据库操作
Utils	工具方法集合

2. 电子发票解析操作流程图

单个电子发票处理流程

选择单个PDF/OFD文件
- 通过OpenFileDialog选择电子发票文件
- 加载文件信息
将单个PDF/OFD转换为图片
- 调用相应的转换方法处理单个文件
- 保存转换后的图片
上传图片至Go API服务
- 将图片转换为Base64编码
- 通过HTTP POST请求发送至Go API
- 接收并解析识别结果
数据格式化与关键字段提取
- 识别发票类型
- 应用对应的格式化规则
- 提取关键字段并验证
结果显示与存储
- 在UI界面显示解析结果
- 保存为文本文件或数据库

批量处理电子发票整体流程

选择文件夹批量加载PDF/OFD文件
- 通过FolderBrowserDialog选择包含电子发票的文件夹
- 扫描文件夹中的PDF和OFD文件并添加到文件列表
批量将PDF/OFD转换为图片
- 调用外部程序pdftojpg.exe处理PDF文件
- 调用Python脚本ofdToJpgCsRun.py处理OFD文件
- 保存转换后的图片到img子目录
批量上传图片至Go API服务
- 将图片转换为Base64编码
- 通过HTTP POST请求将Base64数据发送至Go API的"/ocr"接口
- 接收并解析API返回的JSON格式识别结果
数据格式化与关键字段提取
- 根据发票类型应用不同的格式化规则
- 提取发票代码、发票号码、金额、税额等关键字段
- 验证数据有效性并格式化输出
数据导出与存储
- 将解析后的数据导出为Excel文件
- 数据验证与去重
- 写入SQLite数据库

流程图

flowchart TD subgraph 单个电子发票处理流程 A[选择单个PDF/OFD文件] --> B[转换为图片] B --> C[图片转Base64] C --> D[上传至Go API服务] D --> E[接收OCR识别结果] E --> F[识别发票类型] F --> G[应用格式化规则] G --> H[提取关键字段] H --> I[验证数据有效性] I --> J[显示结果与存储] end subgraph 批量处理电子发票流程 K[选择文件夹] --> L[批量加载PDF/OFD文件] L --> M[多线程批量转换为图片] M --> N[图片队列管理] N --> O[多线程批量上传Go API] O --> P[批量接收识别结果] P --> Q[批量数据格式化] Q --> R[批量关键字段提取] R --> S[数据验证与去重] S --> T[批量导出Excel] T --> U[写入SQLite数据库] end %% 流程间的关系 J -.-> K

工具效果图

详细实现流程代码分析

2.1 批量加载发票文件实现

csharp 复制代码

// 通过按钮点击事件触发文件夹选择
private void button_ReadInvoiceFileNameList_Click(object sender, EventArgs e)
{
    // 使用FolderBrowserDialog选择文件夹
    FolderBrowserDialog folderBrowserDialog = new FolderBrowserDialog();
    folderBrowserDialog.Description = "请选择包含电子发票的文件夹";
    folderBrowserDialog.ShowNewFolderButton = true;
    
    // 如果用户确认选择
    if (folderBrowserDialog.ShowDialog() == DialogResult.OK)
    {
        // 获取选择的文件夹路径
        string selectedPath = folderBrowserDialog.SelectedPath;
        textBox_FilePath.Text = selectedPath;
        
        // 加载文件夹中的PDF和OFD文件
        loadInvoiceFileToList(selectedPath);
    }
}

// 加载发票文件到列表
private void loadInvoiceFileToList(string path)
{
    // 清空文件列表
    listBox_FileList.Items.Clear();
    listBox_JpgList.Items.Clear();
    listFileNames = new List<string>();
    
    // 查找所有PDF和OFD文件
    string[] pdfFiles = Directory.GetFiles(path, "*.pdf");
    string[] ofdFiles = Directory.GetFiles(path, "*.ofd");
    
    // 合并文件列表
    List<string> allFiles = new List<string>();
    allFiles.AddRange(pdfFiles);
    allFiles.AddRange(ofdFiles);
    
    // 将文件添加到列表框
    foreach (string file in allFiles)
    {
        listBox_FileList.Items.Add(file);
    }
}

2.2 批量转换PDF/OFD为图片实现

csharp 复制代码

// PDF文件转图片方法
public List<string> PdfToJpg()
{
    // 输出转换状态
    label_Msg.Text = $"正在转换{fileName}文件到图片，请稍后。。。";
    label_Msg.Refresh();
    
    // 调用外部程序pdftojpg.exe处理PDF
    Process process = new Process();
    process.StartInfo.FileName = "pdftojpg.exe";
    process.StartInfo.Arguments = $"{fileName} {outputDir}";
    process.StartInfo.UseShellExecute = false;
    process.StartInfo.RedirectStandardOutput = true;
    process.StartInfo.RedirectStandardError = true;
    
    // 启动进程并等待完成
    process.Start();
    process.WaitForExit();
    
    // 获取转换后的图片文件列表
    List<string> imageFiles = new List<string>();
    if (Directory.Exists(outputDir))
    {
        imageFiles.AddRange(Directory.GetFiles(outputDir, "*.jpg"));
    }
    
    return imageFiles;
}

// OFD文件转图片方法
public List<string> OfdToJpg()
{
    // 获取Python解释器路径
    Utils utils = new Utils();
    string pythonPath = utils.GetPythonPath();
    
    // 调用Python脚本处理OFD文件
    Process process = new Process();
    process.StartInfo.FileName = pythonPath;
    process.StartInfo.Arguments = $"ofdToJpgCsRun.py {fileName} {outputDir}";
    process.StartInfo.UseShellExecute = false;
    process.StartInfo.RedirectStandardOutput = true;
    
    // 启动进程并获取输出
    process.Start();
    string output = process.StandardOutput.ReadToEnd();
    process.WaitForExit();
    
    // 解析Python脚本返回的JSON格式文件列表
    ResponseData responseData = JsonConvert.DeserializeObject<ResponseData>(output);
    return responseData.Data;
}

2.3 批量上传图片至Go API服务实现

csharp 复制代码

// 批量解析按钮点击事件
private async void button_BatchParsingData_Click(object sender, EventArgs e)
{
    // 禁用按钮防止重复点击
    button_BatchParsingData.Enabled = false;
    
    // 获取图片文件数量
    Int32 intCount = listBox_JpgList.Items?.Count ?? 0;
    if (intCount == 0)
    {
        MessageBox.Show("请先批量转换图片", "提示", MessageBoxButtons.OK, MessageBoxIcon.Error);
        button_BatchParsingData.Enabled = true;
        return;
    }
    
    // 开始计时
    Stopwatch stopwatch = new Stopwatch();
    stopwatch.Start();
    
    // 重置进度条
    progressBar1.Value = 0;
    
    // 将所有项复制到列表
    var items = listBox_JpgList.Items.Cast<object>().ToList();
    
    // 异步批量处理图片
    await Task.Run(async () =>
    {
        foreach (var item in items)
        {
            await Task.Run(async () =>
            {
                // 更新状态标签
                label_Msg.Invoke((Action)(() =>
                {
                    label_Msg.Text = $"正在OCR解析【{item.ToString()}】数据，请稍后......";
                    label_Msg.Refresh();
                }));
                
                // 更新进度条
                int index = listBox_JpgList.Items.IndexOf(item);
                int int_P = (int)Math.Round((double)(index + 1) / intCount * 100);
                Invoke((Action)(() => updateProgressBar(int_P)));
                
                // 读取图片文件
                var imagebyte = File.ReadAllBytes(item.ToString());
                Bitmap bitmap = new Bitmap(new MemoryStream(imagebyte));
                
                // 调用API进行OCR解析
                await allFileOcrAnalysisDataApi(item.ToString(), item.ToString(), bitmap);
            });
        }
    });
    
    // 完成处理
    label_Msg.Text = "图片列表中的所有图片数据全部解析完毕！";
    label_Msg.Refresh();
    
    // 显示执行时间
    stopwatch.Stop();
    label_Msg.Text += utils.GetRunTime(stopwatch, "OCR解析图片时间");
    
    // 导出数据到Excel
    new InvoiceAnalysis_ExportData(dataGridView_Data, richTextBox3, label_ExportDataMsg, formatDatas, eds).FormatDataToExecl();
    
    // 任务完成后启用按钮
    button_BatchParsingData.Enabled = true;
}

// 调用API进行OCR解析
private async Task allFileOcrAnalysisDataApi(string fileName, string jpgName, System.Drawing.Image image)
{
    // 将图片转换为Base64编码
    string base64 = utils.ConvertImageToBase64(fileName);
    
    // 调用Go API服务进行OCR识别
    string response = await GoApiServer.SendImageBase64ToServer(base64);
    
    // 解析API返回的JSON数据
    OcrResultText result = JsonConvert.DeserializeObject<OcrResultText>(response);
    
    // 保存识别结果到文件
    string directoryPath = System.IO.Path.GetDirectoryName(fileName) + "\\txt\\";
    string fileNameWithoutExtension = System.IO.Path.GetFileNameWithoutExtension(fileName);
    string txtPath = System.IO.Path.Combine(directoryPath, fileNameWithoutExtension);
    
    using (FileStream fs = new FileStream(txtPath + "_格式化前.txt", FileMode.Create, FileAccess.Write))
    using (StreamWriter sw = new StreamWriter(fs))
    {
        sw.WriteLine(result.Text);
    }
    
    // 调用格式化方法处理识别结果
    string invoiceType = FData.GetInvoiceType(result.Text);
    string formatData = FData.RunAnalysisDataFormat(directoryPath, fileName, jpgName, invoiceType, result.Text, pds, label_Msg);
    
    // 保存格式化后的结果
    formatDatas.Add(formatData);
}

2.4 Go API服务交互实现

csharp 复制代码

internal class InvoiceAnalysis_GoApiServer
{
    private string apiServerPath;
    private string apiServerPort;
    // 静态只读HttpClient实例，优化连接复用
    private static readonly HttpClient client = new HttpClient();

    // 构造函数，初始化API服务器地址和端口
    public InvoiceAnalysis_GoApiServer(string ApiServerPath, string ApiServerPort)
    {
        this.apiServerPath = ApiServerPath;
        this.apiServerPort = ApiServerPort;
    }
    
    // 发送图片Base64到Go API服务
    public async Task<string> SendImageBase64ToServer(string base64)
    {
        // 创建JSON数据对象
        var data = new { Base64 = base64 };
        // JSON序列化
        string jsonData = JsonConvert.SerializeObject(data);
        var content = new StringContent(jsonData, Encoding.UTF8, "application/json");
        
        // 发送POST请求到"/ocr"接口
        HttpResponseMessage response = await client.PostAsync(apiServerPath + ":" + apiServerPort + "/ocr", content);
        response.EnsureSuccessStatusCode();
        
        // 返回响应内容
        return await response.Content.ReadAsStringAsync();
    }
    
    // 发送文件到Go API服务
    public async Task<string> SendFileToServer(string filePath)
    {
        // 检查文件是否存在
        if (!File.Exists(filePath))
        {
            Console.WriteLine("文件不存在！");
            return "";
        }
        
        // 创建MultipartFormDataContent
        using (MultipartFormDataContent form = new MultipartFormDataContent())
        {
            // 读取文件内容
            byte[] fileBytes = File.ReadAllBytes(filePath);
            ByteArrayContent fileContent = new ByteArrayContent(fileBytes);
            
            // 设置文件内容头部信息
            string extension = Path.GetExtension(filePath).TrimStart('.');
            fileContent.Headers.ContentDisposition = new ContentDispositionHeaderValue("form-data")
            {
                Name = "\"file\"",
                FileName = $"dummy.{extension}"
            };
            
            // 添加文件到表单
            form.Add(fileContent, "file", $"dummy.{extension}");
            
            // 发送POST请求到"/uploadbrowerfile"接口
            string apiUrl = apiServerPath + ":" + apiServerPort + "/uploadbrowerfile";
            HttpResponseMessage response = await client.PostAsync(apiUrl, form);
            
            // 读取响应内容
            string responseBody = await response.Content.ReadAsStringAsync();
            
            return responseBody;
        }
    }
    
    // 获取服务器当前年份
    public async Task<string> GetCurrentYearString()
    {
        // 发送GET请求到"/api/current-year"接口
        string apiUrl = apiServerPath + ":" + apiServerPort + "/api/current-year";
        HttpResponseMessage response = await client.GetAsync(apiUrl);
        response.EnsureSuccessStatusCode();
        
        // 返回响应内容
        return await response.Content.ReadAsStringAsync();
    }
}

3. 数据结构及数据类

ExportData类

用于存储发票的基本信息，是导出数据的核心模型。

csharp 复制代码

public class ExportData
{
    public string 发票代码 { get; set; }      // 发票代码
    public string 发票号码 { get; set; }      // 发票号码
    public string 开票日期 { get; set; }      // 开票日期
    public string 购买方名称 { get; set; }    // 购买方名称
    public string 购买方税号 { get; set; }    // 购买方税号
    public string 销售方名称 { get; set; }    // 销售方名称
    public string 销售方税号 { get; set; }    // 销售方税号
    public string 金额 { get; set; }          // 金额
    public string 税额 { get; set; }          // 税额
    public string 价税合计 { get; set; }      // 价税合计
    public string 项目名称 { get; set; }      // 项目名称
    public string 税收类别 { get; set; }      // 税收类别
    public string 税率 { get; set; }          // 税率
    public string 检测标识 { get; set; }      // 检测标识
    public string 文件名 { get; set; }        // 文件名
}

ProvinceData类

用于存储和管理省份数据，用于识别发票所属地区。

csharp 复制代码

public class ProvinceData
{
    public string ProvinceName;        // 省份名称
    public string Logogram;            // 省份简写
    public string PinyinAbbreviation;  // 拼音缩写
    
    // 添加省份数据
    public void provinceDataAdd(string provinceName, string logogram, string pinyinAbbreviation)
    {
        ProvinceName = provinceName;
        Logogram = logogram;
        PinyinAbbreviation = pinyinAbbreviation;
    }
    
    // 初始化省份数据
    public List<ProvinceData> ProvinceDataInit()
    {
        List<Tuple<string, string, string>> data = new List<Tuple<string, string, string>>();
        
        // 添加全国34个省份/地区数据
        data.Add(Tuple.Create("北京市", "北京", "BJ"));
        data.Add(Tuple.Create("上海市", "上海", "SH"));
        data.Add(Tuple.Create("天津市", "天津", "TJ"));
        data.Add(Tuple.Create("重庆市", "重庆", "CQ"));
        data.Add(Tuple.Create("江苏省", "江苏", "JS"));
        // ... 其他省份数据
        
        List<ProvinceData> pds = new List<ProvinceData>();
        // 遍历数据并创建ProvinceData对象
        foreach (var item in data)
        {
            ProvinceData pd = new ProvinceData();
            pd.provinceDataAdd(item.Item1, item.Item2, item.Item3);
            pds.Add(pd);
        }
        return pds;
    }
    
    // 获取发票所属省份
    public string GetHomeProvince(string strLines, List<ProvinceData> pds)
    {
        // 通过省份简写+"增"的方式识别发票所属省份
        var province = pds.FirstOrDefault(pd => strLines.Contains(pd.Logogram.Trim() + "增"));
        return province?.Logogram ?? string.Empty;
    }
}

OcrResultText类

用于解析Go API返回的OCR结果。

csharp 复制代码

public class OcrResultText
{
    public string Text { get; set; }  // 识别出的文本
    public List<TextBlock> TextBlocks { get; set; }  // 文本块列表
}

public class TextBlock
{
    public string Text { get; set; }  // 文本内容
    public List<Point> Points { get; set; }  // 文本框坐标
}

4. SQLite数据库应用及增删改查操作

项目使用SQLite数据库存储发票格式化规则，支持不同类型发票的解析规则管理。

数据库表结构

sql 复制代码

CREATE TABLE IF NOT EXISTS FormatRules (
    Id INTEGER PRIMARY KEY AUTOINCREMENT,
    InvoiceType TEXT NOT NULL,
    SearchPattern TEXT NOT NULL,
    ReplacePattern TEXT NOT NULL
);

数据库操作类实现

csharp 复制代码

public class InvoiceAnalysis_DatabaseHelper : IDisposable
{
    private SQLiteConnection connection;
    private string connectionString;
    
    // 构造函数，初始化数据库连接
    public InvoiceAnalysis_DatabaseHelper(string dbFilePath)
    {
        connectionString = $"Data Source={dbFilePath};Version=3;";
        connection = new SQLiteConnection(connectionString);
        CreateDatabase();
    }
    
    // 创建数据库表
    private void CreateDatabase()
    {
        // 检查连接状态并打开
        if (connection.State != ConnectionState.Open)
        {
            connection.Open();
        }

        // 创建FormatRules表
        string createTableQuery = @"
            CREATE TABLE IF NOT EXISTS FormatRules (
                Id INTEGER PRIMARY KEY AUTOINCREMENT,
                InvoiceType TEXT NOT NULL,
                SearchPattern TEXT NOT NULL,
                ReplacePattern TEXT NOT NULL
            );";
        using (var command = new SQLiteCommand(createTableQuery, connection))
        {
            command.ExecuteNonQuery();
        }

        connection.Close();
    }
    
    // 添加规则
    public void AddRule(string invoiceType, string searchPattern, string replacePattern)
    {
        using (var connection = new SQLiteConnection(connectionString))
        {
            connection.Open();
            string insertQuery = "INSERT INTO FormatRules (InvoiceType, SearchPattern, ReplacePattern) VALUES (@InvoiceType, @SearchPattern, @ReplacePattern)";
            using (var command = new SQLiteCommand(insertQuery, connection))
            {
                command.Parameters.AddWithValue("@InvoiceType", invoiceType);
                command.Parameters.AddWithValue("@SearchPattern", searchPattern);
                command.Parameters.AddWithValue("@ReplacePattern", replacePattern);
                command.ExecuteNonQuery();
            }
        }
    }
    
    // 更新规则
    public void UpdateRule(int id, string searchPattern, string replacePattern)
    {
        using (var connection = new SQLiteConnection(connectionString))
        {
            connection.Open();
            string updateQuery = "UPDATE FormatRules SET SearchPattern = @SearchPattern, ReplacePattern = @ReplacePattern WHERE Id = @Id";
            using (var command = new SQLiteCommand(updateQuery, connection))
            {
                command.Parameters.AddWithValue("@Id", id);
                command.Parameters.AddWithValue("@SearchPattern", searchPattern);
                command.Parameters.AddWithValue("@ReplacePattern", replacePattern);
                command.ExecuteNonQuery();
            }
        }
    }
    
    // 删除规则
    public void DeleteRule(int id)
    {
        using (var connection = new SQLiteConnection(connectionString))
        {
            connection.Open();
            string deleteQuery = "DELETE FROM FormatRules WHERE Id = @Id";
            using (var command = new SQLiteCommand(deleteQuery, connection))
            {
                command.Parameters.AddWithValue("@Id", id);
                command.ExecuteNonQuery();
            }
        }
    }
    
    // 查询规则
    public SQLiteDataReader GetRules(string invoiceType)
    {
        // 检查连接状态并打开
        if (connection.State != ConnectionState.Open)
        {
            connection.Open();
        }
        
        // 准备SQL查询语句
        string selectQuery = "SELECT * FROM FormatRules WHERE InvoiceType = @InvoiceType";
        var command = new SQLiteCommand(selectQuery, connection);
        command.Parameters.AddWithValue("@InvoiceType", invoiceType);

        // 返回数据读取器
        return command.ExecuteReader(System.Data.CommandBehavior.CloseConnection);
    }
    
    // 批量插入规则
    public void BulkInsertRules(string invoiceType, string filePath)
    {
        using (var connection = new SQLiteConnection(connectionString))
        {
            connection.Open();
            using (var transaction = connection.BeginTransaction())
            {
                string insertQuery = "INSERT INTO FormatRules (InvoiceType, SearchPattern, ReplacePattern) VALUES (@InvoiceType, @SearchPattern, @ReplacePattern)";
                using (var command = new SQLiteCommand(insertQuery, connection))
                {
                    foreach (var line in File.ReadLines(filePath))
                    {
                        var parts = line.Split('|');
                        if (parts.Length == 2)
                        {
                            command.Parameters.Clear();
                            command.Parameters.AddWithValue("@InvoiceType", invoiceType);
                            command.Parameters.AddWithValue("@SearchPattern", parts[0]);
                            command.Parameters.AddWithValue("@ReplacePattern", parts[1]);
                            command.ExecuteNonQuery();
                        }
                    }
                }
                transaction.Commit();
            }
        }
    }
    
    // 实现IDisposable接口
    public void Dispose()
    {
        if (connection != null)
        {
            connection.Dispose();
        }
    }
}

数据库规则应用示例

csharp 复制代码

// 从数据库加载替换规则
private Dictionary<string, string> FromDbLoadReplaceRules(string invoiceType)
{
    Dictionary<string, string> rules = new Dictionary<string, string>();
    
    // 创建数据库连接
    string dbFilePath = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "rules.db");
    using (var dbHelper = new InvoiceAnalysis_DatabaseHelper(dbFilePath))
    {
        // 查询指定发票类型的规则
        using (var reader = dbHelper.GetRules(invoiceType))
        {
            // 读取规则并添加到字典
            while (reader.Read())
            {
                string searchPattern = reader["SearchPattern"].ToString();
                string replacePattern = reader["ReplacePattern"].ToString();
                rules[searchPattern] = replacePattern;
            }
        }
    }
    
    return rules;
}

// 应用替换规则
private string ApplyReplaceRules_Db(string text, string invoiceType)
{
    // 从数据库加载规则
    var rules = FromDbLoadReplaceRules(invoiceType);
    
    // 应用规则
    foreach (var rule in rules)
    {
        text = text.Replace(rule.Key, rule.Value);
    }
    
    return text;
}

5. 第三方库介绍及具体应用示例

1. PaddleOCR

功能：百度飞桨开源OCR引擎，用于识别图片中的文字内容。

PaddleOCRSharp 是由广州英田信息科技有限公司基于百度飞桨（PaddlePaddle）开发的 .NET 框架下的 OCR 工具库。下面将详细介绍其主要功能和技术特点。

主要功能

多语言文字识别：
- 支持多种语言的文字识别，包括中文、英文、日文、韩文等，满足全球化应用需求。
垂直文本检测与识别：
- 能够识别和处理垂直排列的文字，适用于多种文档和图像场景，如海报、指示牌等。
手写体识别：
- 提高手写文字的识别准确率，适用于手写笔记、签名、手写表格等场景。
表格识别：
- 自动识别图像中的表格结构，并将其转换为结构化数据，方便数据的进一步处理和分析。
场景文本检测：
- 适用于各种复杂场景下的文字检测，如街景、广告牌、自然场景等，具备较强的环境适应性。
端到端识别：
- 提供从图像输入到文字输出的完整解决方案，简化了 OCR 流程，提高开发效率。
高精度与高性能：
- 结合深度学习技术，实现高精度的文字识别，同时保持较高的处理速度，适用于实时应用场景。
自定义训练与扩展：
- 支持用户根据自身数据进行模型训练和优化，以适应特定的识别需求，如行业术语、专有名词等。
批量处理：
- 支持批量图像的文字识别，适合处理大量文档和图像的场景，提高工作效率。
集成性强：
- 作为 .NET 框架下的库，易于与其他 .NET 应用程序集成，如 ASP.NET、WPF、WinForms 等。

技术特点

基于深度学习：
- 利用深度学习算法，特别是卷积神经网络（CNN），实现高效的文字特征提取和识别。
跨平台支持：
- 依托于百度飞桨（PaddlePaddle）的跨平台特性，PaddleOCRSharp 可在 Windows、Linux 和 macOS 等操作系统上运行。
丰富的预训练模型：
- 提供多种预训练模型，涵盖不同的场景和需求，用户可以根据具体应用选择合适的模型。
灵活的模型导出与部署：
- 支持将训练好的模型导出为不同格式，方便在不同环境中部署和使用。
易于使用的API：
- 提供简洁易用的API接口，开发者可以快速集成OCR功能，减少开发时间和成本。
良好的文档与社区支持：
- 配套详细的开发文档和示例代码，帮助开发者快速上手；同时，依托百度飞桨的社区资源，可以获得技术支持和交流。
持续更新与优化：
- 随着深度学习技术的不断进步，PaddleOCRSharp 会持续更新模型和功能，保持其在OCR领域的竞争力。

应用场景

文档数字化：将纸质文档转换为电子文本，便于存储、检索和编辑。
票据识别：识别发票、收据、合同等票据上的关键信息，实现财务自动化。
车牌识别：用于交通监控、停车场管理等，识别车辆牌照信息。
信息提取：从图像中提取关键信息，如地址、联系方式、产品信息等，用于数据分析和挖掘。
自然场景文字识别：应用于增强现实（AR）、图像搜索、智能导航等领域，识别自然环境中的文字。

示例代码

以下是一个使用 PaddleOCRSharp 进行文字识别的简单示例代码，包含中文注释以帮助理解：

csharp 复制代码

using PaddleOCRSharp;
using System;
using System.Drawing;

class Program
{
    static void Main()
    {
        // 初始化OCR引擎，指定模型路径（根据实际情况修改）
        using var ocrEngine = new OcrEngine
        {
            ModelDir = "path/to/paddleocr/models"  // 模型文件夹路径
        };

        // 加载预训练的OCR模型
        ocrEngine.LoadModel();

        // 读取待识别的图像文件
        using var image = Image.FromFile("path/to/image.jpg");

        // 执行文字识别
        var recognitionResult = ocrEngine.Recognize(image);

        // 输出识别结果
        foreach (var line in recognitionResult.Lines)
        {
            Console.WriteLine($"文本: {line.Text}, 置信度: {line.Confidence}, 位置: {line.BoundingBox}");
        }
    }
}

注意事项

模型文件 ：确保已下载并正确配置所需的预训练模型文件，模型文件可以从 github.com/PaddlePaddl... 获取。
性能优化：对于大规模应用，建议对模型进行优化，如量化、剪枝等，以提高识别速度和降低资源消耗。
自定义训练：根据具体需求，可以使用自有数据集对模型进行微调，以提高特定场景下的识别准确率。
错误处理：在实际应用中，应加入适当的错误处理机制，以应对识别失败或异常情况。

通过 PaddleOCRSharp，.NET 开发者可以便捷地在自己的应用程序中集成先进的 OCR 功能，提升文本处理的效率和准确性，满足多样化的业务需求。

应用示例：

csharp 复制代码

// 初始化OCR引擎配置
public void initOcrV4Config()
{
    // 使用v4模型
    config.rec_infer = "inference\\ch_PP-OCRv4_rec_infer";
    config.det_infer = "inference\\ch_PP-OCRv4_det_infer";
    config.cls_infer = "inference\\ch_ppocr_mobile_v2.0_cls_infer";
    config.keys = "inference\\ppocr_keys.txt";
    engine = new PaddleOCREngine(config, oCRParameter);
}

// 使用PaddleOCR进行本地文字识别
private void oneFileOcrAnalysisData(string fileName, string jpgName, System.Drawing.Image image)
{
    // 开始识别
    label_Msg.Text = "正在OCR解析数据，请稍后。。。";
    label_Msg.Refresh();
    
    // 使用OCR引擎识别文本
    ocrResult = engine.DetectText(image);
    
    if (ocrResult != null)
    {
        // 处理识别结果
        richTextBox1.Text = ocrResult.Text + "\n";
        
        // 识别发票类型
        string invoiceType = FData.GetInvoiceType(ocrResult.Text);
        
        // 组合文本块
        string strLines = "";
        foreach (var item in ocrResult.TextBlocks)
        {
            strLines += item.Text + ";";
        }
        
        // 调用格式化方法处理文本
        string formatData = FData.RunAnalysisDataFormat(strPath, strFileName, jpgName, invoiceType, strLines, pds, label_Msg);
    }
}

2. Newtonsoft.Json

功能：用于JSON序列化和反序列化，处理API通信和数据交换。

应用示例：

csharp 复制代码

// JSON序列化示例（发送数据到API）
public async Task<string> SendImageBase64ToServer(string base64)
{
    // 创建数据对象
    var data = new { Base64 = base64 };
    
    // 序列化为JSON字符串
    string jsonData = JsonConvert.SerializeObject(data);
    
    // 创建HTTP内容
    var content = new StringContent(jsonData, Encoding.UTF8, "application/json");
    
    // 发送请求
    HttpResponseMessage response = await client.PostAsync(apiServerPath + ":" + apiServerPort + "/ocr", content);
    
    // 返回响应内容
    return await response.Content.ReadAsStringAsync();
}

// JSON反序列化示例（解析API响应）
private async Task allFileOcrAnalysisDataApi(string fileName, string jpgName, System.Drawing.Image image)
{
    // ... 发送请求代码 ...
    
    // 接收API响应
    string response = await GoApiServer.SendImageBase64ToServer(base64);
    
    // 反序列化为对象
    OcrResultText result = JsonConvert.DeserializeObject<OcrResultText>(response);
    
    // 处理结果
    string invoiceType = FData.GetInvoiceType(result.Text);
    // ...
}

// 解析Python脚本返回的JSON
ResponseData responseData = JsonConvert.DeserializeObject<ResponseData>(output);
List<string> imageFiles = responseData.Data;

3. System.Data.SQLite

功能：SQLite数据库的.NET接口，用于数据库操作。

应用示例：

csharp 复制代码

// 数据库查询示例
public SQLiteDataReader GetRules(string invoiceType)
{
    // 检查连接状态并打开
    if (connection.State != ConnectionState.Open)
    {
        connection.Open();
    }
    
    // 准备SQL查询语句
    string selectQuery = "SELECT * FROM FormatRules WHERE InvoiceType = @InvoiceType";
    var command = new SQLiteCommand(selectQuery, connection);
    command.Parameters.AddWithValue("@InvoiceType", invoiceType);
    
    // 执行查询并返回数据读取器
    // CommandBehavior.CloseConnection 确保在DataReader关闭时，连接也会被关闭
    return command.ExecuteReader(System.Data.CommandBehavior.CloseConnection);
}

4. MiniExcel

功能：轻量级Excel操作库，用于导出数据到Excel文件。

应用示例：

csharp 复制代码

// 导出数据到Excel
public void FormatDataToExecl()
{
    // 处理格式化数据并填充到ExportData列表
    foreach (var formatData in formatDatas)
    {
        ExportData ed = new ExportData();
        // 解析格式化数据并填充到ed对象
        parseFormatData(formatData, ed);
        eds.Add(ed);
    }
    
    // 创建保存对话框
    SaveFileDialog saveFileDialog = new SaveFileDialog();
    saveFileDialog.Filter = "Excel文件|*.xlsx";
    saveFileDialog.Title = "导出Excel文件";
    saveFileDialog.FileName = "invoice.xlsx";
    
    // 如果用户确认保存
    if (saveFileDialog.ShowDialog() == DialogResult.OK)
    {
        string filePath = saveFileDialog.FileName;
        // 保存数据到Excel文件
        MiniExcel.SaveAs(filePath, eds);
        label_ExportDataMsg.Text = $"导出成功：{filePath}";
    }
}

5. 外部程序调用

项目使用了多个外部工具来辅助处理：

pdftojpg.exe

功能：将PDF文件转换为图片文件。

应用示例：

csharp 复制代码

// 调用外部程序pdftojpg.exe处理PDF文件
public List<string> PdfToJpgUsingExe()
{
    // 创建输出目录
    string outputDir = Path.GetDirectoryName(fileName) + "\\img\\";
    Directory.CreateDirectory(outputDir);
    
    // 构建进程信息
    Process process = new Process();
    process.StartInfo.FileName = "pdftojpg.exe";
    process.StartInfo.Arguments = $"{fileName} {outputDir}";
    process.StartInfo.UseShellExecute = false;
    process.StartInfo.RedirectStandardOutput = true;
    process.StartInfo.RedirectStandardError = true;
    
    // 启动进程并等待完成
    process.Start();
    string output = process.StandardOutput.ReadToEnd();
    string error = process.StandardError.ReadToEnd();
    process.WaitForExit();
    
    // 记录输出信息
    if (!string.IsNullOrEmpty(output))
    {
        richTextBox.AppendText(output + Environment.NewLine);
    }
    if (!string.IsNullOrEmpty(error))
    {
        richTextBox.AppendText(error + Environment.NewLine);
    }
    
    // 返回生成的图片文件列表
    return Directory.GetFiles(outputDir, "*.jpg").ToList();
}

Python脚本ofdToJpgCsRun.py

功能：将OFD文件转换为图片文件。

应用示例：

csharp 复制代码

// 调用Python脚本处理OFD文件
public List<string> OfdToJpgUsingPython()
{
    // 获取Python解释器路径
    Utils utils = new Utils();
    string pythonPath = utils.GetPythonPath();
    
    if (string.IsNullOrEmpty(pythonPath))
    {
        throw new Exception("未找到Python解释器，请确保已安装Python并添加到环境变量");
    }
    
    // 创建输出目录
    string outputDir = Path.GetDirectoryName(fileName) + "\\img\\";
    Directory.CreateDirectory(outputDir);
    
    // 构建进程信息
    Process process = new Process();
    process.StartInfo.FileName = pythonPath;
    process.StartInfo.Arguments = $"ofdToJpgCsRun.py {fileName} {outputDir}";
    process.StartInfo.UseShellExecute = false;
    process.StartInfo.RedirectStandardOutput = true;
    process.StartInfo.RedirectStandardError = true;
    
    // 启动进程并获取输出
    process.Start();
    string output = process.StandardOutput.ReadToEnd();
    string error = process.StandardError.ReadToEnd();
    process.WaitForExit();
    
    // 记录输出信息
    if (!string.IsNullOrEmpty(error))
    {
        richTextBox.AppendText(error + Environment.NewLine);
    }
    
    // 解析Python脚本返回的JSON格式文件列表
    ResponseData responseData = JsonConvert.DeserializeObject<ResponseData>(output);
    return responseData.Data;
}

6. 关键功能实现细节

6.1 发票类型识别

系统通过关键字匹配来识别不同类型的发票，支持多种发票类型：

csharp 复制代码

// 获取发票类型
public static string GetInvoiceType(string strLines)
{
    // 增值税专用发票
    if (strLines.Contains("增值税专用发票") || strLines.Contains("增值 税 专 用 发 票"))
    {
        return "增值税专用发票";
    }
    
    // 增值税普通发票
    if (strLines.Contains("增值税普通发票") || strLines.Contains("增值 税 普 通 发 票"))
    {
        return "增值税普通发票";
    }
    
    // 深圳普通发票
    if (strLines.Contains("深圳增值税电子普通发票"))
    {
        return "深圳普通发票";
    }
    
    // 铁路电子客票
    if (strLines.Contains("铁路旅客运输"))
    {
        return "铁路电子客票";
    }
    
    // 航空运输电子客票行程单
    if (strLines.Contains("航空运输电子客票行程单"))
    {
        return "航空运输电客票";
    }
    
    // 社会团体发票
    if (strLines.Contains("社会团体会费统一收据"))
    {
        return "社会团体发票";
    }
    
    // 默认返回普通发票
    return "普通发票";
}

6.2 数据格式化与关键字段提取

根据不同类型的发票应用不同的格式化方法，提取关键字段：

csharp 复制代码

// 根据发票类型执行相应的格式化方法
public string RunAnalysisDataFormat(string strPath, string strFileName, string strJpgName, string invoiceType, string strLines, List<ProvinceData> pds, Label label_Msg)
{
    string formatData = "";
    
    // 根据发票类型选择相应的格式化方法
    switch (invoiceType)
    {
        case "普通发票":
            formatData = Pt(strPath, strFileName, strJpgName, strLines);
            break;

        case "增值税普通发票":
            formatData = ZzsPt(strPath, strFileName, strJpgName, strLines, pds);
            break;

        case "增值税专用发票":
            formatData = ZzsZy(strPath, strFileName, strJpgName, strLines, pds);
            break;
            
        case "深圳普通发票":
            formatData = SzPt(strPath, strFileName, strJpgName, strLines, pds);
            break;
            
        case "铁路电子客票":
            formatData = Tl(strPath, strFileName, strJpgName, strLines);
            break;
            
        case "航空运输电客票":
            formatData = Hk(strPath, strFileName, strJpgName, strLines);
            break;
            
        case "社会团体发票":
            formatData = Shtt(strPath, strFileName, strJpgName, strLines);
            break;
            
        default:
            // 默认使用普通发票格式化
            formatData = Pt(strPath, strFileName, strJpgName, strLines);
            break;
    }
    
    return formatData;
}

6.3 税率识别

csharp 复制代码

// 提取税率信息
private string getSl(string strLines)
{
    string strSl = "";
    
    // 匹配常见税率
    if (strLines.ToLower().Contains(";1%"))
    {
        strSl = ";▲税率■1%;";
    }
    else if (strLines.ToLower().Contains(";3%"))
    {
        strSl = ";▲税率■3%;";
    }
    else if (strLines.ToLower().Contains(";5%"))
    {
        strSl = ";▲税率■5%;";
    }
    else if (strLines.ToLower().Contains(";6%"))
    {
        strSl = ";▲税率■6%;";
    }
    else if (strLines.ToLower().Contains(";9%"))
    {
        strSl = ";▲税率■9%;";
    }
    else if (strLines.ToLower().Contains(";13%"))
    {
        strSl = ";▲税率■13%;";
    }
    
    return strSl;
}

6.4 异步批处理实现

系统支持异步批量处理多个发票文件，提高处理效率：

csharp 复制代码

// 异步批量处理图片
await Task.Run(async () =>
{
    foreach (var item in items)
    {
        await Task.Run(async () =>
        {
            // 更新UI状态
            label_Msg.Invoke((Action)(() =>
            {
                label_Msg.Text = $"正在OCR解析【{item.ToString()}】数据，请稍后......";
                label_Msg.Refresh();
            }));
            
            // 读取并处理图片
            var imagebyte = File.ReadAllBytes(item.ToString());
            Bitmap bitmap = new Bitmap(new MemoryStream(imagebyte));
            
            // 调用API进行OCR识别
            await allFileOcrAnalysisDataApi(item.ToString(), item.ToString(), bitmap);
        });
    }
});

7. 总结

电子发票解析工具是一个功能完善的桌面应用程序，通过集成多种技术实现了电子发票的批量处理、解析和管理。系统支持多种发票格式（PDF、OFD）和类型（增值税专用发票、普通发票、火车票等），并通过Go API服务提供高性能的OCR识别能力。

主要技术亮点：

多格式支持：支持PDF和OFD格式的电子发票
高性能批处理：通过异步多线程实现大量发票的并行处理
外部服务集成：与Go API服务集成，提供高质量的OCR识别
灵活的数据处理：根据不同发票类型应用不同的解析规则
数据持久化：使用SQLite数据库存储解析规则和处理结果
易用的用户界面：提供直观的操作界面和实时进度显示

该工具可以帮助企业和个人高效地处理大量电子发票，提取关键信息，减少人工操作，提高工作效率。