使用HtmlAgilityPack+PuppeteerSharp+iText7抓取IdentityServer4帮助文档

需要学习IdentityServer4的用法，但是在IdentityServer4帮助文档网站（参考文献1）中没有找到下载离线文档的地方，准备使用HtmlAgilityPack+PuppeteerSharp+iText7将网站内容抓取生成离线PDF文档，便于本机学习、查看。
首先是分析网页结构，下图是帮助文档首页的html中左侧导航菜单的结构，从中可以看到以下几点：
1）整个导航菜单内容放在类名为wy-menu wy-menu-vertical的div元素内；
2）导航中一级菜单名称放在类名为caption的p元素内；
3）一级菜单下的二级菜单紧跟在p元素后，放在ul元素内，ul元素内的所有类名为toctree-l1的li元素内，类名为toctree-l2的li元素内保存的是更下一级的页面内导航，可以忽略。

根据上述条件，修改之前抓取SqlSugar帮助文档的程序，主要代码及程序运行效果如下所示：

csharp 复制代码

HtmlAgilityPack.HtmlDocument docu = web.Load(txtUrl.Text);
HtmlNode node = docu.DocumentNode.SelectSingleNode(@"//div[@class='wy-menu wy-menu-vertical']");

HtmlNodeCollection tmpNode;
string curClass = string.Empty;

foreach (HtmlNode subNode in node.ChildNodes)
{
    string className = subNode.GetAttributeValue<string>("class", string.Empty);                

    if ((subNode.Name=="p") && (className == "caption"))
    {
        curClass = subNode.InnerText;
    }

    if (subNode.Name== "ul")
    {
        tmpNode = subNode.SelectNodes(".//li[@class='toctree-l1']/a[1]");

        foreach(HtmlNode n in tmpNode)
        {
            m_urls.Add(new LinkInfo { Module = curClass, Name = n.InnerText, Url = @"https://identityserver4.readthedocs.io/en/latest/" + n.Attributes["href"].Value.TrimStart('.') });
            ...
            ...
        }        
    }
}

接着是生成单个PDF文档的代码及效果：

csharp 复制代码

var options = new LaunchOptions { Headless = true };
using var browserFetcher = new BrowserFetcher();
await browserFetcher.DownloadAsync();
await using var browser = await Puppeteer.LaunchAsync(options);

foreach (LinkInfo url in m_urls)
{
    await using var page = await browser.NewPageAsync();
    await page.GoToAsync(url.Url);
    
    PdfOptions option = new PdfOptions();
    option.Format = PuppeteerSharp.Media.PaperFormat.A4;
    option.Landscape = true;

    await page.PdfAsync(Path.Combine(Directory.GetCurrentDirectory() + "\\papers", ($"{url.Module}_{url.Name}.pdf").Replace('/', '_')), option);
    
    await page.DisposeAsync();
}

MessageBox.Show("生成PDF文件结束！");

最后是调用iText7合并所有PDF文档，生成带书签的IdentityServer4帮助文档的代码及效果。生成的文档已上传到CSDN博客资源中，有需要的可以自行下载。

csharp 复制代码

PdfDocument pdfDoc = new PdfDocument(new PdfWriter(txtFileName.Text));
PdfMerger merger = new PdfMerger(pdfDoc);
merger.SetCloseSourceDocuments(false);

List<PdfFileInfo> pdfFiles = GetSourceDocuments();

foreach (PdfFileInfo doc in pdfFiles)
{
    merger.Merge(doc.docu, 1, doc.docu.GetNumberOfPages());
}

PdfOutline rootOutline = pdfDoc.GetOutlines(false);
PdfOutline tmpOutline = null;
PdfOutline tmpSubOutline = null;
int curPageIndex = 1;
int underlineIndex = -1;
string tmpModule = "XXXXXX";

foreach (PdfFileInfo doc in pdfFiles)
{
    string fileName = doc.FileName;

    if (!fileName.StartsWith(tmpModule))
    {
        underlineIndex = fileName.IndexOf('_');

        tmpModule = fileName.Substring(0, underlineIndex);
        tmpOutline = rootOutline.AddOutline(tmpModule);
        tmpOutline.AddDestination(PdfExplicitDestination.CreateFit(pdfDoc.GetPage(curPageIndex)));
    }

    tmpSubOutline = tmpOutline.AddOutline(fileName.Substring(underlineIndex + 1));
    tmpSubOutline.AddDestination(PdfExplicitDestination.CreateFit(pdfDoc.GetPage(curPageIndex)));
    curPageIndex += doc.docu.GetNumberOfPages();
}

pdfDoc.Close();

foreach (PdfFileInfo doc in pdfFiles)
{
    doc.docu.Close();
}

参考文献：

1\]https://identityserver4.readthedocs.io/en/latest/index.html \[2\]https://blog.csdn.net/Gltu_java/article/details/142656171