Firecrawl源码WebScraperDataProvider类源码解读与实战应用

"深入解析WebScraperDataProvider:打造高效网页抓取工具" ------ 本文将带你深入探讨WebScraperDataProvider类的源码,解析其如何通过不同的抓取模式(单URL、站点地图、爬取模式)实现高效网页数据抓取。我们将逐一分析关键功能,助你掌握网页抓取的核心技术,轻松应对各类数据采集需求。

javascript 复制代码
import {
  Document,
  ExtractorOptions,
  PageOptions,
  WebScraperOptions,
} from "../../lib/entities";
import { Progress } from "../../lib/entities";
import { scrapSingleUrl } from "./single_url";
import { SitemapEntry, fetchSitemapData, getLinksFromSitemap } from "./sitemap";
import { WebCrawler } from "./crawler";
import { getValue, setValue } from "../../services/redis";
import { getImageDescription } from "./utils/imageDescription";
import { fetchAndProcessPdf } from "./utils/pdfProcessor";
import {
  replaceImgPathsWithAbsolutePaths,
  replacePathsWithAbsolutePaths,
} from "./utils/replacePaths";
import { generateCompletions } from "../../lib/LLM-extraction";
import { getWebScraperQueue } from "../../../src/services/queue-service";
import { fetchAndProcessDocx } from "./utils/docxProcessor";
import { getAdjustedMaxDepth, getURLDepth } from "./utils/maxDepthUtils";
import { Logger } from "../../lib/logger";
import { ScrapeEvents } from "../../lib/scrape-events";

export class WebScraperDataProvider {
  private jobId: string;
  private bullJobId: string;
  private urls: string[] = [""];
  private mode: "single_urls" | "sitemap" | "crawl" = "single_urls";
  private includes: string | string[];
  private excludes: string | string[];
  private maxCrawledLinks: number;
  private maxCrawledDepth: number = 10;
  private returnOnlyUrls: boolean;
  private limit: number = 10000;
  private concurrentRequests: number = 20;
  private generateImgAltText: boolean = false;
  private ignoreSitemap: boolean = false;
  private pageOptions?: PageOptions;
  private extractorOptions?: ExtractorOptions;
  private replaceAllPathsWithAbsolutePaths?: boolean = false;
  private generateImgAltTextModel: "gpt-4-turbo" | "claude-3-opus" =
    "gpt-4-turbo";
  private crawlerMode: string = "default";
  private allowBackwardCrawling: boolean = false;
  private allowExternalContentLinks: boolean = false;

  authorize(): void {
    throw new Error("Method not implemented.");
  }

  authorizeNango(): Promise<void> {
    throw new Error("Method not implemented.");
  }

  private async convertUrlsToDocuments(
    urls: string[],
    inProgress?: (progress: Progress) => void,
    allHtmls?: string[]
  ): Promise<Document[]> {
    const totalUrls = urls.length;
    let processedUrls = 0;

    const results: (Document | null)[] = new Array(urls.length).fill(null);
    for (let i = 0; i < urls.length; i += this.concurrentRequests) {
      const batchUrls = urls.slice(i, i + this.concurrentRequests);
      await Promise.all(
        batchUrls.map(async (url, index) => {
          const existingHTML = allHtmls ? allHtmls[i + index] : "";
          const result = await scrapSingleUrl(
            this.jobId,
            url,
            this.pageOptions,
            this.extractorOptions,
            existingHTML
          );
          processedUrls++;
          if (inProgress) {
            inProgress({
              current: processedUrls,
              total: totalUrls,
              status: "SCRAPING",
              currentDocumentUrl: url,
              currentDocument: { ...result, index: processedUrls },
            });
          }

          results[i + index] = result;
        })
      );
      try {
        if (this.mode === "crawl" && this.bullJobId) {
          const job = await getWebScraperQueue().getJob(this.bullJobId);
          const jobStatus = await job.getState();
          if (jobStatus === "failed") {
            Logger.info(
              "Job has failed or has been cancelled by the user. Stopping the job..."
            );
            return [] as Document[];
          }
        }
      } catch (error) {
        Logger.error(error.message);
        return [] as Document[];
      }
    }
    return results.filter((result) => result !== null) as Document[];
  }

  async getDocuments(
    useCaching: boolean = false,
    inProgress?: (progress: Progress) => void
  ): Promise<Document[]> {
    this.validateInitialUrl();
    if (!useCaching) {
      return this.processDocumentsWithoutCache(inProgress);
    }

    return this.processDocumentsWithCache(inProgress);
  }

  private validateInitialUrl(): void {
    if (this.urls[0].trim() === "") {
      throw new Error("Url is required");
    }
  }

  /**
   * Process documents without cache handling each mode
   * @param inProgress inProgress
   * @returns documents
   */
  private async processDocumentsWithoutCache(
    inProgress?: (progress: Progress) => void
  ): Promise<Document[]> {
    switch (this.mode) {
      case "crawl":
        return this.handleCrawlMode(inProgress);
      case "single_urls":
        return this.handleSingleUrlsMode(inProgress);
      case "sitemap":
        return this.handleSitemapMode(inProgress);
      default:
        return [];
    }
  }

  private async cleanIrrelevantPath(links: string[]) {
    return links.filter((link) => {
      const normalizedInitialUrl = new URL(this.urls[0]);
      const normalizedLink = new URL(link);

      // Normalize the hostname to account for www and non-www versions
      const initialHostname = normalizedInitialUrl.hostname.replace(
        /^www\./,
        ""
      );
      const linkHostname = normalizedLink.hostname.replace(/^www\./, "");

      // Ensure the protocol and hostname match, and the path starts with the initial URL's path
      return (
        linkHostname === initialHostname &&
        normalizedLink.pathname.startsWith(normalizedInitialUrl.pathname)
      );
    });
  }

  private async handleCrawlMode(
    inProgress?: (progress: Progress) => void
  ): Promise<Document[]> {
    let includes: string[];
    if (Array.isArray(this.includes)) {
      if (this.includes[0] != "") {
        includes = this.includes;
      }
    } else {
      includes = this.includes.split(',');
    }

    let excludes: string[];
    if (Array.isArray(this.excludes)) {
      if (this.excludes[0] != "") {
        excludes = this.excludes;
      }
    } else {
      excludes = this.excludes.split(',');
    }

    const crawler = new WebCrawler({
      jobId: this.jobId,
      initialUrl: this.urls[0],
      includes,
      excludes,
      maxCrawledLinks: this.maxCrawledLinks,
      maxCrawledDepth: getAdjustedMaxDepth(this.urls[0], this.maxCrawledDepth),
      limit: this.limit,
      generateImgAltText: this.generateImgAltText,
      allowBackwardCrawling: this.allowBackwardCrawling,
      allowExternalContentLinks: this.allowExternalContentLinks,
    });

    let links = await crawler.start(
      inProgress,
      this.pageOptions,
      {
        ignoreSitemap: this.ignoreSitemap,
      },
      5,
      this.limit,
      this.maxCrawledDepth
    );

    let allLinks = links.map((e) => e.url);
    const allHtmls = links.map((e) => e.html);

    if (this.returnOnlyUrls) {
      return this.returnOnlyUrlsResponse(allLinks, inProgress);
    }

    let documents = [];
    // check if fast mode is enabled and there is html inside the links
    if (this.crawlerMode === "fast" && links.some((link) => link.html)) {
      documents = await this.processLinks(allLinks, inProgress, allHtmls);
    } else {
      documents = await this.processLinks(allLinks, inProgress);
    }

    return this.cacheAndFinalizeDocuments(documents, allLinks);
  }

  private async handleSingleUrlsMode(
    inProgress?: (progress: Progress) => void
  ): Promise<Document[]> {
    const links = this.urls;

    let documents = await this.processLinks(links, inProgress);
    return documents;
  }

  private async handleSitemapMode(
    inProgress?: (progress: Progress) => void
  ): Promise<Document[]> {
    let links = await getLinksFromSitemap({ sitemapUrl: this.urls[0] });
    links = await this.cleanIrrelevantPath(links);

    if (this.returnOnlyUrls) {
      return this.returnOnlyUrlsResponse(links, inProgress);
    }

    let documents = await this.processLinks(links, inProgress);
    return this.cacheAndFinalizeDocuments(documents, links);
  }

  private async returnOnlyUrlsResponse(
    links: string[],
    inProgress?: (progress: Progress) => void
  ): Promise<Document[]> {
    inProgress?.({
      current: links.length,
      total: links.length,
      status: "COMPLETED",
      currentDocumentUrl: this.urls[0],
    });
    return links.map((url) => ({
      content: "",
      html: this.pageOptions?.includeHtml ? "" : undefined,
      markdown: "",
      metadata: { sourceURL: url, pageStatusCode: 200 },
    }));
  }

  // 处理链接
  private async processLinks(
    links: string[],
    inProgress?: (progress: Progress) => void,
    allHtmls?: string[]
  ): Promise<Document[]> {
    // 过滤出以.pdf结尾的链接
    const pdfLinks = links.filter((link) => link.endsWith(".pdf"));
    // 过滤出以.doc或.docx结尾的链接
    const docLinks = links.filter(
      (link) => link.endsWith(".doc") || link.endsWith(".docx")
    );

    // 并行获取pdf和docx文档
    const [pdfDocuments, docxDocuments] = await Promise.all([
      this.fetchPdfDocuments(pdfLinks),
      this.fetchDocxDocuments(docLinks),
    ]);

    // 过滤掉已经获取的pdf和docx链接
    links = links.filter(
      (link) => !pdfLinks.includes(link) && !docLinks.includes(link)
    );

    // 并行获取其他链接的文档和sitemap数据
    let [documents, sitemapData] = await Promise.all([
      this.convertUrlsToDocuments(links, inProgress, allHtmls),
      this.mode === "single_urls" && links.length > 0
        ? this.getSitemapDataForSingleUrl(this.urls[0], links[0], 1500).catch(
            (error) => {
              Logger.debug(`Failed to fetch sitemap data: ${error}`);
              return null;
            }
          )
        : Promise.resolve(null),
    ]);

    // 如果是single_urls模式,并且有文档,则将sitemap数据添加到文档中
    if (this.mode === "single_urls" && documents.length > 0) {
      documents[0].metadata.sitemap = sitemapData ?? undefined;
    } else {
      // 否则,获取sitemap数据
      documents = await this.getSitemapData(this.urls[0], documents);
    }

    // 应用路径替换
    documents = this.applyPathReplacements(documents);
    // documents = await this.applyImgAltText(documents);
    // 如果是llm-extraction或llm-extraction-from-markdown模式,并且是single_urls模式,则生成补全
    if (
      (this.extractorOptions.mode === "llm-extraction" ||
        this.extractorOptions.mode === "llm-extraction-from-markdown") &&
      this.mode === "single_urls"
    ) {
      documents = await generateCompletions(
        documents,
        this.extractorOptions,
        "markdown"
      );
    }
    // 如果是llm-extraction-from-raw-html模式,并且是single_urls模式,则生成补全
    if (
      this.extractorOptions.mode === "llm-extraction-from-raw-html" &&
      this.mode === "single_urls"
    ) {
      documents = await generateCompletions(
        documents,
        this.extractorOptions,
        "raw-html"
      );
    }
    // 返回文档、pdf文档和docx文档的合并结果
    return documents.concat(pdfDocuments).concat(docxDocuments);
  }

  private async fetchPdfDocuments(pdfLinks: string[]): Promise<Document[]> {
    return Promise.all(
      pdfLinks.map(async (pdfLink) => {
        const timer = Date.now();
        const logInsertPromise = ScrapeEvents.insert(this.jobId, {
          type: "scrape",
          url: pdfLink,
          worker: process.env.FLY_MACHINE_ID,
          method: "pdf-scrape",
          result: null,
        });

        const { content, pageStatusCode, pageError } = await fetchAndProcessPdf(
          pdfLink,
          this.pageOptions.parsePDF
        );

        const insertedLogId = await logInsertPromise;
        ScrapeEvents.updateScrapeResult(insertedLogId, {
          response_size: content.length,
          success: !(pageStatusCode && pageStatusCode >= 400) && !!content && (content.trim().length >= 100),
          error: pageError,
          response_code: pageStatusCode,
          time_taken: Date.now() - timer,
        });
        return {
          content: content,
          metadata: { sourceURL: pdfLink, pageStatusCode, pageError },
          provider: "web-scraper",
        };
      })
    );
  }
  private async fetchDocxDocuments(docxLinks: string[]): Promise<Document[]> {
    return Promise.all(
      docxLinks.map(async (docxLink) => {
        const timer = Date.now();
        const logInsertPromise = ScrapeEvents.insert(this.jobId, {
          type: "scrape",
          url: docxLink,
          worker: process.env.FLY_MACHINE_ID,
          method: "docx-scrape",
          result: null,
        });

        const { content, pageStatusCode, pageError } = await fetchAndProcessDocx(
          docxLink
        );

        const insertedLogId = await logInsertPromise;
        ScrapeEvents.updateScrapeResult(insertedLogId, {
          response_size: content.length,
          success: !(pageStatusCode && pageStatusCode >= 400) && !!content && (content.trim().length >= 100),
          error: pageError,
          response_code: pageStatusCode,
          time_taken: Date.now() - timer,
        });

        return {
          content,
          metadata: { sourceURL: docxLink, pageStatusCode, pageError },
          provider: "web-scraper",
        };
      })
    );
  }

  private applyPathReplacements(documents: Document[]): Document[] {
    if (this.replaceAllPathsWithAbsolutePaths) {
      documents = replacePathsWithAbsolutePaths(documents);
    }
    return replaceImgPathsWithAbsolutePaths(documents);
  }

  private async applyImgAltText(documents: Document[]): Promise<Document[]> {
    return this.generateImgAltText
      ? this.generatesImgAltText(documents)
      : documents;
  }

  private async cacheAndFinalizeDocuments(
    documents: Document[],
    links: string[]
  ): Promise<Document[]> {
    // await this.setCachedDocuments(documents, links);
    documents = this.removeChildLinks(documents);
    return documents.splice(0, this.limit);
  }

  private async processDocumentsWithCache(
    inProgress?: (progress: Progress) => void
  ): Promise<Document[]> {
    let documents = await this.getCachedDocuments(
      this.urls.slice(0, this.limit)
    );
    if (documents.length < this.limit) {
      const newDocuments: Document[] = await this.getDocuments(
        false,
        inProgress
      );
      documents = this.mergeNewDocuments(documents, newDocuments);
    }
    documents = this.filterDocsExcludeInclude(documents);
    documents = this.filterDepth(documents);
    documents = this.removeChildLinks(documents);
    return documents.splice(0, this.limit);
  }

  private mergeNewDocuments(
    existingDocuments: Document[],
    newDocuments: Document[]
  ): Document[] {
    newDocuments.forEach((doc) => {
      if (
        !existingDocuments.some(
          (d) =>
            this.normalizeUrl(d.metadata.sourceURL) ===
            this.normalizeUrl(doc.metadata?.sourceURL)
        )
      ) {
        existingDocuments.push(doc);
      }
    });
    return existingDocuments;
  }

  private filterDocsExcludeInclude(documents: Document[]): Document[] {
    return documents.filter((document) => {
      const url = new URL(document.metadata.sourceURL);
      const path = url.pathname;

      if (!Array.isArray(this.excludes)) {
        this.excludes = this.excludes.split(',');
      }

      if (this.excludes.length > 0 && this.excludes[0] !== "") {
        // Check if the link should be excluded
        if (
          this.excludes.some((excludePattern) =>
            new RegExp(excludePattern).test(path)
          )
        ) {
          return false;
        }
      }

      if (!Array.isArray(this.includes)) {
        this.includes = this.includes.split(',');
      }

      if (this.includes.length > 0 && this.includes[0] !== "") {
        // Check if the link matches the include patterns, if any are specified
        if (this.includes.length > 0) {
          return this.includes.some((includePattern) =>
            new RegExp(includePattern).test(path)
          );
        }
      }
      return true;
    });
  }

  private normalizeUrl(url: string): string {
    if (url.includes("//www.")) {
      return url.replace("//www.", "//");
    }
    return url;
  }

  private removeChildLinks(documents: Document[]): Document[] {
    for (let document of documents) {
      if (document?.childrenLinks) delete document.childrenLinks;
    }
    return documents;
  }

  async setCachedDocuments(documents: Document[], childrenLinks?: string[]) {
    for (const document of documents) {
      if (document.content.trim().length === 0) {
        continue;
      }
      const normalizedUrl = this.normalizeUrl(document.metadata.sourceURL);
      await setValue(
        "web-scraper-cache:" + normalizedUrl,
        JSON.stringify({
          ...document,
          childrenLinks: childrenLinks || [],
        }),
        60 * 60
      ); // 10 days
    }
  }

  async getCachedDocuments(urls: string[]): Promise<Document[]> {
    let documents: Document[] = [];
    for (const url of urls) {
      const normalizedUrl = this.normalizeUrl(url);
      Logger.debug(
        "Getting cached document for web-scraper-cache:" + normalizedUrl
      );
      const cachedDocumentString = await getValue(
        "web-scraper-cache:" + normalizedUrl
      );
      if (cachedDocumentString) {
        const cachedDocument = JSON.parse(cachedDocumentString);
        documents.push(cachedDocument);

        // get children documents
        for (const childUrl of cachedDocument.childrenLinks || []) {
          const normalizedChildUrl = this.normalizeUrl(childUrl);
          const childCachedDocumentString = await getValue(
            "web-scraper-cache:" + normalizedChildUrl
          );
          if (childCachedDocumentString) {
            const childCachedDocument = JSON.parse(childCachedDocumentString);
            if (
              !documents.find(
                (doc) =>
                  doc.metadata.sourceURL ===
                  childCachedDocument.metadata.sourceURL
              )
            ) {
              documents.push(childCachedDocument);
            }
          }
        }
      }
    }
    return documents;
  }

  setOptions(options: WebScraperOptions): void {
    if (!options.urls) {
      throw new Error("Urls are required");
    }

    this.jobId = options.jobId;
    this.bullJobId = options.bullJobId;
    this.urls = options.urls;
    this.mode = options.mode;
    this.concurrentRequests = options.concurrentRequests ?? 20;
    this.includes = options.crawlerOptions?.includes ?? [];
    this.excludes = options.crawlerOptions?.excludes ?? [];
    this.maxCrawledLinks = options.crawlerOptions?.maxCrawledLinks ?? 1000;
    this.maxCrawledDepth = options.crawlerOptions?.maxDepth ?? 10;
    this.returnOnlyUrls = options.crawlerOptions?.returnOnlyUrls ?? false;
    this.limit = options.crawlerOptions?.limit ?? 10000;
    this.generateImgAltText =
      options.crawlerOptions?.generateImgAltText ?? false;
    this.pageOptions = options.pageOptions ?? {
      onlyMainContent: false,
      includeHtml: false,
      replaceAllPathsWithAbsolutePaths: false,
      parsePDF: true,
      removeTags: [],
    };
    this.extractorOptions = options.extractorOptions ?? { mode: "markdown" };
    this.replaceAllPathsWithAbsolutePaths =
      options.crawlerOptions?.replaceAllPathsWithAbsolutePaths ??
      options.pageOptions?.replaceAllPathsWithAbsolutePaths ??
      false;

    if (typeof options.crawlerOptions?.excludes === 'string') {
      this.excludes = options.crawlerOptions?.excludes.split(',').filter((item) => item.trim() !== "");
    }

    if (typeof options.crawlerOptions?.includes === 'string') {
      this.includes = options.crawlerOptions?.includes.split(',').filter((item) => item.trim() !== "");
    }

    this.crawlerMode = options.crawlerOptions?.mode ?? "default";
    this.ignoreSitemap = options.crawlerOptions?.ignoreSitemap ?? false;
    this.allowBackwardCrawling =
      options.crawlerOptions?.allowBackwardCrawling ?? false;
    this.allowExternalContentLinks =
      options.crawlerOptions?.allowExternalContentLinks ?? false;

    // make sure all urls start with https://
    this.urls = this.urls.map((url) => {
      if (!url.trim().startsWith("http")) {
        return `https://${url}`;
      }
      return url;
    });
  }

  private async getSitemapData(baseUrl: string, documents: Document[]) {
    const sitemapData = await fetchSitemapData(baseUrl);
    if (sitemapData) {
      for (let i = 0; i < documents.length; i++) {
        const docInSitemapData = sitemapData.find(
          (data) =>
            this.normalizeUrl(data.loc) ===
            this.normalizeUrl(documents[i].metadata.sourceURL)
        );
        if (docInSitemapData) {
          let sitemapDocData: Partial<SitemapEntry> = {};
          if (docInSitemapData.changefreq) {
            sitemapDocData.changefreq = docInSitemapData.changefreq;
          }
          if (docInSitemapData.priority) {
            sitemapDocData.priority = Number(docInSitemapData.priority);
          }
          if (docInSitemapData.lastmod) {
            sitemapDocData.lastmod = docInSitemapData.lastmod;
          }
          if (Object.keys(sitemapDocData).length !== 0) {
            documents[i].metadata.sitemap = sitemapDocData;
          }
        }
      }
    }
    return documents;
  }
  private async getSitemapDataForSingleUrl(
    baseUrl: string,
    url: string,
    timeout?: number
  ) {
    const sitemapData = await fetchSitemapData(baseUrl, timeout);
    if (sitemapData) {
      const docInSitemapData = sitemapData.find(
        (data) => this.normalizeUrl(data.loc) === this.normalizeUrl(url)
      );
      if (docInSitemapData) {
        let sitemapDocData: Partial<SitemapEntry> = {};
        if (docInSitemapData.changefreq) {
          sitemapDocData.changefreq = docInSitemapData.changefreq;
        }
        if (docInSitemapData.priority) {
          sitemapDocData.priority = Number(docInSitemapData.priority);
        }
        if (docInSitemapData.lastmod) {
          sitemapDocData.lastmod = docInSitemapData.lastmod;
        }
        if (Object.keys(sitemapDocData).length !== 0) {
          return sitemapDocData;
        }
      }
    }
    return null;
  }
  generatesImgAltText = async (documents: Document[]): Promise<Document[]> => {
    await Promise.all(
      documents.map(async (document) => {
        const images = document.content.match(/!\[.*?\]\((.*?)\)/g) || [];

        await Promise.all(
          images.map(async (image: string) => {
            let imageUrl = image.match(/\(([^)]+)\)/)[1];
            let altText = image.match(/\[(.*?)\]/)[1];

            if (
              !altText &&
              !imageUrl.startsWith("data:image") &&
              /\.(png|jpeg|gif|webp)$/.test(imageUrl)
            ) {
              const imageIndex = document.content.indexOf(image);
              const contentLength = document.content.length;
              let backText = document.content.substring(
                imageIndex + image.length,
                Math.min(imageIndex + image.length + 1000, contentLength)
              );
              let frontTextStartIndex = Math.max(imageIndex - 1000, 0);
              let frontText = document.content.substring(
                frontTextStartIndex,
                imageIndex
              );
              altText = await getImageDescription(
                imageUrl,
                backText,
                frontText,
                this.generateImgAltTextModel
              );
            }

            document.content = document.content.replace(
              image,
              `![${altText}](${imageUrl})`
            );
          })
        );
      })
    );

    return documents;
  };

  filterDepth(documents: Document[]): Document[] {
    return documents.filter((document) => {
      const url = new URL(document.metadata.sourceURL);
      return getURLDepth(url.toString()) <= this.maxCrawledDepth;
    });
  }
}

类属性

  • jobId:任务ID。
  • bullJobId:Bull队列任务ID。
  • urls:要抓取的URL列表。
  • mode:抓取模式,可以是single_urls(单个URL)、sitemap(站点地图)或crawl(爬取整个网站)。
  • includesexcludes:用于过滤URL的正则表达式模式。
  • maxCrawledLinksmaxCrawledDepth:爬取的最大链接数和深度。
  • returnOnlyUrls:是否仅返回URL。
  • limit:返回文档的最大数量。
  • concurrentRequests:并发请求的数量。
  • generateImgAltText:是否生成图像的替代文本。
  • ignoreSitemap:是否忽略站点地图。
  • pageOptionsextractorOptions:页面和提取器的选项。
  • replaceAllPathsWithAbsolutePaths:是否将所有路径替换为绝对路径。
  • generateImgAltTextModel:生成图像替代文本使用的模型。
  • crawlerMode:爬虫模式。
  • allowBackwardCrawlingallowExternalContentLinks:是否允许向后爬取和外部内容链接。

方法

  • authorizeauthorizeNango:未实现的方法,可能用于授权。
  • convertUrlsToDocuments:将URL转换为文档,支持并发请求。
  • getDocuments:获取文档,支持缓存。
  • validateInitialUrl:验证初始URL。
  • processDocumentsWithoutCache:处理文档,不使用缓存。
  • cleanIrrelevantPath:清理无关路径。
  • handleCrawlModehandleSingleUrlsModehandleSitemapMode:处理不同模式的抓取。
  • returnOnlyUrlsResponse:仅返回URL的响应。
  • processLinks:处理链接,包括PDF和DOCX文档的处理。
  • fetchPdfDocumentsfetchDocxDocuments:获取PDF和DOCX文档。
  • applyPathReplacements:应用路径替换。
  • applyImgAltText:应用图像替代文本。
  • cacheAndFinalizeDocuments:缓存并最终化文档。
  • processDocumentsWithCache:使用缓存处理文档。
  • mergeNewDocuments:合并新文档。
  • filterDocsExcludeInclude:过滤文档,根据includesexcludes
  • normalizeUrl:标准化URL。
  • removeChildLinks:移除子链接。
  • setCachedDocuments:设置缓存文档。
  • getCachedDocuments:获取缓存文档。
  • setOptions:设置选项。
  • getSitemapDatagetSitemapDataForSingleUrl:获取站点地图数据。
  • generatesImgAltText:生成图像替代文本。
  • filterDepth:过滤深度。

注意事项

  • 该类依赖于多个外部模块和库,如redisbullaxios等,用于缓存、任务队列和HTTP请求。
  • WebScraperDataProvider类设计用于处理网页抓取任务,支持多种模式和选项,可以灵活地抓取单个URL、站点地图或整个网站的数据。
  • 在使用该类时,需要确保所有依赖项正确安装和配置,并且有适当的权限来访问目标网站和数据库。
相关推荐
崔庆才丨静觅4 小时前
hCaptcha 验证码图像识别 API 对接教程
前端
passerby60615 小时前
完成前端时间处理的另一块版图
前端·github·web components
掘了5 小时前
「2025 年终总结」在所有失去的人中,我最怀念我自己
前端·后端·年终总结
崔庆才丨静觅5 小时前
实用免费的 Short URL 短链接 API 对接说明
前端
崔庆才丨静觅5 小时前
5分钟快速搭建 AI 平台并用它赚钱!
前端
崔庆才丨静觅6 小时前
比官方便宜一半以上!Midjourney API 申请及使用
前端
Moment6 小时前
富文本编辑器在 AI 时代为什么这么受欢迎
前端·javascript·后端
崔庆才丨静觅6 小时前
刷屏全网的“nano-banana”API接入指南!0.1元/张量产高清创意图,开发者必藏
前端
剪刀石头布啊6 小时前
jwt介绍
前端
爱敲代码的小鱼6 小时前
AJAX(异步交互的技术来实现从服务端中获取数据):
前端·javascript·ajax