从零开发短视频电商爬虫在爬取时注意 robots.txt 和 sitemap.xml

文章目录

- - [1. robots.txt：](#1. robots.txt：)
  - [2. sitemap.xml：](#2. sitemap.xml：)

当我们爬取一个网站时，通常首先查看网站根目录下的两个重要文件： robots.txt 和 sitemap.xml。这两个文件提供了关于网站爬取行为和结构的重要信息。

1. robots.txt：

robots.txt 是一个文本文件，位于网站的根目录下。它用来指导爬虫，告诉它们哪些页面可以抓取，哪些页面不应该被抓取。

示例 robots.txt 文件：

plaintext 复制代码

User-agent: *
Disallow: /private/
Disallow: /restricted/
Allow: /public/

User-agent: *：这行表示对所有爬虫都生效。
Disallow: /private/：表示禁止爬虫访问 /private/ 目录下的内容。
Disallow: /restricted/：表示禁止爬虫访问 /restricted/ 目录下的内容。
Allow: /public/：表示允许爬虫访问 /public/ 目录下的内容，覆盖了前面的禁止指令。

在 robots.txt 文件中，有时候会包含指向网站 sitemap.xml 文件的信息，以便搜索引擎更有效地发现网站的内容结构。

ini 复制代码

User-agent: *
Disallow: /private/
Disallow: /restricted/
Allow: /public/
Sitemap: https://www.example.com/sitemap.xml

新增了一行 Sitemap，指定了网站的 sitemap.xml 文件的位置。这样的信息告诉搜索引擎该网站的 sitemap 文件的位置，方便搜索引擎爬取并索引网站的内容。

请注意，不是所有网站的 robots.txt 文件都会包含 Sitemap 指令，因此搜索引擎也会根据其他方法来发现和索引网站的 sitemap.xml 文件。

以CSDN-https://www.csdn.net/robots.txt站点为：

ini 复制代码

User-agent: * 
Disallow: /scripts 
Disallow: /public 
Disallow: /css/ 
Disallow: /images/ 
Disallow: /content/ 
Disallow: /ui/ 
Disallow: /js/ 
Disallow: /scripts/ 
Disallow: /article_preview.html* 
Disallow: /tag/
Disallow: /*?*
Disallow: /link/
Disallow: /tags/
Disallow: /news/
Disallow: /xuexi/

以小红书-https://www.xiaohongshu.com/robots.txt站点为：

ini 复制代码

User-agent:Googlebot
Disallow:/

User-agent:Baiduspider
Disallow:/

User-agent:bingbot
Disallow:/

User-agent:Sogou web spider
Disallow:/

User-agent:Sogou wap spider
Disallow:/

User-agent:YisouSpider
Disallow:/

User-agent:BaiduSpider-ads
Disallow:/

User-agent:*
Disallow:/

以阿里云-https://www.aliyun.com/robots.txt为：

ini 复制代码

User-agent: * 
Disallow: /*?spm=* 
Disallow: /*?tracelog=* 
Disallow: /*?page=* 
Disallow: /template 
Disallow: /admin 
Disallow: /config 
Disallow: /classes 
Disallow: /log 
Disallow: /language 
Disallow: /script 
Disallow: /static 
Disallow: /alilog 
Allow: /s/*


Sitemap: https://www.alibabacloud.com/sitemap.xml

2. sitemap.xml：

sitemap.xml 文件包含了网站的结构信息，通常是一个XML文件。它告诉搜索引擎哪些页面是最重要的、最常更新的，以及网站的整体结构。

示例 sitemap.xml 文件：

xml 复制代码

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>https://www.example.com/page1</loc>
      <lastmod>2023-01-01</lastmod>
      <changefreq>daily</changefreq>
      <priority>0.8</priority>
   </url>
   <url>
      <loc>https://www.example.com/page2</loc>
      <lastmod>2023-01-02</lastmod>
      <changefreq>weekly</changefreq>
      <priority>0.6</priority>
   </url>
   <!-- more URLs... -->
</urlset>

<loc>：表示页面的URL。
<lastmod>：表示页面的最后修改时间。
<changefreq>：表示页面内容的更新频率，例如 always, hourly, daily, weekly, monthly, yearly, never。
<priority>：表示页面的优先级，范围是 0.0 到 1.0。

可以通过读取 sitemap.xml 文件更有效地发现和了解网站的内容结构，从而提高爬取效率。

以下是一个多语言Sitemap XML文件的示例，采用了谷歌定义的扩展格式：

xml 复制代码

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
  
  <!-- 主要语言版本 - 中文 -->
  <url>
    <loc>https://www.example.com/zh-cn/page1</loc>
    <xhtml:link rel="alternate" hreflang="en-us" href="https://www.example.com/en-us/page1"/>
    <xhtml:link rel="alternate" hreflang="es-es" href="https://www.example.com/es-es/page1"/>
  </url>

  <!-- 主要语言版本 - 英语 -->
  <url>
    <loc>https://www.example.com/en-us/page2</loc>
    <xhtml:link rel="alternate" hreflang="zh-cn" href="https://www.example.com/zh-cn/page2"/>
    <xhtml:link rel="alternate" hreflang="es-es" href="https://www.example.com/es-es/page2"/>
  </url>

  <!-- 主要语言版本 - 西班牙语 -->
  <url>
    <loc>https://www.example.com/es-es/page3</loc>
    <xhtml:link rel="alternate" hreflang="zh-cn" href="https://www.example.com/zh-cn/page3"/>
    <xhtml:link rel="alternate" hreflang="en-us" href="https://www.example.com/en-us/page3"/>
  </url>
  
  <!-- 更多URL... -->
</urlset>

在这个示例中：

<loc>：表示主要语言版本的页面URL。
<xhtml:link>：表示与主要语言版本相关的其他语言版本。每个<xhtml:link>元素都包含以下属性：
- rel="alternate"：指定这是一个替代版本的链接。
- hreflang：指定替代版本的语言和地区。
- href：包含替代版本的URL。

这种格式的Sitemap XML可以更精确地告诉搜索引擎每个页面的不同语言版本，帮助搜索引擎更好地理解网站的多语言结构。

以阿里云-https://www.aliyun.com/sitemap.xml为：

xml 复制代码

<urlset
    xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <url>
        <loc>https://www.alibabacloud.com</loc>
        <lastmod>2023-12-28</lastmod>
        <changefreq>daily</changefreq>
        <priority>1.0</priority>
    </url>
    <url>
        <loc>https://www.alibabacloud.com/affiliate</loc>
        <lastmod>2023-12-24</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.5</priority>
    </url>
    <url>
        <loc>https://www.alibabacloud.com/blog/a-brief-history-of-development-of-alibaba-cloud-polardb_594254</loc>
        <lastmod>2023-12-24</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.5</priority>
    </url>
    <url>
        <loc>https://www.alibabacloud.com/blog/bootstrapping-function-compute-for-web-using-nodejs-sdk-part-1_594257</loc>
        <lastmod>2023-12-24</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.5</priority>
    </url>
    <url>
        <loc>https://www.alibabacloud.com/blog/shoucheng-zhang-the-association-of-quantum-computing-artificial-intelligence-and-blockchain_594249</loc>
        <lastmod>2023-12-24</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.5</priority>

请注意，虽然这两个文件对搜索引擎非常重要，但并不是所有的网站都提供它们。一些网站可能没有 robots.txt 或 sitemap.xml 文件，或者它们的内容可能是空的。

从零开发短视频电商 爬虫在爬取时注意 robots.txt 和 sitemap.xml

文章目录

1. robots.txt：

2. sitemap.xml：

从零开发短视频电商爬虫在爬取时注意 robots.txt 和 sitemap.xml