【Linux】Ext系列文件系统（下）

一、ext2文件系统

1.1 宏观认识

所有的准备工作都已经做完，是时候认识下文件系统了。我们想要在硬盘上存储文件，必须先把硬盘格式化为某种格式的文件系统，才能存储文件。文件系统的目的就是组织和管理硬盘中的文件。在Linux系统中，最常见的就是 ext2 系列文件系统。其早起版本为 ext2，后来又发展出了 ext3 和 ext4，ext3 和 ext4 虽然对 ext2 进行了增强，但是其核心设计并没有发生变化，我们仍然以 ext2 为演示对象。

ext2 文件系统将整个分区划分为若干个同样大小的块组（Block Group），如下图所示。只要能管理一个分区就能管理所有分区，也就能管理磁盘文件。

上图中启动块（Boot Sector）的大小是确定的，为1KB，由PC标准规定，用于存储磁盘分区信息和启动信息，任何文件系统都不能修改启动块。启动块之后才是 ext2 文件系统的开始。

1.2 Block Group

ext2文件系统会根据分区的大小划分为数个Block Group，而每个Block Group都有着相同的结构组成。

1.3 块组内部构成

1.3.1 超级块（Super Block）

存放文件系统本身的结构信息，描述整个分区的文件系统的信息。记录的信息主要有：block和inode的总量，未使用的 block 和 inode 的数量，一个 block 和 inode 的大小，最近一次挂载的时间，最近一次写入数据的时间，最近一次检验磁盘的时间等其他文件系统的相关信息。Super Block的信息被破坏，可以说整个文件系统的结构就被破坏。

超级块在每个块组的开头都有一份拷贝（第一个块组必须有，后面的块组可以没有）。为了保证文件系统在磁盘部分扇区出现物理问题的情况下还能正常工作，就必须保证文件系统的Super Block的信息在这种情况下也能正常访问。所以一个文件系统的Super Block会在多个Block Group中进行备份，这些Super Block区域的数据保持一致。

cpp 复制代码

/*
 * Structure of the super block
 */
struct ext2_super_block
{
    __le32 s_inodes_count;      /* Inodes count */
    __le32 s_blocks_count;      /* Blocks count */
    __le32 s_r_blocks_count;    /* Reserved blocks count */
    __le32 s_free_blocks_count; /* Free blocks count */
    __le32 s_free_inodes_count; /* Free inodes count */
    __le32 s_first_data_block;  /* First Data Block */
    __le32 s_log_block_size;    /* Block size */
    __le32 s_log_frag_size;     /* Fragment size */
    __le32 s_blocks_per_group;  /* # Blocks per group */
    __le32 s_frags_per_group;   /* # Fragments per group */
    __le32 s_inodes_per_group;  /* # Inodes per group */
    __le32 s_mtime;             /* Mount time */
    __le32 s_wtime;             /* Write time */
    __le16 s_mnt_count;         /* Mount count */
    __le16 s_max_mnt_count;     /* Maximal mount count */
    __le16 s_magic;             /* Magic signature */
    __le16 s_state;             /* File system state */
    __le16 s_errors;            /* Behaviour when detecting errors */
    __le16 s_minor_rev_level;   /* minor revision level */
    __le32 s_lastcheck;         /* time of last check */
    __le32 s_checkinterval;     /* max. time between checks */
    __le32 s_creator_os;        /* OS */
    __le32 s_rev_level;         /* Revision level */
    __le16 s_def_resuid;        /* Default uid for reserved blocks */
    __le16 s_def_resgid;        /* Default gid for reserved blocks */
    /*
     * These fields are for EXT2_DYNAMIC_REV superblocks only.
     *
     * Note: the difference between the compatible feature set and
     * the incompatible feature set is that if there is a bit set
     * in the incompatible feature set that the kernel doesn't
     * know about, it should refuse to mount the filesystem.
     *
     * e2fsck's requirements are more strict; if it doesn't know
     * about a feature in either the compatible or incompatible
     * feature set, it must abort and not try to meddle with
     * things it doesn't understand...
     */
    __le32 s_first_ino;              /* First non-reserved inode */
    __le16 s_inode_size;             /* size of inode structure */
    __le16 s_block_group_nr;         /* block group # of this superblock */
    __le32 s_feature_compat;         /* compatible feature set */
    __le32 s_feature_incompat;       /* incompatible feature set */
    __le32 s_feature_ro_compat;      /* readonly-compatible feature set */
    __u8 s_uuid[16];                 /* 128-bit uuid for volume */
    char s_volume_name[16];          /* volume name */
    char s_last_mounted[64];         /* directory where last mounted */
    __le32 s_algorithm_usage_bitmap; /* For compression */
    /*
     * Performance hints. Directory preallocation should only
     * happen if the EXT2_COMPAT_PREALLOC flag is on.
     */
    __u8 s_prealloc_blocks;     /* Nr of blocks to try to preallocate*/
    __u8 s_prealloc_dir_blocks; /* Nr to preallocate for dirs */
    __u16 s_padding1;
    /*
     * Journaling support valid if EXT3_FEATURE_COMPAT_HAS_JOURNAL set.
     */
    __u8 s_journal_uuid[16]; /* uuid of journal superblock */
    __u32 s_journal_inum;    /* inode number of journal file */
    __u32 s_journal_dev;     /* device number of journal file */
    __u32 s_last_orphan;     /* start of list of inodes to delete */
    __u32 s_hash_seed[4];    /* HTREE hash seed */
    __u8 s_def_hash_version; /* Default hash version to use */
    __u8 s_reserved_char_pad;
    __u16 s_reserved_word_pad;
    __le32 s_default_mount_opts;
    __le32 s_first_meta_bg; /* First metablock block group */
    __u32 s_reserved[190];  /* Padding to the end of the block */
};

1.3.2 GDT（Group Descriptor Table）

块组描述符表，描述块组属性信息，整个分区分成多少个块组就对应有多少个块组描述符。每个块组描述符存储一个块组的描述信息，如：在这个块组中从哪里开始是inode Table，从哪里开始是Data Block，空闲的inode和数据块还有多少个等等。块组描述符在每个块组的开头都有一份拷贝。

cpp 复制代码

// 磁盘级blockgroup的数据结构
/*
* Structure of a blocks group descriptor
*/
struct ext2_group_desc
{
    __le32 bg_block_bitmap; /* Blocks bitmap block */
    __le32 bg_inode_bitmap; /* Inodes bitmap */
    __le32 bg_inode_table; /* Inodes table block*/
    __le16 bg_free_blocks_count; /* Free blocks count */
    __le16 bg_free_inodes_count; /* Free inodes count */
    __le16 bg_used_dirs_count; /* Directories count */
    __le16 bg_pad;
    __le32 bg_reserved[3];
};

1.3.3 块位图（Block Bitmap）

Block Bitmap中记录着Data Block中那个数据块已经被占用了，那个数据块没有被占用。

1.3.4 inode位图（Inode Bitmap）

每个bit表示一个inode是否空闲可用。

1.3.5 i节点表（Inode Table）

存放文件属性，如：文件大小、所有者、最近修改时间等。
当前分组所有inode属性的集合。
inode编号以分区为单位，整体划分，不可跨分区。

1.3.6 Data Block

数据区：存放文件内容，也就是一个一个的block。根据不同的文件类型有以下几种情况：

对于普通文件，文件的数据存储在数据块中。
对于目录，该目录下的所有文件名和目录名存储在所在目录的数据块中，除了文件名外，ls -l命令看到的其他信息保存在该文件的inode中。
Block号按分区划分，不可跨分区。

1.4 inode和datablock映射（弱化）

inode内部存在 __le32 i_block[EXT2_N_BLOCKS];/* Pointers to blocks */,EXT2_N_BLOCKS=15，就是用来inode和block映射的。
这样文件 = 内容 + 属性，就都能找到了。

有些文件可能会很大，当用多级指针就可以用更少的空间指向更多的数据块。

cpp 复制代码

/*
* Structure of an inode on the disk
*/
struct ext2_inode {
    __le16 i_mode; /* File mode */
    __le16 i_uid; /* Low 16 bits of Owner Uid */
    __le32 i_size; /* Size in bytes */
    __le32 i_atime; /* Access time */
    __le32 i_ctime; /* Creation time */
    __le32 i_mtime; /* Modification time */
    __le32 i_dtime; /* Deletion Time */
    __le16 i_gid; /* Low 16 bits of Group Id */
    __le16 i_links_count; /* Links count */
    __le32 i_blocks; /* Blocks count */
    __le32 i_flags; /* File flags */
    union {
        struct {
            __le32 l_i_reserved1;
        } linux1;
        struct {
            __le32 h_i_translator;
        } hurd1;
        struct {
            __le32 m_i_reserved1;
        } masix1;
    } osd1; /* OS dependent 1 */
    __le32 i_block[EXT2_N_BLOCKS];/* Pointers to blocks */
    __le32 i_generation; /* File version (for NFS) */
    __le32 i_file_acl; /* File ACL */
    __le32 i_dir_acl; /* Directory ACL */
    __le32 i_faddr; /* Fragment address */
    union {
        struct {
            __u8 l_i_frag; /* Fragment number */
            __u8 l_i_fsize; /* Fragment size */
            __u16 i_pad1;
            __le16 l_i_uid_high; /* these 2 fields */
            __le16 l_i_gid_high; /* were reserved2[0] */
            __u32 l_i_reserved2;
        } linux2;
        struct {
            __u8 h_i_frag; /* Fragment number */
            __u8 h_i_fsize; /* Fragment size */
            __le16 h_i_mode_high;
            __le16 h_i_uid_high;
            __le16 h_i_gid_high;
            __le32 h_i_author;
        } hurd2;
        struct {
            __u8 m_i_frag; /* Fragment number */
            __u8 m_i_fsize; /* Fragment size */
            __u16 m_pad1;
            __u32 m_i_reserved2[2];
        } masix2;
    } osd2; /* OS dependent 2 */
};

#define EXT2_NDIR_BLOCKS 12
#define EXT2_IND_BLOCK EXT2_NDIR_BLOCKS
#define EXT2_DIND_BLOCK (EXT2_IND_BLOCK + 1)
#define EXT2_TIND_BLOCK (EXT2_DIND_BLOCK + 1)
#define EXT2_N_BLOCKS (EXT2_TIND_BLOCK + 1)

思考：

对文件的增、删、查、改是在做什么呢？
结论：

分区之后的格式化操作，就是对分区进行分组，在每个分组中写入SB、GDT、Block BitMap、Inode Bitmap等管理信息，这些管理信息统称：文件系统。

只要知道文件的 inode 号，就能在指定分区中确定是哪一个分组，进而在哪一个分组中确定是哪一个 inode。

拿到 inode 文件属性和内容就全部都有了。

下面通过 touch 一个新文件来看看如何工作的：

$root@localhost linux\]*# touch abc* \[root@localhost linux\]*# ls -i abc* 263466 abc$

为了说明问题，我们将上图简化：

创建一个新文件主要由以下4个操作：

存储属性：内核先找到一个空闲的 i 节点（这里是263466），再把文件信息记录到其中。
存储数据：该文件需要存储到三个磁盘块，内核找到了三个空闲的块：300，500，800。将内核缓冲区的第一块数据复制到300，下一块复制到500，以此类推。
记录分配情况：文件内容按顺序300，500，800存放。内核在 inode 上的磁盘分布区记录了上述块列表。
添加文件名到目录：新的文件名abc。Linux如何在当前的目录中记录这个文件？内核将入口（263466，abc）添加到目录文件，文件名和 inode 之间的对应关系将文件名和文件的属性和内容连接起来。

1.5 目录与文件名

问题：

我们访问的文件，都是使用文件名，并没有使用 inode 号，我们如何找到 inode 号呢？
目录也是文件吗？应该怎么理解？

答案：

目录也是文件，但是磁盘上没有目录的概念，只有文件属性 + 文件内容的概念。
目录的属性不用多说，内容保存的是：文件名和 inode 号的映射关系。

cpp 复制代码

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <dirent.h>
#include <sys/types.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
    if (argc != 2)
    {
        fprintf(stderr, "Usage: %s <directory>\n", argv[0]);
        exit(EXIT_FAILURE);
    }

    DIR *dir = opendir(argv[1]); // 系统调⽤，⾃⾏查阅
    if (!dir)
    {
        perror("opendir");
        exit(EXIT_FAILURE);
    }

    struct dirent *entry;
    while ((entry = readdir(dir)) != NULL)
    { // 系统调⽤，⾃⾏查阅
        // Skip the "." and ".." directory entries
        if (strcmp(entry->d_name, ".") == 0 || strcmp(entry->d_name, "..") == 0)
        {
            continue;
        }
        printf("Filename: %s, Inode: %lu\n", entry->d_name, (unsigned long)entry->d_ino);
    }
    closedir(dir);
    
    return 0;
}

所以，访问文件，必须打开当前目录，根据文件名，获得对应的 inode 号，然后进行访问。
访问文件必须要知道该文件所在的目录，本质是必须能打开当前工作目录文件，查看目录中的内容。

1.6 路径解析

问题：打开当前工作目录文件，查看当前工作目录文件的内容，但当前工作目录文件不也是文件吗？我们访问当前工作目录不也是只知道当前工作目录的文件名吗？要访问它，不也要知道当前工作目录的 inode 吗？

答案1：所以也要打开当前工作目录的上级目录，但上级目录不也是文件吗？回到了上面的问题。

答案2：所以类似"递归"，需要把路径中所有目录都要解析，出口是"/"根目录。

最终答案：实际上，任何文件都有路径，访问目标文件，比如：/root/boke/test。都要从根目录开始，以此打开每一个目录，根据目录名，依次访问每一个目录下的指定目录名，直到访问到test。这个过程叫做路径解析。

注意：

所以，我们知道了：访问文件必须要有目录 + 文件名 = 文件路径的原因。

根目录固定文件名，inode号，无需查找，系统开机之后就必须知道。

可是路径谁提供？

我们访问文件，都是指令/工具访问，本质是进程访问，进程有CWD！进程提供路径。
我们open文件，提供了路径。

可是最开始的路径从哪里来的？

所以Linux为什么要有根目录，根目录下为什么要有那么多缺省目录？
我们为什么要有家目录，我们可以新建目录？
上面所有行为：本质就是在磁盘文件系统中，新建目录文件。而我们新建的任何文件，都在我们活着系统指定的目录下新建，这不就是天然就有路径了吗。
系统 + 用户共同构建了Linux路径结构。

1.7 路径缓冲

问题1：Linux系统中，存在真正的目录吗？

答案：不存在，只有文件，只保存文件属性 + 文件内容。

问题2：访问任何文件都要从根目录开始解析吗？

答案：原则上是的，但是这样太慢了，所以Linux会缓存历史路径结构。

问题3：Linux目录的概念，怎么产生的？

答案：打开的文件时目录的话，由OS自己在内存中进行路径维护。

Linux中，在内核中维护树状路径结构的内核结构体叫做：struct dentry

cpp 复制代码

struct dentry
{
    atomic_t d_count;
    unsigned int d_flags;  /* protected by d_lock */
    spinlock_t d_lock;     /* per dentry lock */
    struct inode *d_inode; /* Where the name belongs to - NULL is
                            * negative */
    /*
     * The next three fields are touched by __d_lookup. Place them here
     * so they all fit in a cache line.
     */
    struct hlist_node d_hash; /* lookup hash list */
    struct dentry *d_parent;  /* parent directory */
    struct qstr d_name;
    struct list_head d_lru; /* LRU list */
    /*
     * d_child and d_rcu can share memory
     */
    union
    {
        struct list_head d_child; /* child of parent list */
        struct rcu_head d_rcu;
    } d_u;
    struct list_head d_subdirs; /* our children */
    struct list_head d_alias;   /* inode alias list */
    unsigned long d_time;       /* used by d_revalidate */
    struct dentry_operations *d_op;
    struct super_block *d_sb; /* The root of the dentry tree */
    void *d_fsdata;           /* fs-specific data */
#ifdef CONFIG_PROFILING
    struct dcookie_struct *d_cookie; /* cookie, if any */
#endif
    int d_mounted;
    unsigned char d_iname[DNAME_INLINE_LEN_MIN]; /* small names */
};

注意：

每个文件其实都要有对应的 dentry 结构，包括普通文件。这样所有被打开的文件，就可以在内存中形成整个树状结构。

整个树形节点也同时会隶属于LRU（Least Recently Used，最近最少使用）结构中，进行节点淘汰。

整个树形节点也同时会隶属于Hash，方便快速查找。

更重要的是，这个树形结构，整体构成了Linux的路径缓存结构，打开访问任何文件，都现在这棵树下根据路径进行查找，找到了就返回属性 inode 和内容，没找到就从磁盘加载路径，添加 dentry 结构，缓存新路径。

1.8 挂载分区

我们已经能够根据 inode 号在指定分区找文件了，也已经能够根据目录文件内容，找到指定的 inode 了，在指定的分区内，我们可以为所欲为了。可是：inode 不是不能跨分区吗？Linux不是有多个分区吗？我们如何知道我们在哪一个分区？

1.8.1 一个实验

root@iZ2vcc44fhpy7zao55ypyxZ:~/boke/test# dd if=/dev/zero of=./disk.img bs=1M count=5 #制作一个大的磁盘块，就当做一个分区

5+0 records in

5+0 records out

5242880 bytes (5.2 MB, 5.0 MiB) copied, 0.00455198 s, 1.2 GB/s

root@iZ2vcc44fhpy7zao55ypyxZ:~/boke/test# mkfs.ext4 disk.img # 格式化写入文件系统

mke2fs 1.47.0 (5-Feb-2023)

Filesystem too small for a journal

Discarding device blocks: done

Creating filesystem with 1280 4k blocks and 1280 inodes

Allocating group tables: done

Writing inode tables: done

Writing superblocks and filesystem accounting information: done

root@iZ2vcc44fhpy7zao55ypyxZ:~/boke/test# mkdir /mnt/mydisk # 建立空目录

root@iZ2vcc44fhpy7zao55ypyxZ:~/boke/test# df -h # 查看可以使用的分区

Filesystem Size Used Avail Use% Mounted on

tmpfs 168M 1.3M 167M 1% /run

efivarfs 256K 7.3K 244K 3% /sys/firmware/efi/efivars

/dev/vda3 40G 8.9G 29G 24% /

tmpfs 839M 0 839M 0% /dev/shm

tmpfs 5.0M 0 5.0M 0% /run/lock

/dev/vda2 197M 6.2M 191M 4% /boot/efi

/dev/loop0 3.5M 24K 3.0M 1% /data/maxhou/data2mount

tmpfs 839M 220K 839M 1% /run/qemu

overlay 40G 8.9G 29G 24% /var/lib/docker/overlay2/1050e58d56434bb1c599036de61a26007255561b4d841d1b82d59a966a463752/merged

tmpfs 168M 12K 168M 1% /run/user/0

root@iZ2vcc44fhpy7zao55ypyxZ:~/boke/test# sudo mount -t ext4 ./disk.img /mnt/mydisk/ #将分区挂载到指定目录

root@iZ2vcc44fhpy7zao55ypyxZ:~/boke/test# df -h

Filesystem Size Used Avail Use% Mounted on

tmpfs 168M 1.3M 167M 1% /run

efivarfs 256K 7.3K 244K 3% /sys/firmware/efi/efivars

/dev/vda3 40G 8.9G 29G 24% /

tmpfs 839M 0 839M 0% /dev/shm

tmpfs 5.0M 0 5.0M 0% /run/lock

/dev/vda2 197M 6.2M 191M 4% /boot/efi

/dev/loop0 3.5M 24K 3.0M 1% /data/maxhou/data2mount

tmpfs 839M 220K 839M 1% /run/qemu

overlay 40G 8.9G 29G 24% /var/lib/docker/overlay2/1050e58d56434bb1c599036de61a26007255561b4d841d1b82d59a966a463752/merged

tmpfs 168M 12K 168M 1% /run/user/0
/dev/loop1 4.7M 24K 4.4M 1% /mnt/mydisk

root@iZ2vcc44fhpy7zao55ypyxZ:~/boke/test# sudo umount /mnt/mydisk # 卸载分区
root@iZ2vcc44fhpy7zao55ypyxZ:~/boke/test# df -h
Filesystem Size Used Avail Use% Mounted on
tmpfs 168M 1.3M 167M 1% /run
efivarfs 256K 7.3K 244K 3% /sys/firmware/efi/efivars
/dev/vda3 40G 8.9G 29G 24% /
tmpfs 839M 0 839M 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
/dev/vda2 197M 6.2M 191M 4% /boot/efi
/dev/loop0 3.5M 24K 3.0M 1% /data/maxhou/data2mount
tmpfs 839M 220K 839M 1% /run/qemu
overlay 40G 8.9G 29G 24% /var/lib/docker/overlay2/1050e58d56434bb1c599036de61a26007255561b4d841d1b82d59a966a463752/merged
tmpfs 168M 12K 168M 1% /run/user/0
注意：

/dev/loop0 在Linux系统中代表第一个循环设备（loop device）。循环设备，也被称为回环设备或者loopback设备，是一种伪设备（pseudo-device），它允许将文件作为块设备（block device）来使用。这种机制使得可以将文件（比如ISO镜像文件）挂载（mount）为文件系统，就像它们是物理硬盘分区或者外部存储设备一样。