多模态学习笔记 - 2

多模态学习笔记-2

参考repo:WatchTower-Liu/VLM-learning; url: vlm-learning

吐槽

今天的源码看的欲仙欲死，NTK(neural tangent kernel) , rotary_position_embedding这些在之前的学习中完全闻所未闻，导致看的时候一脸懵逼，只能说不愧是Qwen大模型，各种sota的技术都用上了。就是看的有点费劲TAT~

学习心得

这次还是读源码，接着上一次的笔记(多模态学习笔记-1)接着讲，上一次讲了在第一次处理输入的序列数据时，去除掉序列数据input_ids中的image_token(也就是应当替换为图像数据的地方)，并且将device设定为input_ids或者input_embeds挂载的设备(gpu or cpu)，需要注意的是在前向传播时不能同时传入input_ids和input_embeds 参数，只需传入其一即可。

下面来看看接下来的源码，还是前向传播部分(注意，这里的前向传播代码不是Qwen的原装代码，是为了多模态适配重写的代码)。

python 复制代码

		output_attentions = (
            output_attentions
            if output_attentions is not None
            else self.config.output_attentions
        )
        output_hidden_states = (
            output_hidden_states
            if output_hidden_states is not None
            else self.config.output_hidden_states
        )
        use_cache = use_cache if use_cache is not None else self.config.use_cache
        return_dict = (
            return_dict if return_dict is not None else self.config.use_return_dict
        )

这段代码用来初始化output_attention, output_hidden_states和use_cache 三个参数，参数的含义分别如下:

output_attention: 用来确定是否输出注意力权重

output_hidden_states: 用来确定是否输出所有(每个时间步)的隐藏状态，而不是只输出最后一个时间步的隐藏状态。

use_cache: 是否使用缓存机制，通过缓存已计算的键值对信息来减少重复计算，加快模型的推理速度，具体来说，模型会在推理过程中逐步生成每个token，同时将计算得到的每个token对应的K和V缓存起来。当生成下一个token时，模型可以复用之前缓存的K和V，只对新token进行Attention计算，而无需重新计算整个序列的Attention，这样可以显著减少计算量，提高效率。

return_dict: 很好理解，返回值是否以字典形式返回。

python 复制代码

		if input_ids is not None and inputs_embeds is not None:
            raise ValueError(
                "You cannot specify both input_ids and inputs_embeds at the same time"
            )
        elif input_ids is not None:
            input_shape = input_ids.size()
            input_ids = input_ids.view(-1, input_shape[-1]).contiguous()
            batch_size = input_ids.shape[0]
        elif inputs_embeds is not None:
            input_shape = inputs_embeds.size()[:-1]
            batch_size = inputs_embeds.shape[0]
        else:
            raise ValueError("You have to specify either input_ids or inputs_embeds")

如果我们同时传入了Input_ids和input_embeds就会报错，错误信息为不能同时指定input_ids和input_embeds, 如果input_ids不为空(此时Input_embeds为空)，我们用size()函数获取input_ids的形状，通常为(batch_size, seq_len)，利用view函数将input_ids转换为二维张量，contifuous()函数的目的是让Input_ids在内存中连续，这是使用view函数的前提条件(reshape函数就不需要，但是view函数对老代码的兼容性更好)。

如果我们传入了inputs_embeds，一般情况下它的形状是（batch_size, seq_len, embed_size），我们不需要它的embed_size维度，所以在获取input_shape的时候去除掉了最后一个维度, 其batch_szie就为第一个维度。

如果input_ids和input_embeds都未传入，报错，要求我们必须传入至少一个序列输入。

python 复制代码

if images is not None and first_step:
            input_shape = input_shape[0], input_shape[-1] + self.otherConfig["image_context_length"]   ############

在推理或者训练的第一步时，我们需要对输入数据的形状进行处理，一般情况下，input_shape[0]为betch_szie，input_shape[-1]为seq_len，我们需要再seq_len维度加上图片的上下文序列长度，以便后续对图片特征和文字输入进行融合。

python 复制代码

		if token_type_ids is not None:
            token_type_ids = token_type_ids.view(-1, input_shape[-1])
        if position_ids is not None:
            position_ids = position_ids.view(-1, input_shape[-1])

这里对token_type_ids和position_ids进行形状的调整，其中token_type用来区分不同模态的token, position_id则是transformer模型的基础，由于transformer模型的自注意力机制天然无法考虑序列的位置信息，它是并行处理序列输入的每一个元素，而RNN则是递归处理数据，天然可以记忆之前时间步的信息，因此我们需要position_id。我们确保token_type_id和position_id的最后一个维度都是seq_len + image_context_len，以便于后续的处理。

python 复制代码

        if past_key_values is None:
            past_length = 0
            past_key_values = tuple([None] * len(self.h))
        else:
            if self.use_cache_quantization:
                past_length = past_key_values[0][0][0].size(2)
            else:
                past_length = past_key_values[0][0].size(-2)
        if position_ids is None:
            position_ids = torch.arange(
                past_length,
                input_shape[-1] + past_length,
                dtype=torch.long,
                device=device,
            )
            position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])

如果没有缓存的键值对信息，我们将past_length置为0，代表我们目前尚未处理任何序列输入信息(假设use_cache)，将past_key_valuies初始化为长度为注意力头数量(self.h)的元组，用来存储每个注意力头的key_value信息。

如果有缓存的键值对信息，并且启动了缓存量化，past_key_values[0][0][0]很抽象，这里我们知道它是取第一个注意力头的键张量的第三个维度就行。如果不启用缓存量化，同样是取第一个注意力头的键张量的倒数第二个维度。我们只需要知道这两个维度是past_len就行，底层深究会很麻烦。

如果没有传入position_ids,或者说我们目前不处于推理和训练的第一步，我们初始化一个positon_ids，起始位置在past_len之后，终止位置为past_len + (seq_len + image_context_len)，这里加上past_len是为了保持长度不变，数据类型为torch.long,。

最后对position_id进行重新塑性，position_ids原本的size为(seq_len + image_context_len,)，我们添加一个为1的维度(用unsqueeze(0))，并且将position_ids最后一个维度重塑为(seq_len + image_context_len)，这里可能有点多此一举，但是为了代码的健壮性也无妨。

时间原因，今天先写到这里，明日再战fighting~