C1W4.Assignment.Naive Machine Translation and LSH

理论课：C1W4.Machine Translation and Document Search

文章目录

[1. The word embeddings data for English and French words](#1. The word embeddings data for English and French words)
- [1.1The data](#1.1The data)
- - [The subset of data](#The subset of data)
  - [Load two dictionaries](#Load two dictionaries)
- [1.2 Generate embedding and transform matrices](#1.2 Generate embedding and transform matrices)
- - [Exercise 1: Translating English dictionary to French by using embeddings](#Exercise 1: Translating English dictionary to French by using embeddings)
[2. Translations](#2. Translations)
- [2.1 Translation as linear transformation of embeddings](#2.1 Translation as linear transformation of embeddings)
- - [Actual loss function](#Actual loss function)
  - [Exercise 2: Implementing translation mechanism described in this section](#Exercise 2: Implementing translation mechanism described in this section)
  - [Exercise 3](#Exercise 3)
  - - [Computing the gradient of loss in respect to transform matrix R](#Computing the gradient of loss in respect to transform matrix R)
    - [Finding the optimal R with gradient descent algorithm](#Finding the optimal R with gradient descent algorithm)
  - [Exercise 4](#Exercise 4)
  - [Calculate transformation matrix R](#Calculate transformation matrix R)
- [2.2 Testing the translation](#2.2 Testing the translation)
- - [k-Nearest neighbors algorithm](#k-Nearest neighbors algorithm)
  - [Cosine similarity](#Cosine similarity)
  - [Exercise 5](#Exercise 5)
  - [Exercise 6: Test your translation and compute its accuracy](#Exercise 6: Test your translation and compute its accuracy)
[3. LSH and document search](#3. LSH and document search)
- [3.1 Getting the document embeddings](#3.1 Getting the document embeddings)
- - [Bag-of-words (BOW) document models](#Bag-of-words (BOW) document models)
  - [Document embeddings](#Document embeddings)
  - [Exercise 7](#Exercise 7)
  - [Exercise 8: Store all document vectors into a dictionary](#Exercise 8: Store all document vectors into a dictionary)
- [3.2 Looking up the tweets](#3.2 Looking up the tweets)
- [3.3 Finding the most similar tweets with LSH](#3.3 Finding the most similar tweets with LSH)
- [3.4 Getting the hash number for a vector](#3.4 Getting the hash number for a vector)
- - [Hyperplanes in vector spaces](#Hyperplanes in vector spaces)
  - [Using Hyperplanes to split the vector space](#Using Hyperplanes to split the vector space)
  - [Encoding hash buckets](#Encoding hash buckets)
  - [Exercise 9: Implementing hash buckets](#Exercise 9: Implementing hash buckets)
  - - [Create the sets of planes](#Create the sets of planes)
- [3.5 Creating a hash table](#3.5 Creating a hash table)
- - [Exercise 10](#Exercise 10)
- [3.6 Creating all hash tables](#3.6 Creating all hash tables)
- - [Exercise 11: Approximate K-NN](#Exercise 11: Approximate K-NN)

理论课： C1W4.Machine Translation and Document Search
本次课需要NLTK的推特数据集和停用词表。

先导入包

python 复制代码

import pdb
import pickle
import string

import time

import nltk
import numpy as np
from nltk.corpus import stopwords, twitter_samples

from utils import (cosine_similarity, get_dict, process_tweet)
from os import getcwd

import w4_unittest

1. The word embeddings data for English and French words

编写一个将英语翻译成法语的程序。

1.1The data

The subset of data

完整的英语词向量数据集约为 3.64G，法语词向量数据集约为 629 G。减少工作量，这里只需加载要用到的单词的嵌入词子集。

en_embeddings.p

fr_embeddings.p

python 复制代码

en_embeddings_subset = pickle.load(open("./data/en_embeddings.p", "rb"))
fr_embeddings_subset = pickle.load(open("./data/fr_embeddings.p", "rb"))

en_embeddings_subset:键是一个英文单词，值是一个 300 维数组，是该单词的词向量，例如：

'the': array([ 0.08007812, 0.10498047, 0.04980469, 0.0534668 , -0.06738281, ...

fr_embeddings_subset:键是一个法文单词，值是一个 300 维数组，是该单词的词向量，例如：

'la': array([-6.18250e-03, -9.43867e-04, -8.82648e-03, 3.24623e-02,...

Load two dictionaries

加载两个词典，将英语词汇映射为法语词汇

一个训练词典
一个测试词典。

python 复制代码

# loading the english to french dictionaries
en_fr_train = get_dict('./data/en-fr.train.txt')
print('The length of the English to French training dictionary is', len(en_fr_train))
en_fr_test = get_dict('./data/en-fr.test.txt')
print('The length of the English to French test dictionary is', len(en_fr_test))

结果：

The length of the English to French training dictionary is 5000

The length of the English to French test dictionary is 1500

en_fr_train 是一个字典，键是英文单词，值是该英文单词的法文翻译。例如：

{'the': 'la',
'and': 'et',
'was': 'était',
'for': 'pour',
en_fr_test 与 en_fr_train 类似，但它是一个测试集。

1.2 Generate embedding and transform matrices

Exercise 1: Translating English dictionary to French by using embeddings

实现一个函数 get_matrices，该函数接收加载的数据，并返回矩阵 X 和 Y。

输入：

en_fr : 英法词典
en_embeddings : 英语词向量词典
fr_embeddings : 法文词向量词典

矩阵 X 和矩阵 Y，其中 X 中的每一行是一个英语单词的词嵌入/向量，而 Y 中的同一行是该英语单词的法语版本的词嵌入。

使用 en_fr 字典，确保 X 矩阵中的第 i 行与 Y 矩阵中的第 i 行相对应。

python 复制代码

# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def get_matrices(en_fr, french_vecs, english_vecs):
    """
    Input:
        en_fr: English to French dictionary
        french_vecs: French words to their corresponding word embeddings.
        english_vecs: English words to their corresponding word embeddings.
    Output: 
        X: a matrix where the columns are the English embeddings.
        Y: a matrix where the columns correspong to the French embeddings.
        R: the projection matrix that minimizes the F norm ||X R -Y||^2.
    """

    ### START CODE HERE ###

    # X_l and Y_l are lists of the english and french word embeddings
    X_l = list()
    Y_l = list()

    # get the english words (the keys in the dictionary) and store in a set()
    english_set = set(english_vecs.keys())

    # get the french words (keys in the dictionary) and store in a set()
    french_set = set(french_vecs.keys())

    # store the french words that are part of the english-french dictionary (these are the values of the dictionary)
    french_words = set(en_fr.values())

    # loop through all english, french word pairs in the english french dictionary
    for en_word, fr_word in en_fr.items():

        # check that the french word has an embedding and that the english word has an embedding
        if fr_word in french_set and en_word in english_set:

            # get the english embedding
            en_vec = english_vecs[en_word]

            # get the french embedding
            fr_vec = french_vecs[fr_word]

            # add the english embedding to the list
            X_l.append(en_vec)

            # add the french embedding to the list
            Y_l.append(fr_vec)

    # stack the vectors of X_l into a matrix X
    X = np.vstack(X_l)

    # stack the vectors of Y_l into a matrix Y
    Y = np.vstack(Y_l)
    ### END CODE HERE ###

    return X, Y

使用get_matrices()函数获得将英语和法语单词嵌入到相应向量空间模型中的集合 X_train 和 Y_train 。

python 复制代码

# getting the training set:
X_train, Y_train = get_matrices(
    en_fr_train, fr_embeddings_subset, en_embeddings_subset)

大白话：

1.所有英文和法文都有词向量表达，但是太大，我们不需要这么多，先切出我们需要用的英文词向量表达en_embeddings_subset和法文词向量表达fr_embeddings_subset。

2.加载英文法文单词对翻译训练集en_fr_train和测试集en_fr_test

3.单词没有办法训练，于是使用get_matrices函数将en_fr_train中的英文法文单词对一一拿出来，分别在英文词向量表达en_embeddings_subset和法文词向量表达fr_embeddings_subset中找到对应的词向量表达，分别放到X和Y中。

2. Translations

2.1 Translation as linear transformation of embeddings

给定英语和法语词嵌入词典，创建一个转换矩阵 R，完成：

使用英语词汇嵌入 e \mathbf{e} e，乘以转换矩阵后得 e R \mathbf{eR} eR ，也就是得到一个新的词嵌入 f \mathbf{f} f。
e \mathbf{e} e和 f \mathbf{f} f 都是行向量。

然后计算 f \mathbf{f} f在法语词向量表达中的近邻，并得到与转换后的词嵌入式最相似的词。

我们希望转化矩阵R能够最小化以下公式：
arg ⁡ min ⁡ R ∥ X R − Y ∥ F (1) \arg \min {\mathbf{R}}\| \mathbf{X R} - \mathbf{Y}\|{F}\tag{1} argRmin∥XR−Y∥F(1)

上式中的下标F表示Frobenius范数，具体定义看本节课程配套实验，简单说就是矩阵 A A A（维度为 m , n m,n m,n），其Frobenius范数为：
∥ A ∥ F ≡ ∑ i = 1 m ∑ j = 1 n ∣ a i j ∣ 2 (2) \|\mathbf{A}\|{F} \equiv \sqrt{\sum{i=1}^{m} \sum_{j=1}^{n}\left|a_{i j}\right|^{2}}\tag{2} ∥A∥F≡i=1∑mj=1∑n∣aij∣2 (2)

Actual loss function

在实操过程中，为了简化计算：
L F = ∥ X R − Y ∥ F L_F=\| \mathbf{XR} - \mathbf{Y}\|_{F} LF=∥XR−Y∥F

有时会使用Frobenius范数的平方，（去掉了Frobenius的根号，梯度下降求偏导更简便）即：
L F 2 = ∥ X R − Y ∥ F 2 L_{F^2}=\| \mathbf{XR} - \mathbf{Y}\|_{F}^2 LF2=∥XR−Y∥F2

然后，为了进一步简化，可以将这个值除以样本数量 m m m（即矩阵 X X X 的行数），得到：
L F 2 / m = ∥ X R − Y ∥ F 2 m L_{F^2}/m=\cfrac{\| \mathbf{XR} - \mathbf{Y}\|_{F}^2}{m} LF2/m=m∥XR−Y∥F2

这里的 m m m是为了规范化损失函数，使得损失值与样本数量无关，这样在不同规模的数据集上比较模型性能时更为公平。

扩展知识：

Exercise 2: Implementing translation mechanism described in this section

计算损失值，损失函数公式为：
L ( X , Y , R ) = 1 m ∑ i = 1 m ∑ j = 1 n ( a i j ) 2 L(X, Y, R)=\frac{1}{m}\sum_{i=1}^{m} \sum_{j=1}^{n}\left( a_{i j} \right)^{2} L(X,Y,R)=m1i=1∑mj=1∑n(aij)2

其中 a i j a_{i j} aij 是矩阵 X R − Y \mathbf{XR}-\mathbf{Y} XR−Y 第 i i i 行和第 j j j 列的值。
compute_loss() 函数主要步骤：

通过矩阵乘以 X 和 R 计算 Y 的近似值
计算差值 XR - Y
根据Frobenius 公式计算差值的平方并除以 m m m。

python 复制代码

# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def compute_loss(X, Y, R):
    '''
    Inputs: 
        X: a matrix of dimension (m,n) where the columns are the English embeddings.
        Y: a matrix of dimension (m,n) where the columns correspong to the French embeddings.
        R: a matrix of dimension (n,n) - transformation matrix from English to French vector space embeddings.
    Outputs:
        L: a matrix of dimension (m,n) - the value of the loss function for given X, Y and R.
    '''
    ### START CODE HERE ###
    # m is the number of rows in X
    m = X.shape[0]
        
    # diff is XR - Y    
    diff = np.dot(X,R)-Y

    # diff_squared is the element-wise square of the difference    
    diff_squared = np.square(diff)

    # sum_diff_squared is the sum of the squared elements
    sum_diff_squared = np.sum(diff_squared)

    # loss i is the sum_diff_squard divided by the number of examples (m)
    loss = sum_diff_squared/m
    ### END CODE HERE ###
    return loss

测试：

python 复制代码

# Testing your implementation.
np.random.seed(123)
m = 10
n = 5
X = np.random.rand(m, n)
Y = np.random.rand(m, n) * .1
R = np.random.rand(n, n)
print(f"Expected loss for an experiment with random matrices: {compute_loss(X, Y, R):.4f}" )

Exercise 3

Computing the gradient of loss in respect to transform matrix R

梯度计算要点：

计算变换矩阵 R 的梯度损失。
梯度是一个矩阵，它表示 R 的微小变化对损失函数变化的影响程度。
梯度为我们提供了减少 R以最小化损失的方向。
m m m 是训练实例的数量（ X X X 中的行数）。
损失函数 L ( X , Y , R ) L(X, Y, R) L(X,Y,R)的梯度公式（对 R R R求偏导）为：
d d R L ( X , Y , R ) = d d R ( 1 m ∥ X R − Y ∥ F 2 ) = 2 m X T ( X R − Y ) \frac{d}{dR}L(X, Y, R)=\frac{d}{dR}\Big(\frac{1}{m}\| X R -Y\|_{F}^{2}\Big) = \frac{2}{m}X^{T} (X R - Y) dRdL(X,Y,R)=dRd(m1∥XR−Y∥F2)=m2XT(XR−Y)

python 复制代码

# UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def compute_gradient(X, Y, R):
    '''
    Inputs: 
        X: a matrix of dimension (m,n) where the columns are the English embeddings.
        Y: a matrix of dimension (m,n) where the columns correspong to the French embeddings.
        R: a matrix of dimension (n,n) - transformation matrix from English to French vector space embeddings.
    Outputs:
        g: a scalar value - gradient of the loss function L for given X, Y and R.
    '''
    ### START CODE HERE ###
    # m is the number of rows in X
    m = X.shape[0]

    # gradient is X^T(XR - Y) * 2/m    
    gradient = np.dot(np.transpose(X),(np.dot(X,R)-Y))* 2/m    
    
    ### END CODE HERE ###
    return gradient

测试：

python 复制代码

# Testing your implementation.
np.random.seed(123)
m = 10
n = 5
X = np.random.rand(m, n)
Y = np.random.rand(m, n) * .1
R = np.random.rand(n, n)
gradient = compute_gradient(X, Y, R)
print(f"First row of the gradient matrix: {gradient[0]}")

Finding the optimal R with gradient descent algorithm

梯度下降是一种迭代算法，用于寻找函数的最优值。

前面我们提到，损失相对于矩阵的梯度表示矩阵中某个坐标的微小变化对损失函数变化的影响程度。
梯度下降利用这一信息来迭代更新矩阵 R，直到达到损失函数最小值。

梯度下降需要迭代次数新手可以设置一个固定值，而不是迭代到损失低于阈值为止。以下是相关解释：

1.训练集损失值降低不是我们的目标，我们想要的是低验证集损失值降，或者提高验证集准确率。通常会设置"early stopping"临界值，提前停止训练迭代，防止过拟合现象。但是。。。请看第二条

2.在更大的数据集上，regularization做得好的模型训练损失永远不会停止下降。尤其是在 NLP 领域，可以持续训练几个月，模型训练损失会慢慢下降。这是很难在某个临界值上停下来的原因。

3.使用固定迭代次数的一个好处是：你可以减少训练时间。另一个好处是，可以对超参数例如：尝试不同学习率，看看调整学习率会不会带来性能提升。

伪代码：

1.根据矩阵 R R R计算梯度损失 g g g

2.以学习率 α \alpha α更新矩阵 R R R
R new = R old − α g R_{\text{new}}= R_{\text{old}}-\alpha g Rnew=Rold−αg

学习率大小的对梯度下降的影响这里不赘述，这里取learning_rate = 0.0003 =0.0003 =0.0003

Exercise 4

使用align_embeddings()函数完成align_embeddings()

python 复制代码

# UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def align_embeddings(X, Y, train_steps=100, learning_rate=0.0003, verbose=True, compute_loss=compute_loss, compute_gradient=compute_gradient):
    '''
    Inputs:
        X: a matrix of dimension (m,n) where the columns are the English embeddings.
        Y: a matrix of dimension (m,n) where the columns correspong to the French embeddings.
        train_steps: positive int - describes how many steps will gradient descent algorithm do.
        learning_rate: positive float - describes how big steps will  gradient descent algorithm do.
    Outputs:
        R: a matrix of dimension (n,n) - the projection matrix that minimizes the F norm ||X R -Y||^2
    '''
    np.random.seed(129)

    # the number of columns in X is the number of dimensions for a word vector (e.g. 300)
    # R is a square matrix with length equal to the number of dimensions in th  word embedding
    R = np.random.rand(X.shape[1], X.shape[1])

    for i in range(train_steps):
        if verbose and i % 25 == 0:
            print(f"loss at iteration {i} is: {compute_loss(X, Y, R):.4f}")
        ### START CODE HERE ###
        # use the function that you defined to compute the gradient
        gradient = compute_gradient(X, Y, R)

        # update R by subtracting the learning rate times gradient
        R -= learning_rate * gradient
        ### END CODE HERE ###
    return R

测试：

python 复制代码

# Testing your implementation.
np.random.seed(129)
m = 10
n = 5
X = np.random.rand(m, n)
Y = np.random.rand(m, n) * .1
R = align_embeddings(X, Y)

Calculate transformation matrix R

python 复制代码

R_train = align_embeddings(X_train, Y_train, train_steps=400, learning_rate=0.8)

结果：

loss at iteration 0 is: 963.0146

loss at iteration 25 is: 97.8292

loss at iteration 50 is: 26.8329

loss at iteration 75 is: 9.7893

loss at iteration 100 is: 4.3776

loss at iteration 125 is: 2.3281

loss at iteration 150 is: 1.4480

loss at iteration 175 is: 1.0338

loss at iteration 200 is: 0.8251

loss at iteration 225 is: 0.7145

loss at iteration 250 is: 0.6534

loss at iteration 275 is: 0.6185

loss at iteration 300 is: 0.5981

loss at iteration 325 is: 0.5858

loss at iteration 350 is: 0.5782

loss at iteration 375 is: 0.5735

2.2 Testing the translation

k-Nearest neighbors algorithm

k-NN简介

k-NN 将一个向量作为输入，并找出数据集中与之最接近的其他向量。
k "是要找到的 "最近邻 "的数量（例如，k=2 可以找到最近的两个邻居）。

由于我们是通过线性变换矩阵 R \mathbf{R} R 来近似从英语词向量到法语词向量的转换，因此当我们将某个特定英语单词的向量 e \mathbf{e} e 转换到法语向量空间时，大多数情况下我们不会得到法语单词的精确向量表达，需要我们使用1-NN算法，以 e R \mathbf{eR} eR为输入，从矩阵 Y \mathbf{Y} Y中找到一个最相近的词向量 f \mathbf{f} f

Cosine similarity

判断向量距离使用之前就学过的余弦相似度：
cos ⁡ ( u , v ) = u ⋅ v ∥ u ∥ ∥ v ∥ \cos(u,v)=\frac{u\cdot v}{\left\|u\right\|\left\|v\right\|} cos(u,v)=∥u∥∥v∥u⋅v
u u u 和 v v v是两个向量。

当 u u u 和 v v v 位于同一条直线且方向相同时， cos ⁡ ( u , v ) \cos(u,v) cos(u,v) = 1 1 1。
当 u u u 和 v v v 位于同一条直线上且方向相同时， c o s ( u , v ) cos(u,v) cos(u,v) = 1 1 1。
当矢量彼此正交（垂直）时， cos ⁡ ( u , v ) \cos(u,v) cos(u,v) 为 0。

注意：距离和相似度是截然相反的两个概念。

我们可以从余弦相似度中得到距离度量，但余弦相似度不能直接用作距离度量。
当余弦相似度增加时（接近 1 ），两个向量之间的 "距离 "会减小（接近 0 ）。
我们可以将 u u u 和 v v v 之间的余弦距离定义为：
d cos ( u , v ) = 1 − cos ⁡ ( u , v ) d_{\text{cos}}(u,v)=1-\cos(u,v) dcos(u,v)=1−cos(u,v)

Exercise 5

完成函数 nearest_neighbor()。

输入：

向量 v、
一组可能的近邻候选向量
要找到的 k 个最近的邻居，这里默认为1。
距离度量默认基于余弦相似度。

输出：

最近的 k 个向量

余弦相似度cosine_similarity函数已经实现并导入。它的参数是两个向量，并返回它们之间角度的余弦值。
遍历 candidates 中的行，并将当前行与向量 v 的相似度结果保存在一个 python 列表中。注意相似度的顺序与 candidates 行向量的顺序相同。

python 复制代码

# UNQ_C8 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def nearest_neighbor(v, candidates, k=1, cosine_similarity=cosine_similarity):
    """
    Input:
      - v, the vector you are going find the nearest neighbor for
      - candidates: a set of vectors where we will find the neighbors
      - k: top k nearest neighbors to find
    Output:
      - k_idx: the indices of the top k closest vectors in sorted form
    """
    ### START CODE HERE ###
    similarity_l = []

    # for each candidate vector...
    for row in candidates:
        # get the cosine similarity
        cos_similarity = cosine_similarity(v,row)

        # append the similarity to the list
        similarity_l.append(cos_similarity)

    # sort the similarity list and get the indices of the sorted list    
    sorted_ids = np.argsort(similarity_l)
    
    # Reverse the order of the sorted_ids array
    sorted_ids = np.flip(sorted_ids)
    
    # get the indices of the k most similar candidate vectors
    k_idx = sorted_ids[0:k]
    ### END CODE HERE ###
    return k_idx

测试：

python 复制代码

# UNQ_C9 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything

# Test your implementation:
v = np.array([1, 0, 1])
candidates = np.array([[1, 0, 5], [-2, 5, 3], [2, 0, 1], [6, -9, 5], [9, 9, 9]])
print(candidates[nearest_neighbor(v, candidates, 3)])

结果：

\[2 0 1

1 0 5

9 9 9\]

Exercise 6: Test your translation and compute its accuracy

完成函数 test_vocabulary，该函数吃英文词向量矩阵 X X X、法语词向量矩阵 Y Y Y 和 R R R 矩阵，并通过 R R R 计算从 X X X 到 Y Y Y 的翻译准确度。

迭代转换后的英文单词向量，并检查最接近的法语单词向量是否属于实际翻译的法语单词。
使用 nearest_neighbor（参数为 "k=1"）获取最接近的法文词向量索引，并将其与刚才获取的英文嵌入索引进行比较。
记录获得正确翻译的次数。

accuracy = # ( correct predictions ) # ( total predictions ) \text{accuracy}=\frac{\#(\text{correct predictions})}{\#(\text{total predictions})} accuracy=#(total predictions)#(correct predictions)

python 复制代码

# UNQ_C10 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def test_vocabulary(X, Y, R, nearest_neighbor=nearest_neighbor):
    '''
    Input:
        X: a matrix where the columns are the English embeddings.
        Y: a matrix where the columns correspong to the French embeddings.
        R: the transform matrix which translates word embeddings from
        English to French word vector space.
    Output:
        accuracy: for the English to French capitals
    '''

    ### START CODE HERE ###
    # The prediction is X times R
    pred = np.dot(X,R)

    # initialize the number correct to zero
    num_correct = 0

    # loop through each row in pred (each transformed embedding)
    for i in range(len(pred)):
        # get the index of the nearest neighbor of pred at row 'i'; also pass in the candidates in Y
        pred_idx = nearest_neighbor(pred[i], Y)

        # if the index of the nearest neighbor equals the row of i... \
        if pred_idx == i:
            # increment the number correct by 1.
            num_correct += 1

    # accuracy is the number correct divided by the number of rows in 'pred' (also number of rows in X)
    accuracy = num_correct/len(pred)

    ### END CODE HERE ###

    return accuracy

在测试集上计算正确率：

python 复制代码

X_val, Y_val = get_matrices(en_fr_test, fr_embeddings_subset, en_embeddings_subset)
acc = test_vocabulary(X_val, Y_val, R_train)  # this might take a minute or two
print(f"accuracy on test set is {acc:.3f}")

结果：

accuracy on test set is 0.557

3. LSH and document search

本节使用位置敏感哈希算法实现更高效的 K 近邻算法，并将其应用于文档搜索。

处理推文，并将每条推文表示为一个向量（用向量嵌入表示文档）。
使用定位敏感哈希算法和 k 近邻法查找与给定推文相似的推文。

先加载推特数据集

python 复制代码

# get the positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')
all_tweets = all_positive_tweets + all_negative_tweets

3.1 Getting the document embeddings

Bag-of-words (BOW) document models

文本文档是单词序列。

词语的排序会产生不同的效果。例如，"苹果派比意大利辣香肠披萨好吃 "和 "意大利辣香肠披萨比苹果派好吃 "这两个句子由于单词顺序的不同而具有相反的含义。
在某些应用中，忽略词序可以让我们训练出一个高效且仍然有效的模型，这种方法被称为词袋文档模型。

Document embeddings

文档向量是通过对文档中所有单词的嵌入进行求和而创建的。
如果我们不知道某个词的嵌入，就可以忽略该词。

Exercise 7

完成 get_document_embedding() 函数。

函数 get_document_embedding() 将整个文档编码为 "文档向量"。
它接收一个文档（字符串）和一个字典en_embeddings。
它处理文档，并查找每个单词的相应词向量。
然后将它们相加，并返回处理过的推特的所有单词向量的总和。

python 复制代码

# UNQ_C12 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def get_document_embedding(tweet, en_embeddings, process_tweet=process_tweet):
    '''
    Input:
        - tweet: a string
        - en_embeddings: a dictionary of word embeddings
    Output:
        - doc_embedding: sum of all word embeddings in the tweet
    '''
    doc_embedding = np.zeros(300)

    ### START CODE HERE ###
    # process the document into a list of words (process the tweet)
    processed_doc = process_tweet(tweet)
    for word in processed_doc:
        # add the word embedding to the running total for the document embedding
        doc_embedding +=en_embeddings.get(word,0)
    ### END CODE HERE ###
    return doc_embedding

代码中使用了get函数来省略未知单词，将其词向量值设置为0。

测试：

python 复制代码

# testing your function
custom_tweet = "RT @Twitter @chapagain Hello There! Have a great day. :) #good #morning http://chapagain.com.np"
tweet_embedding = get_doc

结果：

array( $-0.00268555, -0.15378189, -0.55761719, -0.07216644, -0.32263184$ )

Exercise 8: Store all document vectors into a dictionary

将所有文档（推文）的向量表达放入词典中：

python 复制代码

# UNQ_C14 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def get_document_vecs(all_docs, en_embeddings, get_document_embedding=get_document_embedding):
    '''
    Input:
        - all_docs: list of strings - all tweets in our dataset.
        - en_embeddings: dictionary with words as the keys and their embeddings as the values.
    Output:
        - document_vec_matrix: matrix of tweet embeddings.
        - ind2Doc_dict: dictionary with indices of tweets in vecs as keys and their embeddings as the values.
    '''

    # the dictionary's key is an index (integer) that identifies a specific tweet
    # the value is the document embedding for that document
    ind2Doc_dict = {}

    # this is list that will store the document vectors
    document_vec_l = []

    for i, doc in enumerate(all_docs):

        ### START CODE HERE ###
        # get the document embedding of the tweet
        doc_embedding = get_document_embedding(doc, en_embeddings)

        # save the document embedding into the ind2Tweet dictionary at index i
        ind2Doc_dict[i] = doc_embedding

        # append the document embedding to the list of document vectors
        document_vec_l.append(doc_embedding)

        ### END CODE HERE ###

    # convert the list of document vectors into a 2D array (each row is a document vector)
    document_vec_matrix = np.vstack(document_vec_l)

    return document_vec_matrix, ind2Doc_dict

测试：

python 复制代码

document_vecs, ind2Tweet = get_document_vecs(all_tweets, en_embeddings_subset)

print(f"length of dictionary {len(ind2Tweet)}")
print(f"shape of document_vecs {document_vecs.shape}")

结果：

length of dictionary 10000

shape of document_vecs (10000, 300)

3.2 Looking up the tweets

现在得到了一个维数为 ( m , d ) (m,d) (m,d) 的向量，其中 m 是推文的数量（10,000），d 是嵌入的维数（300）。现在，可输入一条推文，然后使用余弦相似度来查看语料库中哪条推文与您的推文相似。

python 复制代码

my_tweet = 'i am sad'
process_tweet(my_tweet)
tweet_embedding = get_document_embedding(my_tweet, en_embeddings_subset)

# this gives you a similar tweet as your input.
# this implementation is vectorized...
idx = np.argmax(cosine_similarity(document_vecs, tweet_embedding))
print(all_tweets[idx])

结果：

@zoeeylim sad sad sad kid 😦 it's ok I help you watch the match HAHAHAHAHA

3.3 Finding the most similar tweets with LSH

实施定位敏感散列（LSH）来识别最相似的推文。无需查看所有 10,000 个向量，只需搜索一个子集，即可找到它的其最近的邻居。例如有如下数据：

可以将向量空间划分为若干区域，并在一个区域内搜索给定向量的近邻。例如：

确定平面的数量：

每个平面将空间分为两个部分， n n n个平面则将空间划分为 2 n 2^n 2n个部分（桶）；

如需要将10000个文档向量分布到桶中，且每个桶约有16个向量，则可以估算出桶数为： 10000 16 = 625 \cfrac{10000}{16}=625 1610000=625个；

也就是需要使得： 2 n = 625 2^n=625 2n=625，即 n = log ⁡ 2 625 = 9.29 ≈ 10 n=\log_2625=9.29\approx10 n=log2625=9.29≈10

python 复制代码

# The number of planes. We use log2(256) to have ~16 vectors/bucket.
N_PLANES = 10
# Number of times to repeat the hashing to improve the search.
N_UNIVERSES = 25

3.4 Getting the hash number for a vector

每个向量都需要获取与该向量相关的唯一编号，以便将其分配到 "散列桶 "中。

Hyperplanes in vector spaces

在 3 维的向量空间中，超平面是一个规则的平面。在 2维向量空间中，超平面是一条直线。
一般来说，超平面是维数比原向量空间低 1维的子空间。
超平面由其法向量唯一定义。
平面 π \pi π 的法向量 n n n 是平面 π \pi π 中所有向量的正交向量（在 3维的情况下是垂直向量）。

Using Hyperplanes to split the vector space

可以使用超平面将向量空间分成两个部分。

所有与平面法向量的点积为正的向量都在平面的一边。

所有与平面法向量的点积为负的向量都在平面的另一侧。

Encoding hash buckets

对于一个向量，可以求出它与所有平面的点乘，然后对这些信息进行编码，将向量分配给一个哈希桶。
当向量指向超平面与正常向量相反的一侧时，用 0 进行编码。
否则，如果向量与法线向量在同一侧，则编码为 1。
如果以相同的顺序计算每个向量与每个平面的点乘，就能把每个向量的唯一哈希 ID 编码成了二进制数，比如 $0, 1, 1, ... 0$ 。

Exercise 9: Implementing hash buckets

所有哈希表 hashes已经初始化了，它由N_UNIVERSES 个矩阵组成的列表，每个矩阵都描述了自己的哈希表。每个矩阵都有 N_DIMS 行和 N_PLANES 列。该矩阵的每一列都是一个 N_DIMS 维的法向量，用于为每个 N_PLANES 超平面创建特定哈希表的桶。

代码任务：完成函数 hash_value_of_vector ，将向量 v 放入正确的散列桶。

首先，将向量 v 与相应的平面相乘。这会得到一个维数为 ( 1 , N_planes ) (1,\text{N\_planes}) (1,N_planes) 的向量。
然后将该向量中的每个元素转换为 0 或 1（如果元素是负数，设置为 0，否则为 1。）。
然后通过遍历 N_PLANES 计算出该向量的唯一编码。
然后将 2 i 2^i 2i 乘以相应的位（0 或 1）。
然后将总和存储到变量 hash_value 中。

h a s h = ∑ i = 0 N − 1 ( 2 i × h i ) hash = \sum_{i=0}^{N-1} \left( 2^{i} \times h_{i} \right) hash=i=0∑N−1(2i×hi)

Create the sets of planes

设置25组平面，每组10个平面，这25组平面相当于25个平行时空，用25种不同的方法来划分向量空间。

planes_l列表的每个元素都包含一个有 300 行（词向量有 300 个维度）和 10 列（每个 "宇宙 "中有 10 个平面）的矩阵。

python 复制代码

np.random.seed(0)
planes_l = [np.random.normal(size=(N_DIMS, N_PLANES))
            for _ in range(N_UNIVERSES)]

示例

假设：

N_DIMS = 3（三维空间）

N_PLANES = 2（每个宇宙有2个平面）

N_UNIVERSES = 2（总共有2个宇宙）

代码将生成以下结构的列表：

python 复制代码

[
  [ [法向量1_宇宙1], [法向量2_宇宙1] ],
  [ [法向量1_宇宙2], [法向量2_宇宙2] ]
]

python 复制代码

# UNQ_C17 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def hash_value_of_vector(v, planes):
    """Create a hash for a vector; hash_id says which random hash to use.
    Input:
        - v:  vector of tweet. It's dimension is (1, N_DIMS)
        - planes: matrix of dimension (N_DIMS, N_PLANES) - the set of planes that divide up the region
    Output:
        - res: a number which is used as a hash for your vector

    """
    ### START CODE HERE ###
    # for the set of planes,
    # calculate the dot product between the vector and the matrix containing the planes
    # remember that planes has shape (300, 10)
    # The dot product will have the shape (1,10)    
    dot_product = np.dot(v, planes)
        
    # get the sign of the dot product (1,10) shaped vector
    sign_of_dot_product = np.sign(dot_product)

    # set h to be false (eqivalent to 0 when used in operations) if the sign is negative,
    # and true (equivalent to 1) if the sign is positive (1,10) shaped vector
    # if the sign is 0, i.e. the vector is in the plane, consider the sign to be positive
    h = sign_of_dot_product>=0

    # remove extra un-used dimensions (convert this from a 2D to a 1D array)
    h = np.squeeze(h)

    # initialize the hash value to 0
    hash_value = 0

    n_planes = planes.shape[1]
    for i in range(n_planes):
        # increment the hash value by 2^i * h_i        
        hash_value += (2**i) * h[i]
        
    ### END CODE HERE ###

    # cast hash_value as an integer
    hash_value = int(hash_value)

    return hash_value

测试：

python 复制代码

# UNQ_C18 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything

np.random.seed(0)
idx = 0
planes = planes_l[idx]  # get one 'universe' of planes to test the function
vec = np.random.rand(1, 300)
print(f" The hash value for this vector,",
      f"and the set of planes at index {idx},",
      f"is {hash_value_of_vector(vec, planes)}")

结果：

The hash value for this vector, and the set of planes at index 0, is 768

3.5 Creating a hash table

Exercise 10

既然每个向量（或 tweet）都有一个唯一的编号，那么现在就需要创建一个哈希表，这样在给定一个哈希 ID 后，就可以快速查找相应的向量，这样可以大大缩短搜索时间。

完成make_hash_table函数，该函数将推文向量映射到一个桶中，并在桶中存储向量。它会返回 hash_table 和 id_table。通过 id_table 可以知道某个数据桶中的哪个向量对应哪条推文。

python 复制代码

# UNQ_C19 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# This is the code used to create a hash table: feel free to read over it
def make_hash_table(vecs, planes, hash_value_of_vector=hash_value_of_vector):
    """
    Input:
        - vecs: list of vectors to be hashed.
        - planes: the matrix of planes in a single "universe", with shape (embedding dimensions, number of planes).
    Output:
        - hash_table: dictionary - keys are hashes, values are lists of vectors (hash buckets)
        - id_table: dictionary - keys are hashes, values are list of vectors id's
                            (it's used to know which tweet corresponds to the hashed vector)
    """
    ### START CODE HERE ###

    # number of planes is the number of columns in the planes matrix
    num_of_planes = planes.shape[1]

    # number of buckets is 2^(number of planes)    
    num_buckets = 2**num_of_planes

    # create the hash table as a dictionary.
    # Keys are integers (0,1,2.. number of buckets)
    # Values are empty lists
    hash_table = {i:[] for i in range(num_buckets)}

    # create the id table as a dictionary.
    # Keys are integers (0,1,2... number of buckets)
    # Values are empty lists
    id_table = {i:[] for i in range(num_buckets)}

    # for each vector in 'vecs'
    for i, v in enumerate(vecs):
        # calculate the hash value for the vector
        h = hash_value_of_vector(v, planes)

        # store the vector into hash_table at key h,
        # by appending the vector v to the list at key h
        hash_table[h].append(v)

        # store the vector's index 'i' (each document is given a unique integer 0,1,2...)
        # the key is the h, and the 'i' is appended to the list at key h
        id_table[h].append(i)

    ### END CODE HERE ###

    return hash_table, id_table

测试：

python 复制代码

planes = planes_l[0]  # get one 'universe' of planes to test the function
tmp_hash_table, tmp_id_table = make_hash_table(document_vecs, planes)

print(f"The hash table at key 0 has {len(tmp_hash_table[0])} document vectors")
print(f"The id table at key 0 has {len(tmp_id_table[0])}")
print(f"The first 5 document indices stored at key 0 of are {tmp_id_table[0][0:5]}")

结果：

The hash table at key 0 has 3 document vectors

The id table at key 0 has 3

The first 5 document indices stored at key 0 of are $3276, 3281, 3282$

3.6 Creating all hash tables

为所有平行宇宙创建哈希表，下面函数返回两个列表：hash_tables 和 id_tables，分别包含所有宇宙的哈希表和ID表

python 复制代码

# Creating the hashtables
def create_hash_id_tables(n_universes):
    hash_tables = []
    id_tables = []
    for universe_id in range(n_universes):  # there are 25 hashes遍历25个宇宙。
        print('working on hash universe #:', universe_id)
        planes = planes_l[universe_id]#获取当前宇宙的10个平面法向量。
        hash_table, id_table = make_hash_table(document_vecs, planes)#调用 make_hash_table 函数生成当前宇宙的哈希表和ID表。
        hash_tables.append(hash_table)
        id_tables.append(id_table)
    
    return hash_tables, id_tables

hash_tables, id_tables = create_hash_id_tables(N_UNIVERSES)

Exercise 11: Approximate K-NN

使用位置敏感哈希算法实现近似 K 近邻，以搜索与索引 doc_id 中给定文档相似的文档。

输入

doc_id 是文档列表 all_tweets 的索引。
v 是索引为 doc_id 的 all_tweets 中 tweet 的文档向量。
planes_l 是平面列表（之前创建的全局变量）。
k 是要搜索的最近邻的数量。
num_universes_to_use：为节省时间，我们可以使用少于可用宇宙总数（这里是25）的数量。
hash_tables: 每个宇宙的哈希表列表。
id_tables: 每个宇宙的 id 表列表。

approximate_knn 函数会找到与输入向量 "v "处于同一 "散列桶 "中的候选向量子集。然后在该子集上执行常规的 k 近邻搜索（而不是搜索全部 10,000 条推文）。

python 复制代码

# UNQ_C21 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# This is the code used to do the fast nearest neighbor search. Feel free to go over it
def approximate_knn(doc_id, v, planes_l, hash_tables, id_tables, k=1, num_universes_to_use=25, hash_value_of_vector=hash_value_of_vector):
    """Search for k-NN using hashes."""
    #assert num_universes_to_use <= N_UNIVERSES

    # Vectors that will be checked as possible nearest neighbor
    vecs_to_consider_l = list()

    # list of document IDs
    ids_to_consider_l = list()

    # create a set for ids to consider, for faster checking if a document ID already exists in the set
    ids_to_consider_set = set()

    # loop through the universes of planes
    for universe_id in range(num_universes_to_use):

        # get the set of planes from the planes_l list, for this particular universe_id
        planes = planes_l[universe_id]

        # get the hash value of the vector for this set of planes
        hash_value = hash_value_of_vector(v, planes)

        # get the hash table for this particular universe_id
        hash_table = hash_tables[universe_id]

        # get the list of document vectors for this hash table, where the key is the hash_value
        document_vectors_l = hash_table[hash_value]

        # get the id_table for this particular universe_id
        id_table = id_tables[universe_id]

        # get the subset of documents to consider as nearest neighbors from this id_table dictionary
        new_ids_to_consider = id_table[hash_value]

        ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###

        # loop through the subset of document vectors to consider
        for i, new_id in enumerate(new_ids_to_consider):
            
            if doc_id == new_id:
                continue

            # if the document ID is not yet in the set ids_to_consider...
            if new_id not in ids_to_consider_set:
                # access document_vectors_l list at index i to get the embedding
                # then append it to the list of vectors to consider as possible nearest neighbors
                document_vector_at_i = document_vectors_l[i]
                vecs_to_consider_l.append(document_vector_at_i)

                # append the new_id (the index for the document) to the list of ids to consider
                ids_to_consider_l.append(new_id)

                # also add the new_id to the set of ids to consider
                # (use this to check if new_id is not already in the IDs to consider)
                ids_to_consider_set.add(new_id)

        ### END CODE HERE ###

    # Now run k-NN on the smaller set of vecs-to-consider.
    print("Fast considering %d vecs" % len(vecs_to_consider_l))

    # convert the vecs to consider set to a list, then to a numpy array
    vecs_to_consider_arr = np.array(vecs_to_consider_l)

    # call nearest neighbors on the reduced list of candidate vectors
    nearest_neighbor_idx_l = nearest_neighbor(v, vecs_to_consider_arr, k=k)

    # Use the nearest neighbor index list as indices into the ids to consider
    # create a list of nearest neighbors by the document ids
    nearest_neighbor_ids = [ids_to_consider_l[idx]
                            for idx in nearest_neighbor_idx_l]

    return nearest_neighbor_ids

测试：

python 复制代码

#document_vecs, ind2Tweet
doc_id = 0
doc_to_search = all_tweets[doc_id]
vec_to_search = document_vecs[doc_id]

# Sample
nearest_neighbor_ids = approximate_knn(
    doc_id, vec_to_search, planes_l, hash_tables, id_tables, k=3, num_universes_to_use=5)

结果：

Fast considering 77 vecs

打印具体推文：

python 复制代码

print(f"Nearest neighbors for document {doc_id}")
print(f"Document contents: {doc_to_search}")
print("")

for neighbor_id in nearest_neighbor_ids:
    print(f"Nearest neighbor at document id {neighbor_id}")
    print(f"document contents: {all_tweets[neighbor_id]}")