国科大——数据挖掘（0812课程）——考试真题

前沿： 此文章记录了国科大数据挖掘（0812）课程的考试真题。
注：考试可以携带计算器，毕竟某些题需要计算log，比如：决策树等。

2016年

1. Suppose a hospital tested the age and body fat for 18 randomly selected adults with the following result:

1）Use smoothing by bin means to smooth the age data, using a bin depth of 6. Illustrate your steps.(5 points)

2）Partition the age into 3 bins by equal-width partitioning, and use bin boundary to smooth each bin. (5 points)

3）Use min-max normalization to transform the value 49 for age onto the range [ 0.0, 1.0 ]. (5 points)

4）Use z-score normalization to transform the value 41.2 for body fat , where the standard deviation of body fat is 9.25. (5 points)

2. Given a transaction database below, let min_support = 50% and min_confidence = 75%.

3. Given a data set below with three attributes {A, B, C} and two classes {C1, C2}. Build a decision tree, using information gain to select and split attribute. (15 points)

4. Consider the following data set. Use Naïve Bayesian Classifier to predict the class label for a test sample (A=0, B=1, C=0). (10 points)

5. Given a data set of 8 sample points. Perform K-means to generate 3 clusters. Suppose initially we assign point 1,2,3 as the center of each cluster. Note: list the clusters at each iteration. (15 points)

6. Suppose that a large store has a transaction database that is distributed among four locations. Transactions in each component database have the same format, namely T_j: (i₁, ..., i_m}, where T_j, is a transaction identifier, and i_k (1 <=k <= m） is the identifier of an item purchased in the transaction. Propose an efficient algorithm to mine global association rules (without considering multilevel associations). You may present your algorithm in the form of an outline. Your algorithm should not require shipping all of the data to one site and should not cause excessive network communication overhead. (15 points)

2017年

1. Please briefly describe the major type of data mining techniques and their corresponding application.(10 points)

2. What is Normalization? Please describe the major Normalization methods and their corresponding pro and cons.(6 points)

3. How to overcome overfitting in decision tree? (5 points)

4. An e-mail database is a database that scores a large number of electronic mails messages. It can be viewed as a semi-structured database consisting mainly of text data.

a. (8 points) How can such an e-mail database be structured so as to facilitate

multidimensional search, such as by sender, by receiver, by subject, and by

time?

b.(10 points) Suppose you have roughly classified a set of your previous e-mail

messages as junk, unimportant, normal, or important. Describe how a data

mining system may take this as the training set to automatically classify new

e-mail messages or unclassified ones.

5. Given a transaction database below, let min_support = 30% and min_confidence = 70%:

Find all frequent itemsets using FP-growth method. Write up the conditional pattern base for each item, and the conditional FP-tree for each item.(15 points)

6. Figure 1 is a BP ( Backpropagation ) Neural Network. The learning rate l = 0.9, the Bias at every unit is initialized as 0, and the activation function at every unit is

a. Given a training record( x₁, x₂, z ) where the input x₁ = 1, x₂ = 0, and the class label z = 1, and the weights of the connections in the k^th iteration are w₁₁(k) = 0, w₁₂(k) = 2, w₂₁(k) = 2, w₂₂(k) = 1, T₁(k) = 1, T₂(k) = 1, please give z(k) (Please give the calculation formulas and the relevant values). (10 points)

b. Please give the updated weights, w₁₁, w₁₂, w₂₁, w₂₂, following the weight updating formulas. (10 points)

7. Table 1 gives a User-Product rating matrix.

(1) List the top 2 most similar users of user 2 based on Euclidian Distance. (5 points)

(2) Predict User 2's rating for Product 2.(5 points)

8. (16 points) Suppose that a large store has a transaction database that is distributed among four locations. Transactions in each component database have the same format, namely T_j: (i₁, ..., i_m}, where T_j, is a transaction identifier, and i_k (1 <=k <= m） is the identifier of an item purchased in the transaction. Propose an efficient algorithm to mine global association rules (without considering multilevel associations). You may present your algorithm in the form of an outline. Your algorithm should not require shipping all of the data to one site and should not cause excessive network communication overhead.

2020年

1. 航空公司希望能够分析在其服务中的常客的旅行趋势，这样可以为公司正确定位航空市场中的常客市场。公司希望能够跟踪不同航线上旅客的季节变化情况和增长，并跟踪在不同航班上所消费的食品和饮料情况，这样可以帮助航空公司安排不同航线上的航班和食品供应。请面向航空公司的任务，设计一个数据仓库的模型。

参考：

2. Suppose that the data for analysis includes the attribute age: 20, 13, 15, 16, 25, 35, 36, 40, 45, 46, 52, 25, 25, 30, 21, 22, 22, 33, 33, 35, 35, 70，19，20.

(1) Use smoothing by bin means to smooth the above data, using a bin depth of 6.

(6 points)

(2) Determine outliers in the data by five-number summary? (4 points)

(3) Use min-max normalization to normalize 33. (4 points)

3. 简答

(1) What is overfitting? (4 points)

(2) How to overcome overfitting in decision tree? (4 points)

(3) Please present an attribute selection method in decision tree. (4 points)

(4) A neural network classifier may consist of multiple hidden layers. How to overcome overfitting in a neural network classifier? (6 points)

4. Given a transaction database below, let min_ support = 50% and min_ confidence = 70%：

(1) Find all the frequent itemsets using FP-growth method. Write up the conditional pattern base for each item, and the conditional FP-tree for each item. (10 points)

(2) Using the resulting frequent itemsets, find all the strong associations in the following rule form:

For any transaction x, buys(x, item1) ^ buys(x, item2) → buys(x, item3) [s=?%, c=?%]. (5 points)

5. Suppose you are given the following ratings by students on four different items, where ? indicates that no rating was given:

(1) List the top 2 most similar students of student 1 based on Euclidian Distance. (6 points)

(2) Assuming we only recommend the top 1 item to student 1, which item would

you recommend, item 2 or item 5? (8 points). (Assume similarity = 1/distance)

6. Use the following similarity matrix to perform AGNES clustering. Show your results by drawing a dendrogram (树状图). The dendrogram should clearly show the order in which the points are merged. (12 points) (The number indicates the similarity between the two points)

7. Data stream is continuous, ordered, fast changing data, possibly infinite. Since the volume is so huge and the pattern changes fast, it requires a "single scan algorithm (can only have one look). Please present a K-means based efficient data stream clustering algorithm, which can not only discover the clusters, but also can provide the evolution of the clusters. Please present your algorithm in pseudo code. (15 points)

2024年（回忆）

1. 航空公司希望能够分析在其服务中的常客的旅行趋势，这样可以为公司正确定位航空市场中的常客市场。公司希望能够跟踪不同航线上旅客的季节变化情况和增长，并跟踪在不同航班上所消费的食品和饮料情况，这样可以帮助航空公司安排不同航线上的航班和食品供应。请面向航空公司的任务，设计一个数据仓库的模型。

注：考的是此题的变形，但大差不差。

参考：

国科大——数据挖掘（0812课程）——考试真题

2016年

1. Suppose a hospital tested the age and body fat for 18 randomly selected adults with the following result:

2. Given a transaction database below, let min_support = 50% and min_confidence = 75%.

3. Given a data set below with three attributes {A, B, C} and two classes {C1, C2}. Build a decision tree, using information gain to select and split attribute. (15 points)

4. Consider the following data set. Use Naïve Bayesian Classifier to predict the class label for a test sample (A=0, B=1, C=0). (10 points)

5. Given a data set of 8 sample points. Perform K-means to generate 3 clusters. Suppose initially we assign point 1,2,3 as the center of each cluster. Note: list the clusters at each iteration. (15 points)

2017年

1. Please briefly describe the major type of data mining techniques and their corresponding application.(10 points)

2. What is Normalization? Please describe the major Normalization methods and their corresponding pro and cons.(6 points)

3. How to overcome overfitting in decision tree? (5 points)

4. An e-mail database is a database that scores a large number of electronic mails messages. It can be viewed as a semi-structured database consisting mainly of text data.

5. Given a transaction database below, let min_support = 30% and min_confidence = 70%:

6. Figure 1 is a BP ( Backpropagation ) Neural Network. The learning rate l = 0.9, the Bias at every unit is initialized as 0, and the activation function at every unit is

7. Table 1 gives a User-Product rating matrix.

2020年

2. Suppose that the data for analysis includes the attribute age: 20, 13, 15, 16, 25, 35, 36, 40, 45, 46, 52, 25, 25, 30, 21, 22, 22, 33, 33, 35, 35, 70，19，20.

3. 简答

4. Given a transaction database below, let min_ support = 50% and min_ confidence = 70%：

5. Suppose you are given the following ratings by students on four different items, where ? indicates that no rating was given:

6. Use the following similarity matrix to perform AGNES clustering. Show your results by drawing a dendrogram (树状图). The dendrogram should clearly show the order in which the points are merged. (12 points) (The number indicates the similarity between the two points)

2024年（回忆）

2. 此题考查的是数据处理，印象是给了五六个数据，然后用分箱的等深和等宽计算平滑，以及某个数的归一化，包括归一化到 [0,1] 和 min-max 归一化。

3. 此题考查的是关联规则，即Apriori 算法或者 FP-growth 算法，作业题的变形。

4. 此题考查的是决策树构建，即计算信息增益，选择最优属性，只让构建了一层，以及朴素贝叶斯对某一样本进行分类，作业题的变形。

5. 此题考查的是推荐系统，即计算与user 1 最相似的2个users，然后计算user 1中确实的rating，也是作业题的变形。

6. 此题考查的是AGNES聚类算法，不过给的是样本在二维坐标系的坐标，需要自行计算相似矩阵，随后进行聚类。

7. 设计算法。利用神经网络模型设计一个论文分类模型。

国科大——数据挖掘（0812课程）——考试真题

2016年

1. Suppose a hospital tested the age and body fat for 18 randomly selected adults with the following result:

2. Given a transaction database below, let min_support = 50% and min_confidence = 75%.

3. Given a data set below with three attributes {A, B, C} and two classes {C1, C2}. Build a decision tree, using information gain to select and split attribute. (15 points)

4. Consider the following data set. Use Naïve Bayesian Classifier to predict the class label for a test sample (A=0, B=1, C=0). (10 points)

5. Given a data set of 8 sample points. Perform K-means to generate 3 clusters. Suppose initially we assign point 1,2,3 as the center of each cluster. Note: list the clusters at each iteration. (15 points)

2017年

1. Please briefly describe the major type of data mining techniques and their corresponding application.(10 points)

2. What is Normalization? Please describe the major Normalization methods and their corresponding pro and cons.(6 points)

3. How to overcome overfitting in decision tree? (5 points)

4. An e-mail database is a database that scores a large number of electronic mails messages. It can be viewed as a semi-structured database consisting mainly of text data.

5. Given a transaction database below, let min_support = 30% and min_confidence = 70%:

6. Figure 1 is a BP ( Backpropagation ) Neural Network. The learning rate l = 0.9, the Bias at every unit is initialized as 0, and the activation function at every unit is

7. Table 1 gives a User-Product rating matrix.

2020年

2. Suppose that the data for analysis includes the attribute age: 20, 13, 15, 16, 25, 35, 36, 40, 45, 46, 52, 25, 25, 30, 21, 22, 22, 33, 33, 35, 35, 70，19，20.

3. 简答

4. Given a transaction database below, let min_ support = 50% and min_ confidence = 70%：

5. Suppose you are given the following ratings by students on four different items, where ? indicates that no rating was given:

6. Use the following similarity matrix to perform AGNES clustering. Show your results by drawing a dendrogram (树状图). The dendrogram should clearly show the order in which the points are merged. (12 points) (The number indicates the similarity between the two points)

2024年（回忆）

2. 此题考查的是数据处理，印象是给了五六个数据，然后用分箱的等深和等宽计算平滑，以及某个数的归一化，包括归一化到 [0,1] 和 min-max 归一化。

3. 此题考查的是关联规则，即Apriori 算法 或者 FP-growth 算法，作业题的变形。

4. 此题考查的是决策树构建，即计算信息增益，选择最优属性，只让构建了一层，以及朴素贝叶斯对某一样本进行分类，作业题的变形。

5. 此题考查的是推荐系统，即计算与user 1 最相似的2个users，然后计算user 1中确实的rating，也是作业题的变形。

6. 此题考查的是AGNES聚类算法，不过给的是样本在二维坐标系的坐标，需要自行计算相似矩阵，随后进行聚类。

7. 设计算法。利用神经网络模型设计一个论文分类模型。

3. 此题考查的是关联规则，即Apriori 算法或者 FP-growth 算法，作业题的变形。