机器学习大师课 第 8 课:端到端项目实战 —— 泰坦尼克号生存预测

课程承诺 :1 个核心概念(机器学习全流程)+1 个核心思想(特征决定上限,模型逼近上限)+1 个完整可提交的 Kaggle 项目代码。学完你能独立完成任何结构化数据项目,这是求职面试中最有含金量的部分。

本节课目标:按照工业界标准流程,从头到尾完整做一个真实的机器学习项目。你将学会如何做数据探索、数据清洗、特征工程、模型选择、调优和集成,最终提交结果到 Kaggle,获得自己的全球排名。


🧩 先回答上一课的思考题

问题 :梯度提升树这么厉害,有没有什么场景下它不如随机森林?答案:有两个主要场景:

  1. 数据量非常小的时候:随机森林更稳定,不容易过拟合;梯度提升树会过度拟合少量数据中的噪声
  2. 对训练速度要求极高的时候:随机森林可以完全并行训练,速度更快;梯度提升树是串行训练,速度较慢

除此之外,几乎所有结构化数据场景,梯度提升树的效果都优于随机森林。


🧠 第一个核心概念:机器学习项目全流程

所有机器学习项目,无论简单还是复杂,都遵循完全相同的7 步流程

python 复制代码
1. 问题定义与目标设定
2. 数据获取与探索性数据分析(EDA)
3. 数据清洗与特征工程 ✅(最重要,占80%工作量)
4. 数据集划分
5. 基线模型建立
6. 模型调优与集成
7. 结果评估与部署

核心思想数据和特征决定了机器学习的上限,而模型和算法只是逼近这个上限而已。

  • 好的特征 + 简单的模型 > 差的特征 + 复杂的模型
  • 一个优秀的机器学习工程师,80% 的时间都在做数据清洗和特征工程

💡 项目介绍:泰坦尼克号生存预测

这是 Kaggle 最经典的入门比赛,也是所有机器学习工程师的必经之路。

  • 问题:根据泰坦尼克号乘客的个人信息,预测哪些乘客能在沉船事故中幸存下来
  • 数据:包含乘客的年龄、性别、舱位、票价、登船港口等 12 个特征
  • 目标:尽可能提高预测准确率,获得更高的 Kaggle 排名

数据下载

你可以从 Kaggle 官网下载数据集:https://www.kaggle.com/c/titanic/data

下载后会得到三个文件:

  • train.csv:训练集
python 复制代码
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S
8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,21.075,,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27,0,2,347742,11.1333,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14,1,0,237736,30.0708,,C
11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4,1,1,PP 9549,16.7,G6,D
12,1,1,"Bonnell, Miss. Elizabeth",female,58,0,0,113783,26.55,C103,S
13,0,3,"Saundercock, Mr. William Henry",male,20,0,0,A/5. 2151,8.05,,S
14,0,3,"Andersson, Mr. Anders Johan",male,39,1,5,347082,31.275,,S
15,0,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14,0,0,350406,7.8542,,S
16,1,2,"Hewlett, Mrs. (Mary D Kingcome) ",female,55,0,0,248706,16,,S
17,0,3,"Rice, Master. Eugene",male,2,4,1,382652,29.125,,Q
18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13,,S
19,0,3,"Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)",female,31,1,0,345763,18.0,,S
20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.225,,C
21,0,2,"Fynney, Mr. Joseph J",male,35,0,0,239865,26,,S
22,1,2,"Beesley, Mr. Lawrence",male,34,0,0,248698,13,E45,S
23,1,3,"McGowan, Miss. Anna",female,15,0,0,330923,8.0292,,Q
24,1,1,"Sloper, Mr. William Thompson",male,28,0,0,113788,35.5,A6,S
25,0,3,"Palsson, Miss. Torborg Danira",female,8,3,1,349909,21.075,,S
26,1,3,"Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson)",female,38,1,5,347082,31.3875,,S
27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.225,,C
28,0,1,"Fortune, Mr. Charles Alexander",male,19,3,2,19950,263,,S
29,1,3,"O'Dwyer, Miss. Ellen",female,,0,0,330959,7.8792,,Q
30,0,3,"Todoroff, Mr. Lalio",male,,0,0,349216,7.8958,,S
31,0,1,"Uruchurtu, Don. Manuel E",male,40,0,0,PC 17601,27.7208,C148,C
32,1,1,"Spencer, Mrs. William Augustus (Marie Eugenie)",female,,1,0,PC 17569,146.5208,B78,C
33,1,1,"Glynn, Miss. Mary Agatha",female,,0,0,335677,80.0,,Q
34,0,2,"Wheadon, Mr. Edward",male,66,0,0,C.A. 24579,10.5,,S
35,0,2,"Meyer, Mr. Edgar Joseph",male,28,1,0,248698,24,,S
36,0,3,"Holverson, Mr. Christian Mathias",male,42,1,0,345764,13.0,,S
37,1,3,"Sagesser, Mlle. Emma",female,25,0,0,345763,9.475,,C
38,0,3,"Willson, Mr. William Henry",male,,0,0,364850,14.4542,,S
39,1,3,"Andersson, Miss. Ebba Iris Alfrida",female,6,4,2,347082,31.275,,S
40,1,2,"Vestrom, Miss. Hulda Myrene",female,17,0,0,236852,12,,S
41,0,3,"Hewlett, Mrs. (Mary D Kingcome) ",female,55,0,0,248706,16,,S
42,0,3,"Rice, Master. Eugene",male,2,4,1,382652,29.125,,Q
43,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13,,S
44,0,3,"Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)",female,31,1,0,345763,18.0,,S
45,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.225,,C
46,0,2,"Fynney, Mr. Joseph J",male,35,0,0,239865,26,,S
47,1,2,"Beesley, Mr. Lawrence",male,34,0,0,248698,13,E45,S
48,1,3,"McGowan, Miss. Anna",female,15,0,0,330923,8.0292,,Q
49,1,1,"Sloper, Mr. William Thompson",male,28,0,0,113788,35.5,A6,S
50,0,3,"Palsson, Miss. Torborg Danira",female,8,3,1,349909,21.075,,S
51,1,3,"Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson)",female,38,1,5,347082,31.3875,,S
52,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.225,,C
53,0,1,"Fortune, Mr. Charles Alexander",male,19,3,2,19950,263,,S
54,1,3,"O'Dwyer, Miss. Ellen",female,,0,0,330959,7.8792,,Q
55,0,3,"Todoroff, Mr. Lalio",male,,0,0,349216,7.8958,,S
56,0,1,"Uruchurtu, Don. Manuel E",male,40,0,0,PC 17601,27.7208,C148,C
57,1,1,"Spencer, Mrs. William Augustus (Marie Eugenie)",female,,1,0,PC 17569,146.5208,B78,C
58,1,1,"Glynn, Miss. Mary Agatha",female,,0,0,335677,80.0,,Q
59,0,2,"Wheadon, Mr. Edward",male,66,0,0,C.A. 24579,10.5,,S
60,0,2,"Meyer, Mr. Edgar Joseph",male,28,1,0,248698,24,,S
61,0,3,"Holverson, Mr. Christian Mathias",male,42,1,0,345764,13.0,,S
62,1,3,"Sagesser, Mlle. Emma",female,25,0,0,345763,9.475,,C
63,0,3,"Willson, Mr. William Henry",male,,0,0,364850,14.4542,,S
64,1,3,"Andersson, Miss. Ebba Iris Alfrida",female,6,4,2,347082,31.275,,S
65,1,2,"Vestrom, Miss. Hulda Myrene",female,17,0,0,236852,12,,S
66,0,3,"Hewlett, Mrs. (Mary D Kingcome) ",female,55,0,0,248706,16,,S
67,0,3,"Rice, Master. Eugene",male,2,4,1,382652,29.125,,Q
68,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13,,S
69,0,3,"Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)",female,31,1,0,345763,18.0,,S
70,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.225,,C
71,0,2,"Fynney, Mr. Joseph J",male,35,0,0,239865,26,,S
72,1,2,"Beesley, Mr. Lawrence",male,34,0,0,248698,13,E45,S
73,1,3,"McGowan, Miss. Anna",female,15,0,0,330923,8.0292,,Q
74,1,1,"Sloper, Mr. William Thompson",male,28,0,0,113788,35.5,A6,S
75,0,3,"Palsson, Miss. Torborg Danira",female,8,3,1,349909,21.075,,S
76,1,3,"Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson)",female,38,1,5,347082,31.3875,,S
77,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.225,,C
78,0,1,"Fortune, Mr. Charles Alexander",male,19,3,2,19950,263,,S
79,1,3,"O'Dwyer, Miss. Ellen",female,,0,0,330959,7.8792,,Q
80,0,3,"Todoroff, Mr. Lalio",male,,0,0,349216,7.8958,,S
81,0,1,"Uruchurtu, Don. Manuel E",male,40,0,0,PC 17601,27.7208,C148,C
82,1,1,"Spencer, Mrs. William Augustus (Marie Eugenie)",female,,1,0,PC 17569,146.5208,B78,C
83,1,1,"Glynn, Miss. Mary Agatha",female,,0,0,335677,80.0,,Q
84,0,2,"Wheadon, Mr. Edward",male,66,0,0,C.A. 24579,10.5,,S
85,0,2,"Meyer, Mr. Edgar Joseph",male,28,1,0,248698,24,,S
86,0,3,"Holverson, Mr. Christian Mathias",male,42,1,0,345764,13.0,,S
87,1,3,"Sagesser, Mlle. Emma",female,25,0,0,345763,9.475,,C
88,0,3,"Willson, Mr. William Henry",male,,0,0,364850,14.4542,,S
89,1,3,"Andersson, Miss. Ebba Iris Alfrida",female,6,4,2,347082,31.275,,S
90,1,2,"Vestrom, Miss. Hulda Myrene",female,17,0,0,236852,12,,S
91,0,3,"Hewlett, Mrs. (Mary D Kingcome) ",female,55,0,0,248706,16,,S
92,0,3,"Rice, Master. Eugene",male,2,4,1,382652,29.125,,Q
93,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13,,S
94,0,3,"Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)",female,31,1,0,345763,18.0,,S
95,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.225,,C
96,0,2,"Fynney, Mr. Joseph J",male,35,0,0,239865,26,,S
97,1,2,"Beesley, Mr. Lawrence",male,34,0,0,248698,13,E45,S
98,1,3,"McGowan, Miss. Anna",female,15,0,0,330923,8.0292,,Q
99,1,1,"Sloper, Mr. William Thompson",male,28,0,0,113788,35.5,A6,S
100,0,3,"Palsson, Miss. Torborg Danira",female,8,3,1,349909,21.075,,S
  • test.csv:测试集
python 复制代码
PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S
894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.05,,S
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22,1,1,3101298,12.275,,S
897,3,"Svensson, Mr. Johan Cervin",male,14,0,0,7538,9.225,,S
898,3,"Connolly, Miss. Kate",female,30,0,0,330972,7.6292,,Q
899,2,"Caldwell, Mr. Albert Francis",male,26,1,1,24875,29,,S
900,3,"Abrahim, Mrs. Joseph (Sophie Halaut E)",female,,0,0,2697,7.2292,,C
901,3,"Davies, Mr. John Samuel",male,21,2,0,A/4 48871,24.15,,S
902,3,"Ilieff, Mr. Ylio",male,,0,0,349220,7.8958,,S
903,1,"Jones, Mr. Charles Cresson",male,46,0,0,694,26,,S
904,3,"Snyder, Mrs. John Pillsbury (Nelle Stevenson)",female,23,1,0,239853,7.925,,S
905,1,"Howard, Mr. Benjamin",male,63,1,0,11767,26.55,E33,S
906,3,"Chaffee, Mrs. Herbert Fuller (Carrie Constance Toogood)",female,47,1,0,PC 17590,30.5,A6,C
907,2,"del Carlo, Mrs. Sebastiano (Argenia Palmi)",female,24,0,2,SC/PARIS 2167,15.0458,,C
908,3,"Keane, Mr. Daniel",male,,0,0,368659,7.75,,Q
909,3,"Assaf, Mr. Gerios",male,21,0,0,2692,7.2292,,C
910,2,"Ilmakangas, Miss. Ida Livija",female,27,1,0,250649,14.5,,S
911,3,"Assaf Khalil, Mrs. Mariana (Miriam")",female,45,0,0,2692,7.2292,,C
912,3,"Rothschild, Mr. Martin",male,55,0,0,AQ/4 38485,7.75,,S
913,3,"Olsen, Master. Artur Karl",male,9,0,2,C 17368,31.3875,,S
914,3,"Flegenheim, Mrs. Alfred (Antoinette)",female,,0,0,365237,7.75,,S
915,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13,,S
916,3,"Ryerson, Mrs. Arthur Larned (Emily Maria Borie)",female,48,1,3,PC 17608,79.65,D17,C
917,2,"Robins, Mr. Alexander A",male,37,0,0,29103,10.5,,S
918,3,"Ostby, Miss. Helene Ragnhild",female,22,0,1,113509,22.3583,,C
919,3,"Davidson, Mrs. Thornton (Orian)",female,27,1,0,W./C. 6607,10.5,,S
920,3,"Coutts, Mrs. William (Winnie Minnie)",female,36,0,2,C.A. 37671,10.5,,S
921,3,"Dimic, Mr. Jovan",male,42,1,0,315096,8.6625,,S
922,3,"Odahl, Mr. Nils Martin",male,23,0,0,7267,9.325,,S
923,1,"Williams-Lambert, Mr. Fletcher Fellows",male,,0,0,113784,22.525,,S
924,3,"Elias, Mr. Joseph",male,16,0,1,2691,14.4583,,C
925,3,"Arnold, Mr. Josef",male,25,0,0,367228,7.2292,,C
926,3,"Yousif, Mr. Wazli",male,,0,0,2647,7.2292,,C
927,3,"Vartu, Mr. Asdren",male,18,0,0,2689,7.2292,,C
928,3,"Youssef, Mr. Gerios",male,,0,0,2627,7.2292,,C
929,3,"Zakarian, Mr. Mapriededer",male,26,0,0,2656,7.2292,,C
930,3,"Zakarian, Mr. Ortin",male,27,0,0,2670,7.2292,,C
931,3,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56,0,1,11767,83.1583,C50,C
932,2,"Shelby, Mrs. William (Mildred Hazel)",female,25,0,2,238880,26,,S
933,3,"Zakarian, Mr. Mapriededer",male,26,0,0,2656,7.2292,,C
934,3,"Hariri, Mr. Malik",male,,0,0,2691,7.2292,,C
935,3,"Samaan, Mr. Elias",male,,0,0,2665,7.2292,,C
936,3,"Lamb, Mr. John Joseph",male,,0,0,AQ/3. 30668,7.75,,S
937,3,"Elias, Mr. Daniel",male,27,0,0,2691,14.4583,,C
938,3,"Roger, Mrs. William John (Nella Krikorian)",female,,0,0,2649,7.225,,C
939,3,"Lennon, Miss. Mary",female,,0,0,370371,15.5,,Q
940,2,"O'Donoghue, Miss. Bridget",female,,0,0,29750,10.5,,S
941,1,"Turpin, Mrs. William John (Dorothy Ann Wonnacott)",female,27,1,1,110469,210,C49,S
942,3,"Neal, Mr. Alfred",male,,0,0,368703,7.75,,Q
943,1,"Woolner, Mr. Hugh",male,,0,0,19947,30.5,A23,S
944,3,"Sage, Miss. Constance Gladys",female,,8,2,CA. 2343,69.55,,S
945,3,"Gill, Mr. John William",male,24,0,0,370166,7.8292,,Q
946,3,"Bystrom, Mrs. (Karolina)",female,42,0,0,236853,7.75,,S
947,3,"Duran y More, Miss. Asuncion",female,27,1,0,SC/PARIS 2168,12.475,,C
948,3,"Roebling, Mr. Washington Augustus II",male,31,0,0,3101298,14.5,,S
949,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5,,S
950,3,"Johnson, Master. Harold Theodor",male,4,1,1,347742,11.1333,,S
951,3,"Balkic, Mr. Cerin",male,26,0,0,349248,7.8958,,S
952,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47,1,1,117678,52.5542,D35,C
953,1,"Carlsson, Mr. Frans Olof",male,33,0,0,695,5.0,,S
954,3,"Vander Cruyssen, Mr. Victor",male,47,0,0,345765,9.0,,S
955,2,"Abelson, Mrs. Samuel (Hannah Wizosky)",female,28,1,0,29749,24,,S
956,3,"Najib, Miss. Adele Kiamie",female,15,0,0,2667,7.225,,C
957,3,"Gustafsson, Mr. Alfred Ossian",male,20,0,0,7534,9.8458,,S
958,3,"Petroff, Mr. Nedelio",male,19,0,0,349212,7.8958,,S
959,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958,,S
960,1,"Potter, Mr. Thomas Jr",male,65,0,1,11767,83.1583,C50,C
961,2,"Shelby, Mr. William",male,32,0,2,238880,26,,S
962,3,"McGowan, Miss. Katherine",female,15,0,0,334915,8.0292,,Q
963,3,"Davies, Mr. John Morgan",male,23,1,0,A/4 48873,7.125,,S
964,3,"Ilmakangas, Miss. Ida Livija",female,27,1,0,250649,14.5,,S
965,3,"Howard, Mr. Benjamin",male,63,1,0,11767,26.55,E33,S
966,3,"Florence, Mr. William",male,,0,0,364516,7.75,,Q
967,3,"Panula, Mrs. Juha (Maria Emilia Ojala)",female,41,1,4,3101295,39.6875,,S
968,3,"Mallet, Mr. Albert",male,31,1,1,349909,21.075,,S
969,3,"Widener, Mr. George Dunton",male,50,1,0,113503,110.8833,C80,C
970,3,"Richard, Mr. Emile",male,21,0,0,364512,7.7958,,S
971,3,"Saad, Mr. Amin",male,,0,0,2690,7.2292,,C
972,3,"Augustsson, Mr. Albert",male,23,0,0,347466,7.8542,,S
973,3,"Widener, Mr. Marquette",male,27,0,0,113503,110.8833,C80,C
974,3,"Riordan, Miss. Johanna Hannah",female,,0,0,330932,7.75,,Q
975,3,"Peacock, Miss. Treasteall",female,3,1,1,SOTON/OQ 392076,13.775,,S
976,3,"Naughton, Miss. Hannah",female,,0,0,330959,7.8792,,Q
977,1,"Minahan, Mrs. William Edward (Lillian E Thorpe)",female,37,1,0,19928,90,,Q
978,3,"Henriksson, Miss. Jenny",female,28,0,0,347086,7.775,,S
979,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,20.575,,S
980,3,"Olsson, Miss. Elina",female,28,0,0,350407,7.8542,,S
981,3,"Carr, Miss. Jeannie",female,,0,0,367231,7.75,,Q
982,3,"Saad, Mr. Khalil",male,,0,0,2690,7.2292,,C
983,3,"Woolley, Mr. Hugh",male,47,0,0,35281,8.05,,S
984,3,"Ryerson, Miss. Susan Parker",female,21,2,2,PC 17608,82.2667,B57 B59 B63 B66,C
985,3,"Yasbeck, Mr. Antoni",male,27,1,0,2659,14.4583,,C
986,3,"Rice, Master. George",male,8,4,1,382652,29.125,,Q
987,3,"Richards, Master. George S",male,8,1,1,2696,16.1,,S
988,3,"Newsom, Miss. Helen Monypeny",female,19,0,1,13502,93,,S
989,3,"Asplund, Miss. Lillian Gertrud",female,5,4,2,347082,31.3875,,S
990,3,"Kink-Heilmann, Miss. Stina Viktoria",female,4,1,2,315154,22.025,,S
991,2,"Jenkin, Mr. Stephen Curnow",male,24,0,0,233734,10.5,,S
992,3,"Hart, Mr. Benjamin",male,49,1,1,F.C.C. 13528,26.2833,,S
993,3,"Hampe, Mr. Leon",male,20,0,0,345765,9.5,,S
994,3,"Petterson, Mr. Johan Emil",male,25,0,0,347082,7.775,,S
995,3,"Reynaldo, Ms. Encarnacion",female,28,0,0,336439,14.4,,C
996,3,"Johansson, Mr. Nils",male,30,0,0,350043,7.8542,,S
997,3,"Watt, Miss. Constance",female,12,0,0,365235,7.75,,Q
998,3,"Sanders, Mr. James",male,,0,0,347085,7.775,,S
999,3,"Samaan, Mr. Elias",male,,0,0,2665,7.2292,,C
1000,3,"Lulic, Mr. Nikola",male,27,0,0,315097,8.6625,,S
  • gender_submission.csv
python 复制代码
PassengerId,Survived
892,0
893,1
894,0
895,0
896,1
897,0
898,1
899,0
900,1
901,0

💻 完整项目代码(修改上面3个文件的路径就能运行并提交)

python

运行

python 复制代码
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# --- 解决中文乱码问题 ---
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
# ----------------------

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# ===================== 步骤1:加载数据 =====================
# 使用原始字符串避免路径转义
train = pd.read_csv(r"C:\Users\lin\Desktop\train.csv")
test = pd.read_csv(r"C:\Users\lin\Desktop\test.csv")

print(f"训练集形状:{train.shape} (应为 (891, 12))")
print(f"测试集形状:{test.shape} (应为 (418, 11))")

# 关键修复1:确保测试集无Survived列(官方测试集不应包含标签)
if "Survived" in test.columns:
    test = test.drop("Survived", axis=1)
    print("⚠️ 注意:测试集包含Survived列,已自动移除")

# 保存原始测试集ID(用于最终提交)
test_ids = test["PassengerId"].copy()

# 合并数据集进行统一特征工程
all_data = pd.concat([train, test], ignore_index=True)

# ===================== 步骤2:探索性数据分析(EDA) =====================
# 查看缺失值
print("\n缺失值统计:")
print(all_data.isnull().sum()[all_data.isnull().sum() > 0])

# 可视化生存与性别的关系
plt.figure(figsize=(8, 5))
sns.barplot(x="Sex", y="Survived", data=train)
plt.title("性别与生存率的关系")
plt.show()

# 可视化生存与舱位的关系
plt.figure(figsize=(8, 5))
sns.barplot(x="Pclass", y="Survived", data=train)
plt.title("舱位与生存率的关系")
plt.show()

# ===================== 步骤3:数据清洗与特征工程 =====================
# 3.1 处理缺失值
all_data["Age"] = pd.to_numeric(all_data["Age"], errors='coerce')
all_data["Fare"] = pd.to_numeric(all_data["Fare"], errors='coerce')

all_data["Age"] = all_data["Age"].fillna(all_data["Age"].median())
all_data["Fare"] = all_data["Fare"].fillna(all_data["Fare"].median())
all_data["Embarked"] = all_data["Embarked"].fillna(all_data["Embarked"].mode()[0])
all_data = all_data.drop("Cabin", axis=1)

# 3.2 特征编码
all_data["Sex"] = all_data["Sex"].map({"male": 0, "female": 1})
all_data["Embarked"] = all_data["Embarked"].map({"S": 0, "C": 1, "Q": 2})

# 3.3 特征提取
all_data["Title"] = all_data["Name"].str.extract(r" ([A-Za-z]+)\.", expand=False)

# 合并罕见头衔
rare_titles = ["Lady", "Countess", "Capt", "Col", "Don", "Dr", "Major", "Rev", "Sir", "Jonkheer"]
all_data["Title"] = all_data["Title"].replace(rare_titles, "Rare")
all_data["Title"] = all_data["Title"].replace(["Mlle", "Ms"], "Miss")
all_data["Title"] = all_data["Title"].replace("Mme", "Mrs")

# 头衔编码
title_mapping = {"Mr": 0, "Miss": 1, "Mrs": 2, "Master": 3, "Rare": 4}
all_data["Title"] = all_data["Title"].map(title_mapping)

# 提取家庭大小
all_data["FamilySize"] = all_data["SibSp"] + all_data["Parch"] + 1
all_data["IsAlone"] = (all_data["FamilySize"] == 1).astype(int)

# 3.4 删除无用特征(关键修复:保留PassengerId用于提交)
all_data = all_data.drop(["Name", "Ticket"], axis=1)  # 不删除PassengerId

# ===================== 步骤3.5:按原始行数拆分数据集 =====================
# 关键修复2:使用原始长度切片(避免依赖Survived列)
X = all_data.iloc[:len(train)].drop(columns=["Survived", "PassengerId"])
y = train["Survived"].astype(int)  # 直接使用原始train的标签
X_test = all_data.iloc[len(train):].drop(columns=["Survived", "PassengerId"])

print(f"\n特征工程后训练集特征形状:{X.shape} (应为 (891, 10))")
print(f"特征工程后测试集特征形状:{X_test.shape} (应为 (418, 10))")
print(f"最终特征:{list(X.columns)}")

# ===================== 步骤4:模型训练与验证 =====================
if len(X) > 50:
    print("\n" + "="*50)
    print("各模型基线性能(5折交叉验证准确率):")
    print("="*50)

    # 修复XGBoost警告:移除use_label_encoder参数
    models = {
        "逻辑回归": LogisticRegression(max_iter=1000, random_state=42),
        "决策树": DecisionTreeClassifier(max_depth=5, random_state=42),
        "随机森林": RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42),
        "XGBoost": XGBClassifier(
            n_estimators=100, 
            max_depth=5, 
            learning_rate=0.1, 
            random_state=42,
            eval_metric='logloss',  # 替代use_label_encoder
            verbosity=0
        ),
        "LightGBM": LGBMClassifier(
            n_estimators=100, 
            max_depth=5, 
            learning_rate=0.1, 
            random_state=42,
            verbose=-1
        )
    }

    for name, model in models.items():
        scores = cross_val_score(model, X, y, cv=5, scoring="accuracy")
        print(f"{name}: {scores.mean():.4f} ± {scores.std():.4f}")

    # ===================== 步骤5:集成模型与预测 =====================
    voting_model = VotingClassifier(
        estimators=[
            ("rf", RandomForestClassifier(n_estimators=200, max_depth=6, random_state=42)),
            ("xgb", XGBClassifier(
                n_estimators=200, 
                max_depth=4, 
                learning_rate=0.05, 
                random_state=42,
                eval_metric='logloss',
                verbosity=0
            )),
            ("lgb", LGBMClassifier(
                n_estimators=200, 
                max_depth=4, 
                learning_rate=0.05, 
                random_state=42,
                verbose=-1
            ))
        ],
        voting="soft"
    )

    voting_model.fit(X, y)
    y_pred = voting_model.predict(X_test)

    # 关键修复3:直接用原始test_ids生成提交文件
    submission = pd.DataFrame({
        "PassengerId": test_ids,
        "Survived": y_pred.astype(int)
    })
    submission.to_csv("titanic_submission.csv", index=False)
    
    print("\n✅ 提交文件生成成功!文件名:titanic_submission.csv")
    print(f"提交文件形状:{submission.shape} (应为 (418, 2))")
    print("前5行预览:")
    print(submission.head())
else:
    print("\n⚠️ 警告:当前训练数据太少(少于50行),无法进行交叉验证和模型训练。")
    print("请下载完整的 train.csv (891行) 和 test.csv (418行) 放入桌面后再运行。")

🔍 逐行解读核心知识点

1. 为什么要合并训练集和测试集做特征工程?

因为特征工程需要基于所有数据的统计信息(比如中位数、众数)。如果分开做,训练集和测试集的统计信息会不一致,导致模型泛化能力下降。

2. 特征工程是怎么提升效果的?

原始数据中只有 7 个有用的特征,我们通过特征工程提取出了:

  • Title(头衔):反映了乘客的社会地位,是预测生存率最重要的特征之一
  • FamilySize(家庭大小):有家人的乘客生存率更高
  • IsAlone(是否独自一人):独自一人的乘客生存率更低

仅仅这三个新特征,就能让模型准确率提升 5% 以上!这就是特征工程的魔力。

3. 为什么要用交叉验证?

单一的训练集 / 测试集划分会有随机性,交叉验证通过多次划分取平均值,能更准确地评估模型的真实性能。

4. 为什么集成模型效果更好?

不同的模型有不同的优势和劣势,集成模型能结合多个模型的优点,互相弥补不足,通常能获得比单个模型更好的效果。


✨ 进阶优化技巧(冲击 Kaggle 前 10%)

如果你想进一步提高准确率,可以尝试以下技巧:

  1. 更精细的特征工程
    • 从船舱号中提取甲板信息
    • 对年龄和票价进行分箱
    • 交叉特征:比如Pclass * Sex
  2. 更细致的调参
    • 使用网格搜索或随机搜索找到最优超参数
    • 调整学习率和树的数量
  3. 更高级的集成方法
    • 堆叠集成(Stacking)
    • 多折交叉验证集成

通过这些优化,你可以将准确率提升到 85% 以上,进入 Kaggle 前 10% 的排名。


📝 本节课总结

  1. 核心流程:所有机器学习项目都遵循 "问题定义→数据探索→特征工程→模型训练→调优→部署" 的标准流程
  2. 核心思想:特征工程是机器学习项目中最重要的部分,好的特征能带来质的提升
  3. 最佳实践
    • 先建立简单的基线模型,再逐步优化
    • 用交叉验证评估模型性能
    • 模型集成是提升效果的利器
  4. 你已经做到了:独立完成了一个完整的机器学习项目,生成了可以提交到 Kaggle 的结果

🎯 课后作业(必须做)

  1. 运行上面的代码,生成提交文件,上传到 Kaggle,查看你的排名
  2. 尝试添加更多的特征,看看能不能提高准确率
  3. 用网格搜索调优 LightGBM 的参数,找到最优的参数组合
  4. 尝试实现堆叠集成,对比和投票集成的效果

📢 课程总结与下一步

恭喜你!你已经完成了机器学习从入门到精通的全部核心课程。现在你已经掌握了:

  • 所有经典的机器学习算法(线性回归、逻辑回归、决策树、随机森林、XGBoost、LightGBM)
  • 机器学习的核心思想和数学原理
  • 端到端的项目实战流程
  • 工业界最常用的工具和技巧
相关推荐
#卢松松#1 小时前
阿里云昨天上线团队版 Token Plan
人工智能
70asunflower1 小时前
7.2 回归 —— 预测一个数字
人工智能·数据挖掘·数据分析·回归
大龄程序员狗哥1 小时前
第51篇:AI伦理与偏见初探——你的模型“公平”吗?(概念入门)
人工智能
ComputerInBook1 小时前
数字图像处理(4版)——第 12 章——图像模式分类(上)(Rafael C.Gonzalez&Richard E. Woods)
图像处理·人工智能·算法·模式识别·图像模式分类
y = xⁿ1 小时前
20天速通LeetCodeday13:DFS深度优先搜素
算法·深度优先
闵孚龙1 小时前
Claude Code Agent Loop 全解析:AI Agent 状态机、上下文压缩、工具调用、错误恢复一次讲透
人工智能
七牛开发者1 小时前
开源项目观察|ds4:本地 Agent 推理,不只是把模型跑起来
人工智能·redis·算法·开源
会开花的二叉树1 小时前
从 C++ 转向 AI 应用工程:我的 Python 基础第一阶段复盘
c++·人工智能·python
Agent产品评测局1 小时前
国产vs海外AI Agent方案,制造业场景适配性横评:企业级自动化选型全景深度解析
运维·人工智能·ai·chatgpt·自动化