【数据分析(二)】初探 Pandas

目录

  • 引言
  • [1. 基本数据结构](#1. 基本数据结构)
    • [1.1. Series 的初始化和简单操作](#1.1. Series 的初始化和简单操作)
    • [1.2. DataFrame 的初始化和简单操作](#1.2. DataFrame 的初始化和简单操作)
      • [1.2.1. 初始化与持久化](#1.2.1. 初始化与持久化)
      • [1.2.2. 读取查看](#1.2.2. 读取查看)
      • [1.2.3. 行操作](#1.2.3. 行操作)
      • [1.2.4. 列操作](#1.2.4. 列操作)
      • [1.2.5. 选中筛查](#1.2.5. 选中筛查)
  • [2. 数据预处理](#2. 数据预处理)
    • [2.0. 生成样例表](#2.0. 生成样例表)
    • [2.1. 缺失值处理](#2.1. 缺失值处理)
    • [2.2. 类型转换和排序](#2.2. 类型转换和排序)
    • [2.3. 统计分析](#2.3. 统计分析)
  • [3. 数据透视](#3. 数据透视)
    • [3.0. 生成样例表](#3.0. 生成样例表)
    • [3.1. 生成透视表](#3.1. 生成透视表)
  • [4. 数据重塑](#4. 数据重塑)
    • [4.1. 层次化索引](#4.1. 层次化索引)
      • [4.1.1. 双层索引的Series](#4.1.1. 双层索引的Series)
      • [4.1.2. 双层索引的Dataframe](#4.1.2. 双层索引的Dataframe)
    • [4.2. 离散化处理](#4.2. 离散化处理)
      • [4.2.1. 分组运算](#4.2.1. 分组运算)
      • [4.2.2. 分级标签](#4.2.2. 分级标签)
    • [4.3. 数据集合并](#4.3. 数据集合并)

引言

Pandas (Python Data Analysis Library) 是基于Numpy的一种用于数据分析的工具包,其中纳入了大量库和一些标准数据模型,提供了高效操作大型数据集所需的工具。

以下对Pandas库函数的介绍中,已传入的参数为默认值 ,并且无返回值的函数不会以赋值形式演示。

1. 基本数据结构

1.1. Series 的初始化和简单操作

Pandas中的Series,与Numpy中的arrayPython中的基本数据结构list类似,是一种能保存不同数据类型一维数组

python 复制代码
import pandas as pd
import numpy as np

# 默认行标签建表,并查看索引和值
s1 = pd.Series([-1, 0.7, False, np.nan])
'''
0 -1
1 0.7
2 False
3 NaN
dtype: object
'''
print(s1.index) # RangeIndex(start=0, stop=4, step=1)
print(s1.values) # [-1 0.7 False nan]


# 设定行标签、表格名称和索引名称
s2 = pd.Series([-1, 0.7, False, np.nan], index=list('abcd'), name='demo')
s2.index.name = 'index'
'''
index
a -1
b 0.7
c False
d NaN
Name: demo, dtype: object
'''
print(s2.index) # Index(['a', 'b', 'c', 'd'], dtype='object')
print(s2.values) # [-1 0.7 False nan]


# 查询值
print(s1[3]) # nan
print(s2['d'])


# 切片:标签闭区间
print(s1[::2])
'''
0 -1
2 False
dtype: object
'''

print(s2['a':'c'])
'''
a -1
b 0.7
c False
Name: demo, dtype: object
'''

1.2. DataFrame 的初始化和简单操作

Pandas中的DataFrame,与R中的data.frame类似,是一种二维表格型数据结构 ,相当于Series的容器。

1.2.1. 初始化与持久化

python 复制代码
import pandas as pd
import numpy as np

# 1.字典建表
df = pd.DataFrame({'A': pd.Timestamp('20250110'), 'B': pd.Series(0, index=list(range(4))), 'C': np.array([1]*4), 'D': 2,
                   'E': pd.Categorical(['test', 'train', 'test', 'train'])})
'''
           A  B  C  D      E
0 2025-01-10  0  1  2   test
1 2025-01-10  0  1  2  train
2 2025-01-10  0  1  2   test
3 2025-01-10  0  1  2  train
'''


# 生成时序
date = pd.date_range('20250110', periods=5)
'''
DatetimeIndex(['2025-01-10', '2025-01-11', '2025-01-12', '2025-01-13',
               '2025-01-14'],
              dtype='datetime64[ns]', freq='D')

'''


# 2.时序建表
df = pd.DataFrame(np.random.randn(5, 3), index=date, columns=list('xyz'))
'''
                   x         y         z
2025-01-10 -0.274766 -0.593336  0.724735
2025-01-11  1.552149 -0.300292  0.061253
2025-01-12  0.411908 -0.470191 -0.893243
2025-01-13 -1.328169 -0.999390 -0.081419
2025-01-14  0.683950 -0.483677 -2.019955
'''

# 保存表格:不额外添加索引
df.to_csv(r'E:\Pycharm_Python\course\demo.csv')

1.2.2. 读取查看

python 复制代码
# 读取表格
df = pd.read_csv(r'E:\Pycharm_Python\course\demo.csv', index_col=0)

# 查看头尾(默认5行)
print(df.head(2))
'''
                   x         y         z
2025-01-10 -0.274766 -0.593336  0.724735
2025-01-11  1.552149 -0.300292  0.061253
'''
print(df.tail(1))
'''
                  x         y         z
2025-01-14  0.68395 -0.483677 -2.019955
'''

# 查看格式
print(df.dtypes)
'''
x    float64
y    float64
z    float64
dtype: object
'''

# 查看行标签
print(df.index)
'''Index(['2025-01-10', '2025-01-11', '2025-01-12', '2025-01-13', '2025-01-14'], dtype='object')'''

# 查看列标签
print(df.columns)
'''Index(['x', 'y', 'z'], dtype='object')'''

# 查看数据
print(df.values)
'''
[[-0.27476573 -0.59333579  0.72473541]
 [ 1.55214904 -0.30029235  0.06125304]
 [ 0.41190756 -0.47019098 -0.89324306]
 [-1.32816905 -0.99938983 -0.08141872]
 [ 0.6839496  -0.48367661 -2.01995517]]
'''

1.2.3. 行操作

python 复制代码
# 读取表格
df = pd.read_csv(r'E:\Pycharm_Python\course\demo.csv', index_col=0)

# 单行选中
print(df.iloc[1])
'''
x    1.552149
y   -0.300292
z    0.061253
Name: 2025-01-11, dtype: float64
'''

# 多行选中
# print(df.loc['20250113':'20250114'])
print(df.iloc[3:5])
'''
                   x         y         z
2025-01-13 -1.328169 -0.999390 -0.081419
2025-01-14  0.683950 -0.483677 -2.019955
'''

# 单行添加
df = df._append(pd.Series(dict(zip('xyz', np.random.randn(3))), name='2025-01-15'))
'''
                   x         y         z
2025-01-10 -0.274766 -0.593336  0.724735
2025-01-11  1.552149 -0.300292  0.061253
2025-01-12  0.411908 -0.470191 -0.893243
2025-01-13 -1.328169 -0.999390 -0.081419
2025-01-14  0.683950 -0.483677 -2.019955
2025-01-15  1.205855  0.841471  0.843053
'''

# 单行删除
df = df.drop(['2025-01-15'])
'''
                   x         y         z
2025-01-10 -0.274766 -0.593336  0.724735
2025-01-11  1.552149 -0.300292  0.061253
2025-01-12  0.411908 -0.470191 -0.893243
2025-01-13 -1.328169 -0.999390 -0.081419
2025-01-14  0.683950 -0.483677 -2.019955
'''

1.2.4. 列操作

python 复制代码
# 读取表格
df = pd.read_csv(r'E:\Pycharm_Python\course\demo.csv', index_col=0)

# 单列选中
print(df['x'])
'''
2025-01-10   -0.274766
2025-01-11    1.552149
2025-01-12    0.411908
2025-01-13   -1.328169
2025-01-14    0.683950
Name: x, dtype: float64
'''

# 多列选中
print(df[['x', 'z']])
'''
                   x         z
2025-01-10 -0.274766  0.724735
2025-01-11  1.552149  0.061253
2025-01-12  0.411908 -0.893243
2025-01-13 -1.328169 -0.081419
2025-01-14  0.683950 -2.019955
'''

# 单列添加
df['p'] = np.random.rand(5)
'''
                   x         y         z         p
2025-01-10 -0.274766 -0.593336  0.724735  0.070785
2025-01-11  1.552149 -0.300292  0.061253  0.034027
2025-01-12  0.411908 -0.470191 -0.893243  0.446612
2025-01-13 -1.328169 -0.999390 -0.081419  0.545531
2025-01-14  0.683950 -0.483677 -2.019955  0.261958
'''

# 单列删除
df = df.drop('p', axis=1)
print(df)
'''
                   x         y         z
2025-01-10 -0.274766 -0.593336  0.724735
2025-01-11  1.552149 -0.300292  0.061253
2025-01-12  0.411908 -0.470191 -0.893243
2025-01-13 -1.328169 -0.999390 -0.081419
2025-01-14  0.683950 -0.483677 -2.019955
'''

1.2.5. 选中筛查

python 复制代码
# 读取表格
df = pd.read_csv(r'E:\Pycharm_Python\course\demo.csv', index_col=0)

# 单值选中
print(df.loc['2025-01-12', 'y'])
'''-0.47019098075848'''

# 区域选中
# print(df.loc[['2025-01-12', '2025-01-13'], ['x', 'z']])
print(df[['x', 'z']][2:4])
'''
                   x         z
2025-01-12  0.411908 -0.893243
2025-01-13 -1.328169 -0.081419
'''

# 条件判断
print(df['x'] > 0)
'''
2025-01-10    False
2025-01-11     True
2025-01-12     True
2025-01-13    False
2025-01-14     True
Name: x, dtype: bool
'''

# 条件筛选
print(df[(df['x'] > 0)&(df['z'] > 0)])
'''
                   x         y         z
2025-01-11  1.552149 -0.300292  0.061253
'''

# 区间条件筛选
print(df[df['x'] > 0][1:4])
'''
                   x         y         z
2025-01-12  0.411908 -0.470191 -0.893243
2025-01-14  0.683950 -0.483677 -2.019955
'''

2. 数据预处理

2.0. 生成样例表

python 复制代码
import pandas as pd
import numpy as np

# 生成时序
date = pd.date_range('20250110', periods=5)

# 建表
df = pd.DataFrame(np.random.randn(5, 3), index=date, columns=list('xyz'))

# 生成随机布尔表
mask = np.random.randint(0, 2, df.shape, dtype='bool')

# 随机生成空值
df[pd.DataFrame(mask, index=df.index, columns=df.columns)] = np.nan

# 保存表格
df.to_csv(r'E:\Pycharm_Python\course\demo_nan.csv')

2.1. 缺失值处理

  • isnull():返回一个与原表尺寸相同布尔类型 的表格,原表里的缺失值在其中对应位置上的值为True,其余为False
  • fillna(value, inplace=False):返回将原表的缺失值填充value后的表格,inplace=True将原表格替换为输出(下同)
  • replace(to_replace, value, inplace=False):返回原表的待替换值 to_replace全部替换value后的表格,前两个参数是列表 时表示批量替换
  • dropna(axis=0, inplace=False):返回将原表中含有缺失值的指定维度 的记录删除后的表格
python 复制代码
# 读取表格
df = pd.read_csv(r'E:\Pycharm_Python\course\demo_nan.csv', index_col=0)
'''
                   x         y         z
2025-01-10  0.606495       NaN  0.456811
2025-01-11       NaN  0.743876       NaN
2025-01-12  0.024458  0.733735       NaN
2025-01-13  0.306332       NaN -0.586894
2025-01-14       NaN       NaN       NaN
'''

# 判断缺失值
print(df.isnull())
'''
                x      y      z
2025-01-10  False   True  False
2025-01-11   True  False   True
2025-01-12  False  False   True
2025-01-13  False   True  False
2025-01-14   True   True   True
'''

# 填充缺失值
print(df['x'].fillna(df['x'].mean()))
'''
2025-01-10    0.606495
2025-01-11    0.312428
2025-01-12    0.024458
2025-01-13    0.306332
2025-01-14    0.312428
Name: x, dtype: float64
'''

print(df.fillna(0, inplace=True))
'''
                   x         y         z
2025-01-10  0.606495  0.000000  0.456811
2025-01-11  0.000000  0.743876  0.000000
2025-01-12  0.024458  0.733735  0.000000
2025-01-13  0.306332  0.000000 -0.586894
2025-01-14  0.000000  0.000000  0.000000
'''

# 替换缺失值
print(df.replace(np.nan, 0))

# 删除缺失值
print(df.dropna())

# 保存表格
df.to_csv(r'E:\Pycharm_Python\course\demo.csv')

2.2. 类型转换和排序

python 复制代码
# 读取表格
df = pd.read_csv(r'E:\Pycharm_Python\course\demo.csv', index_col=0)

# 类型转换
print(df.astype(bool))
'''
                x      y      z
2025-01-10   True  False   True
2025-01-11  False   True  False
2025-01-12   True   True  False
2025-01-13   True  False   True
2025-01-14  False  False  False
'''

# 逆序排列
print(df.sort_values(by='x', ascending=False))
'''
                   x         y         z
2025-01-10  0.606495  0.000000  0.456811
2025-01-13  0.306332  0.000000 -0.586894
2025-01-12  0.024458  0.733735  0.000000
2025-01-11  0.000000  0.743876  0.000000
2025-01-14  0.000000  0.000000  0.000000
'''

# 有优先级的正序排列
print(df.sort_values(by=['z', 'y'], ascending=True))
'''
                   x         y         z
2025-01-13  0.306332  0.000000 -0.586894
2025-01-14  0.000000  0.000000  0.000000
2025-01-12  0.024458  0.733735  0.000000
2025-01-11  0.000000  0.743876  0.000000
2025-01-10  0.606495  0.000000  0.456811
'''

2.3. 统计分析

python 复制代码
# 描述性统计
print(df.describe())
'''
              x         y         z
count  5.000000  5.000000  5.000000
mean   0.187457  0.295522 -0.026017
std    0.267662  0.404676  0.370721
min    0.000000  0.000000 -0.586894
25%    0.000000  0.000000  0.000000
50%    0.024458  0.000000  0.000000
75%    0.306332  0.733735  0.000000
max    0.606495  0.743876  0.456811
'''

# 最大值
print(df['x'].max())        # 0.606494812593188

# 最小值
print(df['x'].min())        # 0.0

# 均值
print(df['x'].mean())       # 0.18745690178195

# 中值
print(df['x'].median())     # 0.0244579357029649

# 方差
print(df['x'].var())        # 0.07164321143837143

# 标准差
print(df['x'].std())        # 0.26766249538994336

# 计数
print(df['z'].count())      # 5

# 种类
print(df['z'].unique())     # [ 0.45681124  0.         -0.58689448]

# 分类计数
print(df['z'].value_counts())
'''
z
 0.000000    3
 0.456811    1
-0.586894    1
'''

# 求和
print(df.sum())
'''
x    0.937285
y    1.477611
z   -0.130083
dtype: float64
'''

# 相关系数
print(df.corr())
'''
          x         y         z
x  1.000000 -0.597883  0.306501
y -0.597883  1.000000  0.064061
z  0.306501  0.064061  1.000000
'''

# 协方差
print(df.cov())
'''
          x         y         z
x  0.071643 -0.064761  0.030414
y -0.064761  0.163763  0.009611
z  0.030414  0.009611  0.137434
'''

3. 数据透视

3.0. 生成样例表

python 复制代码
import pandas as pd
import numpy as np

# 生成数据
hour = np.random.randint(0, 24, (1000, 1))
area = np.random.randint(0, 10, 1000)
displacement = np.random.randn(1000, 3)

# 拼接表格
a = np.concatenate((hour, displacement), axis=1)
df = pd.DataFrame(a, index=area, columns=['hour', 'x', 'y', 'z'])
df.index.name = 'area'

# 类型转换
df['hour'] = df['hour'].astype('int64')

# 保存表格
df.to_csv(r'E:\Pycharm_Python\course\demo.csv')
'''
      hour         x         y         z
area                                    
9       18  1.453873 -0.452853  0.126672
5       20 -0.541874 -0.798552  0.209252
9       12  0.848762 -0.734806  0.124415
1       13  0.794053  1.838139 -0.268814
8        2 -0.115496  2.054565  0.860301
...    ...       ...       ...       ...
9       21 -0.212381  0.355993 -1.124492
1       20 -0.010173  0.408953 -0.275197
2       15  0.334253  0.231890  3.557654
0        3 -0.383228 -0.562431  2.418784
8       12 -1.004758 -0.539583  1.589166

[1000 rows x 4 columns]
'''

3.1. 生成透视表

python 复制代码
# 读取表格
df = pd.read_csv(r'E:\Pycharm_Python\course\demo.csv', index_col=0)

# 设置显示上限
pd.set_option('display.max_columns', 4)
pd.set_option('display.max_rows', 10)

# 生成透视表:将缺失值填充为0,并显示各列总和
pt = pd.pivot_table(df, index=['area', 'hour'], values=['x', 'y'], aggfunc=['sum', 'mean'], fill_value=0, margins=True)
'''
                 sum                 mean          
                   x          y         x         y
area hour                                          
0    0      4.372290   0.019988  1.093072  0.004997
     1     -5.463018   3.510755 -1.092604  0.702151
     2      0.429444  -2.022444  0.429444 -2.022444
     3      1.954055  -1.683926  0.488514 -0.420981
     4     -2.226930  -3.827011 -1.113465 -1.913506
...              ...        ...       ...       ...
9    20    -1.422801   0.262971 -0.237134  0.043829
     21    -2.720063   1.411410 -0.544013  0.282282
     22    -0.342656  -0.502878 -0.171328 -0.251439
     23    -1.024892  -0.385198 -0.170815 -0.064200
All       -24.847056  19.244009 -0.024847  0.019244

[239 rows x 4 columns]
'''

# 字典指定各标签的聚合函数
pt = pd.pivot_table(df, index=['area', 'hour'], aggfunc={'x': 'sum', 'y': 'mean'})
'''
                  x         y
area hour                    
0    0     4.372290  0.004997
     1    -5.463018  0.702151
     2     0.429444 -2.022444
     3     1.954055 -0.420981
     4    -2.226930 -1.913506
...             ...       ...
9    19    1.744798 -0.405821
     20   -1.422801  0.043829
     21   -2.720063  0.282282
     22   -0.342656 -0.251439
     23   -1.024892 -0.064200

[238 rows x 2 columns]
'''

4. 数据重塑

4.1. 层次化索引

4.1.1. 双层索引的Series

python 复制代码
import pandas as pd
import numpy as np

# 双层索引
index = pd.MultiIndex.from_arrays([list('aaabbccdd'), list(map(int, '123121212'))], names=('area', 'numbers'))
'''MultiIndex([('a', '1'),
            ('a', '2'),
            ('a', '3'),
            ('b', '1'),
            ('b', '2'),
            ('c', '1'),
            ('c', '2'),
            ('d', '1'),
            ('d', '2')],
           names=['area', 'numbers'])
'''

# 初始化
s = pd.Series(np.random.randn(9), index=index)
print(s)
'''
area  numbers
a     1          0.417328
      2          0.168057
      3          1.252186
b     1         -1.835490
      2          0.951358
c     1         -1.903762
      2         -0.075067
d     1          0.782123
      2          0.355078
dtype: float64
'''

# 单个选中
print(s['a', 1])    # 0.417328381875337

# 单层选中
print(s['a'])
'''
numbers
1    0.417328
2    0.168057
3    1.252186
dtype: float64
'''

# 单层切片
print(s['a':'b'])
'''
area  numbers
a     1          0.417328
      2          0.168057
      3          1.252186
b     1         -1.835490
      2          0.951358
dtype: float64
'''

print(s[:, 2])
'''
area
a    0.168057
b    0.951358
c   -0.075067
d    0.355078
dtype: float64
'''

4.1.2. 双层索引的Dataframe

python 复制代码
# 设置显示上限
pd.set_option('display.max_columns', 4)
pd.set_option('display.max_rows', 8)

# 双层标签
index = pd.MultiIndex.from_arrays([list('aabbccdd'), list(map(int, '12121212'))], names=('area', 'numbers'))
columns = pd.MultiIndex.from_tuples([('t1', 'x'), ('t1', 'y'), ('t2', 'x'), ('t2', 'y')])
df = pd.DataFrame(np.random.randn(8, 4), index=index, columns=columns)
'''
                    t1                  t2          
                     x         y         x         y
area numbers                                        
a    1       -0.125867 -1.722040 -0.266579  0.910084
     2        0.060483  0.750894 -0.479338  0.608312
b    1        0.345995  1.470237  1.763323 -0.336475
     2       -1.977062  0.071204  0.000797 -0.323753
c    1        0.963804  0.186688  0.443276  0.615650
     2        1.729371 -0.775489  1.663172 -0.657688
d    1        0.376276  0.693671  0.982811 -0.393840
     2       -0.632945 -2.046240  0.865305  1.150940
'''

# 单值选中
print(df.loc[('a', 1), ('t1', 'x')])    # -0.12586716795606423

# 单层行选中
print(df.loc['a'])
'''
               t1                  t2          
                x         y         x         y
numbers                                        
1       -0.125867 -1.722040 -0.266579  0.910084
2        0.060483  0.750894 -0.479338  0.608312
'''

# 单层列选中
print(df['t1'])
'''
                     x         y
area numbers                    
a    1       -0.125867 -1.722040
     2        0.060483  0.750894
b    1        0.345995  1.470237
     2       -1.977062  0.071204
c    1        0.963804  0.186688
     2        1.729371 -0.775489
d    1        0.376276  0.693671
     2       -0.632945 -2.046240
'''

# 索引交换
df = df.swaplevel('area', 'numbers')
'''
                    t1                  t2          
                     x         y         x         y
numbers area                                        
1       a    -0.125867 -1.722040 -0.266579  0.910084
2       a     0.060483  0.750894 -0.479338  0.608312
1       b     0.345995  1.470237  1.763323 -0.336475
2       b    -1.977062  0.071204  0.000797 -0.323753
1       c     0.963804  0.186688  0.443276  0.615650
2       c     1.729371 -0.775489  1.663172 -0.657688
1       d     0.376276  0.693671  0.982811 -0.393840
2       d    -0.632945 -2.046240  0.865305  1.150940
'''

4.2. 离散化处理

4.2.1. 分组运算

python 复制代码
# 设置显示上限
pd.set_option('display.max_rows', 6)

# 读取表格
df = pd.read_csv(r'E:\Pycharm_Python\course\demo.csv', index_col=0)
'''
      hour         x         y         z
area                                    
9       18  1.453873 -0.452853  0.126672
5       20 -0.541874 -0.798552  0.209252
9       12  0.848762 -0.734806  0.124415
...    ...       ...       ...       ...
2       15  0.334253  0.231890  3.557654
0        3 -0.383228 -0.562431  2.418784
8       12 -1.004758 -0.539583  1.589166

[1000 rows x 4 columns]
'''

# 对表格按索引和标签分组运算
print(df.groupby([df.index, df['hour']]).mean())
'''
                  x         y         z
area hour                              
0    0     1.093072  0.004997 -0.183101
     1    -1.092604  0.702151 -0.556797
     2     0.429444 -2.022444 -0.346545
...             ...       ...       ...
9    21   -0.544013  0.282282 -0.191250
     22   -0.171328 -0.251439 -0.022121
     23   -0.170815 -0.064200 -0.244528

[238 rows x 3 columns]
'''

# 对某列按索引分组运算
print(df['x'].groupby(df.index).sum())
'''
area
0   -10.897082
1    -4.915652
2   -13.841750
       ...    
7    -4.954806
8    -6.258694
9     2.828964
Name: x, Length: 10, dtype: float64
'''

4.2.2. 分级标签

python 复制代码
# 设置显示上限
pd.set_option('display.max_rows', 6)

# 读取表格
df = pd.read_csv(r'E:\Pycharm_Python\course\demo.csv', index_col=0)
'''
      hour         x         y         z
area                                    
9       18  1.453873 -0.452853  0.126672
5       20 -0.541874 -0.798552  0.209252
9       12  0.848762 -0.734806  0.124415
...    ...       ...       ...       ...
2       15  0.334253  0.231890  3.557654
0        3 -0.383228 -0.562431  2.418784
8       12 -1.004758 -0.539583  1.589166

[1000 rows x 4 columns]
'''

# 按数值区间分级
bins = [0, 0.1, 0.4, 0.8, 1.6, 3.2, 4.0]
labels = ['E', 'D', 'C', 'B', 'A', 'S']
df['rank_x'] = pd.cut(df['x'].abs(), bins, labels=labels)
'''
      hour         x         y         z rank_x
area                                           
9       18  1.453873 -0.452853  0.126672      B
5       20 -0.541874 -0.798552  0.209252      C
9       12  0.848762 -0.734806  0.124415      B
...    ...       ...       ...       ...    ...
2       15  0.334253  0.231890  3.557654      D
0        3 -0.383228 -0.562431  2.418784      D
8       12 -1.004758 -0.539583  1.589166      B

[1000 rows x 5 columns]
'''

# 按分位区间分级
bins = np.percentile(df['x'], [0, 25, 50, 70, 85, 95, 100])
labels = ['E', 'D', 'C', 'B', 'A', 'S']
df['rank_x'] = pd.cut(df['x'].abs(), bins, labels=labels)
print(df)
'''
      hour         x         y         z rank_x
area                                           
9       18  1.453873 -0.452853  0.126672      A
5       20 -0.541874 -0.798552  0.209252      B
9       12  0.848762 -0.734806  0.124415      B
...    ...       ...       ...       ...    ...
2       15  0.334253  0.231890  3.557654      C
0        3 -0.383228 -0.562431  2.418784      C
8       12 -1.004758 -0.539583  1.589166      A

[1000 rows x 5 columns]
'''

4.3. 数据集合并

python 复制代码
# 设置显示上限
pd.set_option('display.max_rows', 6)

# 读取表格
df = pd.read_csv(r'E:\Pycharm_Python\course\demo.csv', index_col=0)
'''
      hour         x         y         z
area                                    
9       18  1.453873 -0.452853  0.126672
5       20 -0.541874 -0.798552  0.209252
9       12  0.848762 -0.734806  0.124415
...    ...       ...       ...       ...
2       15  0.334253  0.231890  3.557654
0        3 -0.383228 -0.562431  2.418784
8       12 -1.004758 -0.539583  1.589166

[1000 rows x 4 columns]
'''

# 单表添加和多表拼接
print(df.iloc[:3]._append(df.iloc[3:6]))
print(pd.concat([df.iloc[:2], df.iloc[2:4], df.iloc[4:6]], axis=0))
'''
      hour         x         y         z
area                                    
9       18  1.453873 -0.452853  0.126672
5       20 -0.541874 -0.798552  0.209252
9       12  0.848762 -0.734806  0.124415
1       13  0.794053  1.838139 -0.268814
8        2 -0.115496  2.054565  0.860301
9        9  1.235167  1.030952 -0.517618
'''

# 合并
'''样表'''
df1 = df.iloc[2:5][['x', 'y']]
'''
      hour         x         y
area                          
9       12  0.848762 -0.734806
1       13  0.794053  1.838139
8        2 -0.115496  2.054565
'''

df2 = df.iloc[2:5][['x', 'z']].sample(frac=1)
'''
      hour         x         z
area                          
8        2 -0.115496  0.860301
1       13  0.794053 -0.268814
9       12  0.848762  0.124415
'''

print(pd.merge(df1, df2, on='area'))
'''
           x_x         y       x_y         z
area                                        
9     0.848762 -0.734806  0.848762  0.124415
1     0.794053  1.838139  0.794053 -0.268814
8    -0.115496  2.054565 -0.115496  0.860301
'''
相关推荐
明月下30 分钟前
【数据分析】coco格式数据生成yolo数据可视化
yolo·信息可视化·数据分析
深度学习实战训练营4 小时前
基于机器学习的电信用户流失预测与数据分析可视化
人工智能·机器学习·数据分析
Zda天天爱打卡5 小时前
【机器学习实战入门】基于深度学习的乳腺癌分类
大数据·人工智能·深度学习·机器学习·分类·数据挖掘
KeyPan6 小时前
【机器学习:三十三(一)、支持向量机】
人工智能·神经网络·算法·机器学习·支持向量机·数据挖掘·迁移学习
五度易链-区域产业数字化管理平台6 小时前
加强金融数据治理,推进金融科技变革!
大数据·数据库·人工智能·科技·金融·数据挖掘·数据分析
dundunmm8 小时前
【深度学习】神经网络之Softmax
深度学习·神经网络·机器学习·数据挖掘·激活函数
dundunmm8 小时前
论文阅读:Structure-Driven Representation Learning for Deep Clustering
论文阅读·人工智能·算法·数据挖掘·聚类·深度聚类
狮歌~资深攻城狮9 小时前
TiDB使用过程中需要注意的坑点:避免踩雷
数据仓库·数据分析·tidb
预测模型的开发与应用研究13 小时前
AI编程工具横向评测--Cloudstudio塑造完全态的jupyter notebook助力数据分析应用开发
人工智能·jupyter·数据分析