《Python数据分析技术栈》第06章使用 Pandas 准备数据 01 Pandas概览(Pandas at a glance)

01 Pandas概览(Pandas at a glance)

《Python数据分析技术栈》第06章使用 Pandas 准备数据 01 Pandas概览(Pandas at a glance)

Pandas概述

Wes McKinney developed the Pandas library in 2008. The name (Pandas) comes from the term "Panel Data" used in econometrics for analyzing time-series data. Pandas has many features, listed in the following, that make it a popular tool for data wrangling and analysis.

Wes McKinney 于 2008 年开发了 Pandas 库。Pandas 这个名字来源于计量经济学中用于分析时间序列数据的术语 "面板数据"。Pandas 有许多功能,这些功能使其成为数据处理和分析的常用工具。

Pandas provides features for labeling of data or indexing, which speeds up the retrieval of data.

Pandas 提供数据标签或索引功能,可加快数据检索速度。

Input and output support: Pandas provides options to read data from different file formats like JSON (JavaScript Object Notation), CSV (Comma-Separated Values), Excel, and HDF5 (Hierarchical Data Format Version 5). It can also be used to write data into databases, web services, and so on.

输入和输出支持: Pandas 提供从不同文件格式读取数据的选项,如 JSON(JavaScript Object Notation)、CSV(Comma-Separated Values)、Excel 和 HDF5(Hierarchical Data Format Version 5)。它还可用于将数据写入数据库、网络服务等。

Most of the data that is needed for analysis is not contained in a single source, and we often need to combine datasets to consolidate the data that we need for analysis. Again, Pandas comes to the rescue with tailor-made functions to combine data.

分析所需的大部分数据并不包含在单一来源中,因此我们经常需要合并数据集,以整合分析所需的数据。Pandas 又一次提供了量身定制的合并数据函数。

Speed and enhanced performance: The Pandas library is based on Cython, which combines the convenience and ease of use of Python with the speed of the C language. Cython helps to optimize performance and reduce overheads.

速度和增强的性能 Pandas 库基于 Cython,它将 Python 的方便易用与 C 语言的速度相结合。Cython 有助于优化性能和减少开销。

Data visualization: To derive insights from the data and make it presentable to the audience, viewing data using visual means is crucial, and Pandas provides a lot of built-in visualization tools using Matplotlib as the base library.

数据可视化: 要从数据中获得洞察力并将其呈现给受众,使用可视化手段查看数据至关重要,而 Pandas 使用 Matplotlib 作为基础库,提供了大量内置可视化工具。

Support for other libraries: Pandas integrates smoothly with other libraries like Numpy, Matplotlib, Scipy, and Scikit-learn. Thus we can perform other tasks like numerical computations, visualizations, statistical analysis, and machine learning in conjunction with data manipulation.

支持其他库 Pandas 可与 Numpy、Matplotlib、Scipy 和 Scikit-learn 等其他库顺利集成。因此,我们可以结合数据处理执行其他任务,如数值计算、可视化、统计分析和机器学习。

Grouping: Pandas provides support for the split-apply-combine methodology, whereby we can group our data into categories, apply separate functions on them, and combine the results.

分组: Pandas 支持 "拆分-应用-合并 "方法,我们可以将数据分组,分别应用不同的函数,然后合并结果。

Handling missing data, duplicates, and filler characters: Data often has missing values, duplicates, blank spaces, special characters (like $, &), and so on that may need to be removed or replaced. With the functions provided in Pandas, you can handle such anomalies with ease.

处理缺失数据、重复数据和填充字符: 数据中经常会有需要删除或替换的缺失值、重复数据、空白、特殊字符(如 $、&)等。利用 Pandas 提供的函数,您可以轻松处理此类异常情况。

Mathematical operations: Many numerical operations and computations can be performed in Pandas, with NumPy being used at the back end for this purpose.

数学运算 在 Pandas 中可以执行许多数值运算和计算,NumPy 在后端用于此目的。

环境准备

If you have not already installed Pandas, go to the Anaconda Prompt and enter the following command.

如果尚未安装 Pandas,请转到 Anaconda 提示符并输入以下命令。

bash 复制代码
pip install pandas

Once the Pandas library is installed, you need to import it before using its functions. In your Jupyter notebook, type the following to import this library.

安装好 Pandas 库后,在使用其功能之前需要将其导入。在 Jupyter 笔记本中,键入以下内容导入该库。

python 复制代码
import pandas as pd

Here, pd is a shorthand name or alias that is a standard for Pandas.

这里,pd 是 Pandas 标准的速记名称或别名。

For some of the examples, we also use functions from the NumPy library. Ensure that both the Pandas and NumPy libraries are installed and imported.

在部分示例中,我们还使用了 NumPy 库中的函数。确保已安装并导入 Pandas 和 NumPy 库。

You need to download a dataset, "subset-covid-data.csv", that contains data about the number of cases and deaths related to the COVID-19 pandemic for various countries on a particular date. Please use the following link for downloading the dataset: https://github.com/DataRepo2019/Data-files/blob/master/subset-covid-data.csv

您需要下载一个名为 "subset-covid-data.csv "的数据集,其中包含特定日期不同国家与 COVID-19 大流行相关的病例数和死亡数的数据。请使用以下链接下载数据集: https://github.com/DataRepo2019/Data-files/blob/master/subset-covid-data.csv

相关推荐
明月说数据13 分钟前
Smartbi 10 月版本亮点:AIChat对话能力提升,国产化部署更安全
ai·数据分析·版本更新
AI视觉网奇16 分钟前
coco json 分类标注工具源代码
开发语言·python
要加油GW1 小时前
python使用vscode 需要配置全局的环境变量。
开发语言·vscode·python
B站计算机毕业设计之家1 小时前
python图像识别系统 AI多功能图像识别检测系统(11种识别功能)银行卡、植物、动物、通用票据、营业执照、身份证、车牌号、驾驶证、行驶证、车型、Logo✅
大数据·开发语言·人工智能·python·图像识别·1024程序员节·识别
@小红花1 小时前
Tableau 从零到精通:系统教学文档(自学版)
信息可视化·数据挖掘·数据分析
wudl55661 小时前
Pandas-DataFrame 数据结构详解
数据结构·pandas
快乐的钢镚子1 小时前
思腾合力云服务器远程连接
运维·服务器·python
苏打水com1 小时前
爬虫进阶实战:突破动态反爬,高效采集CSDN博客详情页数据
爬虫·python
wudl55661 小时前
Pandas 简介与安装
pandas
夫唯不争,故无尤也2 小时前
三大AI部署框架对比:本地权重与多模型协作实战
人工智能·python·深度学习