pandas notes 30

1. What is pandas?

pandas main page

pandas installation instructions

Anaconda distribution of Python (includes pandas)

How to use the IPython/Jupyter notebook

2. How do I read a tabular data file into pandas?

user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']

users = pd.read_table('http://bit.ly/movieusers', sep='|', header=None, names=user_cols)

3. How do I select a pandas Series from a DataFrame?

select the 'City' Series using bracket notation

ufo['City']

or equivalently, use dot notation

ufo.City

ufo['Location'] = ufo.City + ', ' + ufo.State

4. Why do some pandas commands end with parentheses

movies.head()

movies.columns

movies.describe()

movies.shape

movies.dtypes

use an optional parameter to the describe method to summarize only 'object' columns

movies.describe(include=['object'])

5. How do I rename columns in a pandas DataFrame?

ufo.rename(columns={'Colors Reported':'Colors_Reported', 'Shape Reported':'Shape_Reported'}, inplace=True)

ufo_cols = ['city', 'colors reported', 'shape reported', 'state', 'time']

ufo.columns = ufo_cols

replace the column names during the file reading process by using the 'names' parameter

ufo = pd.read_csv('http://bit.ly/uforeports', header=0, names=ufo_cols)

ufo.columns = ufo.columns.str.replace(' ', '_')

6. How do I remove columns from a pandas DataFrame?

remove a single column (axis=1 refers to columns)

ufo.drop('Colors Reported', axis=1, inplace=True)

remove multiple columns at once

ufo.drop(['City', 'State'], axis=1, inplace=True)

remove multiple rows at once (axis=0 refers to rows)

ufo.drop([0, 1], axis=0, inplace=True)

7. How do I sort a pandas DataFrame or a Series?

movies.title.sort_values().head()

movies.title.sort_values(ascending=False).head()

movies.sort_values('title').head()

movies.sort_values('title', ascending=False).head()

movies.sort_values(['content_rating', 'duration']).head()

8. How do I filter rows of a pandas DataFrame by column value?

select the 'genre' Series from the filtered DataFrame

movies[movies.duration >= 200].genre

or equivalently, use the 'loc' method

movies.loc[movies.duration >= 200, 'genre']

9. How do I apply multiple filter criteria to a pandas DataFrame?

movies[(movies.duration >=200) & (movies.genre == 'Drama')]

movies[(movies.genre == 'Crime') | (movies.genre == 'Drama') | (movies.genre == 'Action')]

movies[movies.genre.isin(['Crime', 'Drama', 'Action'])]

10. Your pandas questions answered!

specify which columns to include by name

ufo = pd.read_csv('http://bit.ly/uforeports', usecols=['City', 'State'])

or equivalently, specify columns by position

ufo = pd.read_csv('http://bit.ly/uforeports', usecols=[0, 4])

ufo = pd.read_csv('http://bit.ly/uforeports', nrows=3)

various methods are available to iterate through a DataFrame

for index, row in ufo.iterrows():

print(index, row.City, row.State)

only include numeric columns in the DataFrame

import numpy as np

drinks.select_dtypes(include=[np.number]).dtypes

drinks.describe(include='all')

drinks.describe(include=['object', 'float64'])

11. How do I use the "axis" parameter in pandas?

When performing a mathematical operation with the axis parameter:

axis 0 means the operation should "move down" the row axis

axis 1 means the operation should "move across" the column axis

'index' is an alias for axis 0

drinks.mean(axis='index')

'columns' is an alias for axis 1

drinks.mean(axis='columns')

12. How do I use string methods in pandas?

string methods for pandas Series are accessed via 'str'

orders.item_name.str.upper()

orders[orders.item_name.str.contains('Chicken')]

orders.choice_description.str.replace('[', '').str.replace(']', '')

many pandas string methods support regular expressions (regex)

orders.choice_description.str.replace('[\[\]]', '')

13. How do I change the data type of a pandas Series?

change the data type of an existing Series

drinks['beer_servings'] = drinks.beer_servings.astype(float)

drinks = pd.read_csv('http://bit.ly/drinksbycountry', dtype={'beer_servings':float})

14. When should I use a "groupby" in pandas?

drinks.groupby('continent').beer_servings.mean()

drinks.groupby('continent').beer_servings.agg(['count', 'mean', 'min', 'max'])

drinks.groupby('continent').mean().plot(kind='bar')

15. How do I explore a pandas Series?

movies.genre.value_counts()

display percentages instead of raw counts

movies.genre.value_counts(normalize=True)

movies.genre.unique()

count the number of unique values in the Series

movies.genre.nunique()

compute a cross-tabulation of two Series

pd.crosstab(movies.genre, movies.content_rating)

movies.genre.value_counts().plot(kind='bar')

16. How do I handle missing values in pandas?

What does "NaN" mean?

"NaN" is not a string, rather it's a special value: numpy.nan.

It stands for "Not a Number" and indicates a missing value.

read_csv detects missing values (by default) when reading the file, and replaces them with this special value.

ufo.isnull()

ufo.notnull()

count the number of missing values in each Series

ufo.isnull().sum()

How to handle missing values depends on the dataset as well as the nature of your analysis. Here are some options:

ufo.shape

if 'any' values are missing in a row, then drop that row

ufo.dropna(how='any').shape

if 'all' values are missing in a row, then drop that row (none are dropped in this case)

ufo.dropna(how='all').shape

if 'any' values are missing in a row (considering only 'City' and 'Shape Reported'), then drop that row

ufo.dropna(subset=['City', 'Shape Reported'], how='any').shape

if 'all' values are missing in a row (considering only 'City' and 'Shape Reported'), then drop that row

ufo.dropna(subset=['City', 'Shape Reported'], how='all').shape

'value_counts' does not include missing values by default

ufo['Shape Reported'].value_counts().head()

explicitly include missing values

ufo['Shape Reported'].value_counts(dropna=False).head()

fill in missing values with a specified value

ufo['Shape Reported'].fillna(value='VARIOUS', inplace=True)

17. What do I need to know about the pandas index?

set an existing column as the index

drinks.set_index('country', inplace=True)

country name can now be used for selection

drinks.loc['Brazil', 'beer_servings']

index name is optional

drinks.index.name = None

restore the index name, and move the index back to a column

drinks.index.name = 'country'

drinks.reset_index(inplace=True)

18. What do I need to know about the pandas index?

drinks.continent.value_counts().sort_index()

drinks.continent.value_counts()['Africa']

people = pd.Series([3000000, 85000], index=['Albania', 'Andorra'], name='population')

calculate the total annual beer servings for each country

(drinks.beer_servings * people).head()

The two Series were aligned by their indexes.

If a value is missing in either Series, the result is marked as NaN.

Alignment enables us to easily work with incomplete data.

concatenate the 'drinks' DataFrame with the 'population' Series (aligns by the index)

pd.concat([drinks, people], axis=1).head()

19. How do I select multiple rows and columns from a pandas DataFrame?

The loc method is used to select rows and columns by label. You can pass it:

A single label

A list of labels

A slice of labels

A boolean Series

A colon (which indicates "all labels")

The iloc method is used to select rows and columns by integer position. You can pass it:

A single integer position

A list of integer positions

A slice of integer positions

A colon (which indicates "all integer positions")

drinks = pd.read_csv('http://bit.ly/drinksbycountry', index_col='country')

20. When should I use the "inplace" parameter in pandas?

ufo.drop('City', axis=1, inplace=True)

21. How do I make my pandas DataFrame smaller and faster?

drinks.info()

drinks.info(memory_usage='deep')

drinks.memory_usage(deep=True)

use the 'category' data type (new in pandas 0.15) to store the 'continent' strings as integers

drinks['continent'] = drinks.continent.astype('category')

df['quality'] = df.quality.astype('category', categories=['good', 'very good', 'excellent'], ordered=True)

comparison operators work with ordered categories

df.loc[df.quality > 'good', :]

22. How do I use pandas with scikit-learn to create Kaggle submissions?

23. More of your pandas questions answered!

ufo.sample(n=3)

ufo.sample(n=3, random_state=42)

sample 75% of the DataFrame's rows without replacement

train = ufo.sample(frac=0.75, random_state=99)

store the remaining 25% of the rows in another DataFrame

test = ufo.loc[~ufo.index.isin(train.index), :]

24. How do I create dummy variables in pandas?

create the 'Sex_male' dummy variable using the 'map' method

train['Sex_male'] = train.Sex.map({'female':0, 'male':1})

alternative: use 'get_dummies' to create one column for every possible value

pd.get_dummies(train.Sex)

Generally speaking:

If you have "K" possible values for a categorical feature, you only need "K-1" dummy variables to capture all of the information about that feature.

One convention is to drop the first dummy variable, which defines that level as the "baseline".

add a prefix to identify the source of the dummy variables

pd.get_dummies(train.Sex, prefix='Sex').iloc[:, 1:].head()

use 'get_dummies' with a feature that has 3 possible values

pd.get_dummies(train.Embarked, prefix='Embarked').head(10)

drop the first dummy variable ('C')

pd.get_dummies(train.Embarked, prefix='Embarked').iloc[:, 1:].head(10)

25.How do I work with dates and times in pandas?

ufo['Time'] = pd.to_datetime(ufo.Time)

ufo.Time.dt.hour

ufo.Time.dt.weekday_name

ufo.Time.dt.dayofyear

convert a single string to datetime format (outputs a timestamp object)

ts = pd.to_datetime('1/1/1999')

perform mathematical operations with timestamps (outputs a timedelta object)

ufo.Time.max() - ufo.Time.min()

26. How do I find and remove duplicate rows in pandas?

read a dataset of movie reviewers into a DataFrame

user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']

users = pd.read_table('http://bit.ly/movieusers', sep='|', header=None, names=user_cols, index_col='user_id')

detect duplicate zip codes: True if an item is identical to a previous item

users.zip_code.duplicated()

count the duplicate rows

users.duplicated().sum()

27. How do I avoid a SettingWithCopyWarning in pandas?

Solution: Any time you are attempting to create a DataFrame copy, use the copy method.

28. How do I change display options in pandas?

check the current setting for the 'max_rows' option

pd.get_option('display.max_rows')

overwrite the current setting so that all rows will be displayed

pd.set_option('display.max_rows', None)

reset the 'max_rows' option to its default

pd.reset_option('display.max_rows')

the 'max_columns' option is similar to 'max_rows'

pd.get_option('display.max_columns')

overwrite the current setting so that more characters will be displayed

pd.set_option('display.max_colwidth', 1000)

overwrite the 'precision' setting to display 2 digits after the decimal point of 'Fare'

pd.set_option('display.precision', 2)

use a Python format string to specify a comma as the thousands separator

pd.set_option('display.float_format', '{:,}'.format)

29. How do I create a pandas DataFrame from another object?

optionally specify the order of columns and define the index

df = pd.DataFrame({'id':[100, 101, 102], 'color':['red', 'blue', 'red']}, columns=['id', 'color'], index=['a', 'b', 'c'])

create a DataFrame from a list of lists (each inner list becomes a row)

pd.DataFrame([[100, 'red'], [101, 'blue'], [102, 'red']], columns=['id', 'color'])

arr = np.random.rand(4, 2)

pd.DataFrame(arr, columns=['one', 'two'])

create a DataFrame of student IDs (100 through 109) and test scores (random integers between 60 and 100)

pd.DataFrame({'student':np.arange(100, 110, 1), 'test':np.random.randint(60, 101, 10)})

'set_index' can be chained with the DataFrame constructor to select an index

pd.DataFrame({'student':np.arange(100, 110, 1), 'test':np.random.randint(60, 101, 10)}).set_index('student')

create a new Series using the Series constructor

s = pd.Series(['round', 'square'], index=['c', 'b'], name='shape')

30. How do I apply a function to a pandas Series or DataFrame?

map 'female' to 0 and 'male' to 1

train['Sex_num'] = train.Sex.map({'female':0, 'male':1})

calculate the length of each string in the 'Name' Series

train['Name_length'] = train.Name.apply(len)

round up each element in the 'Fare' Series to the next integer

import numpy as np

train['Fare_ceil'] = train.Fare.apply(np.ceil)

alternatively, use a lambda function

train.Name.str.split(',').apply(lambda x: x[0]).head()

convert every DataFrame element into a float

drinks.loc[:, 'beer_servings':'wine_servings'].applymap(float)

相关推荐
橙露9 小时前
从零基础到实战:Python 数据分析三剑客(Pandas+NumPy+Matplotlib)核心应用指南
python·数据分析·pandas
清水白石0081 天前
手写超速 CSV 解析器:利用 multiprocessing 与 mmap 实现 10 倍 Pandas 加速
python·pandas
Hello.Reader3 天前
PyFlink 向量化 UDF(Vectorized UDF)Arrow 批传输原理、pandas 标量/聚合函数、配置与内存陷阱、五种写法一网打尽
python·flink·pandas
Hello.Reader3 天前
PyFlink Table API Data Types DataType 是什么、UDF 类型声明怎么写、Python / Pandas 类型映射一文搞懂
python·php·pandas
Hello.Reader3 天前
PyFlink Table API 用户自定义函数(UDF)通用 UDF vs Pandas UDF、打包部署、open 预加载资源、读取作业参数、单元测试
log4j·pandas
海棠AI实验室4 天前
第十六章:小项目 2 CSV → 清洗 → 统计 → 图表 → 报告输出
pandas
逻极4 天前
数据分析项目:Pandas + SQLAlchemy,从数据库到DataFrame的丝滑实战
python·mysql·数据分析·pandas·sqlalchemy
海棠AI实验室4 天前
第十七章 调试与排错:读懂 Traceback 的方法论
python·pandas·调试
kong79069284 天前
Pandas简介
信息可视化·数据分析·pandas
爱喝可乐的老王4 天前
数据分析实践--数据解析购房关键
信息可视化·数据分析·pandas·matplotlib