基本数据集操作

（1）读取 CSV 格式的数据集

1 2	pd.read_csv('csv_file') pd.read_excel('excel_file')

(2)将 DataFrame 直接写入 CSV 文件

1	df.to_csv("data.csv", sep=",", index=False)

(3)基本的数据集特征信息

df.info()

(4) 基本的数据集统计信息

1	df.describe()

(5)Print data frame in a table

1 2	print(tabulate(print_table, headers=headers)) # 当「print_table」是一个列表，其中列表元素还是新的列表，「headers」为表头字符串组成的列表。

(6)列出所有列的名字

1	df.columns

基本数据处理

(1)删除缺失数据

1 2	new_df = df.dropna(axis=0, how='any') # 返回一个 DataFrame，其中删除了包含任何 NaN 值的给定轴，选择 how=「all」会删除所有元素都是 NaN 的给定轴。

(2)替换缺失数据

1 2	df.replace(to_replace=None, value=None) # 使用 value 值代替 DataFrame 中的 to_replace 值，其中 value 和 to_replace 都需要我们赋予不同的值。

(3)检查空值 NaN

1 2	pd.isnull(object) # 检查缺失值，即数值数组中的 NaN 和目标数组中的 None/NaN。

(4)删除特征

1 2	new_df = df.drop('feature_variable_name', axis=1) # axis 选择 0 表示行，选择表示列。

(5)将目标类型转换为浮点型

1 2	pd.to_numeric(df["feature_name"], errors='coerce') # 将目标类型转化为数值从而进一步执行计算，在这个案例中为字符串。

(6)将 DataFrame 转换为 NumPy 数组

1	new_df = df.as_matrix()

(7)取 DataFrame 的前面「n」行

1	df.head(n)

(8)通过特征名取数据

1	df.loc[feature_name]

DataFrame 操作

(1)对 DataFrame 使用函数

# 该函数将令 DataFrame 中「height」行的所有值乘上 2
df["height"].apply(*lambda* height: 2 * height)

# OR
def multiply(x):
    return x * 2

df["height"].apply(multiply)

(2)重命名行

1 2	# 下面代码会重命名 DataFrame 的第三行为「size」 df.rename(columns = {df.columns[2]:'size'}, inplace=True)

(3)取某一行的唯一实体

1	df["name"].unique()

(3)访问子 DataFrame

1	new_df = df[["name", "size"]]

(4)总结数据信息

# Sum of values in a data frame
df.sum()
# Lowest value of a data frame
df.min()
# Highest value
df.max()
# Index of the lowest value
df.idxmin()
# Index of the highest value
df.idxmax()
# Statistical summary of the data frame, with quartiles, median, etc.
df.describe()
# Average values
df.mean()
# Median values
df.median()
# Correlation between columns
df.corr()
# To get these values for only one column, just select it like this#
df["size"].median()

(5)给数据排序

1	df.sort_values(ascending = False)

(6)索引

1	df[df["size"] == 5]

(7) 选值

1	df.loc([0], ['size'])

Tags: pandas

← How to create Github IO blog Bayesian Methods EM →

Pandas Cheetsheet

基本数据集操作

（1）读取 CSV 格式的数据集

(2)将 DataFrame 直接写入 CSV 文件

(3)基本的数据集特征信息

(4) 基本的数据集统计信息

(5)Print data frame in a table

(6)列出所有列的名字

基本数据处理

(1)删除缺失数据

(2)替换缺失数据

(3)检查空值 NaN

(4)删除特征

(5)将目标类型转换为浮点型

(6)将 DataFrame 转换为 NumPy 数组

(7)取 DataFrame 的前面「n」行

(8)通过特征名取数据

DataFrame 操作

(1)对 DataFrame 使用函数

(2)重命名行

(3)取某一行的唯一实体

(3)访问子 DataFrame

(4)总结数据信息

(5)给数据排序

(6)索引

(7) 选值