TOP

小兔子的数据分析瞎扯（不是自己写的）

2019-04-29 14:39:32 【大中小】浏览:93次

Tags：兔子数据分析不是自己

Python版本

Python 2 or Python 3

Python 2.x 是早期版本，Python 3.x是当前版本
Python 2.7 (2.x的最终版)于2010年发布后很少有大的更新
Python 2.x 比 Python3.x 拥有更多的工具库
大多数Linux系统默认安装的仍是 Python 2.x
版本选择取决于要解决的问题

建议选择 Python 2.x 的情况：

部署环境不可控，Python版本不能自行选择
某些工具库还没有提供支持 Python 3.x。
如果选择使用 Python 3.x，需要确定要用的工具库支持新版本。

注意：本课程将会使用Python 3.x 版本

Python环境及IDE

Python环境

Anaconda（水蟒）：是一个科学计算软件发行版，集成了大量常用扩展包的环境，包含了 conda、Python 等 180 多个科学计算包及其依赖项，并且支持所有操作系统平台。下载地址：https://www.continuum.io/downloads

安装包：pip install xxx,conda install xxx

卸载包：pip uninstall xxx,conda uninstall xxx

升级包：pip install upgrade xxx,conda update xxx

IDE

Jupyter Notebook：

命令：jupyter notebook

Anaconda自带，无需单独安装
实时查看运行过程
基本的web编辑器（本地）
.ipynb 文件分享
可交互式
记录历史运行结果

IPython：

命令：ipython

Anaconda自带，无需单独安装
Python的交互式命令行 Shell
可交互式
记录历史运行结果
及时验证想法

Spyder：

命令：spyder

Anaconda自带，无需单独安装
完全免费，适合熟悉Matlab的用户
功能强大，使用简单的图形界面开发环境

PyCharm：

需要自行安装：https://www.jetbrains.com/pycharm/download
PyCharm，JetBrains的精品，全平台支持，不多解释了。

Python3.x 常用的新特性

print() 是函数，不是一个语句
raw_input()输入函数，改为 input()
Python 3 对文本和二进制数据做了更为清晰的区分。
1. 文本由unicode表示，为str类型
2. 二进制数据由bytes (字节包)表示，为bytes类型
新增数据类型 bytes (字节包)，代表二进制数据以及被编码的文本字符串前有个前缀b
Python3中 bytes 与 str 转换
1. str 可以编码(encode)成 bytes
2. bytes 可以解码(decode)成 str
字符串格式化输出方式：新增format()方式
dict类型变化

之前的 iterkeys(), itervalues(), iteritems()，

改为现在的 keys(), values(), items()

字符串编码格式回顾：

ASCII：早起计算机保存英文字符的编码方式
GB2312：对ASCII的中文扩展
GBK/GB18030：包括了GB2312的所有内容，同时又增加了近20000个新的汉字和符号
Unicode：包括了全球的符合和编码。每个字符用3~4个字节表示，浪费空间
UTF-8：可变长的编码方式，在互联网上使用最广泛的一种Unicode的实现方式，根据语种决定字符长度，如一个汉字3个字节，一个字母1个字节，也是Linux环境下默认编码格式。

DIKW 体系

DIKW体系是关于数据、信息、知识及智慧的体系，可以追溯至托马斯·斯特尔那斯·艾略特所写的诗--《岩石》。在首段，他写道：“我们在哪里丢失了知识中的智慧？又在哪里丢失了信息中的知识？”（Where is the wisdom we have lost in knowledge？ / Where is the knowledge we have lost in information？）。

1982年12月，美国教育家哈蓝·克利夫兰引用艾略特的这些诗句在其出版的《未来主义者》一书提出了“信息即资源”（Information as a Resource）的主张。

其后，教育家米兰·瑟兰尼、管理思想家罗素·艾可夫进一步对此理论发扬光大，前者在1987年撰写了《管理支援系统：迈向整合知识管理》（Management Support Systems: Towards Integrated Knowledge Management ），后者在1989年撰写了《从数据到智慧》（“From Data to Wisdom”，Human Systems Management）。

数据工程领域中的DIKW体系

D：Data (数据)，是 DIKW 体系中最低级的材料，一般指原始数据，包含（或不包含）有用的信息。

I：Information (信息)，作为一个概念，信息有着多种多样的含义。在数据工程里，表示由数据工程师（使用相关工具）或者数据科学家（使用数学方法），按照某种特定规则，对原始数据进行整合提取后，找出来的更高层数据（具体数据）。

K：Knowledge (知识)，是对某个主题的确定认识，并且这些认识拥有潜在的能力为特定目的而使用。在数据工程里，表示对信息进行针对性的实用化，让提取的信息可以用于商业应用或学术研究。

W：Wisdom (智慧)，表示对知识进行独立的思考分析，得出的某些结论。在数据工程里，工程师和科学家做了大量的工作用计算机程序尽可能多地提取了价值（I/K），然而真正要从数据中洞察出更高的价值，甚至能够对未来的情况进行预测，则需要数据分析师。

数据工程领域职业划分：

数据工程是一整套对数据（D）进行采集、处理、提取价值（变为 I 或 K）的过程。

首先介绍一下相关的几种角色： Data Engineer（数据工程师）, Data Scientist（数据科学家）, Data Analyst（数据分析师）。这三个角色任务重叠性高，要求合作密切，但各负责的领域稍有不同。大部分公司里的这些角色都会根据每个人本身的技能长短而身兼数职，所以有时候比较难以区分：

Data Engineer 数据工程师：分析数据少不了需要运用计算机和各种工具自动化数据处理的过程，包括数据格式转换，储存，更新，查询。数据工程师的工作就是开发工具完成自动化的过程，属于基础设施/工具（Infrastructure/Tools）层。

但是这个角色出现的频率不多 ，因为有现成的MySQL, Oracle等数据库技术，很多大公司只需要DBA就足够了。而 Hadoop, MongoDB 等 NoSQL 技术的开源，更是使在大数据的场景下都没有太多数据工程师的事，一般都是交给数据科学家。

Data Scientist 数据科学家：数据科学家是与数学相结合的中间角色，需要用数学方法处理原始数据找出肉眼看不到的更高层数据，一般是运用统计机器学习（Statistical Machine Learning）或者深度学习（Deep Learning）。

有人称 Data Scientist 为 编程统计学家（Programming Statistician），因为他们需要有很好的统计学基础，但也需要参与程序的开发（基于 Infrastructure 之上），而现在很多很多的数据科学家 职位都要求身兼数据工程师。 数据科学家是把 D 转为 I 或 K 的主力军。

Data Analyst 数据分析师：数据工程师和数据科学家做了大量的工作，用计算机程序尽可能多地提取了价值（I/K），然而真正要从数据中洞察出更高的价值，则需要依靠丰富的行业经验和洞察力，这些都需要人力的干预。

Data Analyst 需要的是对所在业务有深刻了解，能熟练运用手上的工具（无论是 Excel， SPSS也好， Python/R也好，工程师给你开发的工具也好，必要时还要能自己充当工程师和科学家，力尽所能得到自己需要的工具），有针对性地对数据作分析，并且需要把发现的成果向其他职能部门呈现出来，最终变为行动，这就是把数据最终得出 Wisdom。

什么是数据分析：

百度百科：数据分析是指用适当的统计分析方法对收集来的大量数据进行分析，提取有用信息和形成结论而对数据加以详细研究和概括总结的过程。这一过程也是质量管理体系的支持过程。在实用中，数据分析可帮助人们作出判断，以便采取适当行动。

数据分析的过程：

1. 数据收集：本地数据或者网络数据的采集与操作.

2. 数据处理：数据的规整，按照某种格式进行整合存储。

3. 数据分析：数据的科学计算，使用相关数据工具进行分析。

4. 数据展现：数据可视化，使用相关工具对分析出的数据进行展示。

数据分析的工具：

SAS：SAS（STATISTICAL ANALYSIS SYSTEM，简称SAS）公司开发的统计分析软件，是一个功能强大的数据库整合平台。价格昂贵，银行或者大企业才买的起，做离线的分析或者模型用。
SPSS：SPSS（Statistical Product and Service Solutions，统计产品与服务解决方案）是IBM公司推出的一系列用于统计学分析运算、数据挖掘、预测分析和决策支持任务的产品，迄今已有40余年的成长历史，价格昂贵。
R/MATLAB：适合做学术性质的数据分析，在实际应用上需要额外转换为Python或Scala来实现，而且MATLAB（MathWorks公司出品的商业数学软件）是收费的。
Scala：是一门函数式编程语言，熟练使用后开发效率较高，配合Spark适合大规模的数据分析和处理，Scala的运行环境是JVM。
Python：Python在数据工程领域和机器学习领域有很多成熟的框架和算法库，完全可以只用Python就可以构建以数据为中心的应用程序。在数据工程领域和机器学习领域，Python非常非常流行。

传智播客Python学院数据分析 1. 一、工作环境准备及数据分析建模理论基础 1.1. Python 3.x新特性和编码回顾 1.2. DIKW模型与数据工程 1.3. 数据分析建模理论基础 2. 二、科学计算工具NumPy 3. 三、数据分析工具Pandas 4. 四、数据可视化工具 5. 五、自然语言处理NLTK Published with GitBook

Python数据分析课程讲义

数据建模基础

大数据分析场景和模型应用

数据分析建模需要先明确业务需求，然后选择是 描述型分析 还是 预测型分析。

如果分析的目的是描述目标行为模式，就采用描述型数据分析，描述型分析就考虑 关联规则、 序列规则、聚类等模型。
如果是预测型数据分析，就是量化未来一段时间内，某个事件的发生概率。有两大预测分析模型， 分类预测 和 回归预测。

常见的数据建模分类

分类与回归

分类：是通过已有的训练样本去训练得到一个最优模型，再利用这个模型将输入映射为相应的输出，对输出进行简单的判断从而实现分类的目的，也就具有了对未知数据进行分类的能力。
回归：是基于观测数据建立变量间适当的依赖关系，以分析数据内在的规律，得到响应的判断。并可用于预报、控制等问题。

应用：

信用卡申请人风险评估、预测公司业务增长量、预测房价，未来的天气情况等

原理：

回归：用属性的 历史数据 预测未来趋势。算法首先假设一些已知类型的函数可以匹配目标数据，然后分析匹配后的误差，确定一个与目标数据匹配程度最好的函数。回归是对真实值的一种 逼近预测。
分类：将数据映射到 预先定义的 群组或类。算法要求基于数据 特征值 来定义类别，把具有某些特征的数据项映射到给定的某个类别上。分类并没有逼近的概念，最终正确结果只有一个。 在机器学习方法里，分类属于监督学习。

区别：

分类模型采用 离散预测值，回归模型采用 连续的预测值。

聚类

聚类：就是将相似的事物聚集在一起，不相似的事物划分到不同的类别的过程。
聚类分析：又称群分析，它是研究（样品或指标）分类问题的一种统计分析方法，同时也是数据挖掘的一个重要算法。

应用：

根据症状归纳特定疾病、发现信用卡高级用户、根据上网行为对客户分群从而进行精确营销等。

原理：

在没有给定划分类的情况下，根据信息相似度进行信息聚类。

聚类的输入是一组 未被标记的数据，根据样本特征的距离或相似度进行划分。划分原则是保持最大的组内相似性和最小的组间相似性。

不同于分类，聚类事先 没有任何训练样本，直接对数据进行建模。聚类分析的目标，就是在相似的基础上收集数据来分类。 在机器学习方法里，聚类属于无监督学习。

时序模型

不管在哪个领域中（如金融学、经济学、生态学、神经科学、物理学等），时间序列（time series）数据都是一种重要的结构化数据形式。在多个时间点观察或测量到的任何事物，都可以形成一段时间序列。时间序列大多都是固定频率的，数据点将根据某种规律定期出现。

应用：

下个季度的商品销量或库存量是多少？明天用电量是多少？今天的北京地铁13号线的人流情况？

原理：

描述 基于时间或其他序列的 经常发生的规律或趋势，并对其建模。与回归一样，用已知的数据预测未来的值，但这些数据的区别是 变量所处时间的不同。重点考察数据之间在 时间维度上的关联性。

常见的数据分析应用场景如下：

市场营销

营销响应分析建模(逻辑回归，决策树)
净提升度分析建模(关联规则)
客户保有分析建模(卡普兰梅尔分析，神经网络)
购物蓝分析(关联分析Apriori)
自动推荐系统(协同过滤推荐，基于内容推荐，基于人口统计推荐，基于知识推荐，组合推荐，关联规则)
客户细分(聚类)
流失预测(逻辑回归)

风险管理

客户信用风险评分(SVM，决策树，神经网络)
市场风险评分建模(逻辑回归和决策树)
运营风险评分建模(SVM)
欺诈检测(决策树，聚类，社交网络)

....

传智播客Python学院数据分析 1. 一、工作环境准备及数据分析建模理论基础 2. 二、科学计算工具NumPy 2.1. ndarray的创建与数据类型 2.2. ndarray的矩阵处理 2.3. ndarray的元素处理 2.4. 实战案例：2016美国总统大选民意调查统计 3. 三、数据分析工具Pandas 4. 四、数据可视化工具 5. 五、自然语言处理NLTK Published with GitBook

Python数据分析课程讲义

Numpy（Numerical Python）

Numpy：提供了一个在Python中做科学计算的基础库，重在数值计算，主要用于多维数组（矩阵）处理的库。用来存储和处理大型矩阵，比Python自身的嵌套列表结构要高效的多。本身是由C语言开发，是个很基础的扩展，Python其余的科学计算扩展大部分都是以此为基础。

高性能科学计算和数据分析的基础包
ndarray，多维数组（矩阵），具有矢量运算能力，快速、节省空间
矩阵运算，无需循环，可完成类似Matlab中的矢量运算
线性代数、随机数生成
import numpy as np

Scipy

Scipy ：基于Numpy提供了一个在Python中做科学计算的工具集，专为科学和工程设计的Python工具包。主要应用于统计、优化、整合、线性代数模块、傅里叶变换、信号和图像处理、常微分方程求解、稀疏矩阵等，在数学系或者工程系相对用的多一些，和数据处理的关系不大， 我们知道即可，这里不做讲解。

在NumPy库的基础上增加了众多的数学、科学及工程常用的库函数
线性代数、常微分方程求解、信号处理、图像处理
一般的数据处理numpy已经够用
import scipy as sp

参考学习资料：

Python、NumPy和SciPy介绍：http://cs231n.github.io/python-numpy-tutorial

NumPy和SciPy快速入门：https://docs.scipy.org/doc/numpy-dev/user/quickstart.html

Python数据分析课程讲义

ndarray 多维数组(N Dimension Array)

NumPy数组是一个多维的数组对象（矩阵），称为ndarray，具有矢量算术运算能力和复杂的广播能力，并具有执行速度快和节省空间的特点。

注意：ndarray的下标从0开始，且数组里的所有元素必须是相同类型

ndarray拥有的属性

ndim属性：维度个数
shape属性：维度大小
dtype属性：数据类型

ndarray的随机创建

通过随机抽样 (numpy.random) 生成随机数据。

示例代码：

# 导入numpy，别名np
import numpy as np

# 生成指定维度大小（3行4列）的随机多维浮点型数据（二维），rand固定区间0.0 ~ 1.0
arr = np.random.rand(3, 4)
print(arr)
print(type(arr))

# 生成指定维度大小（3行4列）的随机多维整型数据（二维），randint()可以指定区间（-1, 5）
arr = np.random.randint(-1, 5, size = (3, 4)) # 'size='可省略
print(arr)
print(type(arr))

# 生成指定维度大小（3行4列）的随机多维浮点型数据（二维），uniform()可以指定区间（-1, 5）
arr = np.random.uniform(-1, 5, size = (3, 4)) # 'size='可省略
print(arr)
print(type(arr))

print('维度个数: ', arr.ndim)
print('维度大小: ', arr.shape)
print('数据类型: ', arr.dtype)

运行结果：

[[ 0.09371338 0.06273976 0.22748452 0.49557778]
 [ 0.30840042 0.35659161 0.54995724 0.018144  ]
 [ 0.94551493 0.70916088 0.58877255 0.90435672]]
<class 'numpy.ndarray'>

[[ 1 3 0 1]
 [ 1 4 4 3]
 [ 2 0 -1 -1]]
<class 'numpy.ndarray'>

[[ 2.25275308 1.67484038 -0.03161878 -0.44635706]
 [ 1.35459097 1.66294159 2.47419548 -0.51144655]
 [ 1.43987571 4.71505054 4.33634358 2.48202309]]
<class 'numpy.ndarray'>

维度个数: 2
维度大小:  (3, 4)
数据类型: float64

ndarray的序列创建

1. np.array(collection)

collection 为序列型对象(list)、嵌套序列对象(list of list)。

示例代码：

# list序列转换为 ndarray
lis = range(10)
arr = np.array(lis)

print(arr)      # ndarray数据
print(arr.ndim)    # 维度个数
print(arr.shape)  # 维度大小

# list of list嵌套序列转换为ndarray
lis_lis = [range(10), range(10)]
arr = np.array(lis_lis)

print(arr)      # ndarray数据
print(arr.ndim)    # 维度个数
print(arr.shape)  # 维度大小

运行结果：

# list序列转换为 ndarray
[0 1 2 3 4 5 6 7 8 9]
1
(10,)

# list of list嵌套序列转换为 ndarray
[[0 1 2 3 4 5 6 7 8 9]
 [0 1 2 3 4 5 6 7 8 9]]
2
(2, 10)

2. np.zeros()

指定大小的全0数组。注意：第一个参数是元组，用来指定大小，如(3, 4)。

3. np.ones()

指定大小的全1数组。注意：第一个参数是元组，用来指定大小，如(3, 4)。

4. np.empty()

初始化数组，不是总是返回全0，有时返回的是未初始的随机值（内存里的随机值）。

示例代码（2、3、4）：

# np.zeros
zeros_arr = np.zeros((3, 4))

# np.ones
ones_arr = np.ones((2, 3))

# np.empty
empty_arr = np.empty((3, 3))

# np.empty 指定数据类型
empty_int_arr = np.empty((3, 3), int)

print('------zeros_arr-------')
print(zeros_arr)

print('\n------ones_arr-------')
print(ones_arr)

print('\n------empty_arr-------')
print(empty_arr)

print('\n------empty_int_arr-------')
print(empty_int_arr)

运行结果：

------zeros_arr-------
[[ 0. 0. 0. 0.]
 [ 0. 0. 0. 0.]
 [ 0. 0. 0. 0.]]

------ones_arr-------
[[ 1. 1. 1.]
 [ 1. 1. 1.]]

------empty_arr-------
[[ 0. 0. 0.]
 [ 0. 0. 0.]
 [ 0. 0. 0.]]

------empty_int_arr-------
[[0 0 0]
 [0 0 0]
 [0 0 0]]

5. np.arange() 和 reshape()

arange() 类似 python 的 range() ，创建一个一维 ndarray 数组。

reshape() 将重新调整数组的维数。

示例代码（5）：

# np.arange()
arr = np.arange(15) # 15个元素的 一维数组
print(arr)
print(arr.reshape(3, 5)) # 3x5个元素的 二维数组
print(arr.reshape(1, 3, 5)) # 1x3x5个元素的 三维数组

运行结果：

[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14]

[[ 0 1 2 3 4]
 [ 5 6 7 8 9]
 [10 11 12 13 14]]

[[[ 0 1 2 3 4]
  [ 5 6 7 8 9]
  [10 11 12 13 14]]]

6. np.arange() 和 random.shuffle()

random.shuffle() 将打乱数组序列（类似于洗牌）。

示例代码（6）：

arr = np.arange(15)
print(arr)

np.random.shuffle(arr)
print(arr)
print(arr.reshape(3,5))

运行结果：

[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14]

[ 5 8 1 7 4 0 12 9 11 2 13 14 10 3 6]

[[ 5 8 1 7 4]
 [ 0 12 9 11 2]
 [13 14 10 3 6]]

ndarray的数据类型

1. dtype参数

指定数组的数据类型，类型名+位数，如float64, int32

2. astype方法

转换数组的数据类型

示例代码（1、2）：

# 初始化3行4列数组，数据类型为float64
zeros_float_arr = np.zeros((3, 4), dtype=np.float64)
print(zeros_float_arr)
print(zeros_float_arr.dtype)

# astype转换数据类型，将已有的数组的数据类型转换为int32
zeros_int_arr = zeros_float_arr.astype(np.int32)
print(zeros_int_arr)
print(zeros_int_arr.dtype)

运行结果：

[[ 0. 0. 0. 0.]
 [ 0. 0. 0. 0.]
 [ 0. 0. 0. 0.]]
float64

[[0 0 0 0]
 [0 0 0 0]
 [0 0 0 0]]
int32

Python数据分析课程讲义

ndarray的矩阵运算

数组是编程中的概念，矩阵、矢量是数学概念。

在计算机编程中，矩阵可以用数组形式定义，矢量可以用结构定义!

1. 矢量运算：相同大小的数组间运算应用在元素上

示例代码（1）：

# 矢量与矢量运算
arr = np.array([[1, 2, 3],
         [4, 5, 6]])

print("元素相乘：")
print(arr * arr)

print("矩阵相加：")
print(arr + arr)

运行结果：

元素相乘：
[[ 1 4 9]
 [16 25 36]]

矩阵相加：
[[ 2 4 6]
 [ 8 10 12]]

2. 矢量和标量运算："广播" - 将标量"广播"到各个元素

示例代码（2）：

# 矢量与标量运算
print(1. / arr)
print(2. * arr)

运行结果：

[[ 1.     0.5     0.33333333]
 [ 0.25    0.2     0.16666667]]

[[ 2.  4.  6.]
 [ 8. 10. 12.]]

ndarray的索引与切片

1. 一维数组的索引与切片

与Python的列表索引功能相似

示例代码（1）：

# 一维数组
arr1 = np.arange(10)
print(arr1)
print(arr1[2:5])

运行结果：

[0 1 2 3 4 5 6 7 8 9]
[2 3 4]

2. 多维数组的索引与切片：

arr[r1:r2, c1:c2]

arr[1,1] 等价 arr1

[:] 代表某个维度的数据

示例代码（2）：

# 多维数组
arr2 = np.arange(12).reshape(3,4)
print(arr2)

print(arr2[1])

print(arr2[0:2, 2:])

print(arr2[:, 1:3])

运行结果：

[[ 0 1 2 3]
 [ 4 5 6 7]
 [ 8 9 10 11]]

[4 5 6 7]

[[2 3]
 [6 7]]

[[ 1 2]
 [ 5 6]
 [ 9 10]]

3. 条件索引

布尔值多维数组：arr[condition]，condition也可以是多个条件组合。

注意，多个条件组合要使用 & | 连接，而不是Python的 and or。

示例代码（3）：

# 条件索引

# 找出 data_arr 中 2005年后的数据
data_arr = np.random.rand(3,3)
print(data_arr)

year_arr = np.array([[2000, 2001, 2000],
           [2005, 2002, 2009],
           [2001, 2003, 2010]])

is_year_after_2005 = year_arr >= 2005
print(is_year_after_2005, is_year_after_2005.dtype)

filtered_arr = data_arr[is_year_after_2005]
print(filtered_arr)

#filtered_arr = data_arr[year_arr >= 2005]
#print(filtered_arr)

# 多个条件
filtered_arr = data_arr[(year_arr <= 2005) & (year_arr % 2 == 0)]
print(filtered_arr)

运行结果：

[[ 0.53514038 0.93893429 0.1087513 ]
 [ 0.32076215 0.39820313 0.89765765]
 [ 0.6572177  0.71284822 0.15108756]]

[[False False False]
 [ True False True]
 [False False True]] bool

[ 0.32076215 0.89765765 0.15108756]

#[ 0.32076215  0.89765765  0.15108756]

[ 0.53514038 0.1087513  0.39820313]

ndarray的维数转换

二维数组直接使用转换函数：transpose()

高维数组转换要指定维度编号参数 (0, 1, 2, …)，注意参数是元组

示例代码：

arr = np.random.rand(2,3)  # 2x3 数组
print(arr)  
print(arr.transpose()) # 转换为 3x2 数组


arr3d = np.random.rand(2,3,4) # 2x3x4 数组，2对应0，3对应1，4对应3
print(arr3d)
print(arr3d.transpose((1,0,2))) # 根据维度编号，转为为 3x2x4 数组

运行结果：

# 二维数组转换
# 转换前：
[[ 0.50020075 0.88897914 0.18656499]
 [ 0.32765696 0.94564495 0.16549632]]

# 转换后：
[[ 0.50020075 0.32765696]
 [ 0.88897914 0.94564495]
 [ 0.18656499 0.16549632]]


# 高维数组转换
# 转换前：
[[[ 0.91281153 0.61213743 0.16214062 0.73380458]
  [ 0.45539155 0.04232412 0.82857746 0.35097793]
  [ 0.70418988 0.78075814 0.70963972 0.63774692]]

 [[ 0.17772347 0.64875514 0.48422954 0.86919646]
  [ 0.92771033 0.51518773 0.82679073 0.18469917]
  [ 0.37260457 0.49041953 0.96221477 0.16300198]]]

# 转换后：
[[[ 0.91281153 0.61213743 0.16214062 0.73380458]
  [ 0.17772347 0.64875514 0.48422954 0.86919646]]

 [[ 0.45539155 0.04232412 0.82857746 0.35097793]
  [ 0.92771033 0.51518773 0.82679073 0.18469917]]

 [[ 0.70418988 0.78075814 0.70963972 0.63774692]
  [ 0.37260457 0.49041953 0.96221477 0.16300198]]]

Python数据分析课程讲义

元素计算函数

ceil(): 向上最接近的整数，参数是 number 或 array
floor(): 向下最接近的整数，参数是 number 或 array
rint(): 四舍五入，参数是 number 或 array
isnan(): 判断元素是否为 NaN(Not a Number)，参数是 number 或 array
multiply(): 元素相乘，参数是 number 或 array
divide(): 元素相除，参数是 number 或 array
abs()：元素的绝对值，参数是 number 或 array
where(condition, x, y): 三元运算符，x if condition else y

示例代码（1、2、3、4、5、6、7）：

# randn() 返回具有标准正态分布的序列。
arr = np.random.randn(2,3)

print(arr)

print(np.ceil(arr))

print(np.floor(arr))

print(np.rint(arr))

print(np.isnan(arr))

print(np.multiply(arr, arr))

print(np.divide(arr, arr))

print(np.where(arr > 0, 1, -1))

运行结果：

# print(arr)
[[-0.75803752 0.0314314  1.15323032]
 [ 1.17567832 0.43641395 0.26288021]]

# print(np.ceil(arr))
[[-0. 1. 2.]
 [ 2. 1. 1.]]

# print(np.floor(arr))
[[-1. 0. 1.]
 [ 1. 0. 0.]]

# print(np.rint(arr))
[[-1. 0. 1.]
 [ 1. 0. 0.]]

# print(np.isnan(arr))
[[False False False]
 [False False False]]

# print(np.multiply(arr, arr))
[[ 5.16284053e+00  1.77170104e+00  3.04027254e-02]
 [ 5.11465231e-03  3.46109263e+00  1.37512421e-02]]

# print(np.divide(arr, arr))
[[ 1. 1. 1.]
 [ 1. 1. 1.]]

# print(np.where(arr > 0, 1, -1))
[[ 1 1 -1]
 [-1 1 1]]

元素统计函数

np.mean(), np.sum()：所有元素的平均值，所有元素的和，参数是 number 或 array
np.max(), np.min()：所有元素的最大值，所有元素的最小值，参数是 number 或 array
np.std(), np.var()：所有元素的标准差，所有元素的方差，参数是 number 或 array
np.argmax(), np.argmin()：最大值的下标索引值，最小值的下标索引值，参数是 number 或 array
np.cumsum(), np.cumprod()：返回一个一维数组，每个元素都是之前所有元素的累加和和累乘积，参数是 number 或 array
多维数组默认统计全部维度，axis参数可以按指定轴心统计，值为0则按列统计，值为1则按行统计。

示例代码：

arr = np.arange(12).reshape(3,4)
print(arr)

print(np.cumsum(arr)) # 返回一个一维数组，每个元素都是之前所有元素的 累加和

print(np.sum(arr)) # 所有元素的和

print(np.sum(arr, axis=0)) # 数组的按列统计和

print(np.sum(arr, axis=1)) # 数组的按行统计和

运行结果：

# print(arr)
[[ 0 1 2 3]
 [ 4 5 6 7]
 [ 8 9 10 11]]

# print(np.cumsum(arr)) 
[ 0 1 3 6 10 15 21 28 36 45 55 66]

# print(np.sum(arr)) # 所有元素的和
66

# print(np.sum(arr, axis=0)) # 0表示对数组的每一列的统计和
[12 15 18 21]

# print(np.sum(arr, axis=1)) # 1表示数组的每一行的统计和
[ 6 22 38]

元素判断函数

np.any(): 至少有一个元素满足指定条件，返回True
np.all(): 所有的元素满足指定条件，返回True

示例代码：

arr = np.random.randn(2,3)
print(arr)

print(np.any(arr > 0))
print(np.all(arr > 0))

运行结果：

[[ 0.05075769 -1.31919688 -1.80636984]
 [-1.29317016 -1.3336612 -0.19316432]]

True
False

元素去重排序函数

np.unique():找到唯一值并返回排序结果，类似于Python的set集合

示例代码：

arr = np.array([[1, 2, 1], [2, 3, 4]])
print(arr)

print(np.unique(arr))

运行结果：

[[1 2 1]
 [2 3 4]]

[1 2 3 4]

Python数据分析课程讲义

2016年美国总统大选民意调查数据统计：

项目地址：https://www.kaggle.com/fivethirtyeight/2016-election-polls
该数据集包含了2015年11月至2016年11月期间对于2016美国大选的选票数据，共27列数据

示例代码1 ：

# loadtxt
import numpy as np

# csv 名逗号分隔值文件
filename = './presidential_polls.csv'

# 通过loadtxt()读取本地csv文件 
data_array = np.loadtxt(filename,   # 文件名
            delimiter=',', # 分隔符
            dtype=str,   # 数据类型，数据是Unicode字符串
            usecols=(0,2,3)) # 指定读取的列号

# 打印ndarray数据，保留第一行
print(data_array, data_array.shape)

运行结果：

[["b'cycle'" "b'type'" "b'matchup'"]
 ["b'2016'" 'b\'"polls-plus"\'' 'b\'"Clinton vs. Trump vs. Johnson"\'']
 ["b'2016'" 'b\'"polls-plus"\'' 'b\'"Clinton vs. Trump vs. Johnson"\'']
 ..., 
 ["b'2016'" 'b\'"polls-only"\'' 'b\'"Clinton vs. Trump vs. Johnson"\'']
 ["b'2016'" 'b\'"polls-only"\'' 'b\'"Clinton vs. Trump vs. Johnson"\'']
 ["b'2016'" 'b\'"polls-only"\'' 'b\'"Clinton vs. Trump vs. Johnson"\'']] (10237, 3)

示例代码2：

import numpy as np
# 读取列名，即第一行数据
with open(filename, 'r') as f:
  col_names_str = f.readline()[:-1] # [:-1]表示不读取末尾的换行符'\n'

# 将字符串拆分，并组成列表
col_name_lst = col_names_str.split(',')

# 使用的列名：结束时间，克林顿原始票数，川普原始票数，克林顿调整后票数，川普调整后票数
use_col_name_lst = ['enddate', 'rawpoll_clinton', 'rawpoll_trump','adjpoll_clinton', 'adjpoll_trump']

# 获取相应列名的索引号
use_col_index_lst = [col_name_lst.index(use_col_name) for use_col_name in use_col_name_lst]

# 通过genfromtxt()读取本地csv文件，
data_array = np.genfromtxt(filename,   # 文件名
            delimiter=',', # 分隔符
            #skiprows=1,   # 跳过第一行，即跳过列名
            dtype=str,   # 数据类型，数据不再是Unicode字符串
            usecols=use_col_index_lst)# 指定读取的列索引号


# genfromtxt() 不能通过 skiprows 跳过第一行的
# ['enddate' 'rawpoll_clinton' 'rawpoll_trump' 'adjpoll_clinton' 'adjpoll_trump']

# 去掉第一行
data_array = data_array[1:]

# 打印ndarray数据
print(data_array[1:], data_array.shape)

运行结果：

[['10/30/2016' '45' '46' '43.29659' '44.72984']
 ['10/30/2016' '48' '42' '46.29779' '40.72604']
 ['10/24/2016' '48' '45' '46.35931' '45.30585']
 ..., 
 ['9/22/2016' '46.54' '40.04' '45.9713' '39.97518']
 ['6/21/2016' '43' '43' '45.2939' '46.66175']
 ['8/18/2016' '32.54' '43.61' '31.62721' '44.65947']] (10236, 5)

传智播客Python学院数据分析 1. 一、工作环境准备及数据分析建模理论基础 2. 二、科学计算工具NumPy 3. 三、数据分析工具Pandas 3.1. Pandas的数据结构 3.2. Pandas的索引操作 3.3. Pandas的对齐运算 3.4. Pandas的函数应用 3.5. Pandas的层级索引 3.6. Pandas统计计算和描述 3.7. Pandas分组与聚合 3.8. 数据清洗、合并、转化和重构 3.9. 聚类模型 -- K-Means介绍 3.10. 实战案例：全球食品数据分析 4. 四、数据可视化工具 5. 五、自然语言处理NLTK Published with GitBook

Python数据分析课程讲义

什么是Pandas

Pandas的名称来自于面板数据（panel data）和Python数据分析（data analysis）。

Pandas是一个强大的分析结构化数据的工具集，基于NumPy构建，提供了 高级数据结构 和 数据操作工具，它是使Python成为强大而高效的数据分析环境的重要因素之一。

一个强大的分析和操作大型结构化数据集所需的工具集
基础是NumPy，提供了高性能矩阵的运算
提供了大量能够快速便捷地处理数据的函数和方法
应用于数据挖掘，数据分析
提供数据清洗功能

http://pandas.pydata.org

Python数据分析课程讲义

Pandas的数据结构

import pandas as pd

Pandas有两个最主要也是最重要的数据结构： Series 和 DataFrame

Series

Series是一种类似于一维数组的对象，由一组数据（各种NumPy数据类型）以及一组与之对应的索引（数据标签）组成。

类似一维数组的对象
由数据和索引组成
- 索引(index)在左，数据(values)在右
- 索引是自动创建的

1. 通过list构建Series

ser_obj = pd.Series(range(10))

示例代码：

# 通过list构建Series
ser_obj = pd.Series(range(10, 20))
print(ser_obj.head(3))

print(ser_obj)

print(type(ser_obj))

运行结果：

0  10
1  11
2  12
dtype: int64

0  10
1  11
2  12
3  13
4  14
5  15
6  16
7  17
8  18
9  19
dtype: int64

<class 'pandas.core.series.Series'>

2. 获取数据和索引

ser_obj.index 和 ser_obj.values

示例代码：

# 获取数据
print(ser_obj.values)

# 获取索引
print(ser_obj.index)

运行结果：

[10 11 12 13 14 15 16 17 18 19]
RangeIndex(start=0, stop=10, step=1)

3. 通过索引获取数据

ser_obj[idx]

示例代码：

#通过索引获取数据
print(ser_obj[0])
print(ser_obj[8])

运行结果：

10
18

4. 索引与数据的对应关系不被运算结果影响

示例代码：

# 索引与数据的对应关系不被运算结果影响
print(ser_obj * 2)
print(ser_obj > 15)

运行结果：

0  20
1  22
2  24
3  26
4  28
5  30
6  32
7  34
8  36
9  38
dtype: int64

0  False
1  False
2  False
3  False
4  False
5  False
6   True
7   True
8   True
9   True
dtype: bool

5. 通过dict构建Series

示例代码：

# 通过dict构建Series
year_data = {2001: 17.8, 2002: 20.1, 2003: 16.5}
ser_obj2 = pd.Series(year_data)
print(ser_obj2.head())
print(ser_obj2.index)

运行结果：

2001  17.8
2002  20.1
2003  16.5
dtype: float64
Int64Index([2001, 2002, 2003], dtype='int64')

name属性

对象名：ser_obj.name

对象索引名：ser_obj.index.name

示例代码：

# name属性
ser_obj2.name = 'temp'
ser_obj2.index.name = 'year'
print(ser_obj2.head())

运行结果：

year
2001  17.8
2002  20.1
2003  16.5
Name: temp, dtype: float64

DataFrame

DataFrame是一个表格型的数据结构，它含有一组有序的列，每列可以是不同类型的值。DataFrame既有行索引也有列索引，它可以被看做是由Series组成的字典（共用同一个索引），数据是以二维结构存放的。

类似多维数组/表格数据 (如，excel, R中的data.frame)
每列数据可以是不同的类型
索引包括列索引和行索引

1. 通过ndarray构建DataFrame

示例代码：

import numpy as np

# 通过ndarray构建DataFrame
array = np.random.randn(5,4)
print(array)

df_obj = pd.DataFrame(array)
print(df_obj.head())

运行结果：

[[ 0.83500594 -1.49290138 -0.53120106 -0.11313932]
 [ 0.64629762 -0.36779941 0.08011084 0.60080495]
 [-1.23458522 0.33409674 -0.58778195 -0.73610573]
 [-1.47651414 0.99400187 0.21001995 -0.90515656]
 [ 0.56669419 1.38238348 -0.49099007 1.94484598]]

     0     1     2     3
0 0.835006 -1.492901 -0.531201 -0.113139
1 0.646298 -0.367799 0.080111 0.600805
2 -1.234585 0.334097 -0.587782 -0.736106
3 -1.476514 0.994002 0.210020 -0.905157
4 0.566694 1.382383 -0.490990 1.944846

2. 通过dict构建DataFrame

示例代码：

# 通过dict构建DataFrame
dict_data = {'A': 1, 
       'B': pd.Timestamp('20170426'),
       'C': pd.Series(1, index=list(range(4)),dtype='float32'),
       'D': np.array([3] * 4,dtype='int32'),
       'E': ["Python","Java","C++","C"],
       'F': 'ITCast' }
#print dict_data
df_obj2 = pd.DataFrame(dict_data)
print(df_obj2)

运行结果：

  A     B  C D    E    F
0 1 2017-04-26 1.0 3 Python ITCast
1 1 2017-04-26 1.0 3  Java ITCast
2 1 2017-04-26 1.0 3   C++ ITCast
3 1 2017-04-26 1.0 3    C ITCast

3. 通过列索引获取列数据（Series类型）

df_obj[col_idx] 或 df_obj.col_idx

示例代码：

# 通过列索引获取列数据
print(df_obj2['A'])
print(type(df_obj2['A']))

print(df_obj2.A)

运行结果：

0  1.0
1  1.0
2  1.0
3  1.0
Name: A, dtype: float64
<class 'pandas.core.series.Series'>
0  1.0
1  1.0
2  1.0
3  1.0
Name: A, dtype: float64

4. 增加列数据

df_obj[new_col_idx] = data

类似Python的 dict添加key-value

示例代码：

# 增加列
df_obj2['G'] = df_obj2['D'] + 4
print(df_obj2.head())

运行结果：

   A     B  C D    E    F G
0 1.0 2017-01-02 1.0 3 Python ITCast 7
1 1.0 2017-01-02 1.0 3  Java ITCast 7
2 1.0 2017-01-02 1.0 3   C++ ITCast 7
3 1.0 2017-01-02 1.0 3    C ITCast 7

5. 删除列

del df_obj[col_idx]

示例代码：

# 删除列
del(df_obj2['G'] )
print(df_obj2.head())

运行结果：

   A     B  C D    E    F
0 1.0 2017-01-02 1.0 3 Python ITCast
1 1.0 2017-01-02 1.0 3  Java ITCast
2 1.0 2017-01-02 1.0 3   C++ ITCast
3 1.0 2017-01-02 1.0 3    C ITCast

Python数据分析课程讲义

Pandas的索引操作

索引对象Index

1. Series和DataFrame中的索引都是Index对象

示例代码：

print(type(ser_obj.index))
print(type(df_obj2.index))

print(df_obj2.index)

运行结果：

<class 'pandas.indexes.range.RangeIndex'>
<class 'pandas.indexes.numeric.Int64Index'>
Int64Index([0, 1, 2, 3], dtype='int64')

2. 索引对象不可变，保证了数据的安全

示例代码：

# 索引对象不可变
df_obj2.index[0] = 2

运行结果：

---------------------------------------------------------------------------
TypeError                 Traceback (most recent call last)
<ipython-input-23-7f40a356d7d1> in <module>()
   1 # 索引对象不可变
----> 2 df_obj2.index[0] = 2

/Users/Power/anaconda/lib/python3.6/site-packages/pandas/indexes/base.py in __setitem__(self, key, value)
  1402 
  1403   def __setitem__(self, key, value):
-> 1404     raise TypeError("Index does not support mutable operations")
  1405 
  1406   def __getitem__(self, key):

TypeError: Index does not support mutable operations

常见的Index种类

Index，索引
Int64Index，整数索引
MultiIndex，层级索引
DatetimeIndex，时间戳类型

Series索引

1. index 指定行索引名

示例代码：

ser_obj = pd.Series(range(5), index = ['a', 'b', 'c', 'd', 'e'])
print(ser_obj.head())

运行结果：

a  0
b  1
c  2
d  3
e  4
dtype: int64

2. 行索引

ser_obj[‘label’], ser_obj[pos]

示例代码：

# 行索引
print(ser_obj['b'])
print(ser_obj[2])

运行结果：

1
2

3. 切片索引

ser_obj[2:4], ser_obj[‘label1’: ’label3’]

注意，按索引名切片操作时，是包含终止索引的。

示例代码：

# 切片索引
print(ser_obj[1:3])
print(ser_obj['b':'d'])

运行结果：

b  1
c  2
dtype: int64
b  1
c  2
d  3
dtype: int64

4. 不连续索引

ser_obj[[‘label1’, ’label2’, ‘label3’]]

示例代码：

# 不连续索引
print(ser_obj[[0, 2, 4]])
print(ser_obj[['a', 'e']])

运行结果：

a  0
c  2
e  4
dtype: int64
a  0
e  4
dtype: int64

5. 布尔索引

示例代码：

# 布尔索引
ser_bool = ser_obj > 2
print(ser_bool)
print(ser_obj[ser_bool])

print(ser_obj[ser_obj > 2])

运行结果：

a  False
b  False
c  False
d   True
e   True
dtype: bool
d  3
e  4
dtype: int64
d  3
e  4
dtype: int64

DataFrame索引

1. columns 指定列索引名

示例代码：

import numpy as np

df_obj = pd.DataFrame(np.random.randn(5,4), columns = ['a', 'b', 'c', 'd'])
print(df_obj.head())

运行结果：

     a     b     c     d
0 -0.241678 0.621589 0.843546 -0.383105
1 -0.526918 -0.485325 1.124420 -0.653144
2 -1.074163 0.939324 -0.309822 -0.209149
3 -0.716816 1.844654 -2.123637 -1.323484
4 0.368212 -0.910324 0.064703 0.486016

2. 列索引

df_obj[[‘label’]]

示例代码：

# 列索引
print(df_obj['a']) # 返回Series类型
print(df_obj[[0]]) # 返回DataFrame类型
print(type(df_obj[[0]])) # 返回DataFrame类型

运行结果：

0  -0.241678
1  -0.526918
2  -1.074163
3  -0.716816
4  0.368212
Name: a, dtype: float64
<class 'pandas.core.frame.DataFrame'>

3. 不连续索引

df_obj[[‘label1’, ‘label2’]]

示例代码：

# 不连续索引
print(df_obj[['a','c']])
print(df_obj[[1, 3]])

运行结果：

     a     c
0 -0.241678 0.843546
1 -0.526918 1.124420
2 -1.074163 -0.309822
3 -0.716816 -2.123637
4 0.368212 0.064703
     b     d
0 0.621589 -0.383105
1 -0.485325 -0.653144
2 0.939324 -0.209149
3 1.844654 -1.323484
4 -0.910324 0.486016

高级索引：标签、位置和混合

Pandas的高级索引有3种

1. loc 标签索引

DataFrame 不能直接切片，可以通过loc来做切片

loc是基于标签名的索引，也就是我们自定义的索引名

示例代码：

# 标签索引 loc
# Series
print(ser_obj['b':'d'])
print(ser_obj.loc['b':'d'])

# DataFrame
print(df_obj['a'])

# 第一个参数索引行，第二个参数是列
print(df_obj.loc[0:2, 'a'])

运行结果：

b  1
c  2
d  3
dtype: int64
b  1
c  2
d  3
dtype: int64

0  -0.241678
1  -0.526918
2  -1.074163
3  -0.716816
4  0.368212
Name: a, dtype: float64
0  -0.241678
1  -0.526918
2  -1.074163
Name: a, dtype: float64

2. iloc 位置索引

作用和loc一样，不过是基于索引编号来索引

示例代码：

# 整型位置索引 iloc
# Series
print(ser_obj[1:3])
print(ser_obj.iloc[1:3])

# DataFrame
print(df_obj.iloc[0:2, 0]) # 注意和df_obj.loc[0:2, 'a']的区别

运行结果：

b  1
c  2
dtype: int64
b  1
c  2
dtype: int64

0  -0.241678
1  -0.526918
Name: a, dtype: float64

3. ix 标签与位置混合索引

ix是以上二者的综合，既可以使用索引编号，又可以使用自定义索引，要视情况不同来使用，

如果索引既有数字又有英文，那么这种方式是不建议使用的，容易导致定位的混乱。

示例代码：

# 混合索引 ix
# Series
print(ser_obj.ix[1:3])
print(ser_obj.ix['b':'c'])

# DataFrame
print(df_obj.loc[0:2, 'a'])
print(df_obj.ix[0:2, 0])

运行结果：

b  1
c  2
dtype: int64
b  1
c  2
dtype: int64

0  -0.241678
1  -0.526918
2  -1.074163
Name: a, dtype: float64

注意

DataFrame索引操作，可将其看作ndarray的索引操作

标签的切片索引是包含末尾位置的

Python数据分析课程讲义

Pandas的对齐运算

是数据清洗的重要过程，可以按索引对齐进行运算，如果没对齐的位置则补NaN，最后也可以填充NaN

Series的对齐运算

1. Series 按行、索引对齐

示例代码：

s1 = pd.Series(range(10, 20), index = range(10))
s2 = pd.Series(range(20, 25), index = range(5))

print('s1: ' )
print(s1)

print('') 

print('s2: ')
print(s2)

运行结果：

s1: 
0  10
1  11
2  12
3  13
4  14
5  15
6  16
7  17
8  18
9  19
dtype: int64

s2: 
0  20
1  21
2  22
3  23
4  24
dtype: int64

2. Series的对齐运算

示例代码：

# Series 对齐运算
s1 + s2

运行结果：

0  30.0
1  32.0
2  34.0
3  36.0
4  38.0
5   NaN
6   NaN
7   NaN
8   NaN
9   NaN
dtype: float64

DataFrame的对齐运算

1. DataFrame按行、列索引对齐

示例代码：

df1 = pd.DataFrame(np.ones((2,2)), columns = ['a', 'b'])
df2 = pd.DataFrame(np.ones((3,3)), columns = ['a', 'b', 'c'])

print('df1: ')
print(df1)

print('') 
print('df2: ')
print(df2)

运行结果：

df1: 
   a  b
0 1.0 1.0
1 1.0 1.0

df2: 
   a  b  c
0 1.0 1.0 1.0
1 1.0 1.0 1.0
2 1.0 1.0 1.0

2. DataFrame的对齐运算

示例代码：

# DataFrame对齐操作
df1 + df2

运行结果：

   a  b  c
0 2.0 2.0 NaN
1 2.0 2.0 NaN
2 NaN NaN NaN

填充未对齐的数据进行运算

1. fill_value

使用add, sub, div, mul的同时，

通过fill_value指定填充值，未对齐的数据将和填充值做运算

示例代码：

print(s1)
print(s2)
s1.add(s2, fill_value = -1)

print(df1)
print(df2)
df1.sub(df2, fill_value = 2.)

运行结果：

# print(s1)
0  10
1  11
2  12
3  13
4  14
5  15
6  16
7  17
8  18
9  19
dtype: int64

# print(s2)
0  20
1  21
2  22
3  23
4  24
dtype: int64

# s1.add(s2, fill_value = -1)
0  30.0
1  32.0
2  34.0
3  36.0
4  38.0
5  14.0
6  15.0
7  16.0
8  17.0
9  18.0
dtype: float64


# print(df1)
   a  b
0 1.0 1.0
1 1.0 1.0

# print(df2)
   a  b  c
0 1.0 1.0 1.0
1 1.0 1.0 1.0
2 1.0 1.0 1.0


# df1.sub(df2, fill_value = 2.)
   a  b  c
0 0.0 0.0 1.0
1 0.0 0.0 1.0
2 1.0 1.0 1.0

Python数据分析课程讲义

Pandas的函数应用

apply 和 applymap

1. 可直接使用NumPy的函数

示例代码：

# Numpy ufunc 函数
df = pd.DataFrame(np.random.randn(5,4) - 1)
print(df)

print(np.abs(df))

运行结果：

     0     1     2     3
0 -0.062413 0.844813 -1.853721 -1.980717
1 -0.539628 -1.975173 -0.856597 -2.612406
2 -1.277081 -1.088457 -0.152189 0.530325
3 -1.356578 -1.996441 0.368822 -2.211478
4 -0.562777 0.518648 -2.007223 0.059411

     0     1     2     3
0 0.062413 0.844813 1.853721 1.980717
1 0.539628 1.975173 0.856597 2.612406
2 1.277081 1.088457 0.152189 0.530325
3 1.356578 1.996441 0.368822 2.211478
4 0.562777 0.518648 2.007223 0.059411

2. 通过apply将函数应用到列或行上

示例代码：

# 使用apply应用行或列数据
#f = lambda x : x.max()
print(df.apply(lambda x : x.max()))

运行结果：

0  -0.062413
1  0.844813
2  0.368822
3  0.530325
dtype: float64

注意指定轴的方向，默认axis=0，方向是列

示例代码：

# 指定轴方向，axis=1，方向是行
print(df.apply(lambda x : x.max(), axis=1))

运行结果：

0  0.844813
1  -0.539628
2  0.530325
3  0.368822
4  0.518648
dtype: float64

3. 通过applymap将函数应用到每个数据上

示例代码：

# 使用applymap应用到每个数据
f2 = lambda x : '%.2f' % x
print(df.applymap(f2))

运行结果：

    0   1   2   3
0 -0.06  0.84 -1.85 -1.98
1 -0.54 -1.98 -0.86 -2.61
2 -1.28 -1.09 -0.15  0.53
3 -1.36 -2.00  0.37 -2.21
4 -0.56  0.52 -2.01  0.06

排序

1. 索引排序

sort_index()

排序默认使用升序排序，ascending=False 为降序排序

示例代码：

# Series
s4 = pd.Series(range(10, 15), index = np.random.randint(5, size=5))
print(s4)

# 索引排序
s4.sort_index() # 0 0 1 3 3

运行结果：

0  10
3  11
1  12
3  13
0  14
dtype: int64

0  10
0  14
1  12
3  11
3  13
dtype: int64

对DataFrame操作时注意轴方向

示例代码：

# DataFrame
df4 = pd.DataFrame(np.random.randn(3, 5), 
          index=np.random.randint(3, size=3),
          columns=np.random.randint(5, size=5))
print(df4)

df4_isort = df4.sort_index(axis=1, ascending=False)
print(df4_isort) # 4 2 1 1 0

运行结果：

     1     4     0     1     2
2 -0.416686 -0.161256 0.088802 -0.004294 1.164138
1 -0.671914 0.531256 0.303222 -0.509493 -0.342573
1 1.988321 -0.466987 2.787891 -1.105912 0.889082

     4     2     1     1     0
2 -0.161256 1.164138 -0.416686 -0.004294 0.088802
1 0.531256 -0.342573 -0.671914 -0.509493 0.303222
1 -0.466987 0.889082 1.988321 -1.105912 2.787891

2. 按值排序

sort_values(by='column name')

根据某个唯一的列名进行排序，如果有其他相同列名则报错。

示例代码：

# 按值排序
df4_vsort = df4.sort_values(by=0, ascending=False)
print(df4_vsort)

运行结果：

     1     4     0     1     2
1 1.988321 -0.466987 2.787891 -1.105912 0.889082
1 -0.671914 0.531256 0.303222 -0.509493 -0.342573
2 -0.416686 -0.161256 0.088802 -0.004294 1.164138

处理缺失数据

示例代码：

df_data = pd.DataFrame([np.random.randn(3), [1., 2., np.nan],
            [np.nan, 4., np.nan], [1., 2., 3.]])
print(df_data.head())

运行结果：

     0     1     2
0 -0.281885 -0.786572 0.487126
1 1.000000 2.000000    NaN
2    NaN 4.000000    NaN
3 1.000000 2.000000 3.000000

1. 判断是否存在缺失值：isnull()

示例代码：

# isnull
print(df_data.isnull())

运行结果：

    0   1   2
0 False False False
1 False False  True
2  True False  True
3 False False False

2. 丢弃缺失数据：dropna()

根据axis轴方向，丢弃包含NaN的行或列。示例代码：

# dropna
print(df_data.dropna())

print(df_data.dropna(axis=1))

运行结果：

     0     1     2
0 -0.281885 -0.786572 0.487126
3 1.000000 2.000000 3.000000

     1
0 -0.786572
1 2.000000
2 4.000000
3 2.000000

3. 填充缺失数据：fillna()

示例代码：

# fillna
print(df_data.fillna(-100.))

运行结果：

      0     1      2
0  -0.281885 -0.786572  0.487126
1  1.000000 2.000000 -100.000000
2 -100.000000 4.000000 -100.000000
3  1.000000 2.000000  3.000000

Python数据分析课程讲义

层级索引（hierarchical indexing）

下面创建一个Series，在输入索引Index时，输入了由两个子list组成的list，第一个子list是外层索引，第二个list是内层索引。

示例代码：

import pandas as pd
import numpy as np

ser_obj = pd.Series(np.random.randn(12),index=[
         ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'd'],
         [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]
       ])
print(ser_obj)

运行结果：

a 0  0.099174
  1  -0.310414
  2  -0.558047
b 0  1.742445
  1  1.152924
  2  -0.725332
c 0  -0.150638
  1  0.251660
  2  0.063387
d 0  1.080605
  1  0.567547
  2  -0.154148
dtype: float64

MultiIndex索引对象

打印这个Series的索引类型，显示是MultiIndex
直接将索引打印出来，可以看到有lavels,和labels两个信息。lavels表示两个层级中分别有那些标签，labels是每个位置分别是什么标签。

示例代码：

print(type(ser_obj.index))
print(ser_obj.index)

运行结果：

<class 'pandas.indexes.multi.MultiIndex'>
MultiIndex(levels=[['a', 'b', 'c', 'd'], [0, 1, 2]],
      labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]])

选取子集

根据索引获取数据。因为现在有两层索引，当通过外层索引获取数据的时候，可以直接利用外层索引的标签来获取。
当要通过内层索引获取数据的时候，在list中传入两个元素，前者是表示要选取的外层索引，后者表示要选取的内层索引。

1. 外层选取：

ser_obj['outer_label']

示例代码：

# 外层选取
print(ser_obj['c'])

运行结果：

0  -1.362096
1  1.558091
2  -0.452313
dtype: float64

2. 内层选取：

ser_obj[:, 'inner_label']

示例代码：

# 内层选取
print(ser_obj[:, 2])

运行结果：

a  0.826662
b  0.015426
c  -0.452313
d  -0.051063
dtype: float64

常用于分组操作、透视表的生成等

交换分层顺序

1. swaplevel()

.swaplevel( )交换内层与外层索引。

示例代码：

print(ser_obj.swaplevel())

运行结果：

0 a  0.099174
1 a  -0.310414
2 a  -0.558047
0 b  1.742445
1 b  1.152924
2 b  -0.725332
0 c  -0.150638
1 c  0.251660
2 c  0.063387
0 d  1.080605
1 d  0.567547
2 d  -0.154148
dtype: float64

交换并排序分层

sortlevel()

.sortlevel( )先对外层索引进行排序，再对内层索引进行排序，默认是升序。

示例代码：

# 交换并排序分层
print(ser_obj.swaplevel().sortlevel())

运行结果：

0 a  0.099174
  b  1.742445
  c  -0.150638
  d  1.080605
1 a  -0.310414
  b  1.152924
  c  0.251660
  d  0.567547
2 a  -0.558047
  b  -0.725332
  c  0.063387
  d  -0.154148
dtype: float64

Python数据分析课程讲义

Pandas统计计算和描述

示例代码：

import numpy as np
import pandas as pd

df_obj = pd.DataFrame(np.random.randn(5,4), columns = ['a', 'b', 'c', 'd'])
print(df_obj)

运行结果：

     a     b     c     d
0 1.469682 1.948965 1.373124 -0.564129
1 -1.466670 -0.494591 0.467787 -2.007771
2 1.368750 0.532142 0.487862 -1.130825
3 -0.758540 -0.479684 1.239135 1.073077
4 -0.007470 0.997034 2.669219 0.742070

常用的统计计算

sum, mean, max, min…

axis=0 按列统计，axis=1按行统计

skipna 排除缺失值，默认为True

示例代码：

df_obj.sum()

df_obj.max()

df_obj.min(axis=1, skipna=False)

运行结果：

a  0.605751
b  2.503866
c  6.237127
d  -1.887578
dtype: float64

a  1.469682
b  1.948965
c  2.669219
d  1.073077
dtype: float64

0  -0.564129
1  -2.007771
2  -1.130825
3  -0.758540
4  -0.007470
dtype: float64

常用的统计描述

describe 产生多个统计数据

示例代码：

print(df_obj.describe())

运行结果：

       a     b     c     d
count 5.000000 5.000000 5.000000 5.000000
mean  0.180305 0.106488 0.244978 0.178046
std  0.641945 0.454340 1.064356 1.144416
min  -0.677175 -0.490278 -1.164928 -1.574556
25%  -0.064069 -0.182920 -0.464013 -0.089962
50%  0.231722 0.127846 0.355859 0.190482
75%  0.318854 0.463377 1.169750 0.983663
max  1.092195 0.614413 1.328220 1.380601

常用的统计描述方法：

Python数据分析课程讲义

Pandas分组与聚合

分组 (groupby)

对数据集进行分组，然后对每组进行统计分析
SQL能够对数据进行过滤，分组聚合
pandas能利用groupby进行更加复杂的分组运算
分组运算过程：split->apply->combine
1. 拆分：进行分组的根据
2. 应用：每个分组运行的计算规则
3. 合并：把每个分组的计算结果合并起来

示例代码：

import pandas as pd
import numpy as np

dict_obj = {'key1' : ['a', 'b', 'a', 'b', 
           'a', 'b', 'a', 'a'],
      'key2' : ['one', 'one', 'two', 'three',
           'two', 'two', 'one', 'three'],
      'data1': np.random.randn(8),
      'data2': np.random.randn(8)}
df_obj = pd.DataFrame(dict_obj)
print(df_obj)

运行结果：

   data1   data2 key1  key2
0 0.974685 -0.672494  a  one
1 -0.214324 0.758372  b  one
2 1.508838 0.392787  a  two
3 0.522911 0.630814  b three
4 1.347359 -0.177858  a  two
5 -0.264616 1.017155  b  two
6 -0.624708 0.450885  a  one
7 -1.019229 -1.143825  a three

一、GroupBy对象：DataFrameGroupBy，SeriesGroupBy

1. 分组操作

groupby()进行分组，GroupBy对象没有进行实际运算，只是包含分组的中间数据

按列名分组：obj.groupby(‘label’)

示例代码：

# dataframe根据key1进行分组
print(type(df_obj.groupby('key1')))

# dataframe的 data1 列根据 key1 进行分组
print(type(df_obj['data1'].groupby(df_obj['key1'])))

运行结果：

<class 'pandas.core.groupby.DataFrameGroupBy'>
<class 'pandas.core.groupby.SeriesGroupBy'>

2. 分组运算

对GroupBy对象进行分组运算/多重分组运算，如mean()

非数值数据不进行分组运算

示例代码：

# 分组运算
grouped1 = df_obj.groupby('key1')
print(grouped1.mean())

grouped2 = df_obj['data1'].groupby(df_obj['key1'])
print(grouped2.mean())

运行结果：

     data1   data2
key1          
a   0.437389 -0.230101
b   0.014657 0.802114
key1
a  0.437389
b  0.014657
Name: data1, dtype: float64

size() 返回每个分组的元素个数

示例代码：

# size
print(grouped1.size())
print(grouped2.size())

运行结果：

key1
a  5
b  3
dtype: int64
key1
a  5
b  3
dtype: int64

3. 按自定义的key分组

obj.groupby(self_def_key)

自定义的key可为列表或多层列表

obj.groupby([‘label1’, ‘label2’])->多层dataframe

示例代码：

# 按自定义key分组，列表
self_def_key = [0, 1, 2, 3, 3, 4, 5, 7]
print(df_obj.groupby(self_def_key).size())

# 按自定义key分组，多层列表
print(df_obj.groupby([df_obj['key1'], df_obj['key2']]).size())

# 按多个列多层分组
grouped2 = df_obj.groupby(['key1', 'key2'])
print(grouped2.size())

# 多层分组按key的顺序进行
grouped3 = df_obj.groupby(['key2', 'key1'])
print(grouped3.mean())
# unstack可以将多层索引的结果转换成单层的dataframe
print(grouped3.mean().unstack())

运行结果：

0  1
1  1
2  1
3  2
4  1
5  1
7  1
dtype: int64

key1 key2 
a   one   2
   three  1
   two   2
b   one   1
   three  1
   two   1
dtype: int64


key1 key2 
a   one   2
   three  1
   two   2
b   one   1
   three  1
   two   1
dtype: int64


        data1   data2
key2 key1          
one  a   0.174988 -0.110804
   b  -0.214324 0.758372
three a  -1.019229 -1.143825
   b   0.522911 0.630814
two  a   1.428099 0.107465
   b  -0.264616 1.017155

     data1        data2     
key1     a     b     a     b
key2                     
one  0.174988 -0.214324 -0.110804 0.758372
three -1.019229 0.522911 -1.143825 0.630814
two  1.428099 -0.264616 0.107465 1.017155

二、GroupBy对象支持迭代操作

每次迭代返回一个元组 (group_name, group_data)

可用于分组数据的具体运算

1. 单层分组

示例代码：

# 单层分组，根据key1
for group_name, group_data in grouped1:
  print(group_name)
  print(group_data)

运行结果：

a
   data1   data2 key1  key2
0 0.974685 -0.672494  a  one
2 1.508838 0.392787  a  two
4 1.347359 -0.177858  a  two
6 -0.624708 0.450885  a  one
7 -1.019229 -1.143825  a three

b
   data1   data2 key1  key2
1 -0.214324 0.758372  b  one
3 0.522911 0.630814  b three
5 -0.264616 1.017155  b  two

2. 多层分组

示例代码：

# 多层分组，根据key1 和 key2
for group_name, group_data in grouped2:
  print(group_name)
  print(group_data)

运行结果：

('a', 'one')
   data1   data2 key1 key2
0 0.974685 -0.672494  a one
6 -0.624708 0.450885  a one

('a', 'three')
   data1   data2 key1  key2
7 -1.019229 -1.143825  a three

('a', 'two')
   data1   data2 key1 key2
2 1.508838 0.392787  a two
4 1.347359 -0.177858  a two

('b', 'one')
   data1   data2 key1 key2
1 -0.214324 0.758372  b one

('b', 'three')
   data1   data2 key1  key2
3 0.522911 0.630814  b three

('b', 'two')
   data1   data2 key1 key2
5 -0.264616 1.017155  b two

三、GroupBy对象可以转换成列表或字典

示例代码：

# GroupBy对象转换list
print(list(grouped1))

# GroupBy对象转换dict
print(dict(list(grouped1)))

运行结果：

[('a',    data1   data2 key1  key2
0 0.974685 -0.672494  a  one
2 1.508838 0.392787  a  two
4 1.347359 -0.177858  a  two
6 -0.624708 0.450885  a  one
7 -1.019229 -1.143825  a three), 
('b',    data1   data2 key1  key2
1 -0.214324 0.758372  b  one
3 0.522911 0.630814  b three
5 -0.264616 1.017155  b  two)]

{'a':    data1   data2 key1  key2
0 0.974685 -0.672494  a  one
2 1.508838 0.392787  a  two
4 1.347359 -0.177858  a  two
6 -0.624708 0.450885  a  one
7 -1.019229 -1.143825  a three, 
'b':    data1   data2 key1  key2
1 -0.214324 0.758372  b  one
3 0.522911 0.630814  b three
5 -0.264616 1.017155  b  two}

1. 按列分组、按数据类型分组

示例代码：

# 按列分组
print(df_obj.dtypes)

# 按数据类型分组
print(df_obj.groupby(df_obj.dtypes, axis=1).size())
print(df_obj.groupby(df_obj.dtypes, axis=1).sum())

运行结果：

data1  float64
data2  float64
key1   object
key2   object
dtype: object

float64  2
object   2
dtype: int64

  float64 object
0 0.302191  a one
1 0.544048  b one
2 1.901626  a two
3 1.153725 b three
4 1.169501  a two
5 0.752539  b two
6 -0.173823  a one
7 -2.163054 a three

2. 其他分组方法

示例代码：

df_obj2 = pd.DataFrame(np.random.randint(1, 10, (5,5)),
            columns=['a', 'b', 'c', 'd', 'e'],
            index=['A', 'B', 'C', 'D', 'E'])
df_obj2.ix[1, 1:4] = np.NaN
print(df_obj2)

运行结果：

  a  b  c  d e
A 7 2.0 4.0 5.0 8
B 4 NaN NaN NaN 1
C 3 2.0 5.0 4.0 6
D 3 1.0 9.0 7.0 3
E 6 1.0 6.0 8.0 1

3. 通过字典分组

示例代码：

# 通过字典分组
mapping_dict = {'a':'Python', 'b':'Python', 'c':'Java', 'd':'C', 'e':'Java'}
print(df_obj2.groupby(mapping_dict, axis=1).size())
print(df_obj2.groupby(mapping_dict, axis=1).count()) # 非NaN的个数
print(df_obj2.groupby(mapping_dict, axis=1).sum())

运行结果：

C     1
Java   2
Python  2
dtype: int64

  C Java Python
A 1   2    2
B 0   1    1
C 1   2    2
D 1   2    2
E 1   2    2

   C Java Python
A 5.0 12.0   9.0
B NaN  1.0   4.0
C 4.0 11.0   5.0
D 7.0 12.0   4.0
E 8.0  7.0   7.0

4. 通过函数分组，函数传入的参数为行索引或列索引

示例代码：

# 通过函数分组
df_obj3 = pd.DataFrame(np.random.randint(1, 10, (5,5)),
            columns=['a', 'b', 'c', 'd', 'e'],
            index=['AA', 'BBB', 'CC', 'D', 'EE'])
#df_obj3

def group_key(idx):
  """
     idx 为列索引或行索引
   """
  #return idx
  return len(idx)

print(df_obj3.groupby(group_key).size())

# 以上自定义函数等价于
#df_obj3.groupby(len).size()

运行结果：

1  1
2  3
3  1
dtype: int64

5. 通过索引级别分组

示例代码：

# 通过索引级别分组
columns = pd.MultiIndex.from_arrays([['Python', 'Java', 'Python', 'Java', 'Python'],
                   ['A', 'A', 'B', 'C', 'B']], names=['language', 'index'])
df_obj4 = pd.DataFrame(np.random.randint(1, 10, (5, 5)), columns=columns)
print(df_obj4)

# 根据language进行分组
print(df_obj4.groupby(level='language', axis=1).sum())
# 根据index进行分组
print(df_obj4.groupby(level='index', axis=1).sum())

运行结果：

language Python Java Python Java Python
index     A  A   B  C   B
0       2  7   8  4   3
1       5  2   6  1   2
2       6  4   4  5   2
3       4  7   4  3   1
4       7  4   3  4   8

language Java Python
0      11   13
1      3   13
2      9   12
3      10    9
4      8   18

index  A  B C
0    9 11 4
1    7  8 1
2   10  6 5
3   11  5 3
4   11 11 4

聚合 (aggregation)

数组产生标量的过程，如mean()、count()等
常用于对分组后的数据进行计算

示例代码：

dict_obj = {'key1' : ['a', 'b', 'a', 'b', 
           'a', 'b', 'a', 'a'],
      'key2' : ['one', 'one', 'two', 'three',
           'two', 'two', 'one', 'three'],
      'data1': np.random.randint(1,10, 8),
      'data2': np.random.randint(1,10, 8)}
df_obj5 = pd.DataFrame(dict_obj)
print(df_obj5)

运行结果：

  data1 data2 key1  key2
0   3   7  a  one
1   1   5  b  one
2   7   4  a  two
3   2   4  b three
4   6   4  a  two
5   9   9  b  two
6   3   5  a  one
7   8   4  a three

1. 内置的聚合函数

sum(), mean(), max(), min(), count(), size(), describe()

示例代码：

print(df_obj5.groupby('key1').sum())
print(df_obj5.groupby('key1').max())
print(df_obj5.groupby('key1').min())
print(df_obj5.groupby('key1').mean())
print(df_obj5.groupby('key1').size())
print(df_obj5.groupby('key1').count())
print(df_obj5.groupby('key1').describe())

运行结果：

   data1 data2
key1       
a    27   24
b    12   18

   data1 data2 key2
key1          
a     8   7 two
b     9   9 two

   data1 data2 key2
key1          
a     3   4 one
b     1   4 one

   data1 data2
key1       
a    5.4  4.8
b    4.0  6.0

key1
a  5
b  3
dtype: int64

   data1 data2 key2
key1          
a     5   5   5
b     3   3   3

        data1   data2
key1             
a  count 5.000000 5.000000
   mean  5.400000 4.800000
   std  2.302173 1.303840
   min  3.000000 4.000000
   25%  3.000000 4.000000
   50%  6.000000 4.000000
   75%  7.000000 5.000000
   max  8.000000 7.000000
b  count 3.000000 3.000000
   mean  4.000000 6.000000
   std  4.358899 2.645751
   min  1.000000 4.000000
   25%  1.500000 4.500000
   50%  2.000000 5.000000
   75%  5.500000 7.000000
   max  9.000000 9.000000

2. 可自定义函数，传入agg方法中

grouped.agg(func)

func的参数为groupby索引对应的记录

示例代码：

# 自定义聚合函数
def peak_range(df):
  """
     返回数值范围
   """
  #print type(df) #参数为索引所对应的记录
  return df.max() - df.min()

print(df_obj5.groupby('key1').agg(peak_range))
print(df_obj.groupby('key1').agg(lambda df : df.max() - df.min()))

运行结果：

   data1 data2
key1       
a     5   3
b     8   5

     data1   data2
key1          
a   2.528067 1.594711
b   0.787527 0.386341
In [25]:

3. 应用多个聚合函数

同时应用多个函数进行聚合操作，使用函数列表

示例代码：

# 应用多个聚合函数

# 同时应用多个聚合函数
print(df_obj.groupby('key1').agg(['mean', 'std', 'count', peak_range])) # 默认列名为函数名

print(df_obj.groupby('key1').agg(['mean', 'std', 'count', ('range', peak_range)])) # 通过元组提供新的列名

运行结果：

     data1                data2              
     mean    std count peak_range   mean    std count peak_range
key1                                     
a   0.437389 1.174151   5  2.528067 -0.230101 0.686488   5  1.594711
b   0.014657 0.440878   3  0.787527 0.802114 0.196850   3  0.386341

     data1                data2             
     mean    std count   range   mean    std count   range
key1                                    
a   0.437389 1.174151   5 2.528067 -0.230101 0.686488   5 1.594711
b   0.014657 0.440878   3 0.787527 0.802114 0.196850   3 0.386341

4. 对不同的列分别作用不同的聚合函数，使用dict

示例代码：

# 每列作用不同的聚合函数
dict_mapping = {'data1':'mean',
        'data2':'sum'}
print(df_obj.groupby('key1').agg(dict_mapping))

dict_mapping = {'data1':['mean','max'],
        'data2':'sum'}
print(df_obj.groupby('key1').agg(dict_mapping))

运行结果：

     data1   data2
key1          
a   0.437389 -1.150505
b   0.014657 2.406341

     data1        data2
     mean    max    sum
key1               
a   0.437389 1.508838 -1.150505
b   0.014657 0.522911 2.406341

5. 常用的内置聚合函数

数据的分组运算

示例代码：

import pandas as pd
import numpy as np

dict_obj = {'key1' : ['a', 'b', 'a', 'b', 
           'a', 'b', 'a', 'a'],
      'key2' : ['one', 'one', 'two', 'three',
           'two', 'two', 'one', 'three'],
      'data1': np.random.randint(1, 10, 8),
      'data2': np.random.randint(1, 10, 8)}
df_obj = pd.DataFrame(dict_obj)
print(df_obj)

# 按key1分组后，计算data1，data2的统计信息并附加到原始表格中，并添加表头前缀
k1_sum = df_obj.groupby('key1').sum().add_prefix('sum_')
print(k1_sum)

运行结果：

  data1 data2 key1  key2
0   5   1  a  one
1   7   8  b  one
2   1   9  a  two
3   2   6  b three
4   9   8  a  two
5   8   3  b  two
6   3   5  a  one
7   8   3  a three

   sum_data1 sum_data2
key1           
a      26     26
b      17     17

聚合运算后会改变原始数据的形状，

如何保持原始数据的形状

1. merge

使用merge的外连接，比较复杂

示例代码：

# 方法1，使用merge
k1_sum_merge = pd.merge(df_obj, k1_sum, left_on='key1', right_index=True)
print(k1_sum_merge)

运行结果：

  data1 data2 key1  key2 sum_data1 sum_data2
0   5   1  a  one     26     26
2   1   9  a  two     26     26
4   9   8  a  two     26     26
6   3   5  a  one     26     26
7   8   3  a three     26     26
1   7   8  b  one     17     17
3   2   6  b three     17     17
5   8   3  b  two     17     17

2. transform

transform的计算结果和原始数据的形状保持一致，

如：grouped.transform(np.sum)

示例代码：

# 方法2，使用transform
k1_sum_tf = df_obj.groupby('key1').transform(np.sum).add_prefix('sum_')
df_obj[k1_sum_tf.columns] = k1_sum_tf
print(df_obj)

运行结果：

  data1 data2 key1  key2 sum_data1 sum_data2      sum_key2
0   5   1  a  one    26    26 onetwotwoonethree
1   7   8  b  one    17    17    onethreetwo
2   1   9  a  two    26    26 onetwotwoonethree
3   2   6  b three    17    17    onethreetwo
4   9   8  a  two    26    26 onetwotwoonethree
5   8   3  b  two    17    17    onethreetwo
6   3   5  a  one    26    26 onetwotwoonethree
7   8   3  a three    26    26 onetwotwoonethree

也可传入自定义函数，

示例代码：

# 自定义函数传入transform
def diff_mean(s):
  """
     返回数据与均值的差值
   """
  return s - s.mean()

print(df_obj.groupby('key1').transform(diff_mean))

运行结果：

   data1   data2 sum_data1 sum_data2
0 -0.200000 -4.200000     0     0
1 1.333333 2.333333     0     0
2 -4.200000 3.800000     0     0
3 -3.666667 0.333333     0     0
4 3.800000 2.800000     0     0
5 2.333333 -2.666667     0     0
6 -2.200000 -0.200000     0     0
7 2.800000 -2.200000     0     0

groupby.apply(func)

func函数也可以在各分组上分别调用，最后结果通过pd.concat组装到一起（数据合并）

示例代码：

import pandas as pd
import numpy as np

dataset_path = './starcraft.csv'
df_data = pd.read_csv(dataset_path, usecols=['LeagueIndex', 'Age', 'HoursPerWeek', 
                       'TotalHours', 'APM'])

def top_n(df, n=3, column='APM'):
  """
     返回每个分组按 column 的 top n 数据
   """
  return df.sort_values(by=column, ascending=False)[:n]

print(df_data.groupby('LeagueIndex').apply(top_n))

运行结果：

         LeagueIndex  Age HoursPerWeek TotalHours    APM
LeagueIndex                              
1      2214      1 20.0     12.0    730.0 172.9530
      2246      1 27.0      8.0    250.0 141.6282
      1753      1 20.0     28.0    100.0 139.6362
2      3062      2 20.0      6.0    100.0 179.6250
      3229      2 16.0     24.0    110.0 156.7380
      1520      2 29.0      6.0    250.0 151.6470
3      1557      3 22.0      6.0    200.0 226.6554
      484       3 19.0     42.0    450.0 220.0692
      2883      3 16.0      8.0    800.0 208.9500
4      2688      4 26.0     24.0    990.0 249.0210
      1759      4 16.0      6.0    75.0 229.9122
      2637      4 23.0     24.0    650.0 227.2272
5      3277      5 18.0     16.0    950.0 372.6426
      93       5 17.0     36.0    720.0 335.4990
      202       5 37.0     14.0    800.0 327.7218
6      734       6 16.0     28.0    730.0 389.8314
      2746      6 16.0     28.0   4000.0 350.4114
      1810      6 21.0     14.0    730.0 323.2506
7      3127      7 23.0     42.0   2000.0 298.7952
      104       7 21.0     24.0   1000.0 286.4538
      1654      7 18.0     98.0    700.0 236.0316
8      3393      8  NaN      NaN     NaN 375.8664
      3373      8  NaN      NaN     NaN 364.8504
      3372      8  NaN      NaN     NaN 355.3518

1. 产生层级索引：外层索引是分组名，内层索引是df_obj的行索引

示例代码：

# apply函数接收的参数会传入自定义的函数中
print(df_data.groupby('LeagueIndex').apply(top_n, n=2, column='Age'))

运行结果：

         LeagueIndex  Age HoursPerWeek TotalHours    APM
LeagueIndex                              
1      3146      1 40.0     12.0    150.0  38.5590
      3040      1 39.0     10.0    500.0  29.8764
2      920       2 43.0     10.0    730.0  86.0586
      2437      2 41.0      4.0    200.0  54.2166
3      1258      3 41.0     14.0    800.0  77.6472
      2972      3 40.0     10.0    500.0  60.5970
4      1696      4 44.0      6.0    500.0  89.5266
      1729      4 39.0      8.0    500.0  86.7246
5      202       5 37.0     14.0    800.0 327.7218
      2745      5 37.0     18.0   1000.0 123.4098
6      3069      6 31.0      8.0    800.0 133.1790
      2706      6 31.0      8.0    700.0  66.9918
7      2813      7 26.0     36.0   1300.0 188.5512
      1992      7 26.0     24.0   1000.0 219.6690
8      3340      8  NaN      NaN     NaN 189.7404
      3341      8  NaN      NaN     NaN 287.8128

2. 禁止层级索引, group_keys=False

示例代码：

print(df_data.groupby('LeagueIndex', group_keys=False).apply(top_n))

运行结果：

   LeagueIndex  Age HoursPerWeek TotalHours    APM
2214      1 20.0     12.0    730.0 172.9530
2246      1 27.0      8.0    250.0 141.6282
1753      1 20.0     28.0    100.0 139.6362
3062      2 20.0      6.0    100.0 179.6250
3229      2 16.0     24.0    110.0 156.7380
1520      2 29.0      6.0    250.0 151.6470
1557      3 22.0      6.0    200.0 226.6554
484       3 19.0     42.0    450.0 220.0692
2883      3 16.0      8.0    800.0 208.9500
2688      4 26.0     24.0    990.0 249.0210
1759      4 16.0      6.0    75.0 229.9122
2637      4 23.0     24.0    650.0 227.2272
3277      5 18.0     16.0    950.0 372.6426
93       5 17.0     36.0    720.0 335.4990
202       5 37.0     14.0    800.0 327.7218
734       6 16.0     28.0    730.0 389.8314
2746      6 16.0     28.0   4000.0 350.4114
1810      6 21.0     14.0    730.0 323.2506
3127      7 23.0     42.0   2000.0 298.7952
104       7 21.0     24.0   1000.0 286.4538
1654      7 18.0     98.0    700.0 236.0316
3393      8  NaN      NaN     NaN 375.8664
3373      8  NaN      NaN     NaN 364.8504
3372      8  NaN      NaN     NaN 355.3518

apply可以用来处理不同分组内的缺失数据填充，填充该分组的均值。

Python数据分析课程讲义

数据清洗

数据清洗是数据分析关键的一步，直接影响之后的处理工作
数据需要修改吗？有什么需要修改的吗？数据应该怎么调整才能适用于接下来的分析和挖掘？
是一个迭代的过程，实际项目中可能需要不止一次地执行这些清洗操作
处理缺失数据：pd.fillna()，pd.dropna()

数据连接(pd.merge)

pd.merge
根据单个或多个键将不同DataFrame的行连接起来
类似数据库的连接操作

示例代码：

import pandas as pd
import numpy as np

df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
            'data1' : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({'key': ['a', 'b', 'd'],
            'data2' : np.random.randint(0,10,3)})

print(df_obj1)
print(df_obj2)

运行结果：

  data1 key
  data1 key
0   8  b
1   8  b
2   3  a
3   5  c
4   4  a
5   9  a
6   6  b

  data2 key
0   9  a
1   0  b
2   3  d

1. 默认将重叠列的列名作为“外键”进行连接

示例代码：

# 默认将重叠列的列名作为“外键”进行连接
print(pd.merge(df_obj1, df_obj2))

运行结果：

  data1 key data2
0   8  b   0
1   8  b   0
2   6  b   0
3   3  a   9
4   4  a   9
5   9  a   9

2. on显示指定“外键”

示例代码：

# on显示指定“外键”
print(pd.merge(df_obj1, df_obj2, on='key'))

运行结果：

  data1 key data2
0   8  b   0
1   8  b   0
2   6  b   0
3   3  a   9
4   4  a   9
5   9  a   9

3. left_on，左侧数据的“外键”，right_on，右侧数据的“外键”

示例代码：

# left_on，right_on分别指定左侧数据和右侧数据的“外键”

# 更改列名
df_obj1 = df_obj1.rename(columns={'key':'key1'})
df_obj2 = df_obj2.rename(columns={'key':'key2'})

print(pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2'))

运行结果：

  data1 key1 data2 key2
0   8  b   0  b
1   8  b   0  b
2   6  b   0  b
3   3  a   9  a
4   4  a   9  a
5   9  a   9  a

默认是“内连接”(inner)，即结果中的键是交集

how指定连接方式

4. “外连接”(outer)，结果中的键是并集

示例代码：

# “外连接”
print(pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='outer'))

运行结果：

  data1 key1 data2 key2
0  8.0  b  0.0  b
1  8.0  b  0.0  b
2  6.0  b  0.0  b
3  3.0  a  9.0  a
4  4.0  a  9.0  a
5  9.0  a  9.0  a
6  5.0  c  NaN NaN
7  NaN NaN  3.0  d

5. “左连接”(left)

示例代码：

# 左连接
print(pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='left'))

运行结果：

  data1 key1 data2 key2
0   8  b  0.0  b
1   8  b  0.0  b
2   3  a  9.0  a
3   5  c  NaN NaN
4   4  a  9.0  a
5   9  a  9.0  a
6   6  b  0.0  b

6. “右连接”(right)

示例代码：

# 右连接
print(pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='right'))

运行结果：

  data1 key1 data2 key2
0  8.0  b   0  b
1  8.0  b   0  b
2  6.0  b   0  b
3  3.0  a   9  a
4  4.0  a   9  a
5  9.0  a   9  a
6  NaN NaN   3  d

7. 处理重复列名

suffixes，默认为_x, _y

示例代码：

# 处理重复列名
df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
            'data' : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({'key': ['a', 'b', 'd'],
            'data' : np.random.randint(0,10,3)})

print(pd.merge(df_obj1, df_obj2, on='key', suffixes=('_left', '_right')))

运行结果：

  data_left key data_right
0     9  b      1
1     5  b      1
2     1  b      1
3     2  a      8
4     2  a      8
5     5  a      8

8. 按索引连接

left_index=True或right_index=True

示例代码：

# 按索引连接
df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
            'data1' : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({'data2' : np.random.randint(0,10,3)}, index=['a', 'b', 'd'])

print(pd.merge(df_obj1, df_obj2, left_on='key', right_index=True))

运行结果：

  data1 key data2
0   3  b   6
1   4  b   6
6   8  b   6
2   6  a   0
4   3  a   0
5   0  a   0

数据合并(pd.concat)

沿轴方向将多个对象合并到一起

1. NumPy的concat

np.concatenate

示例代码：

import numpy as np
import pandas as pd

arr1 = np.random.randint(0, 10, (3, 4))
arr2 = np.random.randint(0, 10, (3, 4))

print(arr1)
print(arr2)

print(np.concatenate([arr1, arr2]))
print(np.concatenate([arr1, arr2], axis=1))

运行结果：

# print(arr1)
[[3 3 0 8]
 [2 0 3 1]
 [4 8 8 2]]

# print(arr2)
[[6 8 7 3]
 [1 6 8 7]
 [1 4 7 1]]

# print(np.concatenate([arr1, arr2]))
 [[3 3 0 8]
 [2 0 3 1]
 [4 8 8 2]
 [6 8 7 3]
 [1 6 8 7]
 [1 4 7 1]]

# print(np.concatenate([arr1, arr2], axis=1)) 
[[3 3 0 8 6 8 7 3]
 [2 0 3 1 1 6 8 7]
 [4 8 8 2 1 4 7 1]]

2. pd.concat

注意指定轴方向，默认axis=0
join指定合并方式，默认为outer
Series合并时查看行索引有无重复

1) index 没有重复的情况

示例代码：

# index 没有重复的情况
ser_obj1 = pd.Series(np.random.randint(0, 10, 5), index=range(0,5))
ser_obj2 = pd.Series(np.random.randint(0, 10, 4), index=range(5,9))
ser_obj3 = pd.Series(np.random.randint(0, 10, 3), index=range(9,12))

print(ser_obj1)
print(ser_obj2)
print(ser_obj3)

print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
print(pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1))

运行结果：

# print(ser_obj1)
0  1
1  8
2  4
3  9
4  4
dtype: int64

# print(ser_obj2)
5  2
6  6
7  4
8  2
dtype: int64

# print(ser_obj3)
9   6
10  2
11  7
dtype: int64

# print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
0   1
1   8
2   4
3   9
4   4
5   2
6   6
7   4
8   2
9   6
10  2
11  7
dtype: int64

# print(pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1))
   0  1  2
0  1.0 NaN NaN
1  5.0 NaN NaN
2  3.0 NaN NaN
3  2.0 NaN NaN
4  4.0 NaN NaN
5  NaN 9.0 NaN
6  NaN 8.0 NaN
7  NaN 3.0 NaN
8  NaN 6.0 NaN
9  NaN NaN 2.0
10 NaN NaN 3.0
11 NaN NaN 3.0

2) index 有重复的情况

示例代码：

# index 有重复的情况
ser_obj1 = pd.Series(np.random.randint(0, 10, 5), index=range(5))
ser_obj2 = pd.Series(np.random.randint(0, 10, 4), index=range(4))
ser_obj3 = pd.Series(np.random.randint(0, 10, 3), index=range(3))

print(ser_obj1)
print(ser_obj2)
print(ser_obj3)

print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))

运行结果：

# print(ser_obj1)
0  0
1  3
2  7
3  2
4  5
dtype: int64

# print(ser_obj2)
0  5
1  1
2  9
3  9
dtype: int64

# print(ser_obj3)
0  8
1  7
2  9
dtype: int64

# print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
0  0
1  3
2  7
3  2
4  5
0  5
1  1
2  9
3  9
0  8
1  7
2  9
dtype: int64

# print(pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1, join='inner')) 
# join='inner' 将去除NaN所在的行或列
  0 1 2
0 0 5 8
1 3 1 7
2 7 9 9

3) DataFrame合并时同时查看行索引和列索引有无重复

示例代码：

df_obj1 = pd.DataFrame(np.random.randint(0, 10, (3, 2)), index=['a', 'b', 'c'],
            columns=['A', 'B'])
df_obj2 = pd.DataFrame(np.random.randint(0, 10, (2, 2)), index=['a', 'b'],
            columns=['C', 'D'])
print(df_obj1)
print(df_obj2)

print(pd.concat([df_obj1, df_obj2]))
print(pd.concat([df_obj1, df_obj2], axis=1, join='inner'))

运行结果：

# print(df_obj1)
  A B
a 3 3
b 5 4
c 8 6

# print(df_obj2)
  C D
a 1 9
b 6 8

# print(pd.concat([df_obj1, df_obj2]))
   A  B  C  D
a 3.0 3.0 NaN NaN
b 5.0 4.0 NaN NaN
c 8.0 6.0 NaN NaN
a NaN NaN 1.0 9.0
b NaN NaN 6.0 8.0

# print(pd.concat([df_obj1, df_obj2], axis=1, join='inner'))
  A B C D
a 3 3 1 9
b 5 4 6 8

数据重构

1. stack

将列索引旋转为行索引，完成层级索引
DataFrame->Series

示例代码：

import numpy as np
import pandas as pd

df_obj = pd.DataFrame(np.random.randint(0,10, (5,2)), columns=['data1', 'data2'])
print(df_obj)

stacked = df_obj.stack()
print(stacked)

运行结果：

# print(df_obj)
  data1 data2
0   7   9
1   7   8
2   8   9
3   4   1
4   1   2

# print(stacked)
0 data1  7
  data2  9
1 data1  7
  data2  8
2 data1  8
  data2  9
3 data1  4
  data2  1
4 data1  1
  data2  2
dtype: int64

2. unstack

将层级索引展开
Series->DataFrame
认操作内层索引，即level=-1

示例代码：

# 默认操作内层索引
print(stacked.unstack())

# 通过level指定操作索引的级别
print(stacked.unstack(level=0))

运行结果：

# print(stacked.unstack())
  data1 data2
0   7   9
1   7   8
2   8   9
3   4   1
4   1   2

# print(stacked.unstack(level=0))
    0 1 2 3 4
data1 7 7 8 4 1
data2 9 8 9 1 2

数据转换

一、处理重复数据

1 duplicated() 返回布尔型Series表示每行是否为重复行

示例代码：

import numpy as np
import pandas as pd

df_obj = pd.DataFrame({'data1' : ['a'] * 4 + ['b'] * 4,
            'data2' : np.random.randint(0, 4, 8)})
print(df_obj)

print(df_obj.duplicated())

运行结果：

# print(df_obj)
 data1 data2
0   a   3
1   a   2
2   a   3
3   a   3
4   b   1
5   b   0
6   b   3
7   b   0

# print(df_obj.duplicated())
0  False
1  False
2   True
3   True
4  False
5  False
6  False
7   True
dtype: bool

2 drop_duplicates() 过滤重复行

默认判断全部列

可指定按某些列判断

示例代码：

print(df_obj.drop_duplicates())
print(df_obj.drop_duplicates('data2'))

运行结果：

# print(df_obj.drop_duplicates())
 data1 data2
0   a   3
1   a   2
4   b   1
5   b   0
6   b   3

# print(df_obj.drop_duplicates('data2'))
 data1 data2
0   a   3
1   a   2
4   b   1
5   b   0

3. 根据map传入的函数对每行或每列进行转换

Series根据map传入的函数对每行或每列进行转换

示例代码：

ser_obj = pd.Series(np.random.randint(0,10,10))
print(ser_obj)

print(ser_obj.map(lambda x : x ** 2))

运行结果：

# print(ser_obj)
0  1
1  4
2  8
3  6
4  8
5  6
6  6
7  4
8  7
9  3
dtype: int64

# print(ser_obj.map(lambda x : x ** 2))
0   1
1  16
2  64
3  36
4  64
5  36
6  36
7  16
8  49
9   9
dtype: int64

二、数据替换

replace根据值的内容进行替换

示例代码：

# 单个值替换单个值
print(ser_obj.replace(1, -100))

# 多个值替换一个值
print(ser_obj.replace([6, 8], -100))

# 多个值替换多个值
print(ser_obj.replace([4, 7], [-100, -200]))

运行结果：

# print(ser_obj.replace(1, -100))
0  -100
1   4
2   8
3   6
4   8
5   6
6   6
7   4
8   7
9   3
dtype: int64

# print(ser_obj.replace([6, 8], -100))
0   1
1   4
2  -100
3  -100
4  -100
5  -100
6  -100
7   4
8   7
9   3
dtype: int64

# print(ser_obj.replace([4, 7], [-100, -200]))
0   1
1  -100
2   8
3   6
4   8
5   6
6   6
7  -100
8  -200
9   3
dtype: int64

Python数据分析课程讲义

聚类模型：K-Means

聚类（clustering）属于无监督学习（unsupervised learning）
无类别标记
在线 demo：http://syskall.com/kmeans.js

K-Means算法

数据挖掘十大经典算法之一
算法接收参数k；然后将样本点划分为k个聚类；同一聚类中的样本相似度较高；不同聚类中的样本相似度较小

算法思想：

以空间中k个样本点为中心进行聚类，对最靠近它们的样本点归类。通过迭代的方法，逐步更新各聚类中心，直至达到最好的聚类效果

算法描述：

选择k个聚类的初始中心
在第n次迭代中，对任意一个样本点，求其到k个聚类中心的距离，将该样本点归类到距离最小的中心所在的聚类
利用均值等方法更新各类的中心值
对所有的k个聚类中心，如果利用2,3步的迭代更新后，达到稳定，则迭代结束。

优缺点：

优点：速度快，简单
缺点：最终结果和初始点的选择相关，容易陷入局部最优，需要给定k值

Python数据分析课程讲义

全球食品数据分析

项目参考：https://www.kaggle.com/bhouwens/d/openfoodfacts/world-food-facts/how-much-sugar-do-we-eat/discussion

# -*- coding : utf-8 -*-

# 处理zip压缩文件
import zipfile
import os
import pandas as pd
import matplotlib.pyplot as plt


def unzip(zip_filepath, dest_path):
  """
     解压zip文件
   """
  with zipfile.ZipFile(zip_filepath) as zf:
    zf.extractall(path=dest_path)


def get_dataset_filename(zip_filepath):
  """
       获取数据集文件名
   """
  with zipfile.ZipFile(zip_filepath) as zf:
    return zf.namelist()[0]


def run_main():
  """
     主函数
   """
  # 声明变量
  dataset_path = './data' # 数据集路径
  zip_filename = 'open-food-facts.zip' # zip文件名
  zip_filepath = os.path.join(dataset_path, zip_filename) # zip文件路径
  dataset_filename = get_dataset_filename(zip_filepath) # 数据集文件名（在zip中）
  dataset_filepath = os.path.join(dataset_path, dataset_filename) # 数据集文件路径

  print('解压zip...', end='')
  unzip(zip_filepath, dataset_path)
  print('完成.')

  # 读取数据
  data = pd.read_csv(dataset_filepath, usecols=['countries_en', 'additives_n'])

  # 分析各国家食物中的食品添加剂种类个数
  # 1. 数据清理
  # 去除缺失数据
  data = data.dropna()  # 或者data.dropna(inplace=True)

  # 将国家名称转换为小写
  # 课后练习：经过观察发现'countries_en'中的数值不是单独的国家名称，
  # 有的是多个国家名称用逗号隔开，如 Albania,Belgium,France,Germany,Italy,Netherlands,Spain
  # 正确的统计应该是将这些值拆开成多个行记录，然后进行分组统计
  data['countries_en'] = data['countries_en'].str.lower()

  # 2. 数据分组统计
  country_additives = data['additives_n'].groupby(data['countries_en']).mean()

  # 3. 按值从大到小排序
  result = country_additives.sort_values(ascending=False)

  # 4. pandas可视化top10
  result.iloc[:10].plot.bar()
  plt.show()

  # 5. 保存处理结果
  result.to_csv('./country_additives.csv')

  # 删除解压数据，清理空间
  if os.path.exists(dataset_filepath):
    os.remove(dataset_filepath)

if __name__ == '__main__':
  run_main()

传智播客Python学院数据分析 1. 一、工作环境准备及数据分析建模理论基础 2. 二、科学计算工具NumPy 3. 三、数据分析工具Pandas 4. 四、数据可视化工具 4.1. Matplotlib绘图 4.2. Seaborn绘图 4.3. Bokeh绘图 4.4. 实战案例：世界高峰数据可视化 5. 五、自然语言处理NLTK Published with GitBook

Python数据分析课程讲义

Matplotlib

Seaborn

交互式数据可视化—Bokeh

Python数据分析课程讲义

Matplotlib 是一个 Python 的 2D绘图库，通过 Matplotlib，开发者可以仅需要几行代码，便可以生成绘图，直方图，功率谱，条形图，错误图，散点图等。

http://matplotlib.org

用于创建出版质量图表的绘图工具库
目的是为Python构建一个Matlab式的绘图接口
import matplotlib.pyplot as plt
pyploy模块包含了常用的matplotlib API函数

figure

Matplotlib的图像均位于figure对象中
创建figure：fig = plt.figure()

示例代码：

# 引入matplotlib包
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline #在jupyter notebook 里需要使用这一句命令

# 创建figure对象
fig = plt.figure()

运行结果：

<matplotlib.figure.Figure at 0x11a2dd7b8>

subplot

fig.add_subplot(a, b, c)

a,b 表示将fig分割成 a*b 的区域
c 表示当前选中要操作的区域，
注意：从1开始编号（不是从0开始）
plot 绘图的区域是最后一次指定subplot的位置 (jupyter notebook里不能正确显示)

示例代码：

# 指定切分区域的位置
ax1 = fig.add_subplot(2,2,1)
ax2 = fig.add_subplot(2,2,2)
ax3 = fig.add_subplot(2,2,3)
ax4 = fig.add_subplot(2,2,4)

# 在subplot上作图
random_arr = np.random.randn(100)
#print random_arr

# 默认是在最后一次使用subplot的位置上作图，但是在jupyter notebook 里可能显示有误
plt.plot(random_arr)

# 可以指定在某个或多个subplot位置上作图
# ax1 = fig.plot(random_arr)
# ax2 = fig.plot(random_arr)
# ax3 = fig.plot(random_arr)

# 显示绘图结果
plt.show()

运行结果：

直方图：hist

示例代码：

import matplotlib.pyplot as plt
import numpy as np

plt.hist(np.random.randn(100), bins=10, color='b', alpha=0.3)
plt.show()

运行结果：

散点图：scatter

示例代码：

import matplotlib.pyplot as plt
import numpy as np

# 绘制散点图
x = np.arange(50)
y = x + 5 * np.random.rand(50)
plt.scatter(x, y)
plt.show()

运行结果：

柱状图：bar

示例代码：

import matplotlib.pyplot as plt
import numpy as np

# 柱状图
x = np.arange(5)
y1, y2 = np.random.randint(1, 25, size=(2, 5))
width = 0.25
ax = plt.subplot(1,1,1)
ax.bar(x, y1, width, color='r')
ax.bar(x+width, y2, width, color='g')
ax.set_xticks(x+width)
ax.set_xticklabels(['a', 'b', 'c', 'd', 'e'])
plt.show()

运行结果：

矩阵绘图：plt.imshow()

混淆矩阵，三个维度的关系

示例代码：

import matplotlib.pyplot as plt
import numpy as np

# 矩阵绘图
m = np.random.rand(10,10)
print(m)
plt.imshow(m, interpolation='nearest', cmap=plt.cm.ocean)
plt.colorbar()
plt.show()

运行结果：

[[ 0.92859942  0.84162134  0.37814667  0.46401549  0.93935737  0.0344159
   0.56358375  0.75977745  0.87983192  0.22818774]
 [ 0.88216959  0.43369207  0.1303902   0.98446182  0.59474031  0.04414217
   0.86534444  0.34919228  0.53950028  0.89165269]
 [ 0.52919761  0.87408715  0.097871    0.78348534  0.09354791  0.3186
   0.25978432  0.48340623  0.1107699   0.14065592]
 [ 0.90834516  0.42377475  0.73042695  0.51596826  0.14154431  0.22165693
   0.64705882  0.78062873  0.55036304  0.40874584]
 [ 0.98853697  0.46762114  0.69973423  0.7910757   0.63700306  0.68793919
   0.28685306  0.3473426   0.17011744  0.18812329]
 [ 0.73688943  0.58004874  0.03146167  0.08875797  0.32930191  0.87314734
   0.50757536  0.8667078   0.8423364   0.99079049]
 [ 0.37660356  0.63667774  0.78111565  0.25598593  0.38437628  0.95771051
   0.01922366  0.37020219  0.51020305  0.05365718]
 [ 0.87588452  0.56494761  0.67320078  0.46870376  0.66139913  0.55072149
   0.51328222  0.64817353  0.198525    0.18105368]
 [ 0.86038137  0.55914088  0.55240021  0.15260395  0.4681218   0.28863395
   0.6614597   0.69015592  0.46583629  0.15086562]
 [ 0.01373772  0.30514083  0.69804049  0.5014782   0.56855904  0.14889117
   0.87596848  0.29757133  0.76062891  0.03678431]]

plt.subplots()

同时返回新创建的figure和subplot对象数组
生成2行2列subplot:fig, subplot_arr = plt.subplots(2,2)
在jupyter里可以正常显示，推荐使用这种方式创建多个图表

示例代码：

import matplotlib.pyplot as plt
import numpy as np

fig, subplot_arr = plt.subplots(2,2)
# bins 为显示个数，一般小于等于数值个数
subplot_arr[1,0].hist(np.random.randn(100), bins=10, color='b', alpha=0.3)
plt.show()

运行结果：

颜色、标记、线型

ax.plot(x, y, ‘r--’)

等价于ax.plot(x, y, linestyle=‘--’, color=‘r’)

示例代码：

import matplotlib.pyplot as plt
import numpy as np

fig, axes = plt.subplots(2)
axes[0].plot(np.random.randint(0, 100, 50), 'ro--')
# 等价
axes[1].plot(np.random.randint(0, 100, 50), color='r', linestyle='dashed', marker='o')

运行结果：

[<matplotlib.lines.Line2D at 0x11a901e80>]

常用的颜色、标记、线型

刻度、标签、图例

设置刻度范围

plt.xlim(), plt.ylim()

ax.set_xlim(), ax.set_ylim()
设置显示的刻度

plt.xticks(), plt.yticks()

ax.set_xticks(), ax.set_yticks()
设置刻度标签

ax.set_xticklabels(), ax.set_yticklabels()
设置坐标轴标签

ax.set_xlabel(), ax.set_ylabel()
设置标题

ax.set_title()
图例

ax.plot(label=‘legend’)

ax.legend(), plt.legend()

loc=‘best’：自动选择放置图例最佳位置

示例代码：

import matplotlib.pyplot as plt
import numpy as np

fig, ax = plt.subplots(1)
ax.plot(np.random.randn(1000).cumsum(), label='line0')

# 设置刻度
#plt.xlim([0,500])
ax.set_xlim([0, 800])

# 设置显示的刻度
#plt.xticks([0,500])
ax.set_xticks(range(0,500,100))

# 设置刻度标签
ax.set_yticklabels(['Jan', 'Feb', 'Mar'])

# 设置坐标轴标签
ax.set_xlabel('Number')
ax.set_ylabel('Month')

# 设置标题
ax.set_title('Example')

# 图例
ax.plot(np.random.randn(1000).cumsum(), label='line1')
ax.plot(np.random.randn(1000).cumsum(), label='line2')
ax.legend()
ax.legend(loc='best')
#plt.legend()

运行结果： <matplotlib.legend.Legend at 0x11a4061d0>

Python数据分析课程讲义

http://seaborn.pydata.org/index.html

Seaborn其实是在matplotlib的基础上进行了更高级的API封装，从而使得作图更加容易，在大多数情况下使用seaborn就能做出很具有吸引力的图，而使用matplotlib就能制作具有更多特色的图。应该把Seaborn视为matplotlib的补充，而不是替代物。

Python中的一个制图工具库，可以制作出吸引人的、信息量大的统计图
在Matplotlib上构建，支持numpy和pandas的数据结构可视化。
多个内置主题及颜色主题
可视化单一变量、二维变量用于比较数据集中各变量的分布情况
可视化线性回归模型中的独立变量及不独立变量

import numpy as np
import pandas as pd
# from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
# %matplotlib inline

数据集分布可视化

单变量分布 sns.distplot()

示例代码：

# 单变量分布
x1 = np.random.normal(size=1000)
sns.distplot(x1);

x2 = np.random.randint(0, 100, 500)
sns.distplot(x2);

运行结果：

直方图 sns.distplot(kde=False)

示例代码：

# 直方图
sns.distplot(x1, bins=20, kde=False, rug=True)

运行结果：

核密度估计 sns.distplot(hist=False) 或 sns.kdeplot()

示例代码：

# 核密度估计
sns.distplot(x2, hist=False, rug=True)

运行结果：

双变量分布

示例代码：

# 双变量分布
df_obj1 = pd.DataFrame({"x": np.random.randn(500),
                   "y": np.random.randn(500)})

df_obj2 = pd.DataFrame({"x": np.random.randn(500),
                   "y": np.random.randint(0, 100, 500)})

散布图 sns.jointplot()

示例代码：

# 散布图
sns.jointplot(x="x", y="y", data=df_obj1)

运行结果：

二维直方图 Hexbin sns.jointplot(kind=‘hex’)

示例代码：

# 二维直方图
sns.jointplot(x="x", y="y", data=df_obj1, kind="hex");

运行结果：

核密度估计 sns.jointplot(kind=‘kde’)

示例代码：

# 核密度估计
sns.jointplot(x="x", y="y", data=df_obj1, kind="kde");

运行结果：

数据集中变量间关系可视化 sns.pairplot()

示例代码：

# 数据集中变量间关系可视化
dataset = sns.load_dataset("tips")
#dataset = sns.load_dataset("iris")
sns.pairplot(dataset);

运行结果：

类别数据可视化

#titanic = sns.load_dataset('titanic')
#planets = sns.load_dataset('planets')
#flights = sns.load_dataset('flights')
#iris = sns.load_dataset('iris')
exercise = sns.load_dataset('exercise')

类别散布图

sns.stripplot() 数据点会重叠

示例代码：

sns.stripplot(x="diet", y="pulse", data=exercise)

运行结果：

sns.swarmplot() 数据点避免重叠，hue指定子类别

示例代码：

sns.swarmplot(x="diet", y="pulse", data=exercise, hue='kind')

运行结果：

类别内数据分布

盒子图 sns.boxplot(), hue指定子类别

示例代码：

# 盒子图
sns.boxplot(x="diet", y="pulse", data=exercise)
#sns.boxplot(x="diet", y="pulse", data=exercise, hue='kind')

运行结果：

小提琴图 sns.violinplot(), hue指定子类别

示例代码：

# 小提琴图
#sns.violinplot(x="diet", y="pulse", data=exercise)
sns.violinplot(x="diet", y="pulse", data=exercise, hue='kind')

运行结果：

类别内统计图

柱状图 sns.barplot()

示例代码：

# 柱状图
sns.barplot(x="diet", y="pulse", data=exercise, hue='kind')

运行结果：

点图 sns.pointplot()

示例代码：

# 点图
sns.pointplot(x="diet", y="pulse", data=exercise, hue='kind');

运行结果：

Python数据分析课程讲义

http://seaborn.pydata.org/index.html

Python中的一个制图工具库，可以制作出吸引人的、信息量大的统计图
在Matplotlib上构建，支持numpy和pandas的数据结构可视化。
多个内置主题及颜色主题
可视化单一变量、二维变量用于比较数据集中各变量的分布情况
可视化线性回归模型中的独立变量及不独立变量

import numpy as np
import pandas as pd
# from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
# %matplotlib inline

数据集分布可视化

单变量分布 sns.distplot()

示例代码：

# 单变量分布
x1 = np.random.normal(size=1000)
sns.distplot(x1);

x2 = np.random.randint(0, 100, 500)
sns.distplot(x2);

运行结果：

直方图 sns.distplot(kde=False)

示例代码：

# 直方图
sns.distplot(x1, bins=20, kde=False, rug=True)

运行结果：

核密度估计 sns.distplot(hist=False) 或 sns.kdeplot()

示例代码：

# 核密度估计
sns.distplot(x2, hist=False, rug=True)

运行结果：

双变量分布

示例代码：

# 双变量分布
df_obj1 = pd.DataFrame({"x": np.random.randn(500),
                   "y": np.random.randn(500)})

df_obj2 = pd.DataFrame({"x": np.random.randn(500),
                   "y": np.random.randint(0, 100, 500)})

散布图 sns.jointplot()

示例代码：

# 散布图
sns.jointplot(x="x", y="y", data=df_obj1)

运行结果：

二维直方图 Hexbin sns.jointplot(kind=‘hex’)

示例代码：

# 二维直方图
sns.jointplot(x="x", y="y", data=df_obj1, kind="hex");

运行结果：

核密度估计 sns.jointplot(kind=‘kde’)

示例代码：

# 核密度估计
sns.jointplot(x="x", y="y", data=df_obj1, kind="kde");

运行结果：

数据集中变量间关系可视化 sns.pairplot()

示例代码：

# 数据集中变量间关系可视化
dataset = sns.load_dataset("tips")
#dataset = sns.load_dataset("iris")
sns.pairplot(dataset);

运行结果：

类别数据可视化

#titanic = sns.load_dataset('titanic')
#planets = sns.load_dataset('planets')
#flights = sns.load_dataset('flights')
#iris = sns.load_dataset('iris')
exercise = sns.load_dataset('exercise')

类别散布图

sns.stripplot() 数据点会重叠

示例代码：

sns.stripplot(x="diet", y="pulse", data=exercise)

运行结果：

sns.swarmplot() 数据点避免重叠，hue指定子类别

示例代码：

sns.swarmplot(x="diet", y="pulse", data=exercise, hue='kind')

运行结果：

类别内数据分布

盒子图 sns.boxplot(), hue指定子类别

示例代码：

# 盒子图
sns.boxplot(x="diet", y="pulse", data=exercise)
#sns.boxplot(x="diet", y="pulse", data=exercise, hue='kind')

运行结果：

小提琴图 sns.violinplot(), hue指定子类别

示例代码：

# 小提琴图
#sns.violinplot(x="diet", y="pulse", data=exercise)
sns.violinplot(x="diet", y="pulse", data=exercise, hue='kind')

运行结果：

类别内统计图

柱状图 sns.barplot()

示例代码：

# 柱状图
sns.barplot(x="diet", y="pulse", data=exercise, hue='kind')

运行结果：

点图 sns.pointplot()

示例代码：

# 点图
sns.pointplot(x="diet", y="pulse", data=exercise, hue='kind');

运行结果：

Python数据分析课程讲义

http://bokeh.pydata.org/en/latest

Bokeh

是一个专门针对Web浏览器的呈现功能的交互式可视化Python库。这是Bokeh与其它可视化库最核心的区别。

专门针对Web浏览器的交互式、可视化Python绘图库
可以做出像D3.js简洁漂亮的交互可视化效果，但是使用难度低于D3.js。
独立的HTML文档或服务端程序
可以处理大量、动态或数据流
支持Python (或Scala, R, Julia…)
不需要使用java script

Bokeh接口

Charts: 高层接口，以简单的方式绘制复杂的统计图
Plotting: 中层接口，用于组装图形元素
Models: 底层接口，为开发者提供了最大的灵活性

包引用

from bokeh.io import output_notebook, output_file, show
from bokeh.charts import Scatter, Bar, BoxPlot, Chord
from bokeh.layouts import row
import seaborn as sns

# 导入数据
exercise = sns.load_dataset('exercise')

output_notebook()
#output_file('test.html')

from bokeh.io import output_file 生成.html文档
from boken.io import output_notebook 在jupyter中使用

bokeh.charts

http://bokeh.pydata.org/en/latest/docs/reference/charts.html

散点图 Scatter

示例代码：

# 散点图
p = Scatter(data=exercise, x='id', y='pulse', title='exercise dataset')
show(p)

运行结果：

柱状图 Bar

示例代码：

# 柱状图
p = Bar(data=exercise, values='pulse', label='diet', stack='kind', title='exercise dataset')
show(p)

运行结果：

盒子图 BoxPlot

示例代码：

# 盒子图
box1 = BoxPlot(data=exercise, values='pulse', label='diet', color='diet', title='exercise dataset')
box2 = BoxPlot(data=exercise, values='pulse', label='diet', stack='kind', color='kind', title='exercise dataset')
show(row(box1, box2))

运行结果：

弦图 Chord

展示多个节点之间的联系

连线的粗细代表权重

示例代码：

# 弦图 Chord
chord1 = Chord(data=exercise, source="id", target="kind")
chord2 = Chord(data=exercise, source="id", target="kind", value="pulse")

show(row(chord1, chord2))

运行结果：

bokeh.plotting

方框 square, 圆形 circle

示例代码：

from bokeh.plotting import figure
import numpy as np

p = figure(plot_width=400, plot_height=400)
# 方框
p.square(np.random.randint(1,10,5), np.random.randint(1,10,5), size=20, color="navy")

# 圆形
p.circle(np.random.randint(1,10,5), np.random.randint(1,10,5), size=10, color="green")
show(p)

运行结果：

更多图形元素参考：http://bokeh.pydata.org/en/latest/docs/reference/plotting.html

Python数据分析课程讲义

世界高峰数据可视化 (World's Highest Mountains)

参考：https://www.kaggle.com/alex64/d/abcsds/highest-mountains/let-s-climb

import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style

style.use('ggplot')   # 设置图片显示的主题样式

# 解决matplotlib显示中文问题
plt.rcParams['font.sans-serif'] = ['SimHei'] # 指定默认字体
plt.rcParams['axes.unicode_minus'] = False # 解决保存图像是负号'-'显示为方块的问题

dataset_path = './dataset/Mountains.csv'


def preview_data(data):
  """
     数据预览
   """
  # 数据预览
  print(data.head())

  # 数据信息
  print(data.info())


def proc_success(val):
  """
     处理 'Ascents bef. 2004' 列中的数据
   """
  if '>' in str(val):
    return 200
  elif 'Many' in str(val):
    return 160
  else:
    return val


def run_main():
  """
     主函数
   """
  data = pd.read_csv(dataset_path)

  preview_data(data)

  # 数据重构
  # 重命名列名
  data.rename(columns={'Height (m)': 'Height', 'Ascents bef. 2004': 'Success',
             'Failed attempts bef. 2004': 'Failed'}, inplace=True)

  # 数据清洗
  data['Failed'] = data['Failed'].fillna(0).astype(int)
  data['Success'] = data['Success'].apply(proc_success)
  data['Success'] = data['Success'].fillna(0).astype(int)
  data = data[data['First ascent'] != 'unclimbed']
  data['First ascent'] = data['First ascent'].astype(int)

  # 可视化数据
  # 1. 登顶次数 vs 年份

  plt.hist(data['First ascent'].astype(int), bins=20)
  plt.ylabel('高峰数量')
  plt.xlabel('年份')
  plt.title('登顶次数')
  plt.savefig('./first_ascent_vs_year.png')
  plt.show()

  # 2. 高峰vs海拔
  data['Height'].plot.hist(color='steelblue', bins=20)
  plt.bar(data['Height'],
       (data['Height'] - data['Height'].min()) / (data['Height'].max() - data['Height'].min()) * 23,  # 按比例缩放
      color='red',
      width=30, alpha=0.2)
  plt.ylabel('高峰数量')
  plt.xlabel('海拔')
  plt.text(8750, 20, "海拔", color='red')
  plt.title('高峰vs海拔')
  plt.savefig('./mountain_vs_height.png')
  plt.show()

  # 3. 首次登顶
  data['Attempts'] = data['Failed'] + data['Success'] # 攀登尝试次数
  fig = plt.figure(figsize=(13, 7))
  fig.add_subplot(211)
  plt.scatter(data['First ascent'], data['Height'], c=data['Attempts'], alpha=0.8, s=50)
  plt.ylabel('海拔')
  plt.xlabel('登顶')

  fig.add_subplot(212)
  plt.scatter(data['First ascent'], data['Rank'].max() - data['Rank'], c=data['Attempts'], alpha=0.8, s=50)
  plt.ylabel('排名')
  plt.xlabel('登顶')
  plt.savefig('./mountain_vs_attempts.png')
  plt.show()

  # 课后练习，尝试使用seaborn或者bokeh重现上述显示的结果

if __name__ == '__main__':
  run_main()

传智播客Python学院数据分析 1. 一、工作环境准备及数据分析建模理论基础 2. 二、科学计算工具NumPy 3. 三、数据分析工具Pandas 4. 四、数据可视化工具 5. 五、自然语言处理NLTK 5.1. NLTK与自然语言处理基础 5.2. jieba分词 5.3. 情感分析 5.4. 文本相似度和分类 5.5. 实战案例：微博情感分析 Published with GitBook

Python数据分析课程讲义

Python文本分析工具NLTK

情感分析

文本相似度

文本分类

分类预测模型：朴素贝叶斯

实战案例：微博情感分析

Python数据分析课程讲义

NLTK (Natural Language Toolkit)

NTLK是著名的Python自然语言处理工具包，但是主要针对的是英文处理。NLTK配套有文档，有语料库，有书籍。

NLP领域中最常用的一个Python库
开源项目
自带分类、分词等功能
强大的社区支持
语料库，语言的实际使用中真是出现过的语言材料
http://www.nltk.org/py-modindex.html

在NLTK的主页详细介绍了如何在Mac、Linux和Windows下安装NLTK：http://nltk.org/install.html ，建议直接下载Anaconda，省去了大部分包的安装，安装NLTK完毕，可以import nltk测试一下，如果没有问题，还有下载NLTK官方提供的相关语料。

安装步骤：

下载NLTK包 pip install nltk
运行Python，并输入下面的指令
```
 import nltk
 nltk.download()
```
弹出下面的窗口，建议安装所有的包，即all
测试使用：

语料库

nltk.corpus

import nltk
from nltk.corpus import brown # 需要下载brown语料库
# 引用布朗大学的语料库

# 查看语料库包含的类别
print(brown.categories())

# 查看brown语料库
print('共有{}个句子'.format(len(brown.sents())))
print('共有{}个单词'.format(len(brown.words())))

执行结果：

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']

共有57340个句子
共有1161192个单词

分词 (tokenize)

将句子拆分成具有语言语义学上意义的词
中、英文分词区别：
- 英文中，单词之间是以空格作为自然分界符的
- 中文中没有一个形式上的分界符，分词比英文复杂的多
中文分词工具，如：结巴分词 pip install jieba
得到分词结果后，中英文的后续处理没有太大区别

# 导入jieba分词
import jieba

seg_list = jieba.cut("欢迎来到黑马程序员Python学科", cut_all=True)
print("全模式: " + "/ ".join(seg_list))  # 全模式

seg_list = jieba.cut("欢迎来到黑马程序员Python学科", cut_all=False)
print("精确模式: " + "/ ".join(seg_list))  # 精确模式

运行结果：

全模式: 欢迎/ 迎来/ 来到/ 黑马/ 程序/ 程序员/ Python/ 学科
精确模式: 欢迎/ 来到/ 黑马/ 程序员/ Python/ 学科

词形问题

look, looked, looking
影响语料学习的准确度
词形归一化

1. 词干提取(stemming)

示例：

# PorterStemmer
from nltk.stem.porter import PorterStemmer

porter_stemmer = PorterStemmer()
print(porter_stemmer.stem('looked'))
print(porter_stemmer.stem('looking'))

# 运行结果：
# look
# look

示例：

# SnowballStemmer
from nltk.stem import SnowballStemmer

snowball_stemmer = SnowballStemmer('english')
print(snowball_stemmer.stem('looked'))
print(snowball_stemmer.stem('looking'))

# 运行结果：
# look
# look

示例：

# LancasterStemmer
from nltk.stem.lancaster import LancasterStemmer

lancaster_stemmer = LancasterStemmer()
print(lancaster_stemmer.stem('looked'))
print(lancaster_stemmer.stem('looking'))

# 运行结果：
# look
# look

2. 词形归并(lemmatization)

stemming，词干提取，如将ing, ed去掉，只保留单词主干
lemmatization，词形归并，将单词的各种词形归并成一种形式，如am, is, are -> be, went->go
NLTK中的stemmer

PorterStemmer, SnowballStemmer, LancasterStemmer
NLTK中的lemma

WordNetLemmatizer
问题

went 动词 -> go，走 Went 名词 -> Went，文特
指明词性可以更准确地进行lemma

示例：

from nltk.stem import WordNetLemmatizer 
# 需要下载wordnet语料库

wordnet_lematizer = WordNetLemmatizer()
print(wordnet_lematizer.lemmatize('cats'))
print(wordnet_lematizer.lemmatize('boxes'))
print(wordnet_lematizer.lemmatize('are'))
print(wordnet_lematizer.lemmatize('went'))

# 运行结果：
# cat
# box
# are
# went

示例：

# 指明词性可以更准确地进行lemma
# lemmatize 默认为名词
print(wordnet_lematizer.lemmatize('are', pos='v'))
print(wordnet_lematizer.lemmatize('went', pos='v'))

# 运行结果：
# be
# go

3. 词性标注 (Part-Of-Speech)

NLTK中的词性标注

nltk.word_tokenize()

示例：

import nltk

words = nltk.word_tokenize('Python is a widely used programming language.')
print(nltk.pos_tag(words)) # 需要下载 averaged_perceptron_tagger

# 运行结果：
# [('Python', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('widely', 'RB'), ('used', 'VBN'), ('programming', 'NN'), ('language', 'NN'), ('.', '.')]

4. 去除停用词

为节省存储空间和提高搜索效率，NLP中会自动过滤掉某些字或词
停用词都是人工输入、非自动化生成的，形成停用词表
分类

语言中的功能词，如the, is…

词汇词，通常是使用广泛的词，如want
中文停用词表

中文停用词库

哈工大停用词表

四川大学机器智能实验室停用词库

百度停用词列表
其他语言停用词表

http://www.ranks.nl/stopwords
使用NLTK去除停用词

stopwords.words()

示例：

from nltk.corpus import stopwords # 需要下载stopwords

filtered_words = [word for word in words if word not in stopwords.words('english')]
print('原始词：', words)
print('去除停用词后：', filtered_words)

# 运行结果：
# 原始词： ['Python', 'is', 'a', 'widely', 'used', 'programming', 'language', '.']
# 去除停用词后： ['Python', 'widely', 'used', 'programming', 'language', '.']

5. 典型的文本预处理流程

示例：

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

# 原始文本
raw_text = 'Life is like a box of chocolates. You never know what you\'re gonna get.'

# 分词
raw_words = nltk.word_tokenize(raw_text)

# 词形归一化
wordnet_lematizer = WordNetLemmatizer()
words = [wordnet_lematizer.lemmatize(raw_word) for raw_word in raw_words]

# 去除停用词
filtered_words = [word for word in words if word not in stopwords.words('english')]

print('原始文本：', raw_text)
print('预处理结果：', filtered_words)

运行结果：

原始文本： Life is like a box of chocolates. You never know what you're gonna get.
预处理结果： ['Life', 'like', 'box', 'chocolate', '.', 'You', 'never', 'know', "'re", 'gon', 'na', 'get', '.']

使用案例：

import nltk
from nltk.tokenize import WordPunctTokenizer

sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') 
paragraph = "The first time I heard that song was in Hawaii on radio.  I was just a kid, and loved it very much! What a fantastic song!" 

# 分句
sentences = sent_tokenizer.tokenize(paragraph) 
print(sentences)

sentence = "Are you old enough to remember Michael Jackson attending. the Grammys with Brooke Shields and Webster sat on his lap during the show" 

# 分词
words = WordPunctTokenizer().tokenize(sentence.lower()) 
print(words)

输出结果：

['The first time I heard that song was in Hawaii on radio.', 'I was just a kid, and loved it very much!', 'What a fantastic song!']

['are', 'you', 'old', 'enough', 'to', 'remember', 'michael', 'jackson', 'attending', '.', 'the', 'grammys', 'with', 'brooke', 'shields', 'and', 'webster', 'sat', 'on', 'his', 'lap', 'during', 'the', 'show', '']

Python数据分析课程讲义

jieba分词

jieba分词是python写成的一个算是工业界的分词开源库，其github地址为：https://github.com/fxsjy/jieba，在Python里的安装方式： pip install jieba

简单示例：

import jieba as jb

seg_list = jb.cut("我来到北京清华大学", cut_all=True)
print("全模式: " + "/ ".join(seg_list))  # 全模式

seg_list = jb.cut("我来到北京清华大学", cut_all=False)
print("精确模式: " + "/ ".join(seg_list))  # 精确模式

seg_list = jb.cut("他来到了网易杭研大厦")  
print("默认模式: " + "/ ".join(seg_list)) # 默认是精确模式

seg_list = jb.cut_for_search("小明硕士毕业于中国科学院计算所，后在日本京都大学深造")  
print("搜索引擎模式: " + "/ ".join(seg_list)) # 搜索引擎模式

执行结果：

全模式: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
精确模式: 我/ 来到/ 北京/ 清华大学
默认模式: 他/ 来到/ 了/ 网易/ 杭研/ 大厦
搜索引擎模式: 小明/ 硕士/ 毕业/ 于/ 中国/ 科学/ 学院/ 科学院/ 中国科学院/ 计算/ 计算所/ ，/ 后/ 在/ 日本/ 京都/ 大学/ 日本京都大学/ 深造

jieba分词的基本思路

jieba分词对已收录词和未收录词都有相应的算法进行处理，其处理的思路很简单，主要的处理思路如下：

加载词典dict.txt

从内存的词典中构建该句子的DAG（有向无环图）

对于词典中未收录词，使用HMM模型的viterbi算法尝试分词处理

已收录词和未收录词全部分词完毕后，使用dp寻找DAG的最大概率路径输出分词结果

案例：

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import jieba
import requests
from bs4 import BeautifulSoup

def extract_text(url):
    # 发送url请求并获取响应文件
    page_source = requests.get(url).content
    bs_source = BeautifulSoup(page_source, "lxml")

    # 解析出所有的p标签
    report_text = bs_source.find_all('p')

    text = ''
    # 将p标签里的所有内容都保存到一个字符串里
    for p in report_text:
        text += p.get_text()
        text += '\n'

    return text

def word_frequency(text):
    from collections import Counter
    # 返回所有分词后长度大于等于2 的词的列表
    words = [word for word in jieba.cut(text, cut_all=True) if len(word) >= 2]

    # Counter是一个简单的计数器，统计字符出现的个数
    # 分词后的列表将被转化为字典
    c = Counter(words)

    for word_freq in c.most_common(10):
        word, freq = word_freq
        print(word, freq)

if __name__ == "__main__":
    url = 'http://www.gov.cn/premier/2017-03/16/content_5177940.htm'
    text = extract_text(url)
    word_frequency(text)

执行结果：

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/dp/wxmmld_s7k9gk_5fbhdcr2y00000gn/T/jieba.cache
Loading model cost 0.843 seconds.
Prefix dict has been built succesfully.
发展 134
改革 85
经济 71
推进 66
建设 59
社会 49
人民 47
企业 46
加强 46
政策 46

流程介绍

首先，我们从网上抓取政府工作报告的全文。我将这个步骤封装在一个名叫extract_text的简单函数中，接受url作为参数。因为目标页面中报告的文本在所有的p元素中，所以我们只需要通过BeautifulSoup选中全部的p元素即可，最后返回一个包含了报告正文的字符串。
然后，我们就可以利用jieba进行分词了。这里，我们要选择全模式分词。jieba的全模式分词，即把句子中所有的可以成词的词语都扫描出来, 速度非常快，但是不能解决歧义。之所以这么做，是因为默认的精确模式下，返回的词频数据不准确。
分词时，还要注意去除标点符号，由于标点符号的长度都是1，所以我们添加一个len(word) >= 2的条件即可。
最后，我们就可以利用Counter类，将分词后的列表快速地转化为字典，其中的键值就是键的出现次数，也就是这个词在全文中出现的次数。

Python数据分析课程讲义

情感分析

自然语言处理(NLP)

将自然语言（文本）转化为计算机程序更容易理解的形式
预处理得到的字符串 -> 向量化
经典应用
1. 情感分析
2. 文本相似度
3. 文本分类

简单的情感分析

情感字典（sentiment dictionary）
- 人工构造一个字典，如： like -> 1, good -> 2, bad -> -1, terrible-> -2
- 根据关键词匹配
如 AFINN-111： http://www2.imm.dtu.dk/pubdb/views/publication_details.phpid=6010，虽简单粗暴，但很实用
问题：

遇到新词，特殊词等，扩展性较差

使用机器学习模型，nltk.classify

案例：使用机器学习实现

# 简单的例子

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.classify import NaiveBayesClassifier

text1 = 'I like the movie so much!'
text2 = 'That is a good movie.'
text3 = 'This is a great one.'
text4 = 'That is a really bad movie.'
text5 = 'This is a terrible movie.'

def proc_text(text):
  """
     预处处理文本
   """
  # 分词
  raw_words = nltk.word_tokenize(text)

  # 词形归一化
  wordnet_lematizer = WordNetLemmatizer()  
  words = [wordnet_lematizer.lemmatize(raw_word) for raw_word in raw_words]

  # 去除停用词
  filtered_words = [word for word in words if word not in stopwords.words('english')]

  # True 表示该词在文本中，为了使用nltk中的分类器
  return {word: True for word in filtered_words}

# 构造训练样本
train_data = [[proc_text(text1), 1],
        [proc_text(text2), 1],
        [proc_text(text3), 1],
        [proc_text(text4), 0],
        [proc_text(text5), 0]]

# 训练模型
nb_model = NaiveBayesClassifier.train(train_data)

# 测试模型
text6 = 'That is a bad one.'
print(nb_model.classify(proc_text(text5)))

Python数据分析课程讲义

文本相似度

度量文本间的相似性
使用词频表示文本特征
文本中单词出现的频率或次数
NLTK实现词频统计

文本相似度案例：

import nltk
from nltk import FreqDist

text1 = 'I like the movie so much '
text2 = 'That is a good movie '
text3 = 'This is a great one '
text4 = 'That is a really bad movie '
text5 = 'This is a terrible movie'

text = text1 + text2 + text3 + text4 + text5
words = nltk.word_tokenize(text)
freq_dist = FreqDist(words)
print(freq_dist['is'])
# 输出结果：
# 4


# 取出常用的n=5个单词
n = 5
# 构造“常用单词列表”
most_common_words = freq_dist.most_common(n)
print(most_common_words)
# 输出结果：
# [('a', 4), ('movie', 4), ('is', 4), ('This', 2), ('That', 2)]



def lookup_pos(most_common_words):
  """
     查找常用单词的位置
   """
  result = {}
  pos = 0
  for word in most_common_words:
    result[word[0]] = pos
    pos += 1
  return result

# 记录位置
std_pos_dict = lookup_pos(most_common_words)
print(std_pos_dict)
# 输出结果：
# {'movie': 0, 'is': 1, 'a': 2, 'That': 3, 'This': 4}


# 新文本
new_text = 'That one is a good movie. This is so good!'
# 初始化向量
freq_vec = [0] * n
# 分词
new_words = nltk.word_tokenize(new_text)

# 在“常用单词列表”上计算词频
for new_word in new_words:
  if new_word in list(std_pos_dict.keys()):
    freq_vec[std_pos_dict[new_word]] += 1

print(freq_vec)
# 输出结果：
# [1, 2, 1, 1, 1]

文本分类

TF-IDF （词频-逆文档频率）

TF, Term Frequency（词频），表示某个词在该文件中出现的次数
IDF，Inverse Document Frequency（逆文档频率），用于衡量某个词普遍的重要性。
TF-IDF = TF * IDF

举例假设:

一个包含100个单词的文档中出现单词cat的次数为3，则TF=3/100=0.03

样本中一共有10,000,000个文档，其中出现cat的文档数为1,000个，则IDF=log(10,000,000/1,000)=4

TF-IDF = TF IDF = 0.03 4 = 0.12

NLTK实现TF-IDF

TextCollection.tf_idf()

案例：

from nltk.text import TextCollection

text1 = 'I like the movie so much '
text2 = 'That is a good movie '
text3 = 'This is a great one '
text4 = 'That is a really bad movie '
text5 = 'This is a terrible movie'

# 构建TextCollection对象
tc = TextCollection([text1, text2, text3, 
            text4, text5])
new_text = 'That one is a good movie. This is so good!'
word = 'That'
tf_idf_val = tc.tf_idf(word, new_text)
print('{}的TF-IDF值为：{}'.format(word, tf_idf_val))

执行结果：

That的TF-IDF值为：0.02181644599700369

Python数据分析课程讲义

实战案例：微博情感分析

数据：每个文本文件包含相应类的数据

0：喜悦；1：愤怒；2：厌恶；3：低落

步骤

文本读取
分割训练集、测试集
特征提取
模型训练、预测

代码：

tools.py

# -*- coding: utf-8 -*-

import re
import jieba.posseg as pseg
import pandas as pd
import math
import numpy as np

# 加载常用停用词
stopwords1 = [line.rstrip() for line in open('./中文停用词库.txt', 'r', encoding='utf-8')]
# stopwords2 = [line.rstrip() for line in open('./哈工大停用词表.txt', 'r', encoding='utf-8')]
# stopwords3 = [line.rstrip() for line in open('./四川大学机器智能实验室停用词库.txt', 'r', encoding='utf-8')]
# stopwords = stopwords1 + stopwords2 + stopwords3
stopwords = stopwords1


def proc_text(raw_line):
  """
     处理每行的文本数据
     返回分词结果
   """
  # 1. 使用正则表达式去除非中文字符
  filter_pattern = re.compile('[^\u4E00-\u9FD5]+')
  chinese_only = filter_pattern.sub('', raw_line)

  # 2. 结巴分词+词性标注
  words_lst = pseg.cut(chinese_only)

  # 3. 去除停用词
  meaninful_words = []
  for word, flag in words_lst:
    # if (word not in stopwords) and (flag == 'v'):
      # 也可根据词性去除非动词等
    if word not in stopwords:
      meaninful_words.append(word)

  return ' '.join(meaninful_words)


def split_train_test(text_df, size=0.8):
  """
     分割训练集和测试集
   """
  # 为保证每个类中的数据能在训练集中和测试集中的比例相同，所以需要依次对每个类进行处理
  train_text_df = pd.DataFrame()
  test_text_df = pd.DataFrame()

  labels = [0, 1, 2, 3]
  for label in labels:
    # 找出label的记录
    text_df_w_label = text_df[text_df['label'] == label]
    # 重新设置索引，保证每个类的记录是从0开始索引，方便之后的拆分
    text_df_w_label = text_df_w_label.reset_index()

    # 默认按80%训练集，20%测试集分割
    # 这里为了简化操作，取前80%放到训练集中，后20%放到测试集中
    # 当然也可以随机拆分80%，20%（尝试实现下DataFrame中的随机拆分）

    # 该类数据的行数
    n_lines = text_df_w_label.shape[0]
    split_line_no = math.floor(n_lines * size)
    text_df_w_label_train = text_df_w_label.iloc[:split_line_no, :]
    text_df_w_label_test = text_df_w_label.iloc[split_line_no:, :]

    # 放入整体训练集，测试集中
    train_text_df = train_text_df.append(text_df_w_label_train)
    test_text_df = test_text_df.append(text_df_w_label_test)

  train_text_df = train_text_df.reset_index()
  test_text_df = test_text_df.reset_index()
  return train_text_df, test_text_df


def get_word_list_from_data(text_df):
  """
     将数据集中的单词放入到一个列表中
   """
  word_list = []
  for _, r_data in text_df.iterrows():
    word_list += r_data['text'].split(' ')
  return word_list


def extract_feat_from_data(text_df, text_collection, common_words_freqs):
  """
     特征提取
   """
  # 这里只选择TF-IDF特征作为例子
  # 可考虑使用词频或其他文本特征作为额外的特征

  n_sample = text_df.shape[0]
  n_feat = len(common_words_freqs)
  common_words = [word for word, _ in common_words_freqs]

  # 初始化
  X = np.zeros([n_sample, n_feat])
  y = np.zeros(n_sample)

  print('提取特征...')
  for i, r_data in text_df.iterrows():
    if (i + 1) % 5000 == 0:
      print('已完成{}个样本的特征提取'.format(i + 1))

    text = r_data['text']

    feat_vec = []
    for word in common_words:
      if word in text:
        # 如果在高频词中，计算TF-IDF值
        tf_idf_val = text_collection.tf_idf(word, text)
      else:
        tf_idf_val = 0

      feat_vec.append(tf_idf_val)

    # 赋值
    X[i, :] = np.array(feat_vec)
    y[i] = int(r_data['label'])

  return X, y


def cal_acc(true_labels, pred_labels):
  """
     计算准确率
   """
  n_total = len(true_labels)
  correct_list = [true_labels[i] == pred_labels[i] for i in range(n_total)]

  acc = sum(correct_list) / n_total
  return acc

main.py

# main.py

# -*- coding: utf-8 -*-


import os
import pandas as pd
import nltk
from tools import proc_text, split_train_test, get_word_list_from_data, \
  extract_feat_from_data, cal_acc
from nltk.text import TextCollection
from sklearn.naive_bayes import GaussianNB

dataset_path = './dataset'
text_filenames = ['0_simplifyweibo.txt', '1_simplifyweibo.txt',
         '2_simplifyweibo.txt', '3_simplifyweibo.txt']

# 原始数据的csv文件
output_text_filename = 'raw_weibo_text.csv'

# 清洗好的文本数据文件
output_cln_text_filename = 'clean_weibo_text.csv'

# 处理和清洗文本数据的时间较长，通过设置is_first_run进行配置
# 如果是第一次运行需要对原始文本数据进行处理和清洗，需要设为True
# 如果之前已经处理了文本数据，并已经保存了清洗好的文本数据，设为False即可
is_first_run = True


def read_and_save_to_csv():
  """
     读取原始文本数据，将标签和文本数据保存成csv
   """

  text_w_label_df_lst = []
  for text_filename in text_filenames:
    text_file = os.path.join(dataset_path, text_filename)

    # 获取标签，即0, 1, 2, 3
    label = int(text_filename[0])

    # 读取文本文件
    with open(text_file, 'r', encoding='utf-8') as f:
      lines = f.read().splitlines()

    labels = [label] * len(lines)

    text_series = pd.Series(lines)
    label_series = pd.Series(labels)

    # 构造dataframe
    text_w_label_df = pd.concat([label_series, text_series], axis=1)
    text_w_label_df_lst.append(text_w_label_df)

  result_df = pd.concat(text_w_label_df_lst, axis=0)

  # 保存成csv文件
  result_df.columns = ['label', 'text']
  result_df.to_csv(os.path.join(dataset_path, output_text_filename),
           index=None, encoding='utf-8')


def run_main():
  """
     主函数
   """
  # 1. 数据读取，处理，清洗，准备
  if is_first_run:
    print('处理清洗文本数据中...', end=' ')
    # 如果是第一次运行需要对原始文本数据进行处理和清洗

    # 读取原始文本数据，将标签和文本数据保存成csv
    read_and_save_to_csv()

    # 读取处理好的csv文件，构造数据集
    text_df = pd.read_csv(os.path.join(dataset_path, output_text_filename),
               encoding='utf-8')

    # 处理文本数据
    text_df['text'] = text_df['text'].apply(proc_text)

    # 过滤空字符串
    text_df = text_df[text_df['text'] != '']

    # 保存处理好的文本数据
    text_df.to_csv(os.path.join(dataset_path, output_cln_text_filename),
            index=None, encoding='utf-8')
    print('完成，并保存结果。')

  # 2. 分割训练集、测试集
  print('加载处理好的文本数据')
  clean_text_df = pd.read_csv(os.path.join(dataset_path, output_cln_text_filename),
                encoding='utf-8')
  # 分割训练集和测试集
  train_text_df, test_text_df = split_train_test(clean_text_df)
  # 查看训练集测试集基本信息
  print('训练集中各类的数据个数：', train_text_df.groupby('label').size())
  print('测试集中各类的数据个数：', test_text_df.groupby('label').size())

  # 3. 特征提取
  # 计算词频
  n_common_words = 200

  # 将训练集中的单词拿出来统计词频
  print('统计词频...')
  all_words_in_train = get_word_list_from_data(train_text_df)
  fdisk = nltk.FreqDist(all_words_in_train)
  common_words_freqs = fdisk.most_common(n_common_words)
  print('出现最多的{}个词是：'.format(n_common_words))
  for word, count in common_words_freqs:
    print('{}: {}次'.format(word, count))
  print()

  # 在训练集上提取特征
  text_collection = TextCollection(train_text_df['text'].values.tolist())
  print('训练样本提取特征...', end=' ')
  train_X, train_y = extract_feat_from_data(train_text_df, text_collection, common_words_freqs)
  print('完成')
  print()

  print('测试样本提取特征...', end=' ')
  test_X, test_y = extract_feat_from_data(test_text_df, text_collection, common_words_freqs)
  print('完成')

  # 4. 训练模型Naive Bayes
  print('训练模型...', end=' ')
  gnb = GaussianNB()
  gnb.fit(train_X, train_y)
  print('完成')
  print()

  # 5. 预测
  print('测试模型...', end=' ')
  test_pred = gnb.predict(test_X)
  print('完成')

  # 输出准确率
  print('准确率：', cal_acc(test_y, test_pred))

if __name__ == '__main__':
  run_main()


【大中小】【打印】【繁体】【投稿】【收藏】【推荐】【举报】【评论】【关闭】【返回顶部】

上一篇：一些python书=>待买	下一篇： ..

Python版本

Python环境及IDE

字符串编码格式回顾：

数据工程领域中的DIKW体系

数据工程 领域职业划分：

什么是数据分析：

数据分析的过程：

数据分析的工具：

数据建模基础

大数据分析场景和模型应用

常见的数据建模分类

分类与回归

聚类

时序模型

常见的数据分析应用场景如下：

市场营销

风险管理

Numpy（Numerical Python）

Scipy

参考学习资料：

ndarray 多维数组(N Dimension Array)

ndarray的随机创建

ndarray的序列创建

ndarray的数据类型

ndarray的矩阵运算

ndarray的索引与切片

ndarray的维数转换

元素计算函数

元素统计函数

元素判断函数

元素去重排序函数

2016年美国总统大选民意调查数据统计：

示例代码1 ：

示例代码2：

什么是Pandas

Pandas的数据结构

Series

DataFrame

Pandas的索引操作

索引对象Index

Series索引

DataFrame索引

高级索引：标签、位置和混合

Pandas的对齐运算

Series的对齐运算

DataFrame的对齐运算

填充未对齐的数据进行运算

Pandas的函数应用

apply 和 applymap

排序

处理缺失数据

层级索引（hierarchical indexing）

MultiIndex索引对象

选取子集

交换分层顺序

交换并排序分层

Pandas统计计算和描述

常用的统计计算

常用的统计描述

常用的统计描述方法：

Pandas分组与聚合

分组 (groupby)

一、GroupBy对象：DataFrameGroupBy，SeriesGroupBy

二、GroupBy对象支持迭代操作

聚合 (aggregation)

数据的分组运算

groupby.apply(func)

数据清洗

数据连接(pd.merge)

数据合并(pd.concat)

数据重构

数据转换

一、 处理重复数据

二、数据替换

聚类模型：K-Means

K-Means算法

算法思想：

算法描述：

优缺点：

全球食品数据分析

数据工程领域职业划分：

一、处理重复数据