所有 Notebook 的源文件都可以在 GitHub 上访问到呀~

Scales

  • Nominal Data/Categorical Data 名义数据、分类数据
In [17]:
import pandas as pd
import numpy as np
In [18]:
df = pd.DataFrame(['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D'],
                 index = ["excellent", "excellent", "excellent", "good", "good", "good", "ok", "ok", "ok", "poor", "poor"])
df.rename(columns = {0 : "Grades"}, inplace = True)
df
Out[18]:
Grades
excellent A+
excellent A
excellent A-
good B+
good B
good B-
ok C+
ok C
ok C-
poor D+
poor D
  • astype设置为category,有11个不同的分类
In [19]:
df["Grades"].astype("category").head()
Out[19]:
excellent    A+
excellent     A
excellent    A-
good         B+
good          B
Name: Grades, dtype: category
Categories (11, object): [A, A+, A-, B, ..., C+, C-, D, D+]
  • 如果想表示数据是按照逻辑顺序的,那么需要向ordered参数传递true,可以通过输出看出这些用<反映在dtype中
In [20]:
from pandas.api.types import CategoricalDtype
grades = df['Grades'].astype(CategoricalDtype(categories=['D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+'],ordered=True))
grades.head()
Out[20]:
excellent    A+
excellent     A
excellent    A-
good         B+
good          B
Name: Grades, dtype: category
Categories (11, object): [D < D+ < C- < C ... B+ < A- < A < A+]
  • 序数数据有排序,所以可以帮助你进行布尔掩码
  • 若有成绩grades的列表,并将它们与C比较,如果用词法来做,会发现C+和C-实际上都大于C
  • 为了比较,可以表明数据是有一个明显的顺序的,接着广播就会像期望的那样工作
  • 接着就可以对序数数据使用一个数学运算集,如max、min
In [21]:
grades > "C"
Out[21]:
excellent     True
excellent     True
excellent     True
good          True
good          True
good          True
ok            True
ok           False
ok           False
poor         False
poor         False
Name: Grades, dtype: bool

Test1

Try casting this series to categorical with the ordering Low < Medium < High.

In [22]:
s = pd.Series(['Low', 'Low', 'High', 'Medium', 'Low', 'High', 'Low'])
s = s.astype(CategoricalDtype(categories=['Low', 'Medium', 'High'],ordered=True))
s
Out[22]:
0       Low
1       Low
2      High
3    Medium
4       Low
5      High
6       Low
dtype: category
Categories (3, object): [Low < Medium < High]

Test2

Suppose we have a series that holds height data for jacket wearers. Use pd.cut to bin this data into 3 bins.

In [27]:
s = pd.Series([168, 180, 174, 190, 170, 185, 179, 181, 175, 169, 182, 177, 180, 171])
pd.cut(s, 3)
Out[27]:
0     (167.978, 175.333]
1     (175.333, 182.667]
2     (167.978, 175.333]
3       (182.667, 190.0]
4     (167.978, 175.333]
5       (182.667, 190.0]
6     (175.333, 182.667]
7     (175.333, 182.667]
8     (167.978, 175.333]
9     (167.978, 175.333]
10    (175.333, 182.667]
11    (175.333, 182.667]
12    (175.333, 182.667]
13    (167.978, 175.333]
dtype: category
Categories (3, interval[float64]): [(167.978, 175.333] < (175.333, 182.667] < (182.667, 190.0]]
In [28]:
pd.cut(s, 3, labels=['Small', 'Medium', 'Large'])
Out[28]:
0      Small
1     Medium
2      Small
3      Large
4      Small
5      Large
6     Medium
7     Medium
8      Small
9      Small
10    Medium
11    Medium
12    Medium
13     Small
dtype: category
Categories (3, object): [Small < Medium < Large]
In [ ]: