1. DataFrame 处理缺失值 pandas.DataFrame.dropna
df2.dropna(axis=0, how='any', subset=[u'ToC'], inplace=True)
把在ToC列有缺失值的行去掉
2. 根据某维度计算重复的行 pandas.DataFrame.duplicated pandas.Series.value_counts
print df.duplicated(['name']).value_counts() # 如果不指定列,默认会判断所有列
"""
输出:
False 11118
True 664
表示有664行是重复的
"""
duplicated()方法返回一个布尔型的Series,显示各行是否为重复行,非重复行显示为False,重复行显示为True
value_counts()方法统计数组或序列所有元素出现次数,对某一列统计可以直接用df.column_name.value_counts()
3. 去重 pandas.DataFrame.drop_duplicates
df.drop_duplicates(['name'], keep='last', inplace=True)
"""
keep : {‘first’, ‘last’, False}, default ‘first’
first : Drop duplicates except for the first occurrence.
last : Drop duplicates except for the last occurrence.
False : Drop all duplicates.
"""
4. 拼接列 pandas.DataFrame.merge
result = pd.merge(left, right, on='name', how='inner')
"""
其它参数:
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False)
Examples
>>> A >>> B
lkey value rkey value
0 foo 1 0 foo 5
1 bar 2 1 bar 6
2 baz 3 2 qux 7
3 foo 4 3 bar 8
>>> A.merge(B, left_on='lkey', right_on='rkey', how='outer')
lkey value_x rkey value_y
0 foo 1 foo 5
1 foo 4 foo 5
2 bar 2 bar 6
3 bar 2 bar 8
4 baz 3 NaN NaN
5 NaN NaN qux 7
"""
其它参考:Merge, join, and concatenate
5.找出在某一特定维度为空值的所有行
bool_arr = df.name.notnull() print bool_arr.value_counts()
for idx, value in bool_arr.iteritems():
if not value:
print '\n', idx, value
print df.iloc[idx]
6.指定dataframe的维度及顺序
res = {'name':[], 'buss':[], 'label':[]}
with codecs.open(fname, encoding='utf8') as fr:
for idx, line in enumerate(fr):
item = json.loads(line)
res['name'].append(item['name'])
res['buss'].append(item['buss'])
res['label'].append(item['label'])
df = pd.DataFrame(res, columns=['name', 'buss', 'label'])
df.to_csv('data/test_12315_industry_business.csv', index=False, encoding='utf-8')