wangxishi

2020-09-08   阅读量: 13071

Python 数据分析师

python对两列数据进行处理

扫码加入数据分析学习群

想用python查找出两列分别相同的值,怎么操作呢,如图:

image.png

添加CDA认证专家【维克多阿涛】,微信号:【cdashijiazhuang】,提供数据分析指导及CDA考试秘籍。已助千人通过CDA数字化人才认证。欢迎交流,共同成长!
30.9968 2 8 关注作者 收藏

评论(8)

gracejpw1117
2020-09-08

In [19]:

import pandas as pddf=pd.read_excel(r'D:\CDA\CDA_PGC\datas.xlsx',sheetname='Sheet1')
df

Out[19]:


编号日期
01232020-09-09
11232020-09-08
23452020-09-09
33452020-09-08
43452020-09-09
51232020-09-09

In [20]:

df=df.drop_duplicates()

In [21]:

df['日期']=df['日期'].dt.date.apply(str)

In [22]:

df

Out[22]:


编号日期
01232020-09-09
11232020-09-08
23452020-09-09
33452020-09-08

In [23]:

df.to_excel(r'D:\CDA\CDA_PGC\datas2.xlsx',index=None)#删除index列


0.0000 0 0 回复
gracejpw1117
2020-09-08
.drop_duplicates()用于所有列的共同去重


0.0000 0 0 回复
gracejpw1117
2020-09-08
import pandas as pd
df=pd.read_excel(r'D:\CDA\CDA_PGC\datas.xlsx',sheetname='Sheet1')
df['日期']=df['日期'].dt.date #去掉‘时分秒’(从excel导入的日期数据默认00:00:00)
df['合并列']=df['编号'].map(str)+','+df['日期'].map(str) #将pandas默认的pandas默认的int64类型转为字符串df

Out[15]:


编号日期合并列
01232020-09-09123,2020-09-09
11232020-09-08123,2020-09-08
23452020-09-09345,2020-09-09
33452020-09-08345,2020-09-08
43452020-09-09345,2020-09-09
51232020-09-09123,2020-09-09

In [16]:

df.drop_duplicates('合并列',inplace=True)df

Out[16]:


编号日期合并列
01232020-09-09123,2020-09-09
11232020-09-08123,2020-09-08
23452020-09-09345,2020-09-09
33452020-09-08345,2020-09-08

In [17]:

df.drop(labels=['编号','日期'],axis=1,inplace=True)df

Out[17]:


合并列
0123,2020-09-09
1123,2020-09-08
2345,2020-09-09
3345,2020-09-08

In [18]:

df['编号'],df['日期']=df['合并列'].str.split(',',1).str df

Out[18]:


合并列编号日期
0123,2020-09-091232020-09-09
1123,2020-09-081232020-09-08
2345,2020-09-093452020-09-09
3345,2020-09-083452020-09-08

In [19]:

df=df.drop(labels=['合并列'],axis=1)df

Out[19]:


编号日期
01232020-09-09
11232020-09-08
23452020-09-09
33452020-09-08

In [20]:

df.to_excel(r'D:\CDA\CDA_PGC\datas2.xlsx',index=None)#删除index列


66.4083 1 0 回复
gracejpw1117
2020-09-08

注:In[16]、[17]、[18]、[19]最后的df均须另起一行

0.0000 0 0 回复
s1223452840
2020-09-08

对 , 应该是没有


0.0000 0 0 回复
wangxishi
2020-09-08

对的 我原始数据可能没有两列分别是相同值的情况

0.0000 0 0 回复
wangxishi
2020-09-08

最后出来数据是空的,是不是就说明我的数据里面没有两列值分别都一样的数据

0.0000 0 0 回复
s1223452840
2020-09-08

data1 = b.drop_duplicates(keep = False)

data2 = b.drop_duplicates(keep = "first")

data2.append(data1).drop_duplicates(keep= False)


这样写 , 不要subset参数了

32.4776 1 0 回复
wangxishi
2020-09-08

是这么写吗

data2 = data1.drop_duplicates(subset=["编号","日期"])


data3 = data1.drop_duplicates(subset=[""编号","日期"],keep = 'first')


data3.append(data2).drop_duplicates(keep= 'first')


0.0000 0 0 回复
s1223452840
2020-09-08

采用采用pandas中的drop_duplicates对数据去两次重,一次将重复数据全部去除(keep=False)记为data1,另一次将重复数据保留一个(keep='first)记为data2;

  1. 求data1和data2的差集即可:data2.append(data1).drop_duplicates(keep=False)

data1 = a.drop_duplicates(subset=["编号","日期"]

data2 = a.drop_duplicates(subset=["编号","日期"],keep = False)

data2.append(data1).drop_duplicates(keep=False)


30.9477 1 0 回复

推荐课程