淘宝用户行为数据分析项目目录¶

项目介绍
分析目的
分析结论
分析主体
流量类指标
1. 每日PV、UV趋势以及其增长率情况
漏斗分析
1. 浏览-加购物车-购买三大环节的转化漏斗
2. 浏览-购买两大环节的转化漏斗
用户行为路径分析

项目介绍¶

领域：电商
数据来源：天池公开数据集
数据情况：用户行为数据
时间跨度：2017年11月25日至 2017年12月3日(9天)
分析工具：
- 代码语言：Python
- 数据库：Mysql
- 可视化工具：Pyecharts、Matplotlib、Seaborn

本数据集由阿里巴巴提供的一亿多条的淘宝用户行为数据，包含了2017年11月25日至2017年12月3日之间，约一百万随机用户的所有行为（行为包括点击、购买、加购、收藏），数据集的每一行表示一条用户行为，针对的对象是商品详情页。

分析目的¶

对淘宝用户行为数据进行分析，并给出改善建议。

分析结论¶

每日基本情况¶

pv,uv,购买次数,买家数,购买商品数,加购次数,加购人数,加购商品数,收藏次数,收藏人数,收藏商品数都在 2017-12-03号达到9日最高值
人均浏览量在13次左右
人均浏览商品数在8件左右
购浏览买转化率在20%左右
人均购买次数在1.5次
人均购买商品数在1.4件

每日PV、UV趋势以及其增长率情况¶

日pv、日uv 整体呈现上升趋势，在12月2日达到最高峰、3日有所回落，PV日环比、UV日环比在12月2日增长接近25%

上述指标都在12月2日~ 3日有了大幅度提升，12月2日~ 3日虽然是周末，但是结合上个周末11月25日~ 26日的流量分析，可以排除周末原因；由于12月2日~3日距离双12较近，推测较大可能的原因是双12系列促销活动预热所带来的流量增加。

漏斗分析¶

用户使用淘宝的常见过程是：浏览（PV），之后可以收藏（fav）、加入购物车（cart），再到购买（buy）。其中，fav、cart不是购买的必须过程。

收藏功能主要是帮助用户收藏自己喜欢但不立即购买的商品，而购物车既有类似收藏商品的功能，但主要还是方便用户批量购买商品，两者都是用于提升用户购物体验。但是商品收藏之后要购买还是面临着加入购物车和直接购买的选择，因此，本文的漏斗分析会将收藏排除在外，由于数据集的行为类型有限，没有下单、支付环节，这里只分析浏览、加购物车、购买的转化。

转化节点：浏览、加购物车、购买
时间窗口：9天
转化基线：每一步作为下一步的基线
转化路径：
- 转化路径一：浏览-加购物车-购买
- 转化路径二：浏览-购买

浏览-加购物车-购买三大环节的转化漏斗¶

在9天内，有53.09%的用户在浏览后加入了购物车；其中有59.22%的用户会在加入购物车进行购买行为；由于数据集时间范围的限制，暂时无法知道这各环节的转化率是否正常，需要与历史数据对比才能作出最终的判断。

浏览-购买两大环节的转化漏斗¶

在9天内，有32.5%的用户浏览完便会进行购买；由于数据集时间范围的限制，暂时无法知道这各环节的转化率是否正常，需要与历史数据对比才能作出最终的判断。

用户行为路径分布¶

本次分析是以用户浏览为起点，购买为结尾，一共有9条路径。在这9天中，占比最高3条路径分别是：浏览—购买、浏览—加购物车—购买、浏览—加购物车，分别占了25.18%、23.99%、15.38%，总占比64.55%，其他用户路径占比相差不大。

加入购物车、收藏、购买三大功能使用情况¶

这9天中，购买的用户占比为63.94%，使用购物车功能用户占比53.09%，使用收藏功能的用户占比26.44%。

收藏/加购物车-购买时间间隔分布¶

用户收藏/加购物车到购买的时间间隔分布，运营人员可以根据时间间隔分布，给用户设置提醒功能。

在这9天内，加入购物车且有购买意愿的用户中，接近60%的用户会在24h内完成购买，超过88.09%会在3天内购买；
收藏且有购物意愿的用户中，接近77.13%的用户会在24h内完成购买，超过92.24%的用户会在3天内购买。
当用户收藏和加购物车3天后还未购买，可以大致推测用户的购买意愿并不强烈，此时可以通过适当的提示\优惠活动来引导用户购买，或者推荐对应的相似商品供用户选择。

分析主体¶

由于原数据集有1亿多条记录，数据量过大，本文只对其中200万条数据进行分析。

数据集概要¶

数据集来源：天池公开数据集
数据集大小：1亿多条的真实用户行为数据(脱敏)【仅200万条数据进行分析】
数据集时间跨度：2017年11月25日至 2017年12月3日(9天)
该数据集包含5字段

字段	说明
User ID	用户ID(整数型)
Item ID	商品ID(整数型)
Category ID	商品类目ID(整数型)
Behavior type	行为类型(字符串型)
Timestamp	行为发生的时间戳(整数型)

行为类型的字段，包括四种情况

行为类型	说明
pv	商品详情页 pv，等价于点击
buy	商品购买
cart	将商品加入购物车
fav	收藏商品

【注意：这里的行为都是针对商品详情页】

导入相关库¶

# 导入相关库
import numpy as np  # 科学计算工具包
import pandas as pd  # 数据分析工具包
import matplotlib.pyplot as plt # 图表绘制工具包
import seaborn as sns # 基于 matplot, 导入 seaborn 会修改默认的 matplotlib 配色方案和绘图样式，这会提高图表的可读性和美观性
import os,pymysql
from sqlalchemy import create_engine # 数据库

plt.rcParams['font.sans-serif'] = ['SimHei']  # 中文字体设置-黑体
plt.rcParams['axes.unicode_minus'] = False  # 解决保存图像是负号'-'显示为方块的问题
sns.set(font='SimHei')  # 解决Seaborn中文显示问题

# 在 jupyter notebook 里面显示图表
%matplotlib inline

获取数据集¶

知识点

read_csv()

参数 names : array-like, default None 用于结果的列名列表，如果数据文件中没有列标题行，就需要执行header=None。默认列表中不能出现重复，除非设定参数mangle_dupe_cols=True。

# 获取数据集

columns = ["user_id","item_id","category_id","behavior_type","timestamp"]

reader = pd.read_csv("/Users/xusanshan/Downloads/UserBehavior.csv", iterator=True,names=columns)
user_behavior = reader.get_chunk(2000000)

理解数据集¶

通过预览数据，了解所有字段和其含义。

user_behavior.head()

清洗数据集¶

查看数据整体信息¶

user_behavior.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000000 entries, 0 to 1999999
Data columns (total 5 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   user_id        int64 
 1   item_id        int64 
 2   category_id    int64 
 3   behavior_type  object
 4   timestamp      int64 
dtypes: int64(4), object(1)
memory usage: 76.3+ MB

缺失值处理¶

# 统计缺失值数量和占比

def missing_info(data,num=5):
    # func:统计缺失值数量和占比函数
    # data: dataframe类型
    # num: 数字类型，显示前几行数据
    # return: 缺失值统计\占比

    null_data = data.isnull().sum().sort_values(ascending=False)
    percent_1 = data.isnull().sum()/data.isnull().count()
    missing_data = pd.concat([null_data,percent_1.apply(lambda x: format(x, '.2%'))],axis=1,keys=['total missing','missing percentage'])
    print(missing_data.head(num))

missing_info(user_behavior)

               total missing missing percentage
timestamp                  0              0.00%
behavior_type              0              0.00%
category_id                0              0.00%
item_id                    0              0.00%
user_id                    0              0.00%

缺失情况：

所有字段均无缺失值，数据很干净

重复值处理¶

# 查看每一行数据是否存在重复值

user_behavior.duplicated().sum()

0

重复情况：

没有重复值，不需要进行处理。

数据类型转化¶

# 时间戳 timestamp 转 datetime 格式的函数

def timestamp_datetime(data,column_list):
    # func: 时间戳 timestamp 转 datetime 格式的函数
    # data: dataframe类型
    # num: 数字类型，显示前几行数据
    # return: 日期时间类型数据
 
    for i in column_list:
        data[i] = pd.to_datetime(data[i], unit='s')
    print('时间戳转换 datetime 格式完成')
    
data = user_behavior
column_list =['timestamp'] 
timestamp_datetime(data,column_list)

时间戳转换 datetime 格式完成

user_behavior.head()

新增日期时间信息字段¶

date
year_month
hour

user_behavior['date'] = user_behavior['timestamp'].dt.date
user_behavior['year_month'] = user_behavior['timestamp'].dt.to_period('M')
user_behavior['hour']= user_behavior['timestamp'].dt.hour

user_behavior.head()

异常值处理¶

# 观察数据异常情况

user_behavior.describe(include="all")

异常情况：

最早日期 2018-08-28 10:27:12 和最晚日期 1970-01-01 12:13:36 并不在数据集的范围内(2017年11月25日至2017年12月3日之间)

异常处理

因此需要筛选出在2017年11月25日~2017年12月3日范围的用户行为记录。

# 查看异常值的详细数据 

user_behavior[(pd.to_datetime('2017-11-25')>user_behavior['date'])|(user_behavior['date']>pd.to_datetime('2017-12-03'))]

#第一种方法: 筛选出在范围内的记录

user_behavior = user_behavior[(user_behavior['date']>pd.to_datetime('2017-11-24'))&(user_behavior['date']<pd.to_datetime('2017-12-04'))]
user_behavior.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1975349 entries, 2 to 1999999
Data columns (total 8 columns):
 #   Column         Dtype         
---  ------         -----         
 0   user_id        int64         
 1   item_id        int64         
 2   category_id    int64         
 3   behavior_type  object        
 4   timestamp      datetime64[ns]
 5   date           object        
 6   year_month     period[M]     
 7   hour           int64         
dtypes: datetime64[ns](1), int64(4), object(2), period[M](1)
memory usage: 135.6+ MB

#第二种方法: 删除掉不在范围的记录

user_behavior=user_behavior.drop(index=user_behavior[(pd.to_datetime('2017-11-25')>user_behavior['date'])|(user_behavior['date']>pd.to_datetime('2017-12-03'))].index)
user_behavior.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1975349 entries, 2 to 1999999
Data columns (total 8 columns):
 #   Column         Dtype         
---  ------         -----         
 0   user_id        int64         
 1   item_id        int64         
 2   category_id    int64         
 3   behavior_type  object        
 4   timestamp      datetime64[ns]
 5   date           object        
 6   year_month     period[M]     
 7   hour           int64         
dtypes: datetime64[ns](1), int64(4), object(2), period[M](1)
memory usage: 135.6+ MB

数据清洗后

样本大小：1975349 条数据
字段：8个
无缺失值，无重复值
有少量异常值,已处理

导出到数据库¶

# 导出到 mysql 数据库的函数

def export_mysql(data,user,password,host_port,db,table_name):
    # func: 导出到 mysql 数据库函数
    # data: Dataframe 类型
    # user: 用户名
    # password: 密码
    # host_port: 主机，端口
    # db: 数据库名
    # table_name: 表名

    print('将清洗后的数据导出到 mysql')
    engine = create_engine("mysql+pymysql://{}:{}@{}/{}".format(user, password, host_port, db))
    con = engine.connect()#创建连接
    data.to_sql(table_name,engine,if_exists='replace',index=False)
    print('成功导出')
    
user = 'root'
password = '1234567890'
host_port = '127.0.0.1:3306'
db = 'test'

data = user_behavior[['user_id', 'item_id', 'category_id', 'behavior_type', 'timestamp','date', 'hour']]
table_name = 'user_behavior'

export_mysql(data,user,password,host_port,db,table_name)

将清洗后的数据导出到 mysql
成功导出

导入数据库报错原因

将整个 user_behavior 数据集导入 mysql 会报错 ValueError: cannot infer type for <class 'NoneType'>

怀疑是 year_month 字段引起，它的格式是 period[M] ，会不会是 mysql 不能识别，所以报错。尝试去掉这个字段再导入，就成功了。

构建基础表¶

table = user_behavior.groupby(['behavior_type','date']).agg({"user_id":["count","nunique"],"item_id":"nunique"})

#对以上数据进行拆分，每个行为分一个表。
uv_pv_data = table.loc['pv',:].copy()
buy_data = table.loc['buy',:].copy()
cart_data = table.loc['cart',:].copy()
fav_data = table.loc['fav',:].copy()

# 对列名进行重命名
uv_pv_data.columns=['pv','uv','item_num']
buy_data.columns=['buy_times','buyers','buy_items']
cart_data.columns=['cart_times','cart_users','cart_items']
fav_data.columns=['fav_times','fav_users','fav_items']

# 构建基本信息表
basic=pd.concat([uv_pv_data,buy_data,cart_data,fav_data],axis=1)

# 人均浏览量
basic['pv/uv'] = basic['pv']/basic['uv']

# 人均浏览商品数
basic['avg_browse_item'] = basic['item_num']/basic['uv']

# 购买用户比例
basic['buyers_percent'] = basic['buyers'] /basic['uv']

# 人均购买次数
basic['avg_buytimes'] = basic['buy_times'] /basic['buyers']

# 人均购买商品数
basic['avg_buy_items'] = basic['buy_items'] /basic['buyers']

# 颜色越深，数值越高
basic.style.background_gradient()

通过上面的基础表，可以得知每日的基本情况：

pv,uv,购买次数,买家数,购买商品数,加购次数,加购人数,加购商品数,收藏次数,收藏人数,收藏商品数都在 2017-12-03号达到9日最高值
人均浏览量在13次左右
人均浏览商品数在8件左右
购浏览买转化率在20%左右
人均购买次数在1.5次
人均购买商品数在1.4件

流量类指标¶

uv_pv_data

# 计算 人均浏览量 PV
uv_pv_data['avg_pv'] = uv_pv_data['pv']/uv_pv_data['uv']

# 计算 PV 日环比
uv_pv_data['pv_dayonday'] = uv_pv_data["pv"].diff()/uv_pv_data["pv"].shift(1)*100

# 计算 UV 日环比
uv_pv_data['uv_dayonday'] = uv_pv_data["uv"].diff()/uv_pv_data["uv"].shift(1)*100

# 填充 NAN 值为0
uv_pv_data = uv_pv_data.fillna(0)
uv_pv_data.style.background_gradient()

每日 PV、UV 趋势以及其增长率情况¶

fig,axes=plt.subplots(3,1,figsize=(12,18)) #创建一个一行两列的画布

uv_pv_data['pv'].plot(ax=axes[0],marker='o',color='r',label="PV")
uv_pv_data['uv'].plot(ax=axes[1],marker='o',color='b',label="UV")

uv_pv_data['pv_dayonday'].plot(ax=axes[2],marker='o',color='r',label="PV日环比")
uv_pv_data['uv_dayonday'].plot(ax=axes[2],marker='o',color='b',label="UV日环比")

axes[0].set_title('每日 PV 趋势',fontsize=14)
axes[1].set_title('每日 UV 趋势',fontsize=14)
axes[2].set_title('PV、UV日环比',fontsize=14)
axes[0].set_xlabel('日期')
axes[1].set_xlabel('日期')
axes[2].set_xlabel('日期')
axes[0].legend()
axes[1].legend()
axes[2].legend()

<matplotlib.legend.Legend at 0x7fa03a2cbdd8>

日pv、日uv 整体呈现上升趋势，在12月2日达到最高峰、3日有所回落，PV日环比、UV日环比在12月2日增长接近25%
上述指标都在12月2日~ 3日有了大幅度提升，12月2日~ 3日虽然是周末，但是结合上个周末11月25日~ 26日的流量分析，可以排除周末原因；由于12月2日~3日距离双12较近，推测较大可能的原因是双12系列促销活动预热所带来的流量增加。

漏斗分析¶

用户使用淘宝的常见过程是：浏览（PV），之后可以收藏（fav）、加入购物车（cart），再到购买（buy）。其中，fav、cart不是购买的必须过程。

收藏功能主要是帮助用户收藏自己喜欢但不立即购买的商品，而购物车既有类似收藏商品的功能，但主要还是方便用户批量购买商品，两者都是用于提升用户购物体验。但是商品收藏之后要购买还是面临着加入购物车和直接购买的选择，因此，本文的漏斗分析会将收藏排除在外，由于数据集的行为类型有限，没有下单、支付环节，这里只分析浏览、加购物车、购买的转化。

转化节点：浏览、加购物车、购买
时间窗口：9天
转化基线：每一步作为下一步的基线
转化路径：
- 转化路径一：浏览-加购物车-购买
- 转化路径二：浏览-购买

注意¶

分析的数据集中一共有19543名用户，其中有浏览行为的用户有19462名，说明有81名用户没有浏览行为。由于之前排除了2017年11月25日~2017年12月3日范围外的用户行为记录，所以很可能是，这81名用户的浏览行为是在2017年11月25日之前就发生了。所以这里排除这81名用户的数据。

# 总用户数
total_users = user_behavior['user_id'].nunique()

# UV数
pv_users= user_behavior[user_behavior['behavior_type']=='pv']['user_id'].nunique()

print('总用户数:',total_users,'UV数:',pv_users)

总用户数: 19543 UV数: 19462

# 总用户
total_users_list = user_behavior['user_id'].unique()

# UV
pv_users_list = user_behavior[user_behavior['behavior_type']=='pv']['user_id'].unique()

# 找到两者的差集
np.setdiff1d(total_users_list,pv_users_list)

array([   1365,    1679,   10694,   11921,   14888,   16641,   16788,
        104460,  104642,  105479,  107243,  108073,  110487,  110938,
        111233,  111904,  112450,  114283,  115685,  117205,  117360,
        117708,  119999,  120134,  122578,  124506,  124587,  126039,
        126062,  128379,  131618,  134710,  134950,  136021,  137661,
        137972,  138088,  138810,  140376,  142109,  142305,  142573,
        142699,  142816,  143670,  144048,  145459,  147016,  147509,
        148260,  148792,  150764,  151163,  152354,  153975,  154034,
        155381,  157324,  157565,  158229,  161910,  162263,  163229,
        163293,  166860,  170144,  170449,  171007,  171362,  171407,
        172198,  172496,  172676, 1001148, 1003646, 1003767, 1004458,
       1005688, 1005897, 1009964, 1017910])

# 浏览个别用户行为数据
user_behavior[user_behavior['user_id']==1679]

子集处理¶

# 构造漏斗分析的子集
funnel_data = user_behavior[["user_id","behavior_type"]]

# 去重处理：对每个用户去除相同行为，因为分析时只需要知道有没有对应的行为，而不需要知道行为的数量
funnel_data = funnel_data.drop_duplicates()

# 将用户路径组合起来
funnel_route = funnel_data.groupby('user_id')['behavior_type'].apply('-'.join).reset_index()

# 计算每种路径组合的用户数
funnel_route = funnel_route.groupby('behavior_type')['user_id'].nunique().reset_index()

# 为了助于理解，对字段进行更名处理
funnel_route.rename(columns={"behavior_type":"user_route","user_id":"user_amount"},inplace=True)
funnel_route

结果有60种组合路径，按照 pv 可以划分为三种情况:

不含 pv 【排除，不加入漏斗分析】
含 pv
- 以 pv 开头【加入漏斗分析】
- 不以 pv 开头【拆分后加入漏斗分析】

可以看出，这印证了之前我们的想法，有些用户没有 pv 行为，漏斗分析中我们会排除掉这部分用户。同时也发现了，一些用户路径中 pv 不在开头，所以为了进行完整的漏斗分析，这里会对它进行拆分，分成pv前段和pv后段，仅分析pv后段的路径。

# 排除不含pv的数据

funnel_route = funnel_route[funnel_route['user_route'].str.contains("pv")]
funnel_route

#拆分不以 pv 开头的路径

# 按正则表达式来分隔，匹配pv前的-号
spliteDF = funnel_route['user_route'].str.split(r'-(?=pv)', 2, True)
spliteDF.columns = ['before-pv', 'after-pv']

spliteDF = pd.concat([funnel_route['user_amount'],spliteDF],axis=1)
spliteDF

def combine(series):
    if series['after-pv'] == None:
        result = series['before-pv'] 
    else:
        result = series['after-pv']
    return result

spliteDF['route'] = spliteDF.apply(combine,axis=1)
combineDF = spliteDF[['user_amount','route']]
combineDF

#由于漏斗分析不分析收藏环节,删掉fav环节

route_data = combineDF.copy()

route_data['route'].replace(r'-fav','',regex=True,inplace=True)
route_data

# 将路径划分四类：包含 pv-cart-buy 为一类,包含 pv-cart为一类，包含 pv-buy为一类，最后归为 pv一类

def categorize(series):
    if "pv-cart-buy" in series['route']:
        result = "pv-cart-buy"
    elif "pv-cart" in series['route']:
        result = 'pv-cart'
    elif "pv-buy" in series['route']:
        result = 'pv-buy'
    else:
        result = 'only pv'
    return result
        

route_data['route_type'] = route_data.apply(categorize,axis=1)
route_data

# 按照route_type汇总

route_data = route_data.groupby('route_type')['user_amount'].sum().reset_index()
route_data

# 构建两种漏斗表

# 浏览总用户数
pv_users= user_behavior[user_behavior['behavior_type']=='pv']['user_id'].nunique()

# 浏览后加入购物车的用户数
pv_cart_users = route_data[route_data['route_type']=='pv-cart'].user_amount.sum() + route_data[route_data['route_type']=='pv-cart-buy'].user_amount.sum()

# 浏览后加入购物车并购买的用户数
pv_cart_buy_users = route_data[route_data['route_type']=='pv-cart-buy'].user_amount.sum()

# 构建浏览-加入购物车-购买漏斗表
funnel_table_1 = pd.DataFrame({"node":["浏览","加入购物车","购买"],"users":[pv_users,pv_cart_users,pv_cart_buy_users]})

# 浏览后购买的用户数
pv_buy_users = route_data[route_data['route_type']=='pv-buy'].user_amount.sum()

# 构建浏览-加入购物车-购买漏斗表
funnel_table_2 = pd.DataFrame({"node":["浏览","购买"],"users":[pv_users,pv_buy_users]})

# 计算第一种漏斗的每一节点转化率

funnel_table_1['conversion_rate'] = round(funnel_table_1["users"]/funnel_table_1["users"].shift(1)*100,2)
# 填充 NAN 值为0
funnel_table_1 = funnel_table_1.fillna(100)
funnel_table_1

# 计算第二种漏斗的每一节点转化率

funnel_table_2['conversion_rate'] = round(funnel_table_2["users"]/funnel_table_2["users"].shift(1)*100,2)
# 填充 NAN 值为0
funnel_table_2 = funnel_table_2.fillna(100)
funnel_table_2

# 制作漏斗图所需的数据格式

def funnel_pairdata(data):
    # func: 制作漏斗图所需的数据格式
    # data: 数据集
    rate_list = []
    for i in range(len(data)):
        x = data.loc[i,'node']
        y = data.loc[i,'conversion_rate']
        rate_list.append((x,y))
    return rate_list

funnel_pairdata(funnel_table_1)

[('浏览', 100.0), ('加入购物车', 53.09), ('购买', 59.22)]

浏览-加购物车-购买三大环节的转化漏斗¶

# 浏览-加入购物车-购买漏斗图
funnel = Funnel(init_opts=opts.InitOpts(width="800px", height="400px"))

funnel.add(
    series_name='环节',
    data_pair=funnel_pairdata(funnel_table_1),  
    sort_= "none",
    tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b} : {c}%"),
    label_opts=opts.LabelOpts(is_show=True, position="inside",formatter="{b} {c}%"))

funnel.set_global_opts(
    title_opts=opts.TitleOpts(title="浏览-加购物车-购买三大环节的转化漏斗", subtitle="转化率计算方式：每一步作为下一步的基线",pos_left="center"),
    legend_opts=opts.LegendOpts(pos_left=0,pos_bottom="50%",orient="vertical",item_height=12))
    
funnel.render_notebook()

/Users/xusanshan/Anaconda/anaconda3/lib/python3.6/site-packages/pyecharts/charts/chart.py:14: PendingDeprecationWarning: pyecharts 所有图表类型将在 v1.9.0 版本开始强制使用 ChartItem 进行数据项配置 :)
  super().__init__(init_opts=init_opts)

转化路径一：浏览-加购物车-购买

在9天内，有53.09%的用户在浏览后加入了购物车；
其中有59.22%的用户会在加入购物车进行购买行为；
由于数据集时间范围的限制，暂时无法知道这各环节的转化率是否正常，需要与历史数据对比才能作出最终的判断。

浏览-购买两大环节的转化漏斗¶

# 浏览-加入购物车-购买漏斗图
funnel = Funnel(init_opts=opts.InitOpts(width="800px", height="400px"))

funnel.add(
    series_name='环节',
    data_pair=funnel_pairdata(funnel_table_2),        
    tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b} : {c}%"),
    label_opts=opts.LabelOpts(is_show=True, position="inside",formatter="{b} {c}%"))

funnel.set_global_opts(
    title_opts=opts.TitleOpts(title="浏览-购买两大环节的转化漏斗", subtitle="转化率计算方式：每一步作为下一步的基线",pos_left="center"),
    legend_opts=opts.LegendOpts(pos_left=0,pos_bottom="50%",orient="vertical",item_height=12))
    
funnel.render_notebook()

/Users/xusanshan/Anaconda/anaconda3/lib/python3.6/site-packages/pyecharts/charts/chart.py:14: PendingDeprecationWarning: pyecharts 所有图表类型将在 v1.9.0 版本开始强制使用 ChartItem 进行数据项配置 :)
  super().__init__(init_opts=init_opts)

转化路径二：浏览-购买

在9天内，有32.5%的用户浏览完便会进行购买；
由于数据集时间范围的限制，暂时无法知道这各环节的转化率是否正常，需要与历史数据对比才能作出最终的判断。

用户行为路径分析¶

子集处理¶

# 构建所需子集
route_list = combineDF.copy()
route_list = route_list.groupby('route')['user_amount'].sum().reset_index()
route_list

可以发现有部分用户购买后，还进行了加购和收藏的行为，说明除了商品页面外，购买后页面有加入购物车和收藏的入口。

为了了解完整的用户行为路径，这里将用户路径进行划分：以浏览为起点，购买为结尾。

#拆分以 buy 结尾的路径

# 按正则表达式来分隔，匹配buy后的-号
splitedata = route_list['route'].str.split(r'(?<=buy)-', 2, True)
splitedata.columns = ['before-buy', 'after-buy']
splitedata = pd.concat([route_list['user_amount'],splitedata['before-buy']],axis=1)

# 汇总
splitedata = splitedata.groupby('before-buy').user_amount.sum().reset_index()

# 为了助于理解，对字段进行更名处理
splitedata.rename(columns={"before-buy":"routes"},inplace=True)


splitedata

# 降序
categorized_route = splitedata.sort_values(by ="user_amount",ascending=False)

# 计算占比
categorized_route['percent'] = categorized_route['user_amount']/route_list['user_amount'].sum()

format_dict = {'percent': '{:.2%}'}
(categorized_route.style
              .format(format_dict)
              .bar(color='lightgreen', vmin=0, subset=['percent'], align='zero')
)

用户行为路径分布：¶

本次分析是以用户浏览为起点，购买为结尾，一共有9条路径。在这9天中，占比最高3条路径分别是：浏览—购买、浏览—加购物车—购买、浏览—加购物车，分别占了25.18%、23.99%、15.38%，总占比64.55%，其他用户路径占比相差不大。

加入购物车、收藏、购买三大功能使用情况¶

def usage(data):
    fav_usage=0
    cart_usage=0
    buy_usage=0
    for i in range(len(data)):
        if "fav" in data.loc[i,'routes']:
            fav_usage += data.loc[i,'user_amount']
        if "cart" in data.loc[i,'routes']:
            cart_usage +=data.loc[i,'user_amount']
        if "buy" in data.loc[i,'routes']:
            buy_usage +=data.loc[i,'user_amount']
    usage_data= pd.DataFrame({"func":["收藏","加入购物车","购买"],"users":[fav_usage,cart_usage,buy_usage]})
    return usage_data

func_usage = usage(categorized_route)
func_usage

# 计算和浏览用户数的占比
func_usage['percent'] = func_usage['users']/pv_users

# 降序
func_usage = func_usage.sort_values(by ="users",ascending=False)

format_dict = {'percent': '{:.2%}'}
(func_usage.style
              .format(format_dict)
              .bar(color='lightgreen', vmin=0, subset=['percent'], align='zero')
)

这9天中，购买的用户占比为63.94%，使用购物车功能用户占比53.09%，使用收藏功能的用户占比26.44%。

收藏/加购物车-购买时间间隔分布¶

用户收藏/加购物车到购买的时间间隔分布，运营人员可以根据时间间隔分布，给用户设置提醒功能。

# 筛选数据集
df_fav = user_behavior[user_behavior["behavior_type"]=="fav"][["user_id","item_id","timestamp"]]
df_cart = user_behavior[user_behavior["behavior_type"]=="cart"][["user_id","item_id","timestamp"]]
df_buy = user_behavior[user_behavior["behavior_type"]=="buy"][["user_id","item_id","timestamp"]]

# 联结两个表
df_cart_buy = pd.merge(df_cart,df_buy,how="inner",on=["user_id","item_id"],suffixes=('_cart', '_buy'))
df_cart_buy["diff_time"] = df_cart_buy["timestamp_buy"]-df_cart_buy["timestamp_cart"]

df_fav_buy = pd.merge(df_fav,df_buy,how="inner",on=["user_id","item_id"],suffixes=('_fav', '_buy'))
df_fav_buy["diff_time"] = df_fav_buy["timestamp_buy"]-df_fav_buy["timestamp_fav"]

# 时间格式转化，单位为小时
df_cart_buy['diff_time'] = df_cart_buy['diff_time'].map(lambda x : x.days * 24 +x.seconds/3600)
df_fav_buy['diff_time'] = df_fav_buy['diff_time'].map(lambda x : x.days * 24 +x.seconds/3600)

# 筛选出时间差大于0的。时间差为负数，表示一个用户购买了一个商品后，再将同一个商品加入购物车的情况
df_cart_buy = df_cart_buy[df_cart_buy["diff_time"]>=0]
df_fav_buy = df_fav_buy[df_fav_buy["diff_time"]>=0]

df_cart_buy

# 分组，一共9天，分为9组
bins = [0,24,48,72,96,120,144,168,192,240]
df_cart_buy["time_bins"] = pd.cut(df_cart_buy["diff_time"],bins=bins,include_lowest=False) # include_lowest:是否包含左端点

bins = [0,24,48,72,96,120,144,168,192,240]
df_fav_buy["time_bins"] = pd.cut(df_fav_buy["diff_time"],bins=bins,include_lowest=False) # include_lowest:是否包含左端点

# 按组统计
cart_df = df_cart_buy.groupby("time_bins").user_id.count()
fav_df = df_fav_buy.groupby("time_bins").user_id.count()

cart_df

time_bins
(0, 24]       4362
(24, 48]      1401
(48, 72]       682
(72, 96]       381
(96, 120]      211
(120, 144]     134
(144, 168]      87
(168, 192]      44
(192, 240]      14
Name: user_id, dtype: int64

# 绘图 加购物车—购买时间间隔分布
plt.figure(figsize=(12,8))

rects = plt.bar(range(0,len(cart_df.index)),cart_df.values)
plt.xticks(range(0,len(cart_df.index)),cart_df.index)
plt.xlabel("单位：h",fontsize=10)
plt.title("加购物车—购买时间间隔分布(9天内)",fontsize=15)

# 设置数据标签
count=0
Sum=cart_df.sum()
for rect in rects:
    height = rect.get_height()   
    rect_x = rect.get_x()        
    plt.text(rect.get_x() + rect.get_width()/2,height+50,str(height)+'人', ha='center',fontsize=11)
    plt.text(rect.get_x() + rect.get_width()/2,height+140,str('{:.2f}'.format(cart_df.values[count]/Sum *100)) + "%",ha='center',fontsize=11) 
    count=count+1

# 绘图 收藏—购买时间间隔分布
plt.figure(figsize=(12,8))

rects = plt.bar(range(0,len(fav_df.index)),fav_df.values)
plt.xticks(range(0,len(fav_df.index)),fav_df.index)
plt.xlabel("单位：h",fontsize=10)
plt.title("收藏—购买时间间隔分布(9天内)",fontsize=15)

# 设置数据标签
count=0
Sum=fav_df.sum()
for rect in rects:
    height = rect.get_height()   
    rect_x = rect.get_x()        
    plt.text(rect.get_x() + rect.get_width()/2,height+30,str(height)+'人', ha='center',fontsize=11)
    plt.text(rect.get_x() + rect.get_width()/2,height+70,str('{:.2f}'.format(fav_df.values[count]/Sum *100)) + "%",ha='center',fontsize=11) 
    count=count+1

在这9天内，加入购物车且有购买意愿的用户中，接近60%的用户会在24h内完成购买，超过88.09%会在3天内购买；
收藏且有购物意愿的用户中，接近77.13%的用户会在24h内完成购买，超过92.24%的用户会在3天内购买。
当用户收藏和加购物车3天后还未购买，可以大致推测用户的购买意愿并不强烈，此时可以通过适当的提示\优惠活动来引导用户购买，或者推荐对应的相似商品供用户选择。

回到顶部

	user_id	item_id	category_id	behavior_type	timestamp
0	1	2268318	2520377	pv	1511544070
1	1	2333346	2520771	pv	1511561733
2	1	2576651	149192	pv	1511572885
3	1	3830808	4181361	pv	1511593493
4	1	4365585	2520377	pv	1511596146

	user_id	item_id	category_id	behavior_type	timestamp
0	1	2268318	2520377	pv	2017-11-24 17:21:10
1	1	2333346	2520771	pv	2017-11-24 22:15:33
2	1	2576651	149192	pv	2017-11-25 01:21:25
3	1	3830808	4181361	pv	2017-11-25 07:04:53
4	1	4365585	2520377	pv	2017-11-25 07:49:06

	user_id	item_id	category_id	behavior_type	timestamp	date	year_month	hour
count	2.000000e+06	2.000000e+06	2.000000e+06	2000000	2000000	2000000	2000000	2.000000e+06
unique	NaN	NaN	NaN	4	627028	32	7	NaN
top	NaN	NaN	NaN	pv	2017-11-28 08:02:20	2017-12-02	2017-11	NaN
freq	NaN	NaN	NaN	1791216	46	278200	1253884	NaN
first	NaN	NaN	NaN	NaN	1970-01-01 12:13:36	NaN	NaN	NaN
last	NaN	NaN	NaN	NaN	2018-08-28 10:27:12	NaN	NaN	NaN
mean	2.837831e+05	2.577227e+06	2.712752e+06	NaN	NaN	NaN	NaN	9.631077e+00
std	3.421325e+05	1.487724e+06	1.466843e+06	NaN	NaN	NaN	NaN	5.568526e+00
min	1.000000e+00	2.600000e+01	2.171000e+03	NaN	NaN	NaN	NaN	0.000000e+00
25%	1.165960e+05	1.294040e+06	1.349561e+06	NaN	NaN	NaN	NaN	5.000000e+00
50%	1.418930e+05	2.574775e+06	2.723483e+06	NaN	NaN	NaN	NaN	1.000000e+01
75%	1.672790e+05	3.859124e+06	4.145813e+06	NaN	NaN	NaN	NaN	1.400000e+01
max	1.018011e+06	5.163067e+06	5.161669e+06	NaN	NaN	NaN	NaN	2.300000e+01

	pv	uv	item_num	buy_times	buyers	buy_items	cart_times	cart_users	cart_items	fav_times	fav_users	fav_items	pv/uv	avg_browse_item	buyers_percent	avg_buytimes	avg_buy_items
date
2017-11-25	188220	13500	118798	3957	2627	3720	11433	4710	10736	6035	2318	5844	13.942222	8.799852	0.194593	1.506281	1.416064
2017-11-26	188362	13805	119083	4188	2719	3944	11577	4928	10898	6201	2359	6031	13.644477	8.626078	0.196958	1.540272	1.450533
2017-11-27	177142	13574	114914	4341	2808	4074	10735	4701	10143	5755	2259	5610	13.050096	8.465743	0.206866	1.545940	1.450855
2017-11-28	176869	13628	114719	4202	2730	3898	10872	4657	10299	5757	2318	5602	12.978353	8.417890	0.200323	1.539194	1.427839
2017-11-29	181777	13835	117177	4499	2882	4220	10915	4817	10354	5923	2383	5765	13.138923	8.469606	0.208312	1.561069	1.464261
2017-11-30	188693	14047	119162	4490	2949	4167	11516	4921	10838	5796	2378	5636	13.432975	8.483092	0.209938	1.522550	1.413021
2017-12-01	201860	14490	126604	4375	2841	4085	12988	5354	12175	6148	2445	5978	13.930987	8.737336	0.196066	1.539951	1.437874
2017-12-02	249080	18094	151510	5082	3425	4742	15903	6572	14730	8135	3027	7817	13.765889	8.373494	0.189289	1.483796	1.384526
2017-12-03	216918	17594	135921	4773	3258	4466	13758	5869	12926	7074	2823	6872	12.329089	7.725418	0.185177	1.465009	1.370780

	pv	uv	item_num
date
2017-11-25	188220	13500	118798
2017-11-26	188362	13805	119083
2017-11-27	177142	13574	114914
2017-11-28	176869	13628	114719
2017-11-29	181777	13835	117177
2017-11-30	188693	14047	119162
2017-12-01	201860	14490	126604
2017-12-02	249080	18094	151510
2017-12-03	216918	17594	135921

	user_id	item_id	category_id	behavior_type	timestamp	date	year_month	hour
1854691	1679	1491518	2355072	cart	2017-11-25 01:44:03	2017-11-25	2017-11	1
1854692	1679	1048127	4789432	cart	2017-11-26 14:12:42	2017-11-26	2017-11	14
1854693	1679	3987147	4789432	cart	2017-11-26 14:14:47	2017-11-26	2017-11	14
1854694	1679	5099698	2885642	buy	2017-11-26 14:17:43	2017-11-26	2017-11	14
1854695	1679	3987147	4789432	buy	2017-11-28 00:59:54	2017-11-28	2017-11	0
1854696	1679	1808917	1320293	cart	2017-11-28 01:47:20	2017-11-28	2017-11	1
1854697	1679	4663016	982926	cart	2017-11-28 01:49:37	2017-11-28	2017-11	1
1854698	1679	624903	1080785	cart	2017-11-28 08:09:08	2017-11-28	2017-11	8
1854699	1679	2490581	1817004	cart	2017-11-28 08:49:05	2017-11-28	2017-11	8
1854700	1679	3333944	1817004	cart	2017-11-28 09:55:39	2017-11-28	2017-11	9
1854701	1679	2998650	4357323	cart	2017-11-28 10:46:31	2017-11-28	2017-11	10
1854702	1679	637045	4357323	cart	2017-11-28 10:49:14	2017-11-28	2017-11	10
1854703	1679	2445510	411153	cart	2017-11-28 11:00:37	2017-11-28	2017-11	11
1854704	1679	39498	3673049	cart	2017-11-28 11:50:27	2017-11-28	2017-11	11
1854705	1679	2868665	3673049	cart	2017-11-28 12:11:08	2017-11-28	2017-11	12
1854706	1679	1523499	411153	cart	2017-11-28 12:45:27	2017-11-28	2017-11	12
1854707	1679	2882647	3673049	cart	2017-11-29 03:40:31	2017-11-29	2017-11	3
1854708	1679	2961294	2355072	cart	2017-11-29 03:45:10	2017-11-29	2017-11	3
1854709	1679	3726128	3673049	cart	2017-11-29 03:46:03	2017-11-29	2017-11	3
1854710	1679	4484120	2355072	cart	2017-11-29 04:05:45	2017-11-29	2017-11	4
1854711	1679	2287412	2355072	cart	2017-11-29 12:03:08	2017-11-29	2017-11	12
1854712	1679	3159038	2355072	cart	2017-11-29 12:04:30	2017-11-29	2017-11	12
1854713	1679	1503702	4756105	cart	2017-11-29 12:29:02	2017-11-29	2017-11	12
1854714	1679	2263826	982926	cart	2017-11-30 02:56:56	2017-11-30	2017-11	2
1854715	1679	3159978	4756105	cart	2017-11-30 02:59:42	2017-11-30	2017-11	2
1854716	1679	390841	3673049	cart	2017-11-30 03:44:07	2017-11-30	2017-11	3
1854717	1679	148411	2355072	cart	2017-11-30 10:26:45	2017-11-30	2017-11	10
1854718	1679	3032345	2131531	cart	2017-11-30 10:49:16	2017-11-30	2017-11	10
1854719	1679	1191753	2131531	cart	2017-11-30 12:05:32	2017-11-30	2017-11	12
1854720	1679	5148912	2131531	cart	2017-12-01 09:44:25	2017-12-01	2017-12	9
1854721	1679	3606439	2131531	cart	2017-12-01 12:21:34	2017-12-01	2017-12	12
1854722	1679	2927801	2131531	cart	2017-12-01 12:29:23	2017-12-01	2017-12	12
1854723	1679	1352908	2131531	cart	2017-12-01 12:29:49	2017-12-01	2017-12	12
1854724	1679	3971945	2465336	cart	2017-12-01 12:30:57	2017-12-01	2017-12	12
1854725	1679	1184759	2885642	cart	2017-12-02 10:22:14	2017-12-02	2017-12	10
1854726	1679	1594562	2885642	cart	2017-12-02 10:25:20	2017-12-02	2017-12	10
1854727	1679	699950	2885642	cart	2017-12-02 10:30:02	2017-12-02	2017-12	10
1854728	1679	4183606	4801426	cart	2017-12-03 08:12:33	2017-12-03	2017-12	8
1854729	1679	2141661	2520377	cart	2017-12-03 08:22:45	2017-12-03	2017-12	8

	user_route	user_amount
0	buy	1
1	buy-cart	8
2	buy-cart-fav	2
3	buy-cart-pv	44
4	buy-cart-pv-fav	8
5	buy-fav	1
6	buy-fav-cart	1
7	buy-fav-cart-pv	2
8	buy-fav-pv	9
9	buy-fav-pv-cart	15
10	buy-pv	137
11	buy-pv-cart	316
12	buy-pv-cart-fav	57
13	buy-pv-fav	72
14	buy-pv-fav-cart	63
15	cart	8
16	cart-buy	20
17	cart-buy-fav	6
18	cart-buy-pv	26
19	cart-buy-pv-fav	8
20	cart-fav	2
21	cart-fav-buy	5
22	cart-fav-pv	11
23	cart-fav-pv-buy	16
24	cart-pv	260
25	cart-pv-buy	579
26	cart-pv-buy-fav	118
27	cart-pv-fav	79
28	cart-pv-fav-buy	116
29	fav	1
30	fav-buy	4
31	fav-buy-cart	8
32	fav-buy-pv	4
33	fav-buy-pv-cart	6
34	fav-cart	3
35	fav-cart-buy	11
36	fav-cart-buy-pv	4
37	fav-cart-pv	11
38	fav-cart-pv-buy	15
39	fav-pv	81
40	fav-pv-buy	119
41	fav-pv-buy-cart	84
42	fav-pv-cart	103
43	fav-pv-cart-buy	158
44	pv	1165
45	pv-buy	1220
46	pv-buy-cart	1636
47	pv-buy-cart-fav	374
48	pv-buy-fav	410
49	pv-buy-fav-cart	329
50	pv-cart	2554
51	pv-cart-buy	3754
52	pv-cart-buy-fav	757
53	pv-cart-fav	553
54	pv-cart-fav-buy	679
55	pv-fav	883
56	pv-fav-buy	798
57	pv-fav-buy-cart	511
58	pv-fav-cart	547
59	pv-fav-cart-buy	771

	user_amount	route	route_type
3	44	pv	only pv
4	8	pv	only pv
7	2	pv	only pv
8	9	pv	only pv
9	15	pv-cart	pv-cart
10	137	pv	only pv
11	316	pv-cart	pv-cart
12	57	pv-cart	pv-cart
13	72	pv	only pv
14	63	pv-cart	pv-cart
18	26	pv	only pv
19	8	pv	only pv
22	11	pv	only pv
23	16	pv-buy	pv-buy
24	260	pv	only pv
25	579	pv-buy	pv-buy
26	118	pv-buy	pv-buy
27	79	pv	only pv
28	116	pv-buy	pv-buy
32	4	pv	only pv
33	6	pv-cart	pv-cart
36	4	pv	only pv
37	11	pv	only pv
38	15	pv-buy	pv-buy
39	81	pv	only pv
40	119	pv-buy	pv-buy
41	84	pv-buy-cart	pv-buy
42	103	pv-cart	pv-cart
43	158	pv-cart-buy	pv-cart-buy
44	1165	pv	only pv
45	1220	pv-buy	pv-buy
46	1636	pv-buy-cart	pv-buy
47	374	pv-buy-cart	pv-buy
48	410	pv-buy	pv-buy
49	329	pv-buy-cart	pv-buy
50	2554	pv-cart	pv-cart
51	3754	pv-cart-buy	pv-cart-buy
52	757	pv-cart-buy	pv-cart-buy
53	553	pv-cart	pv-cart
54	679	pv-cart-buy	pv-cart-buy
55	883	pv	only pv
56	798	pv-buy	pv-buy
57	511	pv-buy-cart	pv-buy
58	547	pv-cart	pv-cart
59	771	pv-cart-buy	pv-cart-buy

	node	users	conversion_rate
0	浏览	19462	100.00
1	加入购物车	10333	53.09
2	购买	6119	59.22

	route	user_amount
0	pv	1754
1	pv-buy	1949
2	pv-buy-cart	1720
3	pv-buy-cart-fav	374
4	pv-buy-fav	528
5	pv-buy-fav-cart	329
6	pv-cart	2994
7	pv-cart-buy	3912
8	pv-cart-buy-fav	757
9	pv-cart-fav	610
10	pv-cart-fav-buy	679
11	pv-fav	1050
12	pv-fav-buy	914
13	pv-fav-buy-cart	511
14	pv-fav-cart	610
15	pv-fav-cart-buy	771

	routes	user_amount
0	pv	1754
1	pv-buy	4900
2	pv-cart	2994
3	pv-cart-buy	4669
4	pv-cart-fav	610
5	pv-cart-fav-buy	679
6	pv-fav	1050
7	pv-fav-buy	1425
8	pv-fav-cart	610
9	pv-fav-cart-buy	771

	routes	user_amount	percent
1	pv-buy	4900	25.18%
3	pv-cart-buy	4669	23.99%
2	pv-cart	2994	15.38%
0	pv	1754	9.01%
7	pv-fav-buy	1425	7.32%
6	pv-fav	1050	5.40%
9	pv-fav-cart-buy	771	3.96%
5	pv-cart-fav-buy	679	3.49%
4	pv-cart-fav	610	3.13%
8	pv-fav-cart	610	3.13%

	func	users	percent
2	购买	12444	63.94%
1	加入购物车	10333	53.09%
0	收藏	5145	26.44%

	user_id	item_id	timestamp_cart	timestamp_buy	diff_time
0	1000060	5129067	2017-11-29 06:22:04	2017-12-01 03:27:55	45.097500
1	1000061	3085401	2017-11-30 07:12:00	2017-12-03 14:42:53	79.514722
2	1000061	1388090	2017-12-01 02:24:30	2017-12-03 14:13:00	59.808333
3	1000085	4382786	2017-11-28 11:58:17	2017-11-29 09:16:11	21.298333
4	1000103	808592	2017-11-25 11:13:06	2017-11-25 11:19:13	0.101944
...	...	...	...	...	...
7934	174022	4103382	2017-12-01 14:52:18	2017-12-01 15:04:56	0.210556
7935	17404	4071470	2017-11-29 05:34:05	2017-11-29 05:43:29	0.156667
7937	174044	2642191	2017-11-29 05:43:48	2017-12-02 04:39:57	70.935833
7938	174049	3431150	2017-12-02 09:55:10	2017-12-02 16:51:14	6.934444
7939	174049	3431150	2017-12-02 09:55:10	2017-12-03 07:40:56	21.762778