北京房价预测

Kaggle数据

Posted by Gavin on March 16, 2019

日暮途远,人间何世

将军一去,大树飘零

概述

之前学习了加州房价预测模型,便摩拳擦掌,从kaggle上找到一份帝都房价数据,练练手。


实验流程

实验数据

Kaggle 中选择了帝都北京住房价格的数据集,该数据集摘录了2011~2017年链家网上的北京房价数据。

下载并预览数据

下载并解压数据

预览数据

每一行代表一间房,每个房子有26个相关属性,其中以下几个需要备注:
DOM: 市场活跃天数
followers: 关注人数
totalPrice: 房屋总价格
price: 每平米价格
floor: 楼层数,中文数据,处理时需要注意
buildingType: 房屋类型,包含塔楼、平房、复式和样板房
renovationCondition: 装修情况,包括其他、毛坯、简装和精装
buildingStructure: 建筑结构,包含未知、混合、砖木、砖混、钢和钢混结构
ladderRatio: 人均楼梯数
fiveYearsProperty: 产权
district:区域,离散型

读取并初步分析数据

  1. 读取数据
    读取数据报错,怀疑是编码问题,检查文件编码

     file new.csv  
     new.csv: ISO-8859 text, with CRLF line terminators
    

    文件编码是ISO-8859格式,因而将其另存为UTF-8格式,之后读取数据成功

  2. 查看数据结构和描述

    可见与加州不同,这里存在大量非数值型数据。一共有318851个实例,其中DOM、bulidingType、elevator、fiveYearsProperty、subway、communityAverage存在缺失。其中DOM缺失过多,可以考虑删除此属性。其中url、id、Cid是不对房价构成影响的因素,可以直接不予考虑。我的目标预测结果是房屋总价格,因此每平米均价可以删去。
  3. 查看数据基本情况

    查看数据频数直方分布情况

    发现这组数据存在大量离散情况,连续型属性为:DOM、Lat、Lng、communityAverage、followers、square。

     import pandas as pd
     import matplotlib.pyplot as plt
    	
     def load_housing_data(file_path):
         return pd.read_csv(file_path, sep=',', low_memory=False)
    	
     def check_attributes(housing):
         attributes = list(housing)
         for attr in attributes:
             print(housing[attr].value_counts())
    	
     if __name__ == '__main__':
         housing = load_housing_data('new.csv')
         housing = housing.drop(['url','id','price'], axis=1)
         check_attributes(housing)
         housing.describe()
         housing.hist(bins=50, figsize=(20,15))
         plt.savefig('housing_distribution.png')
    

创建测试集

选取数据集的20%作为测试集,由于存在district属性,刚好可以以其作为分层抽样的依据,划分好测试集之后,检查测试集分布是否与原始数据一致

#split the train and test set
spliter = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in spliter.split(housing, housing['district']):
	train_set = housing.loc[train_index]
	test_set = housing.loc[test_index]
test_set.hist(bins=50, figsize=(20,15))
plt.savefig('test.png')

数据探索和可视化

首先将测试集放在一边,对训练集进行数据探索。

  1. 将地理数据可视化

    改变alpha参数,观察实例分布密度

    不得不说,帝都房价就是厉害,每个地区房屋成交量都很巨大。
  2. 将区域、房价信息可视化

    图中每个圆的半径代表价格,颜色代表各区域,基本了解数据中房源的区域集中情况。

    发现帝都房价个地区基本持平,都集中在2500w之下,也鲜有出奇高的房子

     #explore the data
     housing = train_set.copy()
     housing.plot(kind='scatter', x='Lat', y='Lng')
     plt.savefig('gregrophy.png')
    
     housing.plot(kind='scatter', x='Lat', y='Lng', alpha=0.1)
     plt.savefig('gregrophy_more.png')
    
     fig = plt.scatter(x=housing['Lat'], y=housing['Lng'], alpha=0.4, \
         s=housing['totalPrice']/100, label='Price', \
         c=housing['district'], cmap=plt.get_cmap('jet'))
     plt.colorbar(fig)
     plt.legend()
     plt.savefig('gregrophy_district_value.png')
    
     fig = plt.scatter(x=housing['Lat'], y=housing['Lng'], alpha=0.4, \
         c=housing['totalPrice'], cmap=plt.get_cmap('jet'))
     plt.colorbar(fig)
     plt.savefig('gregrophy_price_value.png')
    
  3. 绘制价格随时间变化图
    帝都房价10年开始狂飙突进,18年倒有下降趋势

    自02年~18年帝都房价统计如图,离群点不算太多,盒子被压缩的比较小,说明每个月房内的房子出售价格维持在差异很小的范围内(500w左右)

     price_by_trade_time = pd.DataFrame()
     price_by_trade_time['totalPrice'] = housing['totalPrice']
     price_by_trade_time.index = housing['tradeTime'].astype('datetime64[ns]')
     price_by_trade_month = price_by_trade_time.resample('M').mean().to_period('M').fillna(0)
     price_by_trade_month.plot(kind='line')
    
     price_stat_trade_month_index = [x.strftime('%Y-%m') for x in set(price_by_trade_time.to_period('M').index)]
     price_stat_trade_month_index.sort()
     price_stat_trade_month = []
     for month in price_stat_trade_month_index:
         price_stat_trade_month.append(price_by_trade_time[month]['totalPrice'].values)
     price_stat_trade_month = pd.DataFrame(price_stat_trade_month)
     price_stat_trade_month.index = price_stat_trade_month_index
     price_stat_trade_month = price_stat_trade_month.T
     price_stat_trade_month.boxplot(figsize=(15,10))
     plt.xticks(rotation=90,fontsize=7)
     plt.savefig('price_stat_trade_time.png')
    
  4. 探索房子建筑年限与房价的关系
    查看房子建筑年限数据概况

     未知      15475
     0          14
     1          12
    

    发现存在噪声,选择删除,之后绘制均价-房龄折线图

    百年老房,就是不同凡响!

    发现百年老房只是个例,房龄集中在0~65年附近,放大图像进行细微观察

    大部分房产还是500w附近的,但是半世纪的老房子居然卖得和新房一样,实在难以理解,但是不像流言中北京房价都是千万级的,留在北京有希望了!!!

     #price and constraction correlations
     price_by_cons_time = pd.DataFrame()
     price_by_cons_time['totalPrice'] = housing['totalPrice']
     price_by_cons_time['constructionTime'] = housing['constructionTime']
     price_by_cons_time = price_by_cons_time[
         (price_by_cons_time.constructionTime != '0')
         & (price_by_cons_time.constructionTime != '1')
         & (price_by_cons_time.constructionTime != '未知')
     ]
     price_by_cons_time['constructionTime'] = price_by_cons_time['constructionTime'].astype('int64')
     price_by_cons_time['constructionTime'] = 2018 - price_by_cons_time['constructionTime']
     price_by_cons_time_index = list(set(price_by_cons_time['constructionTime']))
     price_by_cons_time_index.sort()
     price_by_cons_time.index = price_by_cons_time['constructionTime']
     price_by_cons_time = price_by_cons_time.drop('constructionTime', axis=1)
     price_by_cons_time_line = []
     price_by_cons_time_stat = []
     for years in price_by_cons_time_index:
         price_by_cons_time_line.append(price_by_cons_time.loc[years]['totalPrice'].mean())
         try:
             price_by_cons_time_stat.append(price_by_cons_time.loc[years]['totalPrice'].values)
         except Exception:
             price_by_cons_time_stat.append(np.array([price_by_cons_time.loc[years]['totalPrice']]))
     plt.plot(list(price_by_cons_time_index), price_by_cons_time_line)
     plt.savefig('price_cons_line.png')
     price_by_cons_time_stat = pd.DataFrame(price_by_cons_time_stat)
     price_by_cons_time_stat.index = price_by_cons_time_index
     price_by_cons_time_stat = price_by_cons_time_stat.T
     price_by_cons_time_stat.boxplot(figsize=(20,15))
     plt.ylim(0,2500)
     plt.savefig('price_stat_cons_time.png')
    
  5. 探索房价与面积关系 可见1000平以上的豪宅价格飙升,600~900平又是一个上升区间,0~400平应该属于刚需部分,400~600平价格基本稳定,但有可能是样本数量问题,因此我决定再看看整体情况
    发现面积很集中,缩小区间再观察一下

    北京楼市交易成功的房产大多是100平及以下的房子
    看一下面积与价格的情况

    发现基本是面积越大,价格越高
    放大坐标进行观察

     #square and price
     price_by_square = pd.DataFrame()
     price_by_square['totalPrice'] = housing['totalPrice']
     price_by_square['square'] = housing['square']
     price_by_square['square'] = np.ceil(price_by_square['square'])
     price_by_square['square'] = price_by_square['square'] - (price_by_square['square'] % 10)
     price_by_square_index = list(set(price_by_square['square']))
     price_by_square_index.sort()
     price_by_square.index = price_by_square['square']
     price_by_square_line = []
     price_by_square_stat = []
     for squares in price_by_square_index:
         #price_by_square_line.append(price_by_square.loc[squares]['totalPrice'].mean())
         try:
             price_by_square_stat.append(price_by_square.loc[squares]['totalPrice'].values)
         except Exception:
             price_by_square_stat.append(np.array([price_by_square.loc[squares]['totalPrice']]))
     plt.plot(price_by_square_index, price_by_square_line)
     plt.savefig('price_square_mean.png')
     price_by_square['square'].hist(bins=50, figsize=(20,15))
     plt.savefig('price_square.png')
     price_by_square_stat = pd.DataFrame(price_by_square_stat).T
     price_by_square_index = [int(x) for x in price_by_square_index]
     price_by_square_stat.columns = price_by_square_index
     price_by_square_stat.boxplot(figsize=(20,15))
     plt.xticks(rotation=90)
     plt.ylim(0,5000)
     plt.savefig('price_stat_square_time.png')
    
  6. 探索时间、面积与房价的关系 市面上交易的北京房产大多集中在0~2500w左右,0~500平之间
    放大坐标

    再度放大坐标

    发现17年价格一骑绝尘,11年则似乎是北京最佳购房时机

     #price and time,square correlations
     price = pd.DataFrame()
     price['totalPrice'] = housing['totalPrice']
     price['square'] = housing['square']
     price.index = housing['tradeTime'].astype('datetime64[ns]')
     price['square'] = np.ceil(price['square'])
     price['square'] = price['square'] - (price['square'] % 10)
     price = price.to_period('Y')
     price_time_index = [x.strftime('%Y') for x in set(price.index)]
     price_time_index.sort()
     colormap = mpl.cm.Dark2.colors
     m_styles = ['','.','o','^','*']
     for year, (maker, color) in zip(price_time_index, itertools.product(m_styles, colormap)):
         y, x = get_mean(price.loc[year])
         plt.plot(x, y, color=color, marker=maker, label=year)
     plt.xticks(rotation=90)
     plt.xlim(0,750)
     plt.ylim(0,5000)
     plt.legend(price_time_index)
     plt.savefig('price_by_time_square.png')
    	
    	
     def get_mean(price_by_square):
     try:
         price_by_square_index = list(set(price_by_square['square']))
         price_by_square_index.sort()
         price_by_square_line = []
         price_by_square.index = price_by_square['square']
         for squares in price_by_square_index:
             price_by_square_line.append(price_by_square.loc[squares]['totalPrice'].mean())
         price_by_square_index = [int(x) for x in price_by_square_index]
     except Exception:
         price_by_square_line = [price_by_square.loc['totalPrice']]
         price_by_square_index = [int(price_by_square['square'])]
     return price_by_square_line, price_by_square_index
    	
    
  7. 检查是否存在脏数据

    livingRoom存在#NAME?考虑删除 drawingRoom存在中文、数值混杂,混杂的中文也不多,考虑删除

    bathRoom存在明显错误,考虑删除错误记录
    floor属性很混乱,需要特别处理
    buildingType也存在错误
    经检查,需要处理的属性如下:
    constructionTime
    buildingType
    floor
    bathRoom
    drawingRoom
    livingRoom
    连续型属性是:
    communityAverage
    ladderRatio
    constructionTime
    square
    followers
    Lat
    Lng
    离散型属性是:
    district
    subway
    fiveYearsProperty
    elevator
    buildingStructure
    renovationCondition
    buildingType
    floor
    bathRoom
    kitchen
    drawingRoom
    livingRoom
    斜体离散型是0,1二元值,不需要独热编码,tradeTime并非房产属性,删除

数据准备

  1. 清洗数据
    • 数据存在太多脏记录,从头开始清理
    • 移除不需要的属性
    • 将constructionTime转换为连续性属性房龄(用2018作为基准)
    • 清除buildingType中的脏记录
    • 清除livingRoom、drawingRoom、bathRoom中的脏记录,并将其转化为数值型
    • floor属性太过复杂,我决定删除
    class DataNumCleaner(BaseEstimator, TransformerMixin):
     def __init__(self, clean=True):
         self.clean = clean
     def fit(self, X, y=None):
         return self
     def transform(self, X, y=None):
         if self.clean:
             X = X[(X.constructionTime != '0') & (X.constructionTime != '1') & (X.constructionTime != '未知')]
             X['constructionTime'] = 2018 - X['constructionTime'].astype('int64')
             X = X[(X.buildingType == 1) | (X.buildingType == 2) | (X.buildingType == 3) | (X.buildingType == 4)]
             X = X[X.livingRoom != '#NAME?']
             X = X[(X.drawingRoom == '0') | (X.drawingRoom == '1') | (X.drawingRoom == '2') | (X.drawingRoom == '3') | (X.drawingRoom == '4') | (X.drawingRoom == '5')]
             X = X[(X.bathRoom == '0') | (X.bathRoom == '1') | (X.bathRoom == '2') | (X.bathRoom == '3') | (X.bathRoom == '4') | (X.bathRoom == '5') | (X.bathRoom == '6') | (X.bathRoom == '7')]
             X.bathRoom = X.bathRoom.astype('float64')
             X.drawingRoom = X.drawingRoom.astype('float64')
             X.livingRoom = X.livingRoom.astype('float64')
             return X
         else:
             return X
    
  2. 清洗结果还比较理想
  3. 用众数填补缺失值
  4. 将buildingType、renovationCondition、buildingStructure、district转换为独热编码
  5. 建立数据清洗流程

     num_pipeline = Pipeline([
         ('cleaner', DataNumCleaner()),
         ('selector', DataFrameSelector(num_attributes)),
         ('imputer', Imputer(strategy='most_frequent')),
         ('std_scaler', StandardScaler())
     ])
    
     cat_pipeline = Pipeline([
         ('cleaner', DataNumCleaner()),
         ('selector', DataFrameSelector(cat_attributes)),
         ('encoder', OneHotEncoder())
     ])
    
     label_pipeline = Pipeline([
         ('cleaner', DataNumCleaner()),
         ('selector', DataFrameSelector(['totalPrice']))
     ])
    
     full_pipeline = FeatureUnion([
         ('num_pipeline', num_pipeline),
         ('cat_pipeline', cat_pipeline)
     ])
    

模型训练

  1. 线性回归模型
    效果理想
  2. 决策树

    效果也算理想,但是训练时间过久,考虑减少一些无关特征。
    查看特征之间相关性


    发现与价格相关性最高的还是面积、社区均价,但是我们是要预测一套房子的价格,因此选取的特征最好是房子本身的属性,我考虑删除followers、communityAverage

    减少特征之后的线性回归模型性能仍可以接受

  3. 线性SVR

  4. 调参
    由于我的计算机算力实在不济,所以只能先使用线性模型进行练手了
    得到线性svr的最佳参数
    查看每次的RMSE

    结果可以接受

     #improve liner_svr model
     param_grid = [
         {'C': [0.5, 1, 2], 'loss': ['epsilon_insensitive', 'squared_epsilon_insensitive']}
     ]
     grid_search = GridSearchCV(lin_svm_reg, param_grid, cv=5, scoring='neg_mean_squared_error')
     grid_search.fit(housing_prepared,housing_label)
     grid_search.best_params_
     cvres = grid_search.cv_results_
     for mean_score, params in zip(cvres['mean_test_score'], cvres['params']):
         print(np.sqrt(-mean_score), params)
    
     #final model
     final_model = grid_search.best_estimator_
    

模型验证

  1. 利用测试集进行验证

    效果与训练集差不多,可以接受
  2. 从测试集中随机取100个记录进行预测,查看效果

    可见预测结果几乎吻合,因此模型可以使用

     test_index = [randint(0,len(y_test)) for i in range(100)]
     y_label = [y_test[index] for index in test_index]
     y_predict = [final_model.predict(X_test_prepared[index]) for index in test_index]
     x = [i+1 for i in range(100)]
     plt.plot(x, y_label, c='red', label='label')
     plt.plot(x, y_predict, c='blue', label='predict')
     plt.legend()
     plt.savefig('result.png')
    
  3. 导出模型
joblib.dump(final_model,'BeijingHousingPricePredicter.pkl')

总结

  • 北京房价真的高!
  • 北京市场上成功买卖的房产基本都在500w附近,100平米左右,房龄在0~40年之间。面积更大的房产有价无市
  • 北京最佳购房时机在2011年附近
  • 2017年附近竟然交易了一套17500w的天价房产,不知买卖双方是何等神仙
  • 数据清洗很重要,可以自己写转换器,列入PipeLine
  • 有些特征可以凭人为经验删去,但是特征工程 很重要!!
  • 机器学习需要算力较好的计算机ORZ
  • 完整代码