日暮途远,人间何世
将军一去,大树飘零
概述
之前学习了加州房价预测模型,便摩拳擦掌,从kaggle上找到一份帝都房价数据,练练手。
实验流程
实验数据
从 Kaggle 中选择了帝都北京住房价格的数据集,该数据集摘录了2011~2017年链家网上的北京房价数据。
下载并预览数据
下载并解压数据
预览数据
每一行代表一间房,每个房子有26个相关属性,其中以下几个需要备注:
DOM: 市场活跃天数
followers: 关注人数
totalPrice: 房屋总价格
price: 每平米价格
floor: 楼层数,中文数据,处理时需要注意
buildingType: 房屋类型,包含塔楼、平房、复式和样板房
renovationCondition: 装修情况,包括其他、毛坯、简装和精装
buildingStructure: 建筑结构,包含未知、混合、砖木、砖混、钢和钢混结构
ladderRatio: 人均楼梯数
fiveYearsProperty: 产权
district:区域,离散型
读取并初步分析数据
-
读取数据
读取数据报错,怀疑是编码问题,检查文件编码file new.csv new.csv: ISO-8859 text, with CRLF line terminators
文件编码是ISO-8859格式,因而将其另存为UTF-8格式,之后读取数据成功
- 查看数据结构和描述
可见与加州不同,这里存在大量非数值型数据。一共有318851个实例,其中DOM、bulidingType、elevator、fiveYearsProperty、subway、communityAverage存在缺失。其中DOM缺失过多,可以考虑删除此属性。其中url、id、Cid是不对房价构成影响的因素,可以直接不予考虑。我的目标预测结果是房屋总价格,因此每平米均价可以删去。 -
查看数据基本情况
查看数据频数直方分布情况
发现这组数据存在大量离散情况,连续型属性为:DOM、Lat、Lng、communityAverage、followers、square。import pandas as pd import matplotlib.pyplot as plt def load_housing_data(file_path): return pd.read_csv(file_path, sep=',', low_memory=False) def check_attributes(housing): attributes = list(housing) for attr in attributes: print(housing[attr].value_counts()) if __name__ == '__main__': housing = load_housing_data('new.csv') housing = housing.drop(['url','id','price'], axis=1) check_attributes(housing) housing.describe() housing.hist(bins=50, figsize=(20,15)) plt.savefig('housing_distribution.png')
创建测试集
选取数据集的20%作为测试集,由于存在district属性,刚好可以以其作为分层抽样的依据,划分好测试集之后,检查测试集分布是否与原始数据一致
#split the train and test set
spliter = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in spliter.split(housing, housing['district']):
train_set = housing.loc[train_index]
test_set = housing.loc[test_index]
test_set.hist(bins=50, figsize=(20,15))
plt.savefig('test.png')
数据探索和可视化
首先将测试集放在一边,对训练集进行数据探索。
- 将地理数据可视化
改变alpha参数,观察实例分布密度
不得不说,帝都房价就是厉害,每个地区房屋成交量都很巨大。 -
将区域、房价信息可视化
图中每个圆的半径代表价格,颜色代表各区域,基本了解数据中房源的区域集中情况。
发现帝都房价个地区基本持平,都集中在2500w之下,也鲜有出奇高的房子#explore the data housing = train_set.copy() housing.plot(kind='scatter', x='Lat', y='Lng') plt.savefig('gregrophy.png') housing.plot(kind='scatter', x='Lat', y='Lng', alpha=0.1) plt.savefig('gregrophy_more.png') fig = plt.scatter(x=housing['Lat'], y=housing['Lng'], alpha=0.4, \ s=housing['totalPrice']/100, label='Price', \ c=housing['district'], cmap=plt.get_cmap('jet')) plt.colorbar(fig) plt.legend() plt.savefig('gregrophy_district_value.png') fig = plt.scatter(x=housing['Lat'], y=housing['Lng'], alpha=0.4, \ c=housing['totalPrice'], cmap=plt.get_cmap('jet')) plt.colorbar(fig) plt.savefig('gregrophy_price_value.png')
-
绘制价格随时间变化图
帝都房价10年开始狂飙突进,18年倒有下降趋势
自02年~18年帝都房价统计如图,离群点不算太多,盒子被压缩的比较小,说明每个月房内的房子出售价格维持在差异很小的范围内(500w左右)price_by_trade_time = pd.DataFrame() price_by_trade_time['totalPrice'] = housing['totalPrice'] price_by_trade_time.index = housing['tradeTime'].astype('datetime64[ns]') price_by_trade_month = price_by_trade_time.resample('M').mean().to_period('M').fillna(0) price_by_trade_month.plot(kind='line') price_stat_trade_month_index = [x.strftime('%Y-%m') for x in set(price_by_trade_time.to_period('M').index)] price_stat_trade_month_index.sort() price_stat_trade_month = [] for month in price_stat_trade_month_index: price_stat_trade_month.append(price_by_trade_time[month]['totalPrice'].values) price_stat_trade_month = pd.DataFrame(price_stat_trade_month) price_stat_trade_month.index = price_stat_trade_month_index price_stat_trade_month = price_stat_trade_month.T price_stat_trade_month.boxplot(figsize=(15,10)) plt.xticks(rotation=90,fontsize=7) plt.savefig('price_stat_trade_time.png')
-
探索房子建筑年限与房价的关系
查看房子建筑年限数据概况未知 15475 0 14 1 12
发现存在噪声,选择删除,之后绘制均价-房龄折线图
百年老房,就是不同凡响!
发现百年老房只是个例,房龄集中在0~65年附近,放大图像进行细微观察
大部分房产还是500w附近的,但是半世纪的老房子居然卖得和新房一样,实在难以理解,但是不像流言中北京房价都是千万级的,留在北京有希望了!!!#price and constraction correlations price_by_cons_time = pd.DataFrame() price_by_cons_time['totalPrice'] = housing['totalPrice'] price_by_cons_time['constructionTime'] = housing['constructionTime'] price_by_cons_time = price_by_cons_time[ (price_by_cons_time.constructionTime != '0') & (price_by_cons_time.constructionTime != '1') & (price_by_cons_time.constructionTime != '未知') ] price_by_cons_time['constructionTime'] = price_by_cons_time['constructionTime'].astype('int64') price_by_cons_time['constructionTime'] = 2018 - price_by_cons_time['constructionTime'] price_by_cons_time_index = list(set(price_by_cons_time['constructionTime'])) price_by_cons_time_index.sort() price_by_cons_time.index = price_by_cons_time['constructionTime'] price_by_cons_time = price_by_cons_time.drop('constructionTime', axis=1) price_by_cons_time_line = [] price_by_cons_time_stat = [] for years in price_by_cons_time_index: price_by_cons_time_line.append(price_by_cons_time.loc[years]['totalPrice'].mean()) try: price_by_cons_time_stat.append(price_by_cons_time.loc[years]['totalPrice'].values) except Exception: price_by_cons_time_stat.append(np.array([price_by_cons_time.loc[years]['totalPrice']])) plt.plot(list(price_by_cons_time_index), price_by_cons_time_line) plt.savefig('price_cons_line.png') price_by_cons_time_stat = pd.DataFrame(price_by_cons_time_stat) price_by_cons_time_stat.index = price_by_cons_time_index price_by_cons_time_stat = price_by_cons_time_stat.T price_by_cons_time_stat.boxplot(figsize=(20,15)) plt.ylim(0,2500) plt.savefig('price_stat_cons_time.png')
-
探索房价与面积关系 可见1000平以上的豪宅价格飙升,600~900平又是一个上升区间,0~400平应该属于刚需部分,400~600平价格基本稳定,但有可能是样本数量问题,因此我决定再看看整体情况
发现面积很集中,缩小区间再观察一下
北京楼市交易成功的房产大多是100平及以下的房子
看一下面积与价格的情况
发现基本是面积越大,价格越高
放大坐标进行观察
#square and price price_by_square = pd.DataFrame() price_by_square['totalPrice'] = housing['totalPrice'] price_by_square['square'] = housing['square'] price_by_square['square'] = np.ceil(price_by_square['square']) price_by_square['square'] = price_by_square['square'] - (price_by_square['square'] % 10) price_by_square_index = list(set(price_by_square['square'])) price_by_square_index.sort() price_by_square.index = price_by_square['square'] price_by_square_line = [] price_by_square_stat = [] for squares in price_by_square_index: #price_by_square_line.append(price_by_square.loc[squares]['totalPrice'].mean()) try: price_by_square_stat.append(price_by_square.loc[squares]['totalPrice'].values) except Exception: price_by_square_stat.append(np.array([price_by_square.loc[squares]['totalPrice']])) plt.plot(price_by_square_index, price_by_square_line) plt.savefig('price_square_mean.png') price_by_square['square'].hist(bins=50, figsize=(20,15)) plt.savefig('price_square.png') price_by_square_stat = pd.DataFrame(price_by_square_stat).T price_by_square_index = [int(x) for x in price_by_square_index] price_by_square_stat.columns = price_by_square_index price_by_square_stat.boxplot(figsize=(20,15)) plt.xticks(rotation=90) plt.ylim(0,5000) plt.savefig('price_stat_square_time.png')
-
探索时间、面积与房价的关系 市面上交易的北京房产大多集中在0~2500w左右,0~500平之间
放大坐标
再度放大坐标
发现17年价格一骑绝尘,11年则似乎是北京最佳购房时机#price and time,square correlations price = pd.DataFrame() price['totalPrice'] = housing['totalPrice'] price['square'] = housing['square'] price.index = housing['tradeTime'].astype('datetime64[ns]') price['square'] = np.ceil(price['square']) price['square'] = price['square'] - (price['square'] % 10) price = price.to_period('Y') price_time_index = [x.strftime('%Y') for x in set(price.index)] price_time_index.sort() colormap = mpl.cm.Dark2.colors m_styles = ['','.','o','^','*'] for year, (maker, color) in zip(price_time_index, itertools.product(m_styles, colormap)): y, x = get_mean(price.loc[year]) plt.plot(x, y, color=color, marker=maker, label=year) plt.xticks(rotation=90) plt.xlim(0,750) plt.ylim(0,5000) plt.legend(price_time_index) plt.savefig('price_by_time_square.png') def get_mean(price_by_square): try: price_by_square_index = list(set(price_by_square['square'])) price_by_square_index.sort() price_by_square_line = [] price_by_square.index = price_by_square['square'] for squares in price_by_square_index: price_by_square_line.append(price_by_square.loc[squares]['totalPrice'].mean()) price_by_square_index = [int(x) for x in price_by_square_index] except Exception: price_by_square_line = [price_by_square.loc['totalPrice']] price_by_square_index = [int(price_by_square['square'])] return price_by_square_line, price_by_square_index
- 检查是否存在脏数据
livingRoom存在#NAME?考虑删除 drawingRoom存在中文、数值混杂,混杂的中文也不多,考虑删除
bathRoom存在明显错误,考虑删除错误记录
floor属性很混乱,需要特别处理
buildingType也存在错误
经检查,需要处理的属性如下:
constructionTime
buildingType
floor
bathRoom
drawingRoom
livingRoom
连续型属性是:
communityAverage
ladderRatio
constructionTime
square
followers
Lat
Lng
离散型属性是:
district
subway
fiveYearsProperty
elevator
buildingStructure
renovationCondition
buildingType
floor
bathRoom
kitchen
drawingRoom
livingRoom
斜体离散型是0,1二元值,不需要独热编码,tradeTime并非房产属性,删除
数据准备
- 清洗数据
- 数据存在太多脏记录,从头开始清理
- 移除不需要的属性
- 将constructionTime转换为连续性属性房龄(用2018作为基准)
- 清除buildingType中的脏记录
- 清除livingRoom、drawingRoom、bathRoom中的脏记录,并将其转化为数值型
- floor属性太过复杂,我决定删除
class DataNumCleaner(BaseEstimator, TransformerMixin): def __init__(self, clean=True): self.clean = clean def fit(self, X, y=None): return self def transform(self, X, y=None): if self.clean: X = X[(X.constructionTime != '0') & (X.constructionTime != '1') & (X.constructionTime != '未知')] X['constructionTime'] = 2018 - X['constructionTime'].astype('int64') X = X[(X.buildingType == 1) | (X.buildingType == 2) | (X.buildingType == 3) | (X.buildingType == 4)] X = X[X.livingRoom != '#NAME?'] X = X[(X.drawingRoom == '0') | (X.drawingRoom == '1') | (X.drawingRoom == '2') | (X.drawingRoom == '3') | (X.drawingRoom == '4') | (X.drawingRoom == '5')] X = X[(X.bathRoom == '0') | (X.bathRoom == '1') | (X.bathRoom == '2') | (X.bathRoom == '3') | (X.bathRoom == '4') | (X.bathRoom == '5') | (X.bathRoom == '6') | (X.bathRoom == '7')] X.bathRoom = X.bathRoom.astype('float64') X.drawingRoom = X.drawingRoom.astype('float64') X.livingRoom = X.livingRoom.astype('float64') return X else: return X
- 清洗结果还比较理想
- 用众数填补缺失值
- 将buildingType、renovationCondition、buildingStructure、district转换为独热编码
-
建立数据清洗流程
num_pipeline = Pipeline([ ('cleaner', DataNumCleaner()), ('selector', DataFrameSelector(num_attributes)), ('imputer', Imputer(strategy='most_frequent')), ('std_scaler', StandardScaler()) ]) cat_pipeline = Pipeline([ ('cleaner', DataNumCleaner()), ('selector', DataFrameSelector(cat_attributes)), ('encoder', OneHotEncoder()) ]) label_pipeline = Pipeline([ ('cleaner', DataNumCleaner()), ('selector', DataFrameSelector(['totalPrice'])) ]) full_pipeline = FeatureUnion([ ('num_pipeline', num_pipeline), ('cat_pipeline', cat_pipeline) ])
模型训练
- 线性回归模型
效果理想 -
决策树
效果也算理想,但是训练时间过久,考虑减少一些无关特征。
查看特征之间相关性
发现与价格相关性最高的还是面积、社区均价,但是我们是要预测一套房子的价格,因此选取的特征最好是房子本身的属性,我考虑删除followers、communityAverage
减少特征之后的线性回归模型性能仍可以接受 -
线性SVR
-
调参
由于我的计算机算力实在不济,所以只能先使用线性模型进行练手了
得到线性svr的最佳参数
查看每次的RMSE
结果可以接受#improve liner_svr model param_grid = [ {'C': [0.5, 1, 2], 'loss': ['epsilon_insensitive', 'squared_epsilon_insensitive']} ] grid_search = GridSearchCV(lin_svm_reg, param_grid, cv=5, scoring='neg_mean_squared_error') grid_search.fit(housing_prepared,housing_label) grid_search.best_params_ cvres = grid_search.cv_results_ for mean_score, params in zip(cvres['mean_test_score'], cvres['params']): print(np.sqrt(-mean_score), params) #final model final_model = grid_search.best_estimator_
模型验证
- 利用测试集进行验证
效果与训练集差不多,可以接受 -
从测试集中随机取100个记录进行预测,查看效果
可见预测结果几乎吻合,因此模型可以使用test_index = [randint(0,len(y_test)) for i in range(100)] y_label = [y_test[index] for index in test_index] y_predict = [final_model.predict(X_test_prepared[index]) for index in test_index] x = [i+1 for i in range(100)] plt.plot(x, y_label, c='red', label='label') plt.plot(x, y_predict, c='blue', label='predict') plt.legend() plt.savefig('result.png')
- 导出模型
joblib.dump(final_model,'BeijingHousingPricePredicter.pkl')
总结
- 北京房价真的高!
- 北京市场上成功买卖的房产基本都在500w附近,100平米左右,房龄在0~40年之间。面积更大的房产有价无市
- 北京最佳购房时机在2011年附近
- 2017年附近竟然交易了一套17500w的天价房产,不知买卖双方是何等神仙
- 数据清洗很重要,可以自己写转换器,列入PipeLine
- 有些特征可以凭人为经验删去,但是特征工程 很重要!!
- 机器学习需要算力较好的计算机ORZ
- 完整代码