读入数据:
import pandas as pd reviews = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)dtype = reviews.points.dtypedtype'''dtype('int64')'''point_strings = reviews.points.astype(str)# 第一种missing_price_reviews = reviews[reviews.price.isnull()]len(missing_price_reviews)# 第二种n_missing_prices = reviews.price.isnull().sum()# 第三种n_missing_prices = pd.isnull(reviews.price).sum()'''8996''''''Unknown 21247Napa Valley 4480 ... Bardolino Superiore 1Primitivo del Tarantino 1Name: region_1, Length: 1230, dtype: int64'''我们先来处理空值,使用pandas的fillna能够让我们替代空值。下面我们用Unknown替换Nan丢失的值。
reviews_per_region = reviews.region_1.fillna('Unknown')reviews_per_region'''0 Etna1 Unknown2 Willamette Valley3 Lake Michigan Shore4 Willamette Valley5 Navarra6 Vittoria7 Alsace8 Unknown9 Alsace10 Napa Valley11 Alsace12 Alexander Valley ...'''蓝后再统计有多少这样的这些项。这里要用到value_counts()这个函数。
reviews_per_region = reviews.region_1.fillna('Unknown').value_counts()type(reviews.region_1.fillna('Unknown'))reviews_per_region'''pandas.core.series.SeriesUnknown 21247Napa Valley 4480Columbia Valley (WA) 4124Russian River Valley 3091California 2629Paso Robles 2350Mendoza 2301Willamette Valley 2301Alsace 2163Champagne 1613Barolo 1599Finger Lakes 1565 ...'''最后再进行降序排序:
reviews_per_region = reviews.region_1.fillna('Unknown').value_counts().sort_values(ascending=False)# type(reviews.region_1.fillna('Unknown'))reviews_per_region'''Unknown 21247Napa Valley 4480Columbia Valley (WA) 4124Russian River Valley 3091California 2629Paso Robles 2350Mendoza 2301Willamette Valley 2301Alsace 2163Champagne 1613Barolo 1599'''注明:
以上数据来自kaggle learn

免责声明:本文系网络转载或改编,未找到原创作者,版权归原作者所有。如涉及版权,请联系删