Tabular Data¶
xskillscore
can be used on tabular data such as that stored in a pandas.DataFrame
.
It can be used most effectively when evaluating predictions over different fields.
[1]:
import numpy as np
import pandas as pd
import xskillscore as xs
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error
np.random.seed(seed=42)
Boston house prices dataset¶
A small example is to take a dataset and evaluate the model according to a field (column).
Load the Boston house prices dataset:
[2]:
data = load_boston()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['y'] = pd.Series(data.target)
df
[2]:
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
501 | 0.06263 | 0.0 | 11.93 | 0.0 | 0.573 | 6.593 | 69.1 | 2.4786 | 1.0 | 273.0 | 21.0 | 391.99 | 9.67 | 22.4 |
502 | 0.04527 | 0.0 | 11.93 | 0.0 | 0.573 | 6.120 | 76.7 | 2.2875 | 1.0 | 273.0 | 21.0 | 396.90 | 9.08 | 20.6 |
503 | 0.06076 | 0.0 | 11.93 | 0.0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1.0 | 273.0 | 21.0 | 396.90 | 5.64 | 23.9 |
504 | 0.10959 | 0.0 | 11.93 | 0.0 | 0.573 | 6.794 | 89.3 | 2.3889 | 1.0 | 273.0 | 21.0 | 393.45 | 6.48 | 22.0 |
505 | 0.04741 | 0.0 | 11.93 | 0.0 | 0.573 | 6.030 | 80.8 | 2.5050 | 1.0 | 273.0 | 21.0 | 396.90 | 7.88 | 11.9 |
506 rows × 14 columns
Create a dummy prediction column by adding noise to y
:
[3]:
noise = np.random.uniform(-1, 1, size=len(df['y']))
df['yhat'] = (df['y'] + (df['y'] * noise)).clip(lower=df["y"].min())
Evaluate the model over the field RAD
using pandas.groupby.apply
with mean_squared_error
from scikit-learn
:
[4]:
df.groupby('RAD').apply(lambda x: mean_squared_error(x["y"], x["yhat"]))
[4]:
RAD
1.0 161.237554
2.0 313.855750
3.0 307.220760
4.0 162.634430
5.0 221.852969
6.0 155.612978
7.0 214.375240
8.0 278.092560
24.0 148.840507
dtype: float64
You could also do the following using xskillscore
.
First, structure the pandas.DataFrame
to keep the core fields when converting to an xarray
object:
[5]:
min_df = df.reset_index().set_index(["index", "RAD"])[["y", "yhat"]]
min_df
[5]:
y | yhat | ||
---|---|---|---|
index | RAD | ||
0 | 1.0 | 24.0 | 17.977926 |
1 | 2.0 | 21.6 | 41.070858 |
2 | 2.0 | 34.7 | 50.800380 |
3 | 3.0 | 33.4 | 39.990387 |
4 | 3.0 | 36.2 | 11.295750 |
... | ... | ... | ... |
501 | 1.0 | 22.4 | 24.017117 |
502 | 1.0 | 20.6 | 12.752538 |
503 | 1.0 | 23.9 | 38.899402 |
504 | 1.0 | 22.0 | 30.128172 |
505 | 1.0 | 11.9 | 5.000000 |
506 rows × 2 columns
Convert it to an xarray.Dataset
using pandas.DataFrame.to_xarray
. Note: This will create an array of index
by RAD
and pad the values that do not exist with nan
.
[6]:
ds = min_df.to_xarray()
ds
[6]:
<xarray.Dataset> Dimensions: (index: 506, RAD: 9) Coordinates: * index (index) int64 0 1 2 3 4 5 6 7 8 ... 498 499 500 501 502 503 504 505 * RAD (RAD) float64 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 24.0 Data variables: y (index, RAD) float64 24.0 nan nan nan nan ... nan nan nan nan nan yhat (index, RAD) float64 17.98 nan nan nan nan ... nan nan nan nan nan
- index: 506
- RAD: 9
- index(index)int640 1 2 3 4 5 ... 501 502 503 504 505
array([ 0, 1, 2, ..., 503, 504, 505])
- RAD(RAD)float641.0 2.0 3.0 4.0 ... 7.0 8.0 24.0
array([ 1., 2., 3., 4., 5., 6., 7., 8., 24.])
- y(index, RAD)float6424.0 nan nan nan ... nan nan nan
array([[24. , nan, nan, ..., nan, nan, nan], [ nan, 21.6, nan, ..., nan, nan, nan], [ nan, 34.7, nan, ..., nan, nan, nan], ..., [23.9, nan, nan, ..., nan, nan, nan], [22. , nan, nan, ..., nan, nan, nan], [11.9, nan, nan, ..., nan, nan, nan]])
- yhat(index, RAD)float6417.98 nan nan nan ... nan nan nan
array([[17.9779257 , nan, nan, ..., nan, nan, nan], [ nan, 41.07085804, nan, ..., nan, nan, nan], [ nan, 50.80037956, nan, ..., nan, nan, nan], ..., [38.89940194, nan, nan, ..., nan, nan, nan], [30.12817159, nan, nan, ..., nan, nan, nan], [ 5. , nan, nan, ..., nan, nan, nan]])
You call now apply any metric from xskillscore
using the accessor method. The input for the dim
argument is index
as we want to reduce this dimension and apply the metric over RAD
. In addition, there are nan
’s in the xarray.Dataset
so you should use skipna=True
:
[7]:
out = ds.xs.mse('y', 'yhat', dim="index", skipna=True)
out
[7]:
<xarray.DataArray (RAD: 9)> array([161.23755363, 313.85575025, 307.22076012, 162.63442999, 221.85296903, 155.6129776 , 214.37524005, 278.09256049, 148.84050691]) Coordinates: * RAD (RAD) float64 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 24.0
- RAD: 9
- 161.2 313.9 307.2 162.6 221.9 155.6 214.4 278.1 148.8
array([161.23755363, 313.85575025, 307.22076012, 162.63442999, 221.85296903, 155.6129776 , 214.37524005, 278.09256049, 148.84050691])
- RAD(RAD)float641.0 2.0 3.0 4.0 ... 7.0 8.0 24.0
array([ 1., 2., 3., 4., 5., 6., 7., 8., 24.])
It makes sense to return the data in tabular form hence you can call xarray.DataArray.to_series
to convert it to a pandas.Series
:
[8]:
out.to_series()
[8]:
RAD
1.0 161.237554
2.0 313.855750
3.0 307.220760
4.0 162.634430
5.0 221.852969
6.0 155.612978
7.0 214.375240
8.0 278.092560
24.0 148.840507
dtype: float64
Evaluating predictions over many columns¶
xskillscore
is built upon xarray.apply_ufunc
which offers speed-up by vectorizing operations. As a result xskillscore
can be faster than pandas.groupby.apply
. This is espicially true if there are many samples in the dataset and if the predictions have to be evaluated over many fields.
For this exercise we will create fake data for which the predictions have to be evaluated over three fields:
[9]:
stores = np.arange(100)
skus = np.arange(100)
dates = pd.date_range("1/1/2020", "1/10/2020", freq="D")
rows = []
for _, date in enumerate(dates):
for _, store in enumerate(stores):
for _, sku in enumerate(skus):
rows.append(
dict(
{
"DATE": date,
"STORE": store,
"SKU": sku,
"y": np.random.randint(9) + 1,
}
)
)
df = pd.DataFrame(rows)
noise = np.random.uniform(-1, 1, size=len(df['y']))
df['yhat'] = (df['y'] + (df['y'] * noise)).clip(lower=df["y"].min())
df
[9]:
DATE | STORE | SKU | y | yhat | |
---|---|---|---|---|---|
0 | 2020-01-01 | 0 | 0 | 3 | 4.617306 |
1 | 2020-01-01 | 0 | 1 | 6 | 1.000000 |
2 | 2020-01-01 | 0 | 2 | 2 | 3.039347 |
3 | 2020-01-01 | 0 | 3 | 3 | 5.102145 |
4 | 2020-01-01 | 0 | 4 | 5 | 3.563087 |
... | ... | ... | ... | ... | ... |
99995 | 2020-01-10 | 99 | 95 | 9 | 15.836256 |
99996 | 2020-01-10 | 99 | 96 | 5 | 7.515791 |
99997 | 2020-01-10 | 99 | 97 | 1 | 1.000000 |
99998 | 2020-01-10 | 99 | 98 | 6 | 6.676512 |
99999 | 2020-01-10 | 99 | 99 | 5 | 4.600985 |
100000 rows × 5 columns
Time the pandas.groupby.apply
method:
[10]:
%%time
df.groupby(['STORE', 'SKU']).apply(lambda x: mean_squared_error(x["y"], x["yhat"]))
CPU times: user 3.52 s, sys: 4.37 ms, total: 3.52 s
Wall time: 3.52 s
[10]:
STORE SKU
0 0 8.384360
1 7.071648
2 14.677462
3 13.391239
4 12.131033
...
99 95 18.473114
96 10.154608
97 11.743513
98 8.406069
99 7.098808
Length: 10000, dtype: float64
Time it using xskillscore
:
[11]:
%%time
df.set_index(["DATE", "STORE", "SKU"]).to_xarray().xs.mse(
"y", "yhat", dim="DATE"
).to_series()
CPU times: user 23.3 ms, sys: 0 ns, total: 23.3 ms
Wall time: 22.7 ms
[11]:
STORE SKU
0 0 8.384360
1 7.071648
2 14.677462
3 13.391239
4 12.131033
...
99 95 18.473114
96 10.154608
97 11.743513
98 8.406069
99 7.098808
Length: 10000, dtype: float64
See xskillscore-tutorial for further reading.
[ ]: