12 Ways to Test Your Forecasts like A Pro

How to find the best performance estimation approach for time-series forecasts among 12 strategies proposed in the literature. With Python code.

[Image by Author]

Data scientists are always focused on finding the best model for their dataset. However, they often overlook the importance of choosing the best performance estimation method.

This is unfortunate because estimating a MAE of, say, 5.2 on your test set is useless if you are not reasonably sure that the MAE will be actually close to 5.2 on future data.

The topic of performance estimation is particularly critical when dealing with forecasts. Differently from other applications, in time-series the order of observations matters, so we cannot just go for cross-validation and assume it will work fine.

This is why many performance estimation methods have been proposed in the literature. In this article, we will compare 12 of these methods on 58 real time-series datasets. The results are surprising.

A time-series is the observation of one or more variables at different points in time. This implies that, unlike other kinds of datasets, the observations have an inherent order.

For instance, let’s take the time-series of daily visitors of a shop, from March 1st, 2023 to March 8th, 2023.

A univariate time-series. [Image by Author]

The idea is to use the past behavior of the variable to predict its future. So it is pretty natural to use the variable as our target, i.e. y. The matrix of predictors, i.e. X, is made by the variable itself on previous days (called lag_1, lag_2, …).

Obtaining features and a target variable from the time-series. [Image by Author]

Of course, we could ideally add as many lags as we want, not only for the same variable that we are trying to predict but also for other variables.

Furthermore, we may add additional variables such as the day of the week and the month. In general, we can have both variables about the past and about the future (as far as they are foreseeable at prediction time).

Obtaining features and a target variable from the time-series. [Image by Author]

Now that we have a dataset ready for machine learning, let’s move on to performance estimation.

Usually, a machine learning model is tested on some data that it has not seen during the training phase (test set). The fundamental assumption is that the performance on the test set approximates the performance that will be achieved on new (future) data. So, the question becomes:

How do we make sure that the performance we are estimating now will be similar to the performance on some data that we still don’t have?

Since we are dealing with time-series, the most intuitive way is to “pretend” that we are at some point in the past (let’s call it cut-off point) and we still don’t know the data following that point. Then, we can test different strategies on the data previous to the cut-off point and see which one would have yielded the estimate most similar to the performance post cut-off point.

Index of a time-series. A cut-off point is set after observation 12. [Image by Author]

For instance, suppose that we want to test a particular strategy: training a model on the first 70% of the data and evaluating its performance (e.g. Mean Absolute Error) on the latest 30% of the data.

Once we have estimated the MAE on the test set, we train a new model on all the data prior to the cut-off and calculate the MAE on the subsequent data. The distance between the two MAEs is a measure of how good (or bad) is the performance estimation method.

The 5 phases of validating a performance estimation method. [Image by Author]

The performance estimation method depicted in the figure is one of the most common. It is called “holdout” and it consists in using the first part (usually the first 75% or 80%) of the available data for training and the subsequent part for testing.

But many other strategies have been proposed in the literature. In the following paragraph, we will go through 12 of these validation strategies, and compare them on 58 real datasets.

Out of the 12 methods that we will see in this article, 11 are taken from this paper. I have added one — called “inverse holdout” — because it is similar to an approach that I proposed in a previous article and that I found to work extremely well in many cases.

Let’s see a graphical representation of the 12 performance estimation methods. Note that the blue squares represent the observations belonging to the training set, the red squares represent the test set and the white squares are unused data points.

Representation of the 12 performance estimation methods. The blue squares are the training set, the red squares are the test set, the white squares are unused data points. [Image by Author]

Let’s go through each of them:

  • holdout — Holdout. The first part (usually 75% or 80%) of the data is used for training, the subsequent part is used for testing.
  • inv_holdout — Inverse holdout. It’s the opposite of holdout: the latest part of the data (usually 75% or 80%) is used for training and the preceding part is used for testing.
  • rep_holdout — Repeated holdout. It’s similar to holdout, but it consists of n estimates. Moreover, differently from holdout, not all the observations are used in a single estimate. A block (e.g. 70%) of the dataset is randomly selected, the first part of that block is used for training and the subsequent part for testing. This procedure is repeated n (e.g. 5) times.
  • cv — Cross-validation. Each data point is randomly assigned to 1 of n (usually 5) folds. Then, n models are trained: each of them is trained on n-1 folds and tested on the remaining one so that each model has a different fold to be tested on.
  • cv_mod — Modified cross-validation. It’s like cross-validation, but p points before and q points after any testing point are excluded from the training set. This is done to ensure independence between the training and test sets. This may also imply that some points may be never used for training (see the figure).
  • cv_bl — Blocked cross-validation. It’s like plain cross-validation, but there is no random shuffling of observations: each fold is made only of adjacent points.
  • cv_hvbl — Hv-blocked cross-validation. It’s like blocked cross-validation, but p points before and q points after any testing point are excluded from the training set as in modified cross-validation.
  • preq_bls — Prequential blocks. Each fold is made only of adjacent points, and only the points preceding the fold are used for training the corresponding model. This implies that the first fold remains untested.
  • preq_sld_bls — Prequential sliding blocks. Each fold is made only of adjacent points. For each model, one fold is used for training and the subsequent fold is used for testing.
  • preq_bls_gap — Prequential blocks with a gap block. Like prequential blocks but a gap block is left between the training blocks and the test block.
  • preq_slide — Prequential sliding window. n iterations are carried out. At the first iteration, the test set is made of the latest part of observations (e.g. the latest 20%), whereas the first part is used for training. At the following iterations, the test set dimension is progressively reduced, whereas the training part is shifted ahead but the length is kept constant.
  • preq_grow — Prequential growing window. n iterations are carried out. At the first iteration, the test set is made of the latest part of observations (e.g. the latest 20%), whereas the first part is used for training. In the following iterations, the test set dimension is progressively reduced, and all the remaining observations are used for training.
  • I have written a Python function called get_indices that you can find in this GitHub repository.

    Docstring of function “get_indices()”. [Image by Author]

    As you can see, the function has two positional arguments: the name of the method (it must be one of holdout, inv_holdout, rep_holdout, cv, cv_mod, cv_bl, cv_hvbl, preq_bls, preq_sld_bls, preq_bls_gap, preq_slide, preq_grow) and the length of the time-series.

    Moreover, the function has other additional arguments that depend on the method that you choose. In detail, the additional arguments are:

  • holdout. train_size: proportion of the dataset used for training.
  • inv_holdout. train_size: proportion of the dataset used for training.
  • rep_holdout. n_reps: number of repetitions. train_size: proportion of the dataset used for training. test_size: proportion of the dataset used for training.
  • cv. n_folds: number of folds.
  • cv_mod. n_folds: number of folds. gap_before: number of time points to be discarded before the cut point. gap_after: number of time points to be discarded after the cut point.
  • cv_bl. n_folds: number of folds.
  • cv_hvbl. n_folds: number of folds. gap_before: number of time points to be discarded before the cut point. gap_after: number of time points to be discarded after the cut point.
  • preq_bls. n_folds: number of folds.
  • preq_sld_bls. n_folds: number of folds.
  • preq_bls_gap. n_folds: number of folds.
  • preq_slide. train_size: proportion of the dataset used for training. n_reps: number of repetitions.
  • preq_grow. train_size: proportion of the dataset used for training. n_reps: number of repetitions.
  • The function returns yields couples of arrays, where the first array contains the training indices and the second array contains the test indices. For instance, with the following code:

    pip install git+https://github.com/smazzanti/valicast.git
    from valicast.validation_methods import get_indices

    for id_train, id_test in get_indices(method="cv_hvbl", time_series_length=12, n_folds=4, gap_before=1, gap_after=1):
    print(id_train, id_test)

    This is the outcome we would obtain:

    [ 4  5  6  7  8  9 10 11] [0 1 2]
    [ 0 1 7 8 9 10 11] [3 4 5]
    [ 0 1 2 3 4 10 11] [6 7 8]
    [0 1 2 3 4 5 6 7] [ 9 10 11]

    Now that we have seen how the 12 methods work, let’s apply them to some real data.

    I have taken 58 real datasets from the Time Series Data Library (TSDL), which is under the GNU general public license. TSDL consists of 648 datasets, but I selected only the univariate time-series with at least 1,000 observations.

    I repeated the following process for each dataset and for each performance estimation method:

  • I used the method with a linear regression on the first 90% of the dataset and computed the MAE on that portion only (I called this test MAE).
  • I then trained the same model on the whole first 90% of the observations. I then used this model to make the prediction on the latest 10% and computed the MAE on this last 10% (I call this production MAE). This is the benchmark, i.e. the “true” MAE that the model would achieve in production.
  • I computed the relative change between the two metrics as MAE test/MAE prod-1. This is an indicator of how close the performance that we would estimate during the testing phase is to the real performance that we would observe during the production phase. Thus, the smaller this number, the better the method.
  • These are the relative changes in MAE that I obtained for each method on the first 10 datasets (I have highlighted the minimum for each row, which shows the best method for that dataset):

    Relative change between test MAE and production MAE for each dataset and validation method. This is calculated as MAE test/MAE prod-1. The minimum (i.e. best) value by row is highlighted. [Image by Author]

    Selecting the best method for each dataset, this is what we obtain:

    Number of times that a method was the best one for a dataset. The sum is 58 because 58 datasets have been used. [Image by Author]

    Inverse holdout is the best performing with 13 wins, followed by two prequential methods.

    It is also interesting to note that two of the methods most used by practitionners, holdout and plain cross-validation, are among the worse performing.

    But maybe you are thinking that choosing one method over another cannot make that much difference. In order to check that, I have taken for each dataset the best performing and the second-best performing method and compared them, then grouped these statistics based on the best method:

    Average improvement of the best method compared to the second-best method. [Image by Author]

    This means that, when preq_slide is the best method, it is only 2% better than the second-best method (in terms of relative change in MAE). On the contrary, when inv_holdout is the best method, it is on average much better than the second method: the relative change in MAE of the second method is on average 500% higher than inv_holdout.

    I was also curious to see whether the methods are correlated across the different datasets. So this is the average correlation between each method and the other 11 methods:

    Average correlation between each method and all the remaining methods. [Image by Author]

    As you may expect, they are much correlated among them, especially cross-validation with its variants. The less correlated method are inverse holdout (probably because it works differently from all the other methods, with the test set being ahead of the training set) and prequential blocks with a gap block (probably because it has the biggest gap between the training and the test sets).

    The performance estimation strategy is fundamental in any machine learning project, even more so in forecasting, where the order of observations is not random.

    In this article, we tried 12 performance estimation methods on 58 real datasets. We discovered that two of the most widespread methods — plain holdout and plain cross-validation — are rarely the best choice.

    The winner of our experiment was “inverse holdout”. So, is it the silver bullet? No such thing in data science!

    Exactly the same way you test many predictive models on your dataset, you should similarly find the validation strategy that works better for any specific use case.

    [post_relacionado id=»1627″]


    Comentarios

    Deja una respuesta

    Tu dirección de correo electrónico no será publicada. Los campos obligatorios están marcados con *