# How HydroForecast Seasonal works

This article describes our process for creating the HydroForecast seasonal model for an individual basin. This model has a 10-day time step and has been implemented to provide forecasts out 12 months into the future.

Using our neural network approach, we train a “base” model across ~470 sites in North America. The goal of this model is to learn general hydrological principles at a variety of basins. This model is driven by historical weather from ECMWF’s ERA5, satellite observations of snow and vegetation from MODIS and VIIRS sensors, drainage characteristics, and snow water equivalent from NSIDC’s SNODAS. Note that this model is not trained on any weather forecast data; only on historical reanalysis from ERA5. In a later step, this allows us to force the model with different forecast traces and obtain a realistic and accurate hydrological response.

This base reanalysis model is the starting point for models designed for a specific basin. Using a machine learning technique called transfer learning, we adjust the base model parameters by “tuning” the model to a specific site. This tuned model generally shows improved accuracy across a broad range of metrics. After tuning, the model is very good at understanding the hydrological conditions of the basin (such as snowpack or soil moisture) and it knows how future weather translates into future flows.

To create seasonal scenarios operationally, we force the model with historical weather patterns. We call this the analog trace approach.

### Analog modeling approach

At each forecast issue time, the analog model combines its understanding of the current conditions of the basin with possible weather patterns which come from 40 years (1982-2021) of historical weather patterns from ERA5 and operational weather forecasts. The combination of these two pieces of information creates 40 scenarios that simulate realistic future flows, one for each historical year. Each of these 40 scenarios is represented by a full distribution of forecasted flows as described in the plot below.

*Figure 1: Stylized example of analog traces. Each trace has a distribution where the mean is a single value of that distribution. Each blue distribution is created by forcing the model with a historical weather trace from ERA5. The black distribution represents our final forecast and is created by mixing the individual distributions.*

To create our final forecast, we have two strategies that depend on the forecast horizon:

- Within the first 30 days, we select a subset of these flow scenarios that represent the most likely of the 40 years to materialize. Each scenario gets weighted based on its “similarity” to the current 35-day precipitation forecast from NOAA’s Global Ensemble Forecasting System (GEFS). From this subset of scenarios, we create an equally-weighted mixture of their distributions which becomes our final forecast. Quantiles and the mean are then derived from this final forecast distribution.
- For forecast horizons of more than 30 days, we create an equally-weighted mixture of all the scenario distributions. Quantiles and the mean are then derived from this final forecast distribution.

### Synthetic modeling approach

Our research shows that in some basins, we can improve the forecast in the first 30 days by creating what we call synthetic traces instead of analog traces. For forecast horizons over 30 days, we always use the analog trace approach described in the previous section.

In this approach, we generate 40 streamflow scenarios by forcing the model with 40 weather traces generated from the most recent GEFS weather forecast. Any aggregation of distributions is then done over all scenarios instead of just a selection of scenarios.

### Accessing seasonal forecasts in the dashboard

The traces shown in the “traces” view are the means from each scenario. Within the first 30 days, they can be analog or synthetic while in days 31+ they are always analog. The “distribution” view aggregates distributions of the scenarios (not only their means) for a more complete picture of the possible flows. This aggregation differs slightly between analog and synthetic traces and forecast horizons:

- Analog traces days 1-30: we use distribution of only the most similar scenarios to compute the final quantiles.
- Analog traces days 31+: we use distributions of all scenarios to compute the final quantiles.
- Synthetic traces days 1-30: we use distributions of all scenarios to compute the final quantiles.

Note that the 90% confidence interval derived from the quantiles will almost certainly be wider than the range of trace means, with the exception of analog traces for days 1-30. For analog traces for days 1-30 we use only a subset of the scenarios so it’s possible that the most extreme scenarios will not be included in the final distribution.

In the final post-processing step, we replace our seasonal forecasted quantiles for days 1-10 with the quantiles forecasted by our short-term model. This is done to take advantage of the increased accuracy from our short-term model in those first 10 days. As a result of this, since our short-term model doesn’t produce traces, there can be a disconnect between the trace view and the distribution view in the first 10 days of our forecast.

*Figure 2: Traces view in our dashboard.*

*Figure 3: Distribution view in our dashboard.*