Overview » Teams » SBUHybrid

SBUHybrid

Members

First name (team leader)

Cheng

Last name

Zheng

Organisation name

Stony Brook University

Organisation type

Research Organisation (Academic, Independent, etc.)

Organisation location

United States of America

First name

Edmund

Last name

Chang

Organisation name

Stony Brook University

Organisation type

Research Organisation (Academic, Independent, etc.)

Organisation location

United States of America

Models

Model name

HybridPPx

Number of individuals supporting model development:

1-5

Maximum number of Central Processing Units (CPUs) supporting model development or forecast production:

48-1,000

Maximum number of Graphics Processing Units (GPUs) supporting model development or forecast production:

< 4

How would you best classify the IT system used for model development or forecast production:

High-Performance Computing (HPC) Cluster

Model summary questionnaire for model HybridPPx

Please note that the list below shows all questionnaires submitted for this model.
They are displayed from the most recent to the earliest, covering each 13-week competition period in which the team competed with this model.

Which of the following descriptions best represent the overarching design of your forecasting model?

Post-processing of numerical weather prediction (NWP) data.
Statistical model focused on generating quintile probabilities.
Hybrid model that integrates physical simulations with machine learning or statistical techniques.
Ensemble-based model, aggregating multiple predictions to assess uncertainty and variability.

What techniques did you use to initialise your model? (For example: data sources and processing of initial conditions)

The model is not autoregressive (i.e., we do not initialize the model with a specific snapshot of the atmosphere) But we do incorporate information from ECMWF IFS subseasonal forecast runs (initialized on 00z Thursday)

If any, what data does your model rely on for real-time forecasting purposes?

ECMWF IFS subseasonal forecast runs ERA5T Note our model also needs some observation data before the initialization of the forecast (00z Thursday). Since there is a 5-day delay for ERA5T, we use ECMWF medium-range forecast (from TIGGE) to represent observed conditions in day -3 to day -1.

What types of datasets were used for model training? (For example: observational datasets, reanalysis data, NWP outputs or satellite data)

For data-driven model training: Community Earth System Model 2 Large Ensemble (CESM2-LE) For postprocessing: ERA5 ECMWF IFS CY49r1 subseasonal reforecast

Please provide an overview of your final ML/AI model architecture (For example: key design features, specific algorithms or frameworks used, and any pre- or post-processing steps)

(1) Key design feature: A data-driven reconstruction/prediction model targeting on precipitation/temperature/pressure trained with CESM2-LE data. (2) Hybrid "deterministic forecast": Use observational data (in past 2 weeks) and predictions from IFS realtime subseasonal forecasts as predictors in the data-driven model, we make a deterministic prediction of the target variables. Note IFS prediction of the target variables (e.g., precipitation) is NOT used in this step. There are multiple configurations of the data-driven model incorporating different amount of IFS predicted information. (3) Post-processing and transformation to probability forecast: Based on the application of (1)-(2) on reforecast data, probability forecast is generated using the hybrid model configurations with high prediction skill during the reforecast period at different grid points. Probability forecast is made by using: i) Hybrid "deterministic forecast"; ii) ensemble spread of the target variable in IFS realtime forecast; iii) probability thresholds derived by applying the hybrid model with reforecast data; iv) simple distribution transformation techniques for certain target variables.

Have you published or presented any work related to this forecasting model? If yes, could you share references or links?

A paper discussing the data-driven model has been submitted.

Before submitting your forecasts to the AI Weather Quest, did you validate your model against observational or independent datasets? If so, how?

We validate our model against ERA5 data. We used ERA5 data and derived probability thresholds following descriptions on AI Weather Quest websites. Hybrid model is tested by incorporating 2 versions of ECMWF IFS model, CY48r1 and CY49r1. Forecast skill, from both our hybrid model and IFS predictions, is evaluated in both reforecast and realtime forecast period of the two IFS model versions. Thus, we tested how well our model is compared with IFS predictions.

Did you face any challenges during model development, and how did you address them?

(1) We are a very small team (2 persons) and the team leader (Cheng Zheng) is doing almost all the coding and validation work. (2) Our computational resources are rather limited (HPC within our University). Model training could take a few weeks of time. It is quite time consuming to re-train the model if we change the model design. (3) Access to realtime data: Some observational data (e.g., ERA5 in the past 5 days) we use in the hybrid model is not available in realtime. To make our model ready for realtime forecasts, we need to find alternative ways to get near-realtime observational data.

Are there any limitations to your current model that you aim to address in future iterations?

Both the data-driven model and post-processing techniques have large potential for improvements. We plan to work on it when we have available time and funding.

Are there any other AI/ML model components or innovations that you wish to highlight?

We use climate model simulations to build the data-driven model. This seems to be an innovative way compared to traditional approaches for subseasonal forecasts.

Who contributed to the development of this model? Please list all individuals who contributed to this model, along with their specific roles (e.g., data preparation, model architecture, model validation, etc) to acknowledge individual contributions.

Hybrid Model design/structure: Cheng Zheng and Edmund Chang Data preparation, model training, post-processing techniques, model validation, generating realtime forecast: Cheng Zheng

Which of the following descriptions best represent the overarching design of your forecasting model?

Post-processing of numerical weather prediction (NWP) data.
Statistical model focused on generating quintile probabilities.
Hybrid model that integrates physical simulations with machine learning or statistical techniques.
Ensemble-based model, aggregating multiple predictions to assess uncertainty and variability.