CLINT
Members
First name (team leader)
Harilaos
Last name
Loukos
Organisation name
the climate data factory
Organisation type
Small & Medium Enterprise or Startup
Organisation location
France
First name
Ronan
Last name
McAdam
Organisation name
CMCC Foundation - Euro-Mediterranean Centre on Climate Change
Organisation type
Other (please specify)
Organisation location
Italy
First name
Jorge
Last name
Pérez Aracil
Organisation name
University of Alcalá
Organisation type
Research Organisation (Academic, Independent, etc.)
Organisation location
Spain
First name
Thomas
Last name
Noël
Organisation name
The Climate Data Factory
Organisation type
Small & Medium Enterprise or Startup
Organisation location
France
First name
Guilhem
Last name
Vignal
Organisation name
The Climate Data Factory
Organisation type
Small & Medium Enterprise or Startup
Organisation location
France
Models
Model name
CLINTDD
Number of individuals supporting model development:
1-5
Maximum number of Central Processing Units (CPUs) supporting model development or forecast production:
8-48
Maximum number of Graphics Processing Units (GPUs) supporting model development or forecast production:
< 4
How would you best classify the IT system used for model development or forecast production:
Cloud computing system
Model summary questionnaire for model CLINTDD
Please note that the list below shows all questionnaires submitted for this model.
They are displayed from the most recent to the earliest, covering each 13-week competition period in which the team competed with this model.
Which of the following descriptions best represent the overarching design of your forecasting model?
- Post-processing of numerical weather prediction (NWP) data.
- Machine learning-based weather prediction.
- Hybrid model that integrates physical simulations with machine learning or statistical techniques.
What techniques did you use to initialise your model? (For example: data sources and processing of initial conditions)
The CLINTDD model consists of two components:
(1) an ML data-driven forecasting core trained on paleoclimate data, and
(2) a hybrid ensemble component based on the ECMWF Extended-Range Ensemble Forecast.
We initialise the data-driven component, while the dynamical ensemble is acquired from ECMWF and processed for calibration providing physically consistent forecasts used later for hybrid post-processing.
The data-driven model was first trained on the PMIP6 Past2k paleoclimate simulation (MPI-ESM1.2-LR, 0–1850 CE) to identify the most relevant predictor variables and time-lag relationships controlling surface temperature and precipitation variability. An (evolutionary) optimisation algorithm was used to identify predictor combinations which provide the optimal seasonal forecast skill in the model world.
Once the optimal configuration was determined in the model world, the same predictors in the ERA5 reanalysis are used for operational forecasts.
For each forecast start date ERA5 provides the initial predictor fields describing the atmospheric, oceanic, and land-surface state (such as 2 m temperature, mean sea-level pressure, 500 hPa geopotential height, soil moisture, sea-surface temperature, sea-ice concentration, and others). These variables are converted to anomalies relative to ERA5 climatology, and spatially clustered using an enhanced k-means algorithm (which includes weighting by geographical distance). Lagged weekly averages of these clustered predictors (typically 1–12 weeks prior to the start date) form the initial condition vector of the AI model.
Lagged weekly averages of these clustered predictors (typically 1–12 weeks before the start date) form the initial condition vector of the AI model, following the lag structure optimised from the paleoclimate training.
If any, what data does your model rely on for real-time forecasting purposes?
The CLINTDD model relies on three main datasets for real-time operation:
- ERA5 near-real-time reanalysis, providing the most recent atmospheric, land, and ocean predictors used to initialise each weekly forecast.
- Past2k paleoclimate simulation (MPI-ESM1.2-LR), used at each forecast cycle to perform feature selection and identify the most relevant predictors and lag relationships.
- ECMWF Extended-Range Ensemble Forecasts, which supply the operational ensemble members up to 45 days ahead and are post-processed through the hybrid AI framework.
What types of datasets were used for model training? (For example: observational datasets, reanalysis data, NWP outputs or satellite data)
The CLINTDD data-driven model is trained entirely on the Past2k paleoclimate simulation (MPI-ESM1.2-LR, years 0–1850 CE).
This dataset provides a long, stationary record used to optimise the feature-selection algorithm and train the machine-learning models linking lagged predictors to temperature and precipitation variability.
The ERA5 reanalysis is not used for training but to define the spatial clusters of predictors and to apply the trained model for real-world forecasts.
Please provide an overview of your final ML/AI model architecture (For example: key design features, specific algorithms or frameworks used, and any pre- or post-processing steps)
The CLINTDD model combines a data-driven forecasting framework with a hybrid ensemble post-processing step. Its architecture consists of four main stages:
1 - Dimensionality reduction – Predictor fields (e.g. temperature, soil moisture, mean sea-level pressure, 500 hPa geopotential height, etc) are spatially clustered using an enhanced k-means algorithm to reduce input dimensionality while retaining key regional variability.
2- Feature selection (FS) – An evolutionary algorithm (Probabilistic Coral Reef Optimization) identifies the optimal combination of clustered predictors, domains, and time-lags that maximise predictive skill. This optimisation is performed on the Past2k paleoclimate simulation and repeated for each forecast date.
3- Forecasting model – The selected predictors are used as inputs to machine-learning regressors (e.g. Random Forest, LightGBM, AdaBoost) that predict weekly mean temperature and precipitation anomalies.
4 - Hybrid ensemble layer – The ECMWF Extended-Range Ensemble Forecasts are first calibrated against ERA5 and then post-processed using the AI data driven forecast as a first-guess to select ensemble members, improving both deterministic and probabilistic forecast skill.
Have you published or presented any work related to this forecasting model? If yes, could you share references or links?
The data-driven forecast framework applied to seasonal forecast of european heatwaves was published:
McAdam, R., Pérez-Aracil, J., Squintu, A., Peláez-Rodríguez, C., Hansen, F., Torralba, V., Loukos, H., Zorita, E., Giuliani, M., Cavicchia, L., Salcedo-Sanz, S., & Scoccimarro, E. (2025). Feature selection for data-driven seasonal forecasts of European heatwaves. Communications Earth & Environment, 6(1), 842. https://doi.org/10.1038/s43247-025-02863-4
The driver detection framework was first applied to heatwaves in:
Pérez-Aracil, J., et al. Identifying key drivers of heatwaves: A novel spatio-temporal framework for extreme event detection. Weather and Climate Extremes. https://doi.org/10.1016/j.wace.2025.100792
Before submitting your forecasts to the AI Weather Quest, did you validate your model against observational or independent datasets? If so, how?
Validation of the subseasonal version of the CLINTDD model is ongoing.
The framework follows the same approach used for the seasonal prototype: cross-validation against ERA5 reanalysis as an independent reference after training on Past2k, using deterministic (correlation, N-RMSE) and probabilistic (CRPSS) metrics.
Additional benchmarking of the hybrid ensemble configuration is being carried out within the AI-Weather Quest competition, comparing forecasts to ECMWF operational outputs under the same evaluation protocol.
Did you face any challenges during model development, and how did you address them?
The main challenge was the computational cost of the feature-selection (FS) algorithm, which must be rerun for each forecast date.
To address this, code optimisation, memory management, and batch parallelisation were implemented, reducing execution time and making the method suitable for operational use.
Are there any limitations to your current model that you aim to address in future iterations?
Yes. The subseasonal implementation is still experimental, while most validation work has so far focused on the seasonal counterpart.
The operational domain is global, and we are addressing several methodological and computational aspects during the AI-Weather Quest competition — including optimisation of the feature-selection algorithm, refinement of lag and cluster choices across regions, and improvement of probabilistic calibration for weekly forecasts.
As a small team, we are carrying out this development in parallel with our other activities, progressively integrating these advances into the operational workflow.
Are there any other AI/ML model components or innovations that you wish to highlight?
The main innovation is the integration of a paleoclimate-trained feature-selection framework (Past2k) into an operational subseasonal forecasting workflow. This approach allows the model to identify physically consistent predictors and time-lags that remain robust under different climate regimes. As a result, this work provides two key outcomes: skillful and computationally-economic forecasts, and identification of drivers (crucial to scientific understanding of process studied here e.g. heatwaves)
Additional innovations include the hybrid ensemble post-processing guided by AI forecasts, and the global, cloud-based implementation that makes large-scale feature selection and inference feasible within the weekly competition schedule.
Who contributed to the development of this model? Please list all individuals who contributed to this model, along with their specific roles (e.g., data preparation, model architecture, model validation, etc) to acknowledge individual contributions.
The CLINTDD model was developed through a collaboration between The Climate Data Factory (TCDF), CMCC Foundation, and the Universidad de Alcalá (UAH) in the context of the H2020 CLINT project.
Harilaos Loukos (TCDF) led the design of the hybrid architecture, coordinated the integration of the data-driven and ensemble components, and supervised the operational implementation for the AI-Weather Quest competition.
Ronan McAdam (CMCC) developed the data-driven forecasting framework and led the scientific validation of the method.
Jorge Pérez-Aracil (UAH) and colleagues developed the feature-selection algorithm (PCRO-SL) and the initial version of the driver-detection framework.
Thomas Noël (TCDF) implemented the cloud infrastructure, automated workflows, and operational deployment.
Guilhem Vignal (TCDF) handled code development and optimisation, data integration, and data-processing routines of the data-driven framework.
Model name
CLINTMF
Number of individuals supporting model development:
1-5
Maximum number of Central Processing Units (CPUs) supporting model development or forecast production:
8-48
Maximum number of Graphics Processing Units (GPUs) supporting model development or forecast production:
< 4
How would you best classify the IT system used for model development or forecast production:
Cloud computing system
Model name
CLINTSE
Number of individuals supporting model development:
1-5
Maximum number of Central Processing Units (CPUs) supporting model development or forecast production:
8-48
Maximum number of Graphics Processing Units (GPUs) supporting model development or forecast production:
< 4
How would you best classify the IT system used for model development or forecast production:
Cloud computing system
Submitted forecast data in previous period(s)
Please note: Submitted forecast data is only publicly available once the evaluation of a full competitive period has been completed. See the competition's full detailed schedule with submitted data publication dates for each period here.
Access forecasts data