Evaluation System
Forecast evaluation methods
During the Competition Phase, weekly leaderboards will display Ranked Probability Skill Scores (RPSSs). All scores will be area-weighted and benchmarked against climatology.
- Temperature and precipitation forecasts will be evaluated over land-dominated regions (>=80% land coverage).
- Mean sea level pressure forecasts will be assessed across all grid points.
Forecast evaluation datasets
All forecasts will be evaluated against initial ECMWF Reanalysis version 5 (ERA5) release data (ERA5T).
For temperature and pressure
Forecasts will be evaluated against weekly averages computed from six-hourly data (0, 6, 12, and 18 UTC).
For precipitation
Forecasts will be evaluated against weekly accumulations derived from hourly data.
The evaluation system relies on a climatological reference, which is derived by calculating quintiles of historical data. For this challenge, the climatological boundaries are calculated from 20 years of historical weekly data aggregated into weekly statistics. To enhance statistical spread, the sample size is expanded to 100 observations by supplementing with data from +/- 4 days at two-day intervals around the requested target date.
ERA5 provides climatological quintile boundaries, while ERA5T is used for the most recent observations.
In addition to ERA5T-based evaluations, forecasts will be compared against other global observational and reanalysis datasets such as IMERG and MSWEP. While these additional comparisons will not affect leaderboard rankings, they will offer a more comprehensive assessment of AI/ML sub-seasonal forecasts.
Forecast evaluation tools
To ensure transparency and replicability of the evaluation techniques used, participants can download the evaluation code and evaluate their forecasts using the AI-WQ-package Python package:
- retrieve_evaluation_data.py retrieves the required datasets including observations and climatological quintile boundaries.
- forecast_evaluation.py computes evaluation metrics.
The evaluation code should enable participants to self-assess their forecasts without competitive pressure during the Testing JJA period, which begins on 15 May 2025.
Join one of our two slots of the Testing Period Launch Webinar on May 7th to learn more about the Testing JJA Period and ask your questions live:
Leaderboards structure
Each week, a set of leaderboards will publicly display the latest rankings on two evaluation metrics:
- Weekly RPSSs – tracking weekly performance.
- Period-aggregated RPSSs – calculated from mean ranked probability scores across the competitive period, offering insights into average forecast accuracy.
For each metric, scores will be computed for each forecast window (days 19-25 or days 26-32) and each variable (tas, mslp, pr). Additionally, a mean RPSS will be calculated by averaging scores across all variables (variable-averaged RPSS).
Competition winners for each forecast window will be primarily determined by the variable-averaged, period-aggregated RPSS.
Beyond real-time rankings, historical leaderboards will display results by competitive period and forecast initialisation date for reference.
What is a Ranked Probability Skill Score (RPSS)?
Unlike deterministic predictions, probabilistic forecasts cannot be strictly classified as “right” or “wrong”, except in cases where probabilities of 0% or 100% are given. To evaluate probabilistic forecasts, a variety of verification methods have been developed.
In the AI Weather Quest, probabilistic forecasts are evaluated using Ranked Probability Skills Scores (RPSSs), as they allow for a comprehensive assessment across multiple categories within a probabilistic system, such as individual quintiles. The RPSS quantifies the deviation of forecast probabilities assigned to specific categories, in our case 20% intervals, to corresponding observations (Weigel et al., 2008).
Mathematically, the RPSS is defined by:
Here:
- Yk is the kth component of the cumulative categorical forecast vector Y.
- Ok is the kth component of the corresponding cumulative observation vector 0.
- Pk is the kth component of the cumulative climatological probability, which equals 0.2 for quintiles.
The following RPSSs under this framework can be interpreted as:
- RPSS = 1; the forecast has perfect skill compared to climatology, indicating the forecast is highly beneficial.
- RPSS = 0; the forecast has no additional skill compared to the climatology.
- RPSS < 0; the forecast is less accurate than climatology, indicating that the forecast lacks skill.
In addition to calculating weekly RPSSs, the AI Weather Quest will also compute aggregated RPSSs values for each competitive period and across the full competition forecasting year. The period-aggregated RPSS compares the ranked probability scores (RPSs) when using climatology and the forecast at every temporal and spatial point within the competitive period.
The aggregated RPSS is defined as:
where t represents temporal points and l represents spatial points.
Instead of averaging weekly RPSSs, which may not provide the most meaningful comparison between observations and forecasts, we will apply this more comprehensive aggregation approach to ensure a more accurate evaluation.
Further details regarding different methods for evaluating probabilistic forecasts can be found in ECMWF’s forecast user guide.
Awards celebrations
Periodic awards

At the conclusion of each 13-week competitive period, teams with the top-performing AI/ML models – determined by the highest forecast-mean RPSSs – will be celebrated in a webinar.
To ensure inclusivity and fairness, the competition will also spotlight exceptional models developed by participants from diverse organisation types and those created with limited computational resources. The ranking and recognition methods may be refined over time to better reflect the diversity of participants and their contributions.
To ensure eligibility for end-of period recognition:
- Participants must submit forecasts every week with a given model (i.e. under the same model’s name) across the entire 13-week period.
- Participants must submit the associated model summary during the final four weeks of each competition period, guided by this questionnaire. The team leader is responsible for submitting the model summary via the team’s login page. Once periodic awards have been granted, all model summaries will be made publicly available.
Participants can join and compete in as many 13-week periods as they wish.
Annual awards

At the end of the first forecasting year (September 2026), a celebration will be organised to showcase the teams behind the best-performing teams across the entire year and those developed by diverse / resources-limited participants. It will also serve as a platform for fostering valuable connections, promoting collaboration between forecast developers at ECMWF and participants, and encouraging knowledge exchange to further harness AI/ML in advancing sub-seasonal weather forecasting.
Only teams registered by August 1st 2025 and submitting forecasts every week with a given model (i.e. under the same model’s name) across the four competitive periods will be eligible for annual awards.
Transparency
The competition is designed not only to identify the best-performing AI-driven sub-seasonal forecasts but also to serve as a valuable resource for the broader scientific and AI communities.
To promote transparency and innovation:
- A public ECMWF-hosted sub-seasonal AI forecasting portal will display submitted probabilistic forecasts on day 5 of the forecast submission schedule, following the closure of the submission window. This portal will enable spatial comparisons, potentially revealing patterns of model agreement and skill.
- Submitted forecast data will be made publicly accessible via the AI Weather Quest Python package at the conclusion of each competitive period.
- Submitted model summaries will be made publicly accessible on this website once periodic awards have been granted.