WEATHER-5K: A Large-scale Global Station Weather Dataset Towards Comprehensive Time-series Forecasting Benchmark (2024)

Tao Han
HKUST
Shanghai AI Laboratory
&Song Guo
HKUST
\ANDZhenghao Chen
The University of Sydney
&Wanghan Xu
Shanghai AI Laboratory
&Lei Bai
Shanghai AI Laboratory
Corresponding Author: Lei Bai, baisanshi@gmail.com.

Abstract

Global Station Weather Forecasting (GSWF) is crucial for various sectors, including aviation, agriculture, energy, and disaster preparedness. Recent advancements in deep learning have significantly improved the accuracy of weather predictions by optimizing models based on public meteorological data.However, existing public datasets for GSWF optimization and benchmarking still suffer from significant limitations, such as small sizes, limited temporal coverage, and a lack of comprehensive variables. These shortcomings prevent them from effectively reflecting the benchmarks of current forecasting methods and fail to support the real needs of operational weather forecasting.To address these challenges, we present the WEATHER-5K dataset. This dataset comprises a comprehensive collection of data from 5,672 weather stations worldwide, spanning a 10-year period with one-hour intervals. It includes multiple crucial weather elements, providing a more reliable and interpretable resource for forecasting.Furthermore, our WEATHER-5K dataset can serve as a benchmark for comprehensively evaluating existing well-known forecasting models, extending beyond GSWF methods to support future time-series research challenges and opportunities. The dataset and benchmark implementation are publicly available at: https://github.com/taohan10200/WEATHER-5K.

1 Introduction

Global Station Weather Forecasting (GSWF) is essential for providing precise and timely weather information, with significant implications for various sectors. Accurate weather forecasts enable airlines to enhance aviation safety[1], support agriculture in optimizing crop management[2], and assist the energy industry in managing production and mitigating weather-related risks[3]. Moreover, reliable forecasts are crucial for early warning systems, aiding in the preparation for natural disasters and extreme weather events, thereby safeguarding lives and property[4, 5].

However, the current GSWF predominantly relies on Numerical Weather Prediction (NWP) models, which include both physical-based NWP models[6, 7] and data-driven NWP models[8, 9, 10, 11]. These models do not optimally consider forecasting as an end-to-end optimized task, leading to two primary challenges.First, such models are computationally demanding, relying on extensive resources for data assimilation and medium-range forecasting. Second, the results obtained from these models are typically represented as grid data, which lacks the precision required to accurately represent specific weather station locations.

Recently, there have been some initial efforts[12] to formulate GSWF as an end-to-end task by directly learning from global station weather observations. However due to limitations of the public global station weather datasets, the scale of stations and data used for optimizing such methods remains relatively small. Specifically, current datasets often collect data from a single station[13] or a localized region[14] or limited time range[12]. This characteristic results in forecasting methods achieving only a one-day lead time, limiting their applicability in real-world scenarios. Hence, it is urgent to develop a comprehensive global station weather dataset that enables forecasting models to generalize across diverse stations and regions worldwide. We believe such a comprehensive dataset would significantly enhance the robustness and applicability of GSWF models, addressing the limitations of current approaches.

Moreover, such a large weather station dataset can also serve as an extensive time-series dataset to perform comprehensive time-series forecasting benchmarks for various forecasting methods proposed for diverse purposes (e.g., traffic prediction models[15, 16]).In fact, due to the lack of large-scale time-series datasets, most existing forecasting methods have primarily been evaluated and analyzed using small-scale datasets. These simple datasets fail to encompass the complex scientific problems that researchers need to discover and resolve, thereby hindering progress in the field of time-series prediction.Our dataset offers a diverse temporal and spatial range of time-series data, enabling a comprehensive evaluation of time-series forecasting methods and driving significant advancements in the field.To summarize, this research offers two significant contributions, namely:

•
Introduction of the WEATHER-5K Dataset.We propose a large-scale time-series dataset for sparse weather forecasting, namely WEATHER-5K, which comprises 5,672 weather stations located worldwide. This extensive dataset provides a diverse representation of weather conditions across different regions. It also offers substantial temporal coverage, with each station containing 10 years of hourly data, enabling the analysis of long-term weather patterns and the development of forecasting models that capture seasonal and interannual variations. We also conduct thorough data analysis to uncover trends, patterns, and correlations within our weather data.
•
Extensive Time-Series Benchmarking. We implement a comprehensive set of widely recognized time-series forecasting methods encompassing diverse domains such as traffic prediction, weather prediction, and electricity prediction. To evaluate the performance of different forecasting algorithms, we conducted extensive benchmark experiments on WEATHER-5K, establishing a standardized evaluation framework. The empirical findings derived from these benchmarks not only shed light on research challenges but also identify future opportunities for the advancement of GSWF and time-series forecasting.

In the future, our newly proposed dataset can support operational weather services for the public, an essential aspect of weather forecasting research that often holds greater significance than the research itself. Additionally, weather station data serve as a crucial source of observational data for NWP models, effectively bridging the gap between numerical models and station-based predictions. This not only improves the accuracy of numerical forecasts but also plays a vital role in verifying and evaluating the predictive performance of NWP models[11].

WEATHER-5K: A Large-scale Global Station Weather Dataset Towards Comprehensive Time-series Forecasting Benchmark (1)

2 Related Work

2.1 Time-series Forecasting

Time-series forecasting involves many domains[14, 17, 18, 19] and their methods have undergone three stages since its birth: statistical learning, machine learning, and deep learning.

Statistical learning methods, including ARIMA[20], ETS[21], StatsForecast[22], VAR[14], and Kalman Filter (KF)[23], are among the early proposals and are widely utilized. These methods rely on historical data to predict future values and are based on the assumption that past observations hold predictive power. While they provide a solid foundation, their performance may be limited when faced with complex patterns or nonlinear relationships.

Machine learning methods have gained prominence in time-series forecasting due to rapid advancements in the field. Algorithms such as XGBoost[24], GBRT[25], Random Forests[26], and LightGBM[27] offer enhanced capabilities to handle nonlinear relationships and complex patterns. These methods demonstrate flexibility in handling different types and lengths of time-series data and generally provide superior forecasting accuracy compared to traditional statistical methods.

Leveraging the representation learning capabilities of deep neural networks, Deep-Learning (DL) methods have shown promising results in time-series forecasting. DL treats time-series as sequences of vectors and utilizes architectures such as convolutional neural networks (CNNs)[28], recurrent neural networks (RNNs)[29], or Transformers[30] to capture temporal dependencies. For example, TCN[31] and DeepAR[32] implement CNNs or RNNs to model the temporal structure of the data. Transformer architectures, like REformer[33], Informer[34], Pyformer[35], FEDformer[15], Autoformer[13], Triformer[36], and PatchTST[37], have also been applied in time-series forecasting tasks, allowing for the capture of more complex temporal dynamics and significantly improving forecasting performance. In addition, while pursuing forecasting accuracy with complex models, MLP-based models such as N-HiTS[38], N-BEATS[39], and DLinear[40] employ a straightforward architecture with a relatively low number of parameters while achieving competitive performance. Recently, Mamba[41], a selective state space model, has also gained traction due to its ability to process dependencies in sequences while maintaining near-linear complexity. Some variants of Mamba[42, 43, 44] have been successfully applied to time-series forecasting.

2.2 Data-driven Numerical Weather Prediction

Since 2022, there has been a growing interest in data-driven Numerical Weather Prediction (NWP) models within the AI and atmospheric science communities. The correlations between NWP and GSWF are as follows: 1) GSWF can be obtained by interpolating the forecast results of NWP models to specific latitudes and longitudes. Similarly, GSWF can also be used to bias-correct the forecast results of NWP models. These models, like, Pangu-Weather[9], GraphCast[8], FengWu[10], and FengWu-GHR[11], have shown the potential to outperform traditional physical-based NWP models in terms of forecast skill and operational efficiency. However, these models, operating at the mesh space ( e,g., the grid resolution of $0.25^{\circ}$ and $0.09^{\circ}$ ), are may not be the optimal solution for GSWF as discussed in Section1. Prior to our work, some initial attempts, like Corrformer[12], have treated GSWF as an independent forecasting task, demonstrating promising results but remaining a great gap compared with the data-driven NWP models.

3 WEATHER-5K: Global Station Weather Dataset

3.1 Collection and Processing

Our dataset is collected using data sourced from the National Centers for Environmental Information (NCEI), specifically the Integrated Surface Database (ISD),¹¹1https://www.ncei.noaa.gov/products/land-based-station/integrated-surface-databasea global repository of hourly and synoptic surface observations gathered from numerous original data sources. The ISD encompasses a wide range of meteorological parameters recorded by each weather station, including wind speed and direction, temperature, dew point, cloud data, sea level pressure, altimeter setting, station pressure, present weather, visibility, precipitation, snow depth, and various other elements.Although the ISD contains records from over $20,000$ stations spanning several decades, not all stations are suitable for machine learning applications.For instance, certain stations are no longer operational, many do not report data on an hourly basis, and numerous stations have missing values for critical weather elements.To create the WEATHER-5K dataset, a meticulous selection process was conducted to include only long-term, hourly reporting stations that are currently operational and provide essential observations such as temperature, dew point temperature, wind, and sea level pressure.

To obtain a high-quality dataset of weather stations, a series of post-processing steps were performed on the raw weather station data collected from 2014 to 2024. Initially, $10,701$ commonly operating stations were identified. The first step involved selecting stations that reported data every hour on the hour. However, many stations did not meet this criterion. To address this, a replacement method estimated missing hourly data points using the nearest available time points within a 30-minute window, significantly improving the distribution of valid hourly data.

Despite this improvement, a tiny portion of hourly data points remained missing due to the lack of observations within the 30-minute window. To fill these gaps, linear interpolation was employed using data from the 12 consecutive hours surrounding each missing point, ensuring the reliability of the interpolated data.Moreover, stations with more than 90% valid hourly data were chosen as the final candidates. To ensure the dataset’s completeness, ERA5 reanalysis data[45] was utilized to interpolate and fill any remaining gaps in the weather station data.

As a result of these post-processing steps, a subset of $5,672$ weather stations worldwide was selected, spanning the period from 2014 to 2023, ensuring a recent and relevant time frame. This selection process focused on balancing the longevity of station operation, hourly data availability, and the inclusion of diverse weather variables. Consequently, our dataset provides a robust foundation for evaluating forecasting models in the context of GSWF.

3.2 Qualitative Analysis

WEATHER-5K: A Large-scale Global Station Weather Dataset Towards Comprehensive Time-series Forecasting Benchmark (2)

Regional disparities of stations.The WEATHER-5K dataset reveals regional disparities in the distribution of weather stations, which can significantly impact the learning and understanding of atmospheric dynamics in certain areas. As illustrated in Figure1, some land regions have sparse data coverage compared to others. These disparities can be attributed to factors such as geographical characteristics, levels of economic development, and the strategic placement of weather stations. Note that the number of oceanic stations is also very limited due to the expensive cost of establishing stations at sea. Regions with limited station coverage may exhibit unique weather patterns and phenomena that are not adequately captured due to insufficient data. Addressing these disparities and expanding coverage in underrepresented areas is crucial for improving the accuracy and reliability of weather forecasting and analysis in those regions.

Data characteristics. Here we present a statistical analysis for different characteristics of different variables and the discrepancy of the same variable over different latitudes based on one year of data, By visualizing the four noteworthy observations: temperature, wind speed, wind direction, and sea level pressure, There are three primary findings: First, as shown in Figures2 a), b), and c), temperature characteristics in the Northern and Southern Hemispheres are opposite during the same seasons. In the equatorial region, the variance for its seasonal pattern is generally lower compared to higher latitudes. Second, Figures2 d) and e) indicate Wind Speed and Direction. They show that the wind speed and direction are non-stationary, characterized by intense fluctuations and a lack of clear patterns, making them challenging to predict. Last, As depicted in Figure2 f), Sea Level Pressure shows a stable distribution with fewer fluctuations, making it relatively easier to predict.

Dataset	Domain	Frequency	Lengths	Stations	Variables	Year	Volume
Exchange[46]	Exchange	1 day	7,588	1	8	1990-2010	623KB
Electricity[47]	Electricity	1 hour	26,304	321	1	2016-2019	92MB
ETTm2[34]	Electricity	15 mins	57,600	1	7	2016-2018	9.3MB
Traffic[13]	Traffic	1 hour	17,544	862	1	2016-2018	131MB
LargeST-CA[48]	Traffic	5 mins	525,888	8600	1	2017-2021	36.8GB
Solar[46]	Weather	10 mins	52,560	137	1	2006	8.3MB
Wind[49]	Weather	15 mins	48,673	1	7	2020-2021	2.7MB
Weather[13]	Weather	10 mins	52,696	1	21	2020	7.0MB
Weather-Australia[14]	Weather	1 day	1,332 $\sim$ 65,981	3,010	4	unknown	202MB
GlobalTempWind[12]	Weather	1 hour	17,544	3,850	2	2019-2020	1034MB
CMA_Wind[12]	Weather	1 hour	17,520	34,040	1	2018-2019	N/A
WEATHER-5K (Ours)	Weather	1 hour	87,648	5672	5	2014-2023	40.0GB

Comparison with existing datasets.Table1 presents a comparison between the WEATHER-5K and other popular time-series datasets. The development of time-series datasets suffers from several limitations. 1) Small scale and Out-of-date: The mainstream time-series datasets[46, 47, 13] used for research purposes remain relatively small in scale. For instance, datasets related to electricity consumption or exchange rates are sparse or outdated, which limits the practical application of forecasting models. These two natures also hinder the exploration of more scientific challenges associated with time-series forecasting.2) Lagging behind other fields: Compared to natural language processing and computer vision, the time-series forecasting domain has been slower in incorporating large-scale datasets. The use of extensive datasets (e.g., CommonCrawl and LAION-5B[50]) in other fields have demonstrated unprecedented economic value and significantly advanced scientific discoveries. However, until recently, the first large-scale time-series dataset, LargeST[48], is only introduced.In contrast, WEATHER-5K addresses the limitation of small-scale datasets by providing a comprehensive and large-scale collection of weather station data. This abundance of data enables researchers to tackle more complex forecasting challenges and explore the dynamics of weather patterns on a global scale.

4 Time-Series Forecasting Benchmarks on WEATHER-5K

4.1 Problem Definition

Considering $N$ weather stations around the world and each station collects $V$ meteorological variables, the data of all weather stations can be represented by a spatial-temporal time-series $X\in\mathbb{R}^{N\times T\times V}$ for a given look-back window of fixed length $T$ . At timestamp $t$ , time-series forecasting is to predict $\hat{X}_{t+1:t+\tau}=\{X_{t+1},...,X_{t+\tau}\}$ based on the past $T$ frames $X_{t-T+1:t}=\{X_{t-T+1},...,X_{t}\}$ . Here, $\tau$ is the length of the forecast horizon. Using $X$ and $\hat{X}$ to represent the observation data and the forecasted data, respectively, the process of GSWF can be simplified by a mapping: $\hat{X}=\mathcal{M}(X)$ , where $\mathcal{M}$ can be different kinds of time-series forecasting methods. For example, by setting $N=1$ and ignoring the spatial information, many state-of-the-art time-series forecasting methods[49, 34, 35, 13, 15, 41, 51, 37, 16] can be explored on this task. When $N$ is multiple scattered stations, method[12] based on spatial-temporal modeling can also be applied to this task.

4.2 Evaluation Metrics

Overall performance. Mean Absolute Error (MAE) and Mean Square Error (MSE) are used to evaluate the overall performance of the GSWF. MAE measures the predictive robustness of an algorithm but is insensitive to outliers, whereas MSE is sensitive to outliers and can amplify errors.

Extreme performance. In the field of weather forecasting, an important evaluation is the ability to predict extreme weather events, such as extreme high or low temperatures. However, MAE and MSE alone do not adequately capture this ability. Therefore, we introduce a specialized metric, Symmetric Extremal Dependency Index (SEDI)[52, 53] to evaluate the performance of extreme weather forecasting for time-series forecasting models.SEDI can assess the model’s capability to predict extreme weather conditions. It classifies each prediction in its station location as either extreme or normal weather based on upper quantile thresholds ( $90\%,95\%,98\%$ , and $99.5\%$ ) or lower quantile thresholds ( $10\%,5\%,2\%$ , and $0.5\%$ ). The calculation of SEDI value can be formulated as:

\mathrm{SEDI}(p)=\frac{\mathrm{sum}(\hat{X}<Q^{p}_{lower}\&X<Q^{p}_{lower})+%\mathrm{sum}(\hat{X}>Q^{p}_{upper}\&X>Q^{p}_{upper})}{\mathrm{sum}(X<Q^{p}_{%lower})+\mathrm{sum}(X>Q^{p}_{upper})}.

(1)

where $\hat{X}<Q^{p}_{lower}$ and $X<Q^{p}_{lower}$ judge whether the predicted or observed data point belongs to the extreme event or not based on the threshold $Q^{p}_{lower}$ , and vice versa for the upper percentiles. The reason behind setting two opposite percentiles is that extreme small value and extreme large value are both crucial for weather forecasting. For example, the smaller value in temperature may represent the winter storm while the larger value may represent the heatwave. $\mathrm{SEDI}\in[0,1]$ quantifies the model’s ability to correctly identify extreme weather events. A higher SEDI value indicates better performance in extreme weather prediction.

4.3 Experimental Protocols

Dataset splitting. The WEATHER-5K dataset is divided into three subsets: training, validation, and testing datasets. The training set consists of weather data from 2014 to 2021, the validation set includes data from the year 2022, and the testing set comprises data from the year 2023. The division of the dataset follows an 8:1:1 ratio, allowing the model to be trained on sufficient historical data, validated on a separate year, and tested on the most recent data for accurate evaluation.

Baselines	LeadTime	Temperature		Dewpoint		Wind Rate		Wind Direc.		Sea Level		Overall
Baselines	LeadTime	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE
Informer2021[34]EW	24	1.88	7.51	1.94	8.30	1.30	3.62	60.7	6906.9	2.01	10.56	13.6	1387.4
	72	2.75	14.84	2.86	17.24	1.53	4.86	71.5	8251.4	4.24	39.24	16.4	1631.4
	120	3.11	18.21	3.25	21.50	1.60	5.38	75.7	8504.5	5.15	54.31	18.3	1720.4
	168	3.24	20.24	3.43	24.89	1.63	5.65	76.2	8718.4	5.26	58.42	18.1	1764.4
Autoformer2021[13]EXITW	24	1.93	8.64	2.06	9.57	1.42	3.97	66.5	7710.0	2.26	12.78	15.2	1553.4
	72	2.72	15.14	2.97	18.38	1.54	5.14	75.4	9111.5	4.25	42.34	17.8	1846.7
	120	3.21	20.27	3.34	23.12	1.58	5.73	79.2	9143.5	4.83	48.88	18.1	1868.3
	168	3.43	21.71	3.56	22.55	1.64	5.95	79.8	9435.8	5.32	61.85	18.5	1885.7
Pyraformer2021[35]EW	24	1.75	6.92	1.83	7.88	1.30	3.58	61.8	6930.2	1.90	9.72	13.7	1391.7
	72	2.47	13.03	2.67	15.39	1.52	4.97	72.0	8222.4	3.76	33.67	16.5	1657.9
	120	2.77	16.04	3.00	18.95	1.59	5.37	75.1	8610.7	4.43	43.91	17.4	1739.0
	168	2.95	17.95	3.20	21.06	1.61	5.56	76.4	8773.5	4.77	49.97	17.8	1773.6
FEDformer2022[15]EXITW	24	1.98	8.45	2.02	9.25	1.36	3.91	66.0	7384.1	2.13	11.43	14.7	1483.4
	72	2.87	16.50	3.01	18.70	1.59	5.31	76.2	8824.8	4.15	37.60	17.6	1780.6
	120	3.19	20.29	3.36	23.10	1.66	5.71	79.0	9143.3	4.81	48.86	18.4	1848.3
	168	3.35	22.12	3.54	25.21	1.68	5.88	79.7	9189.2	5.01	53.39	18.7	1859.2
DLinear2023[40]EXITW	24	2.71	13.82	2.47	12.36	1.44	4.34	66.6	8234.5	3.09	21.34	15.3	1657.3
	72	3.55	23.05	3.48	22.85	1.62	5.37	75.0	9250.8	4.64	45.83	17.7	1869.6
	120	3.90	27.60	3.89	27.72	1.67	5.70	77.3	9510.6	5.19	56.22	18.4	1925.6
	168	4.11	30.38	4.11	30.58	1.69	5.88	78.4	9630.0	5.48	61.73	18.8	1951.7
PatchTST2023[37]EXITW	24	2.05	9.26	2.16	10.58	1.40	4.20	66.2	7765.8	2.19	12.54	14.8	1560.5
	72	2.82	16.60	3.06	19.96	1.60	5.39	75.2	9067.8	4.28	42.46	17.4	1830.5
	120	3.15	20.32	3.43	24.39	1.66	5.79	77.8	9452.6	5.09	57.29	18.2	1912.1
	168	3.33	22.54	3.63	26.94	1.69	6.00	79.0	9638.1	5.51	65.3	18.6	1951.7
Corrformer2023[12]W	24	1.99	8.21	2.09	9.47	1.38	3.83	66.7	7832.3	2.19	12.39	14.9	1584.4
	72	2.74	15.16	2.99	18.40	1.56	4.91	75.6	9111.7	4.27	42.36	17.8	1846.7
	120	3.06	18.63	3.34	22.48	1.61	5.56	78.0	9477.4	5.08	57.13	18.1	1915.8
	168	3.09	18.69	3.36	22.53	1.63	5.69	78.9	9636.0	5.34	61.83	18.4	1938.8
Mamba2023[41]G	24	1.98	8.59	2.01	9.52	1.37	4.02	66.0	7709.5	2.21	12.73	14.7	1548.9
	72	2.79	16.00	2.90	18.11	1.55	5.11	75.1	8863.9	4.29	41.88	17.3	1789.0
	120	3.03	18.47	3.18	21.02	1.58	5.28	76.7	8931.2	4.93	52.56	17.9	1805.7
	168	3.16	19.88	3.32	22.53	1.59	5.35	77.4	8958.8	5.21	57.37	18.1	1812.8
iTransformer2024[16]ETW	24	1.82	7.49	1.93	8.80	1.32	3.77	63.2	7358.8	1.99	10.84	14.1	1478.0
	72	2.60	14.46	2.84	17.5	1.52	4.96	73.2	8713.3	4.14	40.65	16.9	1758.2
	120	2.97	18.36	3.24	22.16	1.59	5.42	76.4	9192.2	4.95	54.67	17.8	1858.6
	168	3.18	20.64	3.48	24.89	1.64	5.67	78.0	9441.1	5.36	62.31	18.3	1910.9

Baselines	Temperature		Dewpoint		Wind Speed		Wind Direction		Sea Level
Baselines	99.5th $\uparrow$	90th $\uparrow$	99.5th $\uparrow$	90th $\uparrow$	99.5th $\uparrow$	90th $\uparrow$	99.5th $\uparrow$	90th $\uparrow$	99.5th $\uparrow$	90th $\uparrow$
Informer[34]	11.8	49.5	9.2	39.2	2.1	6.7	0.12	2.9	9.8	35.7
Autoformer[13]	12.4	52.1	8.3	38.9	0.3	7.8	0.13	1.6	10.4	32.1
Pyraformer[35]	10.7	54.8	7.2	40.1	0.6	7.2	0.06	1.1	10.5	26.2
FEDformer[15]	11.9	50.9	9.9	40.7	2.9	9.5	0.08	0.7	7.5	21.4
DLinear[40]	5.8	18.8	3.2	19.9	0.3	5.1	0.13	1.7	2.8	17.5
PatchTST[37]	10.9	50.8	8.9	42.4	0.5	8.9	0.10	2.2	13.5	36.7
Corrformer[12]	10.9	48.9	8.4	39.9	1.7	8.4	0.12	0.9	8.9	30.9
Mamba[41]	10.0	51.3	7.5	40.6	0.9	8.1	0.05	1.0	10.1	31.3
iTransformer[16]	14.1	55.0	10.4	44.8	1.3	10.3	0.14	2.3	15.9	37.5

Baselines. As discussed in Section2.1, many time-series forecasting methods can be modified to fit this task. We here compare 9 baselines. 1) Temporal-only methods: The transformer-based models won the most popularity in the past three years, some methods between 2021 and 2024, like Informer (2021)[34], Autoformer (2021)[13], Pyraformer (2021)[35], FEDformer (2022)[15], PatchTST (2023)[37], iTransformer (2024)[16] are selected to represent the state-of-the-art long-term forecasting methods. Apart from them, we also adopt the efficient MLP-based DLinear (2023)[40] and the new trending architecture Mamba (2023)[41] as baselines. 2) Spatial-temporal method: Temporal-only deep models that do not consider the spatial correlations, we, therefore, survey the Corrformer (2023)[12] as a baseline model that considers the dynamic characteristics of correlations among weather stations at different locations.

Task settings. To facilitate fair comparison between different baselines, we align the input length of all baselines. The final setting is to predict the $\tau$ -step future based on 48 historical steps. Specifically, we predict the future weather conditions for lead times of 1, 3, 5, and 7 days, corresponding to predicting the 24-step, 72-step, 120-step, and 168-step future data, respectively. Note the report results are only performed once instead of multiple times in original implementations. This does not affect the comparison as we observe the results are stable for different seeds due to the large dataset volume.

Implementation details. We develop and implement the baselines based on the Time-Series-Library.²²2https://github.com/thuml/Time-Series-Library To ensure a fair comparison, we made no changes to the intermediate network structures of all baseline models. Instead, we make corresponding adjustments to the inputs and outputs to adapt them to the WEATHER-5K dataset. Additionally, all optimization parameters are kept the same. For instance, the training is performed for a total of 300,000 iterations, starting with a learning rate of 1e-4. We employ the cosine decay strategy and gradually decay the learning rate to 0 by the end of training. The batch size for all models is set to 1,024 except for Correformer. During the validation phase, an early stopping is executed if training loss does not decrease for three consecutive times. The checkpoint with the lowest validation loss encountered prior to the early stop is saved and used for testing. Experiments are conducted on an 224 Intel(R) Xeon(R) Platinum 8480CL CPUs @ 3.80 GHz, 2.0 TB RAM computing server, equipped with 8 NVIDIA H800 GPUs.

4.4 Quantitative Comparison

Table2 reports the MAE and MSE for five variables under different predictive lengths. Some findings are concluded as follows.The simple linear implementation DLinear[40] shows a relatively poor performance while the early method, like Pyraformer[35], shows a significant advantage among all baselines. Overall, the remaining baselines show similar performance. There could be several reasons for this. Firstly, the existing time-series forecasting models have relatively small parameter settings and computational capacity, which can achieve good fitting performance on small-scale datasets but may not be suitable for large-scale datasets. Additionally, these models, except Corrformer[12], only consider temporal dependencies and overlook the spatial distribution differences and correlations in meteorological data. Furthermore, we observe that Corrformer, despite considering spatial relationships, does not perform as well as some simpler models like Informer[34]. One possible reason for this may be that Corrformer’s spatial modeling relies on pre-defined local regions. Moreover, we noticed that different models exhibit preferences for some variables. For example, Informer[34] performs better in predicting wind speed and direction than other methods. In addition, some models show preferences for long-term forecasting, such as the Mamba[41] model, which achieves the best performance in long-term predictions even though it has a general short-term performance. This suggests that the Mamba structure may be more suitable for long-term forecasting. Finally, based on the current results, the cumulative error stabilizes and increases significantly after the third day. This indicates the predictive errors are large after three days, and there is still ample room for improvement in long-term prediction performance.

Table3 presents a report on the predictive performance of various models for extreme values. Our finding is that the current time-series forecasting models studied in this paper can not effectively capture the extreme values, especially with the lower and upper quantiles at $0.05\%$ and $99.5\%$ . Additionally, it is observed that the performance of wind prediction is the worst among all variables. This can be attributed to the non-stationary nature of the wind distribution, which makes it extremely challenging to predict accurately. The evaluation results also indicate a research direction for future time-series forecasting, which is to pay more attention to extreme values.

4.5 Ablation Study

WEATHER-5K: A Large-scale Global Station Weather Dataset Towards Comprehensive Time-series Forecasting Benchmark (3)

The ablation experiments are conducted to explore the impact of different input lengths on predictions. Here, we utilized the iTransformer[16] model for verification. Specifically, we set the input length to $24,48,72,96$ , and $120$ , while keeping the predicted length constant at 72. The experimental results are shown in Figure3, illustrating the performance variation of MAE for four weather variables as the input sequence length increases. The results demonstrate that the performance for some variables (temperature and wind) is slightly improved when increasing the input length. On the other hand, the MAE error for sea level pressure rises slightly. We ultimately set the input length as 48 to balance computation and performance.

5 Conclusion, Discussion, and Limitation

To facilitate accurate, efficient, and scalable weather forecasting for global weather stations, we introduce WEATHER-5K as a new benchmark dataset. WEATHER-5K encompasses numerous global stations, providing comprehensive, long-term meteorological data. This dataset enables state-of-the-art time-series forecasting methods to be easily adopted and yield promising results. However, we also noticed that current methods might still lag behind numerical weather prediction models, particularly for longer lead times. This means WEATHER-5K would present new challenges and opportunities, fostering advanced techniques and innovative research. We highlight key future research opportunities inspired by this dataset.

•
Developing the spatial-aware time-series forecasting methods.Our observation shows that different regions exhibit distinct weather patterns, such as variations between hemispheres or coastal and inland climates, while current forecasting methods do not fully capture global spatial relationships. By incorporating spatial information, such as weather stations’ geographical coordinates, researchers can leverage the relationships between nearby stations to enhance prediction accuracy. Spatially aware models can more effectively capture local weather patterns and dependencies, resulting in more precise forecasts.
•
Briding GSWF with numerical prediction models.We are aware that most Numerical weather prediction (NWP) models[9, 8, 11] can provide robust global atmospheric forecasts. By utilizing outputs from these models, we can develop bias correction models tailored to meteorological stations. Leveraging this diverse information can significantly enhance weather prediction accuracy at these stations.

Currently, our work has some limitations. Firstly, the large size of the dataset increases development costs and computational demands, posing challenges for researchers in terms of resources. Secondly, there is a scarcity of existing research utilizing spatial modeling in time-series forecasting. Consequently, the baselines we implemented do not fully leverage the potential of the WEATHER-5K dataset. We welcome and look forward to future researchers fully utilizing WEATHER-5K to develop more advanced time series forecasting methods.

References

[1]Gultepe, I., R.Sharman, P.D. Williams, etal.A review of high impact weather for aviation meteorology.Pure and applied geophysics, 176:1869–1921, 2019.
[2]Ukhurebor, K.E., C.O. Adetunji, O.T. Olugbemi, etal.Precision agriculture: Weather forecasting for future farming.In Ai, edge and iot-based smart agriculture, pages 101–121. Elsevier, 2022.
[3]Dehalwar, V., A.Kalam, M.L. Kolhe, etal.Electricity load forecasting for urban area using weather forecast information.In 2016 IEEE International Conference on Power and Renewable Energy (ICPRE), pages 355–359. IEEE, 2016.
[4]Wang, X., K.Chen, L.Liu, etal.Global tropical cyclone intensity forecasting with multi-modal multi-scale causal autoregressive model.arXiv preprint arXiv:2402.13270, 2024.
[5]Sillmann, J., T.Thorarinsdottir, N.Keenlyside, etal.Understanding, modeling and predicting weather and climate extremes: Challenges and opportunities.Weather and climate extremes, 18:65–74, 2017.
[6]Phillips, N.A.The general circulation of the atmosphere: A numerical experiment.Quarterly Journal of the Royal Meteorological Society, 82(352):123–164, 1956.
[7]Lynch, P.The origins of computer weather prediction and climate modeling.Journal of computational physics, 227(7):3431–3444, 2008.
[8]Lam, R., A.Sanchez-Gonzalez, M.Willson, etal.Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, 2023.
[9]Bi, K., L.Xie, H.Zhang, etal.Accurate medium-range global weather forecasting with 3d neural networks.Nature, 619(7970):533–538, 2023.
[10]Chen, K., T.Han, J.Gong, etal.Fengwu: Pushing the skillful global medium-range weather forecast beyond 10 days lead.arXiv preprint arXiv:2304.02948, 2023.
[11]Han, T., S.Guo, F.Ling, etal.Fengwu-ghr: Learning the kilometer-scale medium-range global weather forecasting.arXiv preprint arXiv:2402.00059, 2024.
[12]Wu, H., H.Zhou, M.Long, etal.Interpretable weather forecasting for worldwide stations with a unified deep model.Nature Machine Intelligence, 5(6):602–611, 2023.
[13]Wu, H., J.Xu, J.Wang, etal.Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting.Advances in neural information processing systems, 34:22419–22430, 2021.
[14]Godahewa, R., C.Bergmeir, G.I. Webb, etal.Monash time series forecasting archive.In Neural Information Processing Systems Track on Datasets and Benchmarks. 2021.
[15]Zhou, T., Z.Ma, Q.Wen, etal.Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting.In International conference on machine learning, pages 27268–27286. PMLR, 2022.
[16]Liu, Y., T.Hu, H.Zhang, etal.itransformer: Inverted transformers are effective for time series forecasting.2024.
[17]Chen, Z., J.Zhou, X.Wang, etal.Neural net-based and safety-oriented visual analytics for time-spatial data.In 2017 International Joint Conference on Neural Networks (IJCNN), pages 1133–1140. IEEE, 2017.
[18]Bai, L., L.Yao, C.Li, etal.Adaptive graph convolutional recurrent network for traffic forecasting.Advances in neural information processing systems, 33:17804–17815, 2020.
[19]Qiu, X., J.Hu, L.Zhou, etal.Tfb: Towards comprehensive and fair benchmarking of time series forecasting methods.arXiv preprint arXiv:2403.20150, 2024.
[20]Box, G.E., D.A. Pierce.Distribution of residual autocorrelations in autoregressive-integrated moving average time series models.Journal of the American statistical Association, 65(332):1509–1526, 1970.
[21]Hyndman, R., A.B. Koehler, J.K. Ord, etal.Forecasting with exponential smoothing: the state space approach.Springer Science & Business Media, 2008.
[22]FedericoGarza, C. C. K. G.O., Max MergenthalerCanseco.StatsForecast: Lightning fast forecasting with statistical and econometric models.PyCon Salt Lake City, Utah, US 2022, 2022.
[23]Harvey, A.C.Forecasting, structural time series models and the kalman filter.1990.
[24]Chen, T., C.Guestrin.Xgboost: A scalable tree boosting system.In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794. 2016.
[25]Friedman, J.H.Greedy function approximation: a gradient boosting machine.Annals of statistics, pages 1189–1232, 2001.
[26]Breiman, L.Random forests.Machine learning, 45:5–32, 2001.
[27]Ke, G., Q.Meng, T.Finley, etal.Lightgbm: A highly efficient gradient boosting decision tree.Advances in neural information processing systems, 30, 2017.
[28]Lim, B., S.Zohren.Time-series forecasting with deep learning: a survey.Philosophical Transactions of the Royal Society A, 379(2194):20200209, 2021.
[29]Hewamalage, H., C.Bergmeir, K.Bandara.Recurrent neural networks for time series forecasting: Current status and future directions.International Journal of Forecasting, 37(1):388–427, 2021.
[30]Wen, Q., T.Zhou, C.Zhang, etal.Transformers in time series: A survey.arXiv preprint arXiv:2202.07125, 2022.
[31]Bai, S., J.Z. Kolter, V.Koltun.An empirical evaluation of generic convolutional and recurrent networks for sequence modeling.arXiv preprint arXiv:1803.01271, 2018.
[32]Salinas, D., V.Flunkert, J.Gasthaus, etal.Deepar: Probabilistic forecasting with autoregressive recurrent networks.International journal of forecasting, 36(3):1181–1191, 2020.
[33]Kitaev, N., Ł.Kaiser, A.Levskaya.Reformer: The efficient transformer.arXiv preprint arXiv:2001.04451, 2020.
[34]Zhou, H., S.Zhang, J.Peng, etal.Informer: Beyond efficient transformer for long sequence time-series forecasting.In Proceedings of the AAAI conference on artificial intelligence, vol.35, pages 11106–11115. 2021.
[35]Liu, S., H.Yu, C.Liao, etal.Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting.In International conference on learning representations. 2021.
[36]Cirstea, R.-G., C.Guo, B.Yang, etal.Triformer: Triangular, variable-specific attentions for long sequence multivariate time series forecasting–full version.IJCAI, 2022.
[37]Nie, Y., N.H. Nguyen, P.Sinthong, etal.A time series is worth 64 words: Long-term forecasting with transformers.arXiv preprint arXiv:2211.14730, 2022.
[38]Challu, C., K.G. Olivares, B.N. Oreshkin, etal.Nhits: Neural hierarchical interpolation for time series forecasting.In Proceedings of the AAAI Conference on Artificial Intelligence, vol.37, pages 6989–6997. 2023.
[39]Oreshkin, B.N., D.Carpov, N.Chapados, etal.N-beats: Neural basis expansion analysis for interpretable time series forecasting.arXiv preprint arXiv:1905.10437, 2019.
[40]Zeng, A., M.Chen, L.Zhang, etal.Are transformers effective for time series forecasting?In Proceedings of the AAAI conference on artificial intelligence, vol.37, pages 11121–11128. 2023.
[41]Gu, A., T.Dao.Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023.
[42]Wang, Z., F.Kong, S.Feng, etal.Is mamba effective for time series forecasting?arXiv preprint arXiv:2403.11144, 2024.
[43]Ahamed, M.A., Q.Cheng.Timemachine: A time series is worth 4 mambas for long-term forecasting.arXiv preprint arXiv:2403.09898, 2024.
[44]Shi, Z.Mambastock: Selective state space model for stock prediction.arXiv preprint arXiv:2402.18959, 2024.
[45]Hersbach, H., B.Bell, P.Berrisford, etal.Era5 hourly data on single levels from 1979 to present.Copernicus climate change service (c3s) climate data store (cds), 10(10.24381), 2018.
[46]Lai, G., W.-C. Chang, Y.Yang, etal.Modeling long-and short-term temporal patterns with deep neural networks.In The 41st international ACM SIGIR conference on research & development in information retrieval, pages 95–104. 2018.
[47]Trindade, A.ElectricityLoadDiagrams20112014.UCI Machine Learning Repository, 2015.DOI: https://doi.org/10.24432/C58C86.
[48]Liu, X., Y.Xia, Y.Liang, etal.Largest: A benchmark dataset for large-scale traffic forecasting.Advances in Neural Information Processing Systems, 36, 2024.
[49]Li, Y., X.Lu, Y.Wang, etal.Generative time series forecasting with diffusion, denoise, and disentanglement.Advances in Neural Information Processing Systems, 35:23009–23022, 2022.
[50]Schuhmann, C., R.Beaumont, R.Vencu, etal.Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
[51]Wang, S., H.Wu, X.Shi, etal.Timemixer: Decomposable multiscale mixing for time series forecasting.2024.
[52]Han, T., S.Guo, W.Xu, etal.Cra5: Extreme compression of era5 for portable global climate and weather research via an efficient variational transformer.arXiv preprint arXiv:2405.03376, 2024.
[53]Xu, W., K.Chen, T.Han, etal.Extremecast: Boosting extreme value prediction for global weather forecast.arXiv preprint arXiv:2402.01295, 2024.
[54]Gebru, T., J.Morgenstern, B.Vecchione, etal.Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021.

6 Appendix

Appendix A Dataset Documentation

We organize the dataset documentation based on the template of datasheets for datasets[54].

A.1 Motivation

For what purpose was the dataset created? Was there a specific task in mind? Was there aspecific gap that needed to be filled? Please provide a description.

This dataset is created with the following motivations: 1)The current weather station dataset limits the applicability of forecasting models in real-world scenarios. Hence, it is urgent to develop a comprehensive global station weather dataset that enables forecasting models to generalize across diverse stations and regions worldwide. 2) The limited sizes of existing time-series datasets may not reflect the real performance of the forecasting models, the proposed large weather station dataset can also serve as an extensive time-series dataset to perform comprehensive time-series forecasting benchmarks for various forecasting methods. 3) The existing simple datasets fail to encompass the complex scientific problems that researchers need to discover and resolve, thereby hindering progress in the field of time-series prediction. the proposed dataset offers a diverse temporal and spatial range of time-series data, enabling a comprehensive evaluation of time-series forecasting methods and driving significant advancements in the field. 4) A large-scale weather station dataset is a crucial source of observational data for numerical weather prediction models, effectively bridging the gap between numerical models and station-based predictions. This not only improves the accuracy of numerical forecasts but also plays a vital role in verifying and evaluating the predictive performance of numerical weather prediction models.

Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g.,company, institution, organization)?

This dataset was created by Tao Han, Song Guo, Zhenghao Chen, Wanghan Xu, and Lei Bai. The authors are researchersaffiliated with the Hong Kong University of Science and Technology, Shanghai AI Laboratory, and the University of Sydney.

A.2 Composition

What do the instances that comprise the dataset represent (e.g., documents, photos, people,countries)? Are there multiple types of instances (e.g., movies, users, and ratings; people andinteractions between them; nodes and edges)? Please provide a description.

WEATHER-5K consists of 5,672 CSV files. Each CSV file represents data from a single weather station, with hourly observations recorded from 2014 to 2023. The dataset represents a collection of weather observation data, where each instance corresponds to an hourly observation from a specific weather station, with various meteorological measurements and auxiliary information.

How many instances are there in total (of each type, if appropriate)?

WEATHER-5K has a total number of 5,762 stations with a 10-year time coverage and includes 8 mandatory variables and 2 auxiliary features. For each sensor. It also possesses 87,648 instances.

Does the dataset contain all possible instances or is it a sample (not necessarily random) ofinstances from a larger set? If the dataset is a sample, then what is the larger set? Is the samplerepresentative of the larger set (e.g., geographic coverage)? If so, please describe how thisrepresentativeness was validated/verified.

Our dataset is collected and further processed using data sourced from the National Centers for Environmental Information (NCEI), specifically the Integrated Surface Database (ISD),³³3https://www.ncei.noaa.gov/products/land-based-station/integrated-surface-database. Although the ISD contains records from over $20,000$ stations spanning several decades, not all stations are suitable for machine learning applications.For instance, some stations are no longer operational, many do not report data on an hourly basis, and numerous stations have missing values for critical weather elements.To get a high-quality weather station dataset, a meticulous selection process was conducted to include only long-term, hourly reporting stations that are currently operational and provide essential observations such as temperature, dew point temperature, wind, and sea level pressure. After that, we use the process procedure detailed in Section 3 to make the final WEATHER-5K, which is in the principle of applicability for time-series forecasting research.

What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description.

The key characteristics of each instance are:

Instance Type

Each row in the CSV file represents a single hourly weather observation from a specific weather station.

Instance Fields

Each instance (row) has the following fields:

Field	Description
DATE	The date of the observation
LONGITUDE	The longitude of the weather station
LATITUDE	The latitude of the weather station
TMP	The temperature observation
DEW	The dew point observation
WND_ANGLE	The wind angle observation
WND_RATE	The wind rate observation
SLP	The sea level pressure observation
MASK	A binary list indicate the quality of the observation
TIME_DIFF	An auxiliary field

Temporal Dimension

The dataset covers hourly weather observations from 2014-01-01T00:00:00 to 2023-12-31T00:00:00, a total of 87,648 time slots.

Spatial Dimension

Each CSV file represents data from a single weather station, identified by its geographic coordinates (LONGITUDE and LATITUDE).

Is there a label or target associated with each instance? If so, please provide a description.

No, weather observation data can take itself as label in the forecasting task, and weather forecasting can be considered as a self-supervised learning task.

Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable).

No, many efforts have been made to ensure there is no missing data in the WEATHER-5K dataset.

Are relationships between individual instances made explicit (e.g., users’ movie ratings, social network links)? If so, please describe how these relationships are made explicit.

Yes, the weather stations in the dataset have geographical relationships, and we have used latitude, longitude, and elevation to represent their geographic locations. This information can be leveraged in subsequent work to model the spatial relationships between the instances.

Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them.

Yes, we chronologically split the data into train (2013-01-01 to 2021-12-31), validation (2022-01-01 to 2022-12-31), and test (2023-01-01 to 2023-12-31) sets, with a ratio of 8:1:1.

Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description.

Yes, the errors and noise in the dataset arise from two main sources. Firstly, the use of meteorological automatic stations introduces a certain degree of observational error, particularly in the measurement of wind speed and direction, which are relatively difficult to measure accurately. Secondly, in our data processing efforts to ensure the completeness of the dataset, we have employed interpolation operations, which can introduce some additional error. However, the proportion of error introduced by the interpolation is relatively small.

Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)?

Yes, it is self-contained.

Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor–patient confidentiality, data that includes the content of individuals’ non-public communications)? If so, please provide a description.

No, all our data are from a publicly available data source, i.e., NCEI.

Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why.

No, all our data are numerical.

A.3 Collection Process

How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data (e.g., part-of-speech tags, model-based guesses for age or language)?

We source the data from the the Integrated Surface Database (ISD) ogrinized and maintained by the National Centers for Environmental Information (NCEI). ISD is a global database that consists of hourly and synoptic surface observations compiled from numerous sources into a single common ASCII format and common data model. ISD integrates data from more than 100 original data sources.

What mechanisms or procedures were used to collect the data (e.g., hardware apparatuses or sensors, manual human curation, software programs, software APIs)? How were these mechanisms or procedures validated?

NCEI (formerly the National Climatic Data Center) started developing ISD in 1998 with assistance from partners in the U.S. Air Force and Navy, as well external funding from several sources. The database incorporates data from over 35,000 stations around the world, with some that include having, and includes observations data from as far back as 1901. The number of stations with data in ISD increased substantially in the 1940s and again in the early 1970s. There are currently more than 14,000 active ISD stations that are updated daily in the database. The total uncompressed data volume is around 600 gigabytes; however, it continues to grow as more data are added.

If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)?

The sampling strategy is deterministic.

Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)?

Our code collects publicly available data, which is free. On our side, we developed a download API to efficiently retrieve the source data, which was done by our team members.

Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created.

The WEATHER-5K dataset is collected and processed in 2024. This timeframe of the source data data is matches the creation timeframe of the data.

Were any ethical review processes conducted (e.g., by an institutional review board)?

No, such processes are unnecessary in our case.

A.4 Preprocessing/cleaning/labeling

Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description.

Yes, to obtain a high-quality dataset of weather stations, a series of post-processing steps were performed on the raw weather station data collected from 2014 to 2024. Initially, $10,701$ commonly operating stations were identified. The first step involved selecting stations that reported data every hour on the hour. However, many stations did not meet this criterion. To address this, a replacement method estimated missing hourly data points using the nearest available time points within a 30-minute window, significantly improving the distribution of valid hourly data. Some following processing steps are described in Section 3.

Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw” data.

The raw data are available in the NCEI. The link is: https://www.ncei.noaa.gov/products/land-based-station/integrated-surface-database. To get the preprocessed data, you can run the ‘weather_station_api.py’ in our released repository: https://github.com/taohan10200/WEATHER-5K.

Is the software that was used to preprocess/clean/label the data available? If so, please provide a link or other access point.

No.

A.5 Uses

Has the dataset been used for any tasks already? If so, please provide a description.

The dataset is used in this paper for the global station weather forecasting task.

Is there a repository that links to any or all papers or systems that use the dataset? If so, please provide a link or other access point.

No, but we may include a leader board and list papers using this dataset in the future.

What (other) tasks could the dataset be used for?

Weather data imputation, numerical weather prediction, and data assimilation

Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?

We believe that our dataset will not encounter usage limit.

Are there tasks for which the dataset should not be used? If so, please provide a description.

No, users could use our dataset in any task as long as it does not violate laws.

A.6 Distribution

Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? If so, please provide a description.

No, it will always be held on GitHub.

How will the dataset will be distributed (e.g., tarball on website, API, GitHub)? Does the dataset have a digital object identifier (DOI)?

The instructions for building WEATHER-5K are available at: https://github.com/taohan10200/WEATHER-5K.The dataset does not have a digital object identifier currently.

When will the dataset be distributed?

On June 07, 2024.

Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? If so, please describe this license and/or ToU, and provide a link or other access point to.

Our benchmark dataset is released under a CC BY-NC 4.0 International License: https:creativecommons.org/licenses/by-nc/4.0. Our code implementation is released under the MIT License: https://opensource.org/licenses/MIT.

Have any third parties imposed IP-based or other restrictions on the data associated with the instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms, as well as any fees associated with these restrictions.

Yes, for commercial use, please check the website: https://www.ncei.noaa.gov/.

Do any export controls or other regulatory restrictions apply to the dataset or to individual instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any supporting documentation.

No.

A.7 Maintenance

Who will be supporting/hosting/maintaining the dataset?

The authors of the paper.

How can the owner/curator/manager of the dataset be contacted (e.g., email address)?

Please contact this email address: mailto:hantao10200@gmail.com.

Is there an erratum? If so, please provide a link or other access point.

Users can use GitHub to report issues or bugs.

Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)? If so, please describe how often, by whom, and how updates will be communicated to dataset consumers (e.g., mailing list, GitHub)?

Yes, the authors will actively update the code and data on GitHub. Any updates of the dataset will be announced in our GitHub repository.

If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were the individuals in question told that their data would be retained for a fixed period of time and then deleted)? If so, please describe these limits and explain how they will be enforced.

The dataset does not relate to people.

Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to dataset consumers.

Yes, we will provide the information on GitHub.

If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? If so, please provide a description. Will these contributions be validated/verified? If so, please describe how. If not, why not? Is there a process for communicating/distributing these contributions to dataset consumers? If so, please provide a description.

Yes, we welcome users to submit pull requests on GitHub, and we will actively validate the requests.

WEATHER-5K: A Large-scale Global Station Weather Dataset Towards Comprehensive Time-series Forecasting Benchmark (4)

Appendix B More Dataset Information

Station distribution by countries. Figure4 shows the histogram of the number of weather stations in the WEATHER-5K dataset over 33 countries. WEATHER-5K is a global database, though the best spatial coverage is evident in North America, Europe, Australia, and parts of Asia. Coverage in the Northern Hemisphere is better than the Southern Hemisphere.

	Temperature	Dewpoint	Wind Direction	Wind Speed	Sea Level Pressure
Mean	12.71	6.53	191.19	3.37	1014.85
Standard Deviation	13.08	12.14	99.67	2.66	9.17

Characteristics of data distribution. Figure5 provides violin plots for several variables. For temperature and dewpoint, the distributions of their data have similar shapes. The upper and lower distributions of the data are symmetrical around the median. The temperature distribution is most concentrated in the low-latitude regions. As the latitude increases, the center of the temperature distribution starts to shift and also becomes more dispersed. This indicates that the temperature difference is larger in the mid-to-high latitude regions.For sea level pressure, we find that the distribution centers are similar across different latitudes, with little shift. However, as the latitude increases, the dispersion of sea level pressure becomes greater.

WEATHER-5K: A Large-scale Global Station Weather Dataset Towards Comprehensive Time-series Forecasting Benchmark (5)

Climate mean and standard deviation. Table4 presents the mean and standard deviation values for five key weather variables measured at 5,762 weather stations. The variables included are temperature, dewpoint, wind direction, wind speed, and sea level pressure. The mean temperature across the weather stations is 12.71 degrees, with a standard deviation of 13.08 degrees. For dewpoint, the mean is 6.53 degrees and the standard deviation is 12.14 degrees. The mean wind direction is 191.19 degrees, with a standard deviation of 99.67 degrees. The mean wind speed is 3.37 meters per second, with a standard deviation of 2.66 meters per second. Finally, the mean sea level pressure is 1014.85 millibars, with a standard deviation of 9.17 millibars. These statistics provide a high-level overview of the typical weather conditions captured by the network of weather stations.

	Informer[34]	Autoformer[13]	Pyraformer[35]	FEDformer[15]	DLinear[40]
Train Time (Hours)	21 $\sim$ 22	36 $\sim$ 40	20 $\sim$ 21	38 $\sim$ 40	1.0 $\sim$ 1.5
GPU Memory (MiB)	12,880	64,688	33,750	18,804	850
	PatchTST[37]	Corrformer[12]	Mamba[41]	iTransformer[16]
Train Time (Hours)	7 $\sim$ 8	144 $\sim$ 168	3 $\sim$ 4	3 $\sim$ 4
GPU Memory	22,512	46,486	11,406	45,672

Appendix C More Experimental Results

Efficiency comparisons. We summarize the efficiency comparisons in Table5 with the following observations. In terms of training time, DLinear stands out as the most efficient, requiring only 1.0-1.5 hours for 300,000 iterations, while Mamba and iTransformer also demonstrate relatively fast training times of 3 and 4 days, respectively. On the other hand, Corrformer has the longest training time, taking 144-168 days. The other methods, Informer, Autoformer, Pyraformer, and FEDformer, have training times in the range of 20-40 hours. Regarding GPU memory usage, DLinear has the lowest requirement at 850 MiB, while Informer, Pyraformer, and Mamba have moderate GPU memory needs. Autoformer, FEDformer, Corrformer, and iTransformer, on the other hand, have relatively high GPU memory requirements, ranging from 18,804 MiB to 64,688 MiB. The trade-off between training time and GPU memory usage should be considered when selecting the appropriate time-series forecasting method for a specific application, depending on the available computing resources and the requirements of the task.

Visualization results.In Figures7 8 9 10 11 12 13, we plot visualization results to showcase the performance of various time-series forecasting methods, including Pyraformer, FEDformer, DLinear, PatchTST, Mamba, iTransformer, and Corrformer. These visualizations provide a comparative analysis of how each of these different forecasting approaches performs on the time-series data.

By presenting the results in this series of figures, we are able to illustrate the unique characteristics and capabilities of each method. This allows the reader to gain a better understanding of the strengths and weaknesses of the various techniques, and how they may be suited for different types of time-series forecasting problems.

Appendix D Live Weather Demon

In addition to providing high-quality datasets and benchmarks to promote scientific research, we are also committed to putting the research results into practice and providing weather services to the public. Figure6 is a demo we are currently testing internally for providing weather station forecasts, and we will publish it in a GitHub repository in the future. The demo shows weather forecasts from the forecasting model trained on the WEATHER-5K dataset. The first data row shows the latest observation, the rest the forecast for the upcoming 24 hours.