Leader: Klaus Johannsen

Co-leader: Alla Sapronava

## Objective

To manage model outputs in a Big Data infrastructure, taking care of downloading and ingesting the data, preprocessing, and standardizing. To provide Big Data analysis tools and algorithms required by the tasks in WP2, WP3, and WP4.

## Background

The output from the upcoming Climate Model Intercomparison Project phase 6 (CMIP6) is scheduled to be publicly available from early 2018 on. The estimated number of participating models and the size of the output-data are described in Table 1. With increasing model complexity and resolutions, the size of the model outputs is estimated to grow by a factor of 10 to 20 when compared to CMIP5 output. This implies an essential need to resort to Big Data IT infrastructures and computational tools to efficiently manage and analyse the data.

The in-house Big Data framework at Uni Research Computing (Apache Hadoop + Spark) can manage data volumes up to ~100 Tb (as compared to few terabytes in typical Linux cluster used by climate scientists), and is scalable in principle up to any desired size. In the current setup the entire cluster has a total of about 2100 GB RAM memory, 550 TB storage and 500 cores CPU, and ~1/3 of the cluster will be reserved for the COLUMBIA project. WP1 will first of all provide the IT infrastructure for efficiently handling and streamlining of large data sets coming from CMIP6. Secondly, the infrastructure is equipped with a large suite of Machine Learning computational tools, which will be customized 1) for the climate analyses required in the other WPs (see also Fig. 3) and 2) to integrate different analyses methodologies into a generalised data science approach. Machine Learning techniques have proven to be a clear advantage in climate science to combine the results of multiple climate models in the last decade (e.g., Monteleoni et al., 2011; Lakshmanan et al., 2015), and are an essential asset in a Big Data framework on top of the pure IT infrastructure.

## Description of work

Task 1.1. Download and preprocess the CMIP6 model data. The CMIP6 data will be downloaded from the Earth System Grid Federation database and will be preprocessed (including operations such as data reduction, cleaning, and filtering) and ingested in a native format suited to the Hadoop distributed file-system, such as ORC (Optimized Row Columnar) tables. Once in the Big Data framework, it will be prepared for analysis in the other WPs (e.g. preliminary operations such as spatial interpolation and setting a chosen resolution for geographical grids).

Task 1.2. Implement non-linear dimensionality reduction in Big Data ecosystem. WP1 will take care of providing WP2 with the required computational tools as a set of algorithms and pipelines to be deployed in the Big Data framework. The process-oriented model evaluation carried out in WP2 is typically based on linear dimensionality reduction techniques (such as Singular Value Decomposition and Principal Component Analysis; von Storch and Zwiers, 2003), which are standard parts of the majority of Machine Learning libraries. WP1 will provide a fast and scalable implementation in Hadoop+Spark, suitable for the analysis of large datasets. The limitation of PCA and similar techniques is that the projection on a lower-dimensional space is linear. Further reduction and thus further insight into the data structure requires nonlinear techniques, a number of which have been commonly applied to climate model evaluation, many of them based on neural networks (e.g. Nonlinear PCA, self-organized maps, multilayer perceptrons, Isomap; Gamez et al., 2004; Leloup et al., 2007; Ross et al. 2008). WP1 will provide Big-Data implementations of non-linear dimensionality reduction approaches (e.g. NPCA), which can identify relevant low-dimensional manifolds in climate data sets arising from nonlinear processes. Beyond that, WP1 will also realize a combination of optimization (e.g. via a genetic algorithm) and nonlinear analysis to explore new ways of combining known variables into meaningful climate processes and features, which can emerge by exploring the data structure from a more abstract perspective.

Task 1.3. Development of optimization technique for emergent constraint identification. As shown in the proof-of-concept, the linearization problem when applying the “emergent constraint” method to model evaluation can be efficiently solved even with large datasets by applying an optimization technique such as a genetic algorithm, a fast and scalable implementation of which will be provided to WP3. The pilot study showcases a constraint problem based on global ocean carbon sink seen at two points in time. Here, the concept will be extended: 1) to higher dimensions, by combining more variables together and 2) to different slices of the temporal dimension (more time points, time-space regions). As in Task 1.2 non-linear dimensionality reduction techniques will then applied to optimize the constraint identification.

Task 1.4. Implement non-linear dimensionality reduction as in Task 1.2 but for multiple future climate scenarios. WP1 will provide the algorithms required by WP4 to evaluate the performance of the models selected in WP2 and WP3 in different future climate scenarios, disentangling the scenario uncertainty from the internal variability uncertainty. Similar approach as in Task 1.2 will be followed but separately for each future scenario (e.g., future non-mitigated, mitigated, and quadrupling of atmospheric CO2 scenarios, where each scenario will be treated separately).