In the context of nation-wide collaborative research projects and initiatives, data pipelines should be interoperable between different clusters within Switzerland. Several technologies are available to help share pipelines and make them work in different HPC environments; workflow managers such as Snakemake can be combined with containerisation technologies including Docker and Singularity.
The approach of “code moving to data”, rather than moving data to computing infrastructures, is becoming increasingly necessary, in part due to projects with sensitive data, as well as those with ever-increasing data sets. The aim of this project was to assess whether containerization could address this need of sharing and running pipelines in a reproducible manner on different HPC clusters across Switzerland, and as such involved a number of project partners. A secondary goal was to create a community around these container technologies in order to facilitate development, deployment and running of these containers.
The project resulted in three tangible outcomes: a putative tested technology stack to run the same workflow in different HPC clusters in Switzerland, a set of guidelines for pipeline interoperability using the Docker and Singularity container technologies, and a validation procedure for testing different technology stacks in the context of interoperable workflows. In addition, the project also resulted in the creation of a highly interactive community of container developers, deployers and maintainers across the Swiss research landscape, thereby facilitating the future exchange of pipelines for collaborative research projects.
Technology stack to run the same workflow in different Swiss HPC clusters
Guidelines for pipeline interoperability using
Validation procedure for testing technology stacks
Energy production from solar radiation is becoming increasingly important in the light of current environmental challenges. With this in mind, current research aims to combine solar energy generation with urban planning in order to maximize efficiency. The goal is to evaluate the potential of building roofs located in urban areas for producing solar energy.
The iCeBOUND CTI project created a Decision Support System prototype that leverages 3D digital urban data to facilitate environmental analyses in cities. The industrial and governmental partners intended to use this application in production and needed a sustainable deployment model for the future.
However, since the clients of the software were not able to acquire hardware specifically for this purpose, this computationally time-intensive prototype needed to be ported and migrated to generic Cloud infrastructures. In the first support project for the iCeBOUND consortium, the application was ported to run on public Cloud infrastructures, and made portable in order to be future-proof. Further technical details can be found here.
Building on this success, a second support project for the consortium was delivered that produced an optimized GPU deployment. This led to both reduced calculation times, and further reductions in the the costs of solar energy potentials. This deployment has the potential to allow the calculation of a whole city at once, something of great value to the iCeBOUND consortium.
and GPU Deployment
Many research core facilities and labs have adhoc methods of tracking usage, billing, or resource planning for usage of their resources (e.g. microscopy, genomics, etc). In addition, many of these facilities are shared across institutions, and most tools do not support cross institution concepts for discovering resources.
In this support project, a tool was developed to offer these capabilities. Whilst the original scope of the project was to implement features for a number of research groups at ETH Zurich and the University of Basel, Open IRIS now has 17 Swiss organizations with users registered in the system (Open IRIS is integrated with AAI to all Swiss universities).
These organizations span universities, institutes, hospitals, and commercial organizations that are working together to optimize and share research resources.
The use of the system has also extended well beyond Switzerland with users in many other countries as well. In total there are currently approximately 4000 users registered in the platform covering 85 organizations in 12 countries. It has around 150 resource providers registered with over 1000 resources in the system. In 2016, over 1880 users were added to the system with hundreds of logins per day.
We are delighted that the contribution Open IRIS has made was recently recognised as the winner of the 2017 «S-Lab/UKSPA Laboratory Effectiveness Award».
Neuroscience Imaging Pipeline
A detailed understanding of brain function requires monitoring, analyzing and interpreting the activity of large networks of neurons whilst animals perform meaningful behaviors. In recent years, neuroscientists have come ever closer to achieving this goal, mainly by developing imaging techniques that allow measurements of large fractions of neurons with high spatial and temporal resolution.
In addition, behavioral monitoring has become increasingly sophisticated. For example, high-speed video recordings allow the precise evaluation of animals’ movements whilst neuronal activity is acquired in parallel. In combination, these new experimental approaches offer novel insights into the neuronal programs that organize behavior. However, these techniques are very data-intensive, and pose new analysis and data management challenges for neuroscientists.
In this project, a generic data analysis pipeline was developed for Prof. Fritjof Helmchen at the Brain Research Institute, University of Zurich. This pipeline allows concurrent processing of neuronal and behavioral data from the acquisition stage, right through to the visualization of results. One of the main design criteria, and a significant advantage over current ‘lab-internal’ solutions, is the scalability of the pipeline to ever-increasing amounts of data.
In order to achieve this, distributed processing frameworks optimized for dealing with large volumes of data (e.g. Hadoop, Spark) were utilized. In addition, the pipeline is not restricted, and is able to deal with data acquired by different modalities and systems (e.g. microscopy, electrophysiology, video tracking). This was achieved by combining an internal data model well suited for multi-dimensional time-series data with a set of ‘plug-ins’ for existing data stores. Finally, a number of training sessions were delivered at the customer to ensure maximum benefit from the pipeline.
Scalable data analysis pipeline
Customized on-site training sessions
The ATLAS experiment is one of the four major experiments of the CERN Large Hadron Collider (LHC). ATLAS is a general-purpose physics experiment run by an internal collaboration of scientists, with the aim of fully exploiting the potential of the LHC to answer fundamental questions such as discovering the fundamental building blocks of matter or the fundamental forces of nature. One of the most prominent activities of ATLAS is its involvement along with the CMS in the discovery of the Higgs boson.
Switzerland provides computational and storage resources to ATLAS. However, ATLAS faces increasing computational demands in order to fully exploit the scientific data produced. As the complexity and size of data increases, more powerful computational infrastructure will be required to process and analyse it. One potential approach for dealing with this is the use of cloud computing resources, which have become increasingly popular in recent years.
In this support project led by S3IT and in cooperation with the physics department at the University of Bern, different cloud infrastructures, including the academic SWITCHengines platform, were tested for the processing of ATLAS data. The ElastiCluster software from the University of Zurich was used to set up and implement 320 SWITCHengines cores in about an hour to process ATLAS data. The performance of the infrastructure in continuous operation was then recorded for several months.
A number of conclusions can be drawn from this support project in regards to the feasibility of using SWITCHengines. Firstly, that there is a low setup time, and it is possible to quickly set up and implement a medium sized cluster of 1000 virtual CPU cores in less than a day. Secondly, over the course of several months, the uptime on SWITCHengines was close to 100%. In addition, maintenance on the user side was was almost non-existent, which compares very favourably to the operation of own hardware where this certainly would not have been the case. Thirdly, that the cost of SWITCHengines is competitive with commercial cloud providers. As such, this support project suggests that cloud infrastructure, and in particular SWITCHengines, is a feasible and cost-efficient computing resource for ATLAS.
ATLAS Cloud Testing