Author: Mark Woodbridge

Research Software Directories

This is a summary of a SORSE discussion session, presented by:

  • Mark Woodbridge, Imperial College London
  • Vanessa Sochat, Stanford University
  • Jurriaan Spaaks, Netherlands eScience Center

And featuring contributions from:

  • Malin Sandström, INCF
  • Alexander Struck, Humboldt University of Berlin

Introduction

The discussion session “Research Software Directories: What, Why, and How?” was held on September 16 during SORSE, an International Series of Online Research Software Events. As presenters, we each shared efforts to develop and maintain software directories: catalogues to showcase the software outputs of an institution or community. The directories presented were:

Each of the above offered several advantages and disadvantages, or were scoped for particular use cases. For example, research-software.nl provides a robust application for serving detailed metrics and metadata for software, however it requires more manual entry. The Research Software Encyclopedia is automated and does not require hosting, but it lacks the same level of metadata. The Imperial College London and GitHub Search research software directories offer much quicker to deploy solutions, but might be too simple for some use cases. The directories are discussed in detail in the following sections. In addition to this set, we suggest the reader take a look at the Awesome Registries list to find additional examples.

How many participants use software directories?

We were quite surprised at the results of asking attendees the extent to which they have contributed or used software directories. For a total of 27 participants, 43% have used a directory for a relevant project, 27% have submitted software to a directory, and 58% indicated neither of the above.

Presentations

The Research Software Directory by Netherlands eScience Center

Jurriaan’s presentation started off by explaining why the Netherlands eScience Center had a need for what eventually became the Research Software Directory. Primary reasons were that as the Netherlands eScience Center grew beyond say, 20 or so engineers, tracking what software was available in-house really became too difficult a problem to do ad-hoc, despite the fact that all of their repositories are publicly accessible on GitHub. Secondly, the eScience Center strives to be as open as possible, and they thought it was important to be able to show the outside world where the taxpayer’s money had gone. Lastly, the eScience Center has a continuous need to keep track of various metrics, both for reporting to their funders (SURF and NWO), but also for helping management make informed business decisions.

Jurriaan then demonstrated the eScience Center’s instance of the Research Software Directory. While walking the viewers through the design, he explained how the product pages’ design was helping site visitors on their way towards adoption of the software presented on the product page.

When designing the Research Software Directory, specific attention was paid to how an instance is filled with data, how this data is curated, and how to do this in a way that can be sustained over time. To this end, the Research Software Directory harvests much of its information automatically, for example using APIs to GitHub (code development platform), Zenodo (archiving service), and Zotero (reference manager). This setup allows engineers employed by the Netherlands eScience Center to stay mostly in their comfort zone (i.e. GitHub). They just need to make sure to follow best practices such as having publicly accessible repositories, making releases on Zenodo using the automated integration, and including software citation metadata (CFF) in their repositories. Given that they already do much of that anyway, making an entry in the Research Software Directory can be achieved in a few clicks via the Admin interface that the Research Software Directory provides.

The Research Software Directory has proven to be a great resource for managing the organization, for providing funders with relevant metrics, and for increasing the visibility of tools. Despite these upsides, of course there are some downsides as well, for example it has proven difficult to carve out enough time to curate prose on the product pages, leading to text snippets that are sometimes too difficult to read for visitors not yet familiar with the software that the product page presents. A second problem is maintenance of the Research Software Directory software itself: the software stack includes more than 40 techniques, methods, and tools, in various languages and using a variety of frameworks. It has proven difficult to find developers that are familiar enough with all of these to be effective at maintaining the site. While this has not led to any significant downtime in the 3 years research-software.nl has been running, eScience Center intends to start reducing the software stack in the very near future. Furthermore, they are investigating whether it’s feasible to provide Research Software Directories as a service.

The Research Software Directory by Imperial College London

Mark Woodbridge demonstrated Imperial College’s Research Software Directory, explaining how it was developed to present a manually curated list of GitHub and GitLab repositories – motivated by a desire to rapidly catalogue and demonstrate the breadth of software developed at Imperial. It is also intended to encourage collaboration by assisting researchers to identify existing expertise and projects at Imperial.

The chosen approach has resulted in a system which is easy to maintain – both in operational complexity and in adding entries to the directory (even if the latter does depend on some familiarity with git and GitHub i.e. making a commit and pull request). This simplicity comes at a price: it depends on Algolia (a freemium service), has limited features, and is not easy to customise. It also relies on manual curation and repository metadata: due to limited bandwidth and lack of incentives, developers rarely submit or annotate software themselves. Finally, it lacks the polish and level of detail that you might expect of a public-facing showcase.

The system has however achieved its aims in effectively showcasing research software and developers at Imperial, and has provided a set of metadata enabling the identification of preferred languages to fast-growing fields of research. A suite of standalone utility scripts ensures that the contact details and project web pages remain up-to-date, and that new repositories by known developers are added to the directory in a timely manner.

The Research Software Encyclopedia

The Research Software Encyclopedia (RSEPedia) is a community-driven, open source directory that provides a means to communicate about software. It consists of three components – a set of criteria and taxonomy items used to describe or otherwise communicate about software categorization preferences, a database, and a command line client to interact with those components. The criteria and taxonomy items are maintained in their own GitHub repository, https://github.com/rseng/rseng, and render to an interface to allow for exploration and visualization. Importantly, the site for these items hosts a weekly software showcase, allowing the community to learn more about open source libraries that might be useful for their work. The terms are also served programmatically to a RESTful application programming interface (API) that makes them readily available for the RSEPedia software, which is also provided on GitHub (https://github.com/rseng/rse). Using the software, an individual or institution is empowered to easily generate a database and interface for a set of software they care about. They can inspect, add, search, or otherwise interact with metadata. While relational databases can be created, the community maintained database is a flat file database hosted on GitHub (https://rseng.github.io/software) that allows an interested contributor to browse, and annotate software with criteria and taxonomy items in an online interface. Annotation only takes a few clicks, and the process to make changes and update the database is fully automated via GitHub actions. Annotation in bulk is also easy to do locally after cloning the software repository, starting the annotation interface, and opening a pull request with changes. Importantly, although annotation can help to share ideas about software, it is not required to make the RSEPedia useful. By way of being able to communicate about software via asking questions, and by way of the software showcase, the RSEPedia can be successful for your needs if you just need a way to describe what you are looking for (e.g., for a grant or journal) or just want to share your set of software to be easily searchable.

GitHub Search is a derivation of the Research Software Directory by Imperial College London, but it removes the Algolia dependency, and derives software repositories directly from the GitHub API list of repositories for an organization directly on GitHub pages. This means that deployment is easy, coming down to simply creating the repository with a GitHub action to build it at some frequency to update the pages.

Discussions

After the presentations, attendees were divided over three groups for a 20-minute discussion session. All three groups saw lively discussions and discussed a plethora of relevant subjects, a selection of which is included below.

How do software directories interact with high performance computing (HPC)?

With several attendees that work as administrators for HPC, the question quickly came up about the relationship between software directories and HPC centers. Indeed, these centers typically maintain a large catalog of software for a user base, and it could be beneficial to link this software catalog or strategy to maintain it with a software directory. For example, if you are familiar with spack or easybuild you could imagine having integration to use a software directory to look up metadata, or generate user-friendly documentation pages. The pages might have install instructions, examples, and optimization hints for different architectures.

Guix-HPC is a package manager for a variety of software that is developed to allow reproducible HPC environments. It may interact with existing instances of Research Software Directories.

Curation policies

The main concern related to the “curation” of software directories were criteria for inclusion. A lively discussion related to the definition of “research software”, particularly in relation to scale and licensing. In the broadest sense there was agreement in principle that it could refer to any tool or library used to produce scientific results.

In terms of scale, attendees working in life sciences research emphasized that research software in their context could be a standalone script, and software directories should therefore “scale-down” appropriately.  Scripts of this type may be less substantial but their quality could well be assessed similarly to more prototypical projects in terms of documentation, design for re-use and version control.

Licensing was a more challenging topic – an argument was made for directories enabling users to find any tool that might accelerate research, including commercial software  – as long as an appropriate licence was available.

In broader terms, there was consensus that curators should avoid making assumptions about software applicability and relevance, even if they do have domain knowledge. More important than strict policies is effective annotation and filters so that users can apply their own criteria when searching for relevant software.

Searching for software

Searching for software presents its own challenges as an RSD only presents local results and many other platforms would need to be consulted for an exhaustive overview of relevant packages. Here, some registry lists prove to be helpful, for example Awesome Research Software Registries.

The purpose and minimum features of Research Software Directories

Participants identified discoverability as a major issue in relation to research software, particularly for domain specialists (i.e. end-users). This led to the following features being considered of primary importance:

  • Metadata clearly explaining the purpose and value of individual software tools in non-technical terms. The community is currently working on metadata standards like CFF or CodeMeta.
  • Contact details for the authors of the software in case further advice or support is required
  • Installation and getting started instructions
  • Guidance on how to cite the software
  • Licensing terms. This was discussed not only in relation to terms of use but also, for non-free software, ensuring cost-efficiencies by avoiding unilateral purchasing decisions and promoting the use or procurement of shared/group licences.

Many other features may benefit researchers, for example, linking from an RSD entry to its accompanying paper and data, as suggested in “Generalist Repository Comparison Chart” or listing received software citations, as implemented in swMATH.

Organization-based registry vs community-based registry

Some registries out there are scoped to serve an organization, whereas other registries like ascl.net or bio.tools aim to serve an entire research community. An advantage of the latter is increased traffic to the registry, and real benefits for users to browse the registry to see if somebody else in the community already created a solution. However, because the social structure across the community is quite loose, it will be more difficult to keep people involved, to discover new tools that could be added to the registry, and to make sure that the language used on the registry’s pages is understandable by everyone in the community. Furthermore, governance of the instance will be more difficult. For example, within the community there may exist different opinions on what metadata should be kept, and weighing these opinions will be more difficult in a larger community than a small one.

In contrast, organizational registries are more easy to run and govern — discovering tools that could be added is (or used to be) a matter of hanging out at the coffee machine and asking your colleague what they are working on right now. Helping your colleague enter their data, and making sure they do it correctly, is easier as well, and some good old-fashioned peer pressure can be applied if needed. Funding policies currently do not mandate the publication of research software, as Horizon 2020 required for research data (if possible).

Further resources

Recommendations and Next Steps

By discussing topics of curation, federation, technology and sustainability of research software directories with a wider audience, this discussion section hoped to not only promote the benefits of such directories and encourage their deployment, but also to identify issues and gather ideas to address them. From discussion above, it’s clear that there are interesting projects and updates to existing directories that might be pursued.

Running Jupyter notebooks on Imperial College’s compute cluster

We were really glad to see James Howard (NHLI, Faculty of Medicine) announcing on Twitter that he’d published a Kaggle kernel to accompany his recent publication on MR image analysis for cardiac pacemaker identification using neural networks via PyTorch and torchvision. Sharing code in this way is a great way to promote open research, enable reproducibility and encourage re-use.

Figure 3 from Cardiac Rhythm Device Identification Using Neural Networks

We thought it might be helpful to explain how to run similar notebooks on Imperial’s cluster compute service, given that it can provide some benefits while you’re developing code:

  • Your code and data remain securely on-premise, thanks to the RCS Jupyter Service and Research Data Store
  • You can run parallel interactive and non-interactive jobs that span several days, across multiple GPUs

With James’ permission we’ve lightly modified his notebook and published it in an exemplar repository alongside some instructions to run it on the compute cluster. We hope this can help others to use a combination of Conda, Jupyter and PBS in order to conduct GPU-accelerated machine learning on infrastructure managed by the College’s Research Computing Service – without incurring any cost at the point of use.

Many thanks to James Howard for sharing his notebook and reviewing our instructions

RSLondonSouthEast 2020

RSLondonSouthEast 2020, the annual gathering for Research Software Engineers based in or around London, took place on the 6th February at the Royal Society. The College was strongly represented by contributions from RSEs based at Imperial.

Full talks:

Lightning talks:

Posters:

Jeremy Cohen introduces RSLondonSouthEast 2020 at the Royal Society

Jeremy Cohen (Department of Computing) was the chair of the organising committee. Stefano Galvan (Department of Mechanical Engineering), Alex Hill (Department of Infectious Disease Epidemiology) and Jazz Mack Smith (Department of Metabolism, Digestion and Reproduction) served on the programme committee.

Many thanks to all the committee members and everyone who presented, submitted proposals or attended on the day, and to EPSRC and the Society of Research Software Engineering for their support. For more information from the event check Jeremy’s full report, RESIDE’s blog post or #RSLondonSE2020 on Twitter.

A review of the RSE team’s activities in 2019

2019 has been another very busy and productive year for the RSE team in the Research Computing Service at Imperial College. Our core mission is to accelerate the research conducted at Imperial through collaborative software development, and we have now completed 24 projects since our inception 2 years ago with 75% of our first-year projects resulting in follow-on engagements. We’ve highlighted 5 of our most fruitful collaborations on our new webpages, which also provide more information about the team and the services we offer. We are about to appoint our fifth team member, reflecting the value we’ve offered to research projects (and proving that there is a career pathway for RSEs!).

In addition to our project work we’ve assisted researchers at over 40 RCS clinics this year and played a strong supporting role in Imperial’s Research Software community, from Hacktoberfest to departmental events. We’ve developed two brand new Graduate School courses in Research Software Engineering (to be delivered next term) and have helped deliver 4 Software Carpentry workshops. We’ve also played an increasingly active role in promoting the benefits of RSE (and the role itself) to relevant stakeholders in the College. This has complemented our broader engagement activities: acting as expert reviewers for JOSS submissions, contributing to numerous OSS projects, presenting at 3 international RSE conferences (deRSE19, UKRSE19 and NL-RSE19), and promoting our work via blogging, social media and attendance at several other relevant events – locally (e.g. RSLondonSouthEast 2019) and nationally (e.g. CW19, CIUK).

RSE19 conference photograph
The team (amongst amongst many other RSEs!) at UKRSE19. Photo courtesy @RSEConUK.

We continue to develop tools and infrastructure to support RSE within in the College. The nascent Research Software Directory aims to showcase the breadth of software developed at Imperial, encouraging collaboration, re-use and citation. We’re also attempting to give software a stronger position amongst research outputs through our current work on the Research References Tracking Tool (R2T2) and helping researchers submit their software to Spiral via Symplectic. Finally, we continue to share advice and guidance on how to adopt better RSE practices, such as QA and CI.

As we look forward and further develop the Research Computing Service’s RSE capacity and expertise we’d like to thank all the academics who have trusted us with their projects, and all the researchers who’ve taken the time to explain their work and have enthusiastically embraced good software engineering practices. We’re looking forward to another 12 months of strengthening RSE at Imperial!

NL-RSE19

On 20 November 2019 Mark Woodbridge and Jeremy Cohen represented Imperial College at NL-RSE19, the first annual conference of the Netherlands Research Software Engineer community.

NL-RSE19 poster session

Their presentation, Strength in Numbers: Growing RSE Capacity at Imperial College London (10.5281/zenodo.3548308) described the expanding groups involved in RSE at Imperial, their respective activities, and how examples of these are fostering collaboration and awareness across the College. They also took the opportunity to display a poster first shown at UKRSE19 that highlights key aspects of these initiatives. The talk and poster generated much interest and resulted in productive discussions with members of the NL-RSE community in relation to building inclusive communities, long-term support for research software, personal development opportunities for RSEs, and how best to support the broad range of research typically carried out in larger institutions.

NL-RSE19 poster session

Many thanks to the organisers (in particular Niels Drost and Ben van Werkhoven of the Netherlands eScience Center) for the opportunity to engage with the vibrant and rapidly growing RSE community in the Netherlands.

Using the Cloud for Research Software Engineering

We previously described three RSE-related use cases for Microsoft’s Azure platform, ranging in deployment granularity from VMs to individual JavaScript functions. In this post we’ll explain further how we use those and other Azure services to complement our on-premise infrastructure – helping us to deliver our RSE projects faster.

At Imperial we’re fortunate to have a powerful and well-maintained high-performance computing (HPC) system. We use this as a batch processing back-end for user-facing web applications that we have developed (such as Smart Forming) and for benchmarking projects including MUSE. The web applications themselves are typically hosted on CentOS VMware virtual machines hosted in our data centre and maintained by a dedicated team within ICT. These servers are set up to authenticate against our institutional sign-on system, are pre-configured with monitoring and alerting, and can directly access other on-premise systems (such as the HPC cluster and our Research Data Store).

Despite this local infrastructure we still derive a lot of value from access to our institutional Azure subscription, in both ad hoc and longer-term use of cloud resources. This gives us capabilities that would be difficult or costly to replicate on-premise. These include:

  • The ability to rapidly provision and tear-down systems and services
  • Access to higher-level (lower-maintenance) abstractions i.e. PaaS and FaaS
  • Access to a diverse range of operating systems and configurations, from VMs for multiple versions of Windows to macOS build agents

In particular we rely on the following services:

  • DevOps Pipelines: Cross-platform QA (primarily testing and linting) and packaging (including PyInstaller builds on macOS and Windows). Build failures are pushed to relevant Teams channels.
  • Functions: Our Trending app provides us with information about active repositories in our institutional GitHub organisation. Using Functions makes its deployment zero-maintenance.
  • App Service: Our GtR app provides us with alerts for new UKRI grants to Imperial College. It is deployed to App Service to avoid the setup and maintenance required of a standalone VM.
  • Cosmos DB: Both GtR and Trending use the MongoDB API provided by Cosmos.
  • Virtual Machines: We use Azure when we need VMs for long-running services that are required to accept incoming requests from other systems but don’t need access to on-premise resources, or when we need short-lived VMs for testing purposes:
  • Container Registry: We use continuous deployment for all our web apps (including MAGDA and POWBAL), meaning that pushing to the master branch in GitHub is sufficient to run our QA pipeline, build a Docker image which is pushed to the Azure registry, and for Watchtower to pull the image onto the target server and restart the relevant service(s).
  • Single Sign-On: This allows users of our internal apps to authenticate using their existing Office 365 accounts – avoiding the need for further login details.
  • Notebooks: We have our own Jupyter server attached to our cluster and data store, but Azure Notebooks are very useful for sharing externally, and for teaching large classes.

In short, Azure provides us with services that work alongside our existing systems, enabling us to deliver RSE projects more effectively and with much lower operational overheads than if we tried to replicate the same features on-premise. And by becoming familiar with these services we’re better equipped to advise and assist researchers across Imperial College who wish to take advantage of all the compute resources at their disposal – on-premise and in the cloud.

deRSE19

The first German national RSE conference took place in Potsdam on 4th-6th June 2019 with 187 attendees. deRSE19 was a really vibrant, welcoming and well-organised event in a great location and had a diverse agenda, encouraging participants from across Europe to share experiences of software engineering in research.

deRSE19 group photo
deRSE19 aerial group photo (CC-BY Antonia Cozacu, Jan Philipp Dietrich, de-RSE e.V.)

In terms of presentations Imperial College was the best-represented institution from outside Germany, with the following speakers:

  • Jeremy Cohen (EPSRC RSE Fellow, Department of Computing) who presented a talk on building research software communities and a poster about RSLondon.
  • Alex Hill (Senior Web Application Developer, Department of Infectious Disease Epidemiology) who spoke about the challenges of conducting constructive code reviews, particularly in a research setting.
  • Mark Woodbridge (RSE Team Lead, Research Computing Service) who gave a talk on RSE 2.0, reflecting on progress in Research Software Engineering and how it may develop in the near future.

Many thanks to all the event organisers and sponsors for giving us the opportunity to present.

Also during the conference a keynote on RSE collaboration was delivered by Alys Brett, chair of the newly established Society of Research Software Engineering and head of the Software Engineering Group at the UKAEA. UK RSEs also attended deRSE19 from the Software Sustainability Institute, the University of Westminster, and the University of Southampton. We look forward to reuniting with them, as well as colleagues from Germany and beyond at UKRSE19 in September!

RSLondonSouthEast 2019

Research Software London‘s first annual workshop took place at the Royal Society on February 8, 2019, bringing together a regional community of research software users and developers from over 20 institutions. It featured a diverse schedule of talks and discussions about software engineering, community building and both domain-specific and general-purpose tools of relevance to research.

There were four talks from Imperial researchers, including the keynote from Professor Spencer Sherwin, Director of Research Computing. The College’s Research Computing Service was also represented by Dr Diego Alonso Álvarez, who presented an introduction to xarray and described the RSE team’s work on integrating it into the MUSE energy systems model.

Please see Diego’s slides for more information. Other talks and media from the event are available via #rslondonse19.

Thanks to RSLondon, the programme committee and its chair Dr Jeremy Cohen for organising an informative and stimulating day, and to the EPSRC for supporting the event. We’re looking forward to participating in future meetings and helping further strengthen the regional RSE community.

Cloud-first: Serverless alerts for trending repositories

This is the third and final post in a series describing activities funded by our RSE Cloud Computing Award. We are exploring the use of selected Microsoft Azure services to accelerate the delivery of RSE projects via a cloud-first approach.

In our previous two posts we described two ways of deploying web applications to Azure: firstly using a Virtual Machine in place of an on-premise server, and then using the App Service to run a Docker container. The former provides a means of provisioning an arbitrary machine much more rapidly that would traditionally be possible, and the latter gives us a seamless route from development to production – greatly reducing the burden of long-term maintenance and monitoring.

By taking these steps we’ve reduced our unit of deployment from a VM to a container and simplified the provisioning process accordingly. However, building a container, even when automated, incurs an overhead in time and space and the resultant artifact is still one-step removed from our code. Can we do any better – perhaps by simply bundling our code and submitting to a suitable capable runtime – without needing to understand a technology such as Docker?

Azure Functions provide a “serverless” compute service that can run code on-demand (i.e. in response to a trigger) without having to explicitly provision infrastructure. There are similarities with the App Service in terms of ease of management, but also some differences: principally that in return for some loss of flexibility in runtime environment you get an even simpler deployment mechanism and potentially much lower usage charges. Your code can be executed in response to a range of events, including webhooks, database triggers, spreadsheet updates or file uploads.In this post we’ll demonstrate how to run deploy a simple scheduled task: a Node.js script that sends a periodic email identifying the most active repositories within a GitHub organisation. It uses the GitHub GraphQL API to get the the latest statistics (stars, forks and commits) and tracks the changes in a database. I use this script to receive weekly updates for trending repositories under ImperialCollegeLondon, but it’s easy to reconfigure for your own organisation.

As previously, we’ll use the Azure Cloud Shell, and arguments that you’ll want to set yourself are highlighted in bold.

Getting started

As usual we first create a resource group, and then add a storage account for our function:

az group create --name myResourceGroup --location westeurope
az storage account create --resource-group myResourceGroup --name ictrendingstore --sku Standard_LRS

Creating our function app

Then we create our app (a container for one or more functions):

az functionapp create --resource-group myResourceGroup --name ictrending --storage-account ictrendingstore --consumption-plan-location westeurope

And upgrade Node.js so that we can use ES6 features including async functions:

az functionapp config appsettings set --resource-group myResourceGroup --name ictrending --settings FUNCTIONS_EXTENSION_VERSION=beta WEBSITE_NODE_DEFAULT_VERSION=8.9.4

Deploying our code

Before we upload our code we configure the runtime with some required configuration (repository name, GitHub token, MongoDB URL and email settings):

az functionapp config appsettings set --resource-group myResourceGroup --name ictrending --settings GITHUB_ACCESS_TOKEN=xxx ORGANISATION=ImperialCollegeLondon MONGO_URL=mongodb://username:password@example.com/db SMTP_URL=smtp://username:password@example.com EMAIL_FROM=from@example.com EMAIL_TO=to@example.com

I’m using Azure’s MongoDB-compatible service (Cosmos DB) but there are many other hosting providers, including MongoDB themselves (Atlas).

We then simply upload a zipped copy of our code, its dependencies, and a trigger configuration (a timer for 8am on Mondays):

curl -LO https://github.com/ImperialCollegeLondon/trending/releases/download/v1.0.0/trending.zip
az functionapp deployment source config-zip ---resource-group myResourceGroup --name ictrending --src trending.zip

You’ll subsequently receive your weekly email on Monday morning, assuming there has been some activity in your chosen organisation!

Inspecting the code reveals that it needs to comply with a (very lightweight) calling convention by exporting a default function and invoking a callback on the provided context, and it needs to be written in one of several supported languages. We uploaded our source as an archive but you can also deploy (and then update) code directly from source control.

Tidying up

As usual you can delete your entire resource group, including your storage account and function by running:

az group delete --name myResourceGroup

Summary

In this post we’ve shown how zipping and uploading your source code can be sufficient to get an app into production. This is all without knowledge of any particular operating system or virtualisation technology, and at very low cost thanks to consumption-based charging and on-demand activation. Whether you choose to deliver your software as a VM, container or source archive will obviously depend on the nature of the application and its usage patterns, but this flexibility provides potentially great productivity gains – not only in deployment but also long-term maintenance. In this instance it’s a great fit for short-lived scheduled tasks but there any a huge number of alternative applications.

We’d like to thank Microsoft Azure for Research and the Software Sustainability Institute for their support of this project.

Cloud-first: Rapid webapp deployment using containers

This is the second in a series of posts describing activities funded by our RSE Cloud Computing Award. We are exploring the use of selected Microsoft Azure services to accelerate the delivery of RSE projects via a cloud-first approach.

In our previous post we described the deployment of a fairly typical web application to the cloud, using an Azure Virtual Machine in place of an on-premise server. Such VMs offer familiarity and a great deal of flexibility, but require initial provisioning followed by ongoing maintenance and monitoring. Our team at Imperial College is increasingly using containers to package applications and their dependencies, using Docker images as our unit of deployment. Can we do better than provisioning servers on a case-by-case basis to get web applications into production, and thereby more rapidly deliver services to our users?

The Azure App Service provides a solution named Web App for Containers, which essentially allows you to deploy a container directly without provisioning a VM. It handles updates to the underlying OS, load balancing and scaling. In this post we’ll demonstrate how to run pre-built and custom Docker images on Azure, without having to manually configure any OS or container runtime. As previously, we’ll use the Azure Cloud Shell, and arguments that you’ll want to set yourself are highlighted in bold.

Getting started

First of all we create an App Service plan. This only needs to be performed once for your active subscription:

az group create --name myResourceGroup --location "West Europe"
az appservice plan create --name myAppServicePlan --resource-group myResourceGroup --sku S1 --is-linux

Deploying a pre-built, public container image

It’s then just one command to run a Docker container. In this case we’ll deploy Nginx using its Docker Hub image:

az webapp create --resource-group myResourceGroup --plan myAppServicePlan --name ic-nginx --deployment-container-image-name nginx

We can then visit our public site at https://ic-nginx.azurewebsites.net/

You can use a custom DNS name by following these further instructions. Note that the site automatically has HTTPS enabled.

Decommissioning the webapp (thereby avoiding any further charges) is similarly straightforward:

az webapp delete --resource-group myResourceGroup --name ic-nginx

Deploying a custom container image

Running your own app is as simple as providing a valid container identifier to az webapp create.  This can point to either a public or private image on Docker Hub or any other container registry, including Azure’s native registry.

For demonstration purposes we’ll build a Datasette image to publish the UK responses from the 2017 RSE Survey. Datasette is a great tool for automatically converting an SQLite database to a public website, providing not only a means to browse and query the data (including query bookmarking) but also an API for programmatic access to the underyling data. It has a sister tool, csvs-to-sqlite, that takes CSV files and produces a suitable SQLite file.

First we need to install both tools, download the survey data, and convert it from CSV to SQLite:

pip install https://github.com/simonw/csvs-to-sqlite/zipball/master datasette
curl -O https://raw.githubusercontent.com/softwaresaved/international-survey/master/analysis/2017/uk/data/cleaned_data.csv
csvs-to-sqlite --table responses cleaned_data.csv uk-rse-survey-2017.db

Then we can create a Docker image containing the data and the Datasette app with one command, annotating with the appropriate licence information:

datasette package uk-rse-survey-2017.db
--tag mwoodbri/uk-rse-survey:2017
--title "UK RSE Survey (2017)"
--license "Attribution 2.5 UK: Scotland (CC BY 2.5 SCOTLAND)"
--license_url "https://creativecommons.org/licenses/by/2.5/scotland/deed.en_GB"
--source "The University of Edinburgh on behalf of the Software Sustainability Institute"
--source_url "https://github.com/softwaresaved/international-survey"

Then we push the image to Docker Hub:

docker push mwoodbri/uk-rse-survey:2017

And, as previously, create an Azure Web App:

az webapp create --resource-group myResourceGroup --plan myAppServicePlan --name rse-survey --deployment-container-image-name mwoodbri/uk-rse-survey:2017

Using Datasette

After a brief delay the app is publicly available: https://rse-survey.azurewebsites.net/

Note that the App Service automatically detects the right port to expose (8001 in this case) and maps it to port 80.

Datasette enables you to run and bookmark SQL queries, for example this query which lists the contributors’ organisations in order of the number of responses received:

Private registries

If you’re hosting your images on a publicly accessible that requires authentication then you can use the previous az webapp create command into two steps: one to create the app and then to assign the relevant image. In this case we’ll use the Azure Container Registry but this approach is compatible with any Docker Hub compatible registry.

First we’ll provision a container registry. These steps are unnecessary if you already have one:

az acr create --name myrepo --resource-group myResourceGroup --sku Basic --admin-enabled true
az acr credential show --name myrepo

Then we can login to our private registry and push our appropriately tagged image:

docker login myrepo.azurecr.io --username username

docker push myrepo.azurecr.io/uk-rse-survey:2017

Finally we can create our webapp and configure it to be created using the image from our private registry:

az webapp create --resource-group myResourceGroup --plan myAppServicePlan --name rse-survey
az webapp config container set --resource-group myResourceGroup --name rse-survey --docker-custom-image-name myrepo.azurecr.io/rse-survey --docker-registry-server-url https://myrepo.azurecr.io --docker-registry-server-user username --docker-registry-server-password password

The end result should be exactly the same as when using the same image but from the public registry.

Tidying up

As usual, you can delete your entire resource group, including your App Service plan, registry (if created) and webapps by running:

az group delete --name myResourceGroup

Summary

In this post we’ve demonstrated how a Docker image can be run on Azure using one command, and how to build an deploy a simple app that presents a simple interface to explore data provided in CSV format. We’ve also shown how to use images from private registries.

This approach is ideal for deploying self-contained apps, but doesn’t present an immediate solution for orchestrating more complex, multi-container applications. We’ll revisit this in a subsequent post.

Many thanks to the Software Sustainability Institute for curating and sharing the the RSE survey data (reused under CC BY 2.5 SCOTLAND) and Simon Willison for Datasette.