Dealing with the data challenges

The OUTBREAK platform takes a world-leading AI approach to predicting the threat of an antibiotic resistant infection.

Cloud and Data Management Lead Denis Bauer explains the novel approach to gathering and analysing data from humans, animals and the environment…

Q: What is your role with the OUTBREAK project?

We quickly realised that OUTBREAK will be an IT heavy project, as well as a science heavy project, because it is pulling together so many different data sources from so many different agencies for the first time. I am on the data analytics, maintenance and governance sides; working out what kind of novel cloud-based architecture we need so that we can bring in data from many different sources, with so many different constraints, and still be able to process a real-time analysis.

Q: You’ve been working with those cloud-based solutions for a long time. What are the particular challenges for OUTBREAK?

Typically, when you have a cloud solution, you stay with one provider but OUTBREAK is bringing together information from different agencies which store their information with different cloud providers. So, we can either replicate all their data into one cloud provider or try to integrate their different clouds.

Ultimately, it’s just a large number of computers that have connections between them natively. Whether the computer is owned by AWS and talking to another computer that is owned by AWS, it’s not conceptually too much different to an AWS computer talking to an Oracle computer. It’s more a matter of putting the right framework in place, where the authentication and the data transfer is happening.

Q: Why have you chosen the harder path of trying to integrate the cloud providers?

It’ll be much more flexible to integrate the data. If you try to replicate all the data, there will always be a time lag and source of error. In the future, talking between cloud providers will be straightforward because technically there’s nothing that stops it.

Q: What are the main cloud providers you need to integrate?

There are lots popping up all the time. The main ones are Amazon Web Services, Google Cloud Platforms, Oracle, Microsoft Azure and Intel.

Q: When you consider the amount of data that potentially could be included in OUTBREAK, is it a large amount of data compared to what is already currently being analysed and monitored?

It is hard to benchmark against other initiatives because OUTBREAK is the first time we are attempting to combine data sets from so many different sources. OUTBREAK’s largest data set will be smaller than, for example, a human genome consortia, with three billion letters in the genome and a couple of hundred thousand individuals (a matrix of a couple of trillion entries).

However, we will be dealing with meta-genomics species, where the number of individual organisms can range in the billions, combined with human health records, and environmental sensor data – who knows how big this collective data set is going to get.

Another example, let’s assume we have clinical data (patient information) and want to link that to information about sewage but you need a link between the patient and the hospitals and where the sewage is flowing to or from. What kind of waterways are in that environment and how frequently is that flushed? Ultimately, in order to make a conceptual statement of “because A happened, D was the outcome” there were probably two other things in between that needed to be in place for this particular pathway to occur.

At this stage, we don’t have that data, for some of this we don’t yet know where those sources of data are, so we need to plan this uncertainty into our cloud-architecture.

Q: So, it’s a bit of a treasure hunt, trying to find the data and then working out how you’re going to integrate the data so that it means something?

Yes, and we are learning from other incidents along the way. For example, the COVID-19 outbreak is not too dissimilar to Antimicrobial Resistance (AMR). For COVID-19, we don’t have treatments available so the consequences are very direct and visible. For AMR we still have working drugs to treat people but we don’t know exactly how many people die from AMR.

AMR is currently a hidden pandemic but is on the trajectory to reaching 10 million deaths per year by 2050 without intervention. So, everything we learn from dealing with COVID, we can apply to AMR.

Q: What other tech issues do you see for making OUTBREAK a reality?

The biggest challenge for me is identifying what kind of data sets should go in and getting access to those data sets. The next biggest challenge will be the quality of the data because those data sets were never meant to be integrated. We’ll need to find ways of potentially salvaging or putting in quality control metrics to say, “this dataset provides high quality information but for this one we need to do XYZ to reach a quality threshold.”

The third biggest challenge will be around the cloud architecture. The cloud-native architectures we need to use are quite new, so there will be no stock standard solution that company X has used for 20 years that we can just replicate. It will be our very own architecture, our very own purpose-built bespoke, fit for purpose concoction. Which gives us fantastic flexibility but it also puts a lot of emphasis on development; it becomes a science of developing the right cloud architecture that is fit for purpose.

Q: Do you have any inkling of how long it might take to build a useful system?

I think it would be six months to get a first prototype up and running; we’re very keen on agile development, where we put something together rapidly to get it out for people to test in their environment – not to make inferences or decisions but to see whether the information coming out is robust and fits with their workflows and thinking. From there, we build in more functionality, bringing more datasets in a co-design process with the stakeholders.

Q: What gets you excited about the OUTBREAK project?

Antimicrobial Resistance is probably the single biggest challenge we face as humans and time is running out fast for finding a solution. This means we need to be able to evaluate our risk and change our behavior accordingly, and this system allows us to do so in a data-driven approach.

Dr Denis Bauer is an internationally recognised expert in machine learning. She is the transformational bioinformatics leader at Australia’s national science agency, CSIRO.

Recent News

Archives

Categories

Quick links

Our purpose