Telling Stories with Data
Let’s start with an easy one. What kind of data do you work with and what do you do with it?
We use population level sequencing data sets from TCGA, mutation datasets from COSMIC, ClinVar, HGMD, curated database from other labs. We use discarded datasets, negative datasets, already published datasets, anything and everything. We develop and use structural genomics, mathematical modelling and machine learning tools to analyse mutations that map to noncoding regions of the human genome.
Tell us how you think you can use data to make a difference in your field.
We live on these datasets. Biological data is going to exceed 2.5 Exabytes in the next two years, and the bottleneck is the analysis of these datasets. Our job is to find patterns in these datasets. Rare variants and driver mutations become significant and identifiable only when we look for them in a population context.
How do you talk about your data to someone outside of academia?
For us it is not difficult. The datasets we are using are generated and curated by governmental and international consortiums. They have done the bulk of publicity. For example, the TCGA dataset has all kinds of data from thousands of cancer patients and is curated by the NIH. The power of this data is for all to see. I just say we try to aid in cancer diagnosis by crawling through these datasets to find patterns.
What data-related challenges do you have to deal with in your research environment?
We are happy with the publicly available datasets. Our problem starts with the datasets we collect. How to store, analyse, and make it available for everyone to use are the questions we are trying to answer all the time.
How do you think these challenges might be overcome?
I am an ardent proponent of cloud-storage and computation. I believe that is the future. I am also aware that some countries are concerned with data migration outside their geographical boundaries.
If you were in charge what data-related rule would you introduce?
I am not going to make up anything new. Past US Presidents have made laws like any data generated with public funds should be made available.
Governmental organisations should demystify cloud based storage and computation processes. People are unduly worried. People are giving away more personal data wilfully on Facebook, Twitter, Instagram than through genome sequences collected by public consortiums.
Tell us about your happiest data moment.
It is not one moment, it is a series of moments up until now. I can run a viable research program with no startup money or funds just by scavenging through publicly available datasets.
What advice do you have for someone who is just embarking on a career in your field?
Learn machine-learning and cloud-computing
What do you think the future of research data looks like?
Lots of data analysis than data generation
There is A LOT of data out there about all sorts of things and it is being collected all the time. Does anything frighten you about data?
I am in fact excited. I believe we need to train more data scientists. We are in good times. Data is becoming truly democratic!