By Nik Stanbridge, VP Marketing, Arkivum.
One of the biggest phenomena of modern technology is Big Data. Even seeing the phrase written down makes it look big. As it is also generally capitalised this makes it look and sound even bigger. But how many organisations actually know what it means? And how big is big?
Over the last few years there have been plenty of definitions doing the rounds. Data however was already getting bigger in Big Data terms in 2001 when industry analyst Doug Laney described the "3Vs"—volume, variety, and velocity—as the key "data management challenges" for enterprises. These are the same "3Vs" that have been used in the last few years by just about anyone attempting to define or describe Big Data.
The Oxford English Dictionary (OED) defines Big Data as "data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges", while trusty Wikipedia defines Big Data as "an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications". Right now in the world of research academia and also in life sciences I think many organisations are truly starting to get to grips with what Big Data really means and how big is big. Certainly scientists, research institutes and governments all over the world are starting to struggle to find reliable and long-term storage space for their ever increasing volumes of data.
I recently read two articles that explain how Big Data is having an impact on both life sciences and academia and the articles attempt to put some numbers around 'how big is big'.
The first article in Inquisitr talks about how scientists are generating data about our genomes and the fact that they will soon be facing an acute shortage of storage space. The pace at which raw genome data is being generated is faster than even that for YouTube. The article goes on to say that companies like Google, Facebook and Twitter are taking this all in their stride simply because their businesses are all about helping us share our 'small data' while making it Big in the process. The numbers Google faces are eye watering - 100 petabytes (PB) of data a year from YouTube alone. That's Big. And it all has to be on fast, expensive, always-on storage because their job is to serve data up immediately.
However, one organisation's Big is another organisations' really quite small. The article also says that next generation gene sequencing (NGS) will dwarf what we currently call Big Data and it will truly redefine what we mean by Big Data. By 2025 (that's just ten years away) NGS activities will be generating anything up to 40 exabytes (EB) a year. That's 40 times the YouTube figure.
The drivers motivating the amount of gene sequencing being carried out that will generate this huge volume of data will bring about massive benefits to scientists and ultimately to us in terms of our health. That's because the sequencing brings in the ability to predict and diagnose illnesses and inherited conditions so the value of this data is incredibly high. Even more crucially, the cost of performing the sequencing is coming down at a rapid rate. So what is happening now is that rather than requesting a single test for a specific condition, clinicians are simply requesting the whole genome to be sequenced, as it is considered that the benefits outweigh the cost.
The other article, in The Guardian, asks a range of scientists in different fields about their use of (and contribution to) Big Data. One of the messages coming out of this article is that the instruments and techniques used in their specific fields are generating a lot of data, data that has far-reaching implications and use, and data that needs to be kept for the long-term. In fact, the value of the data may not be realised now but can and will at some point in the future. The scientists in the article are saying that they don't have the full picture of the value of their data - all they know is that it is valuable.
So this leads me on to ask what do you do with something that is clearly valuable? With something that you can't exactly value today, but that you just know is going to be more valuable tomorrow?
The answer is that you keep it very, very safe. And if that something is data, where is this safe place? It's in a data archive that is designed for economical long-term storage where the potential value of the data far outweighs the storage costs. In this way, the data will actually be archived and the value of it will actually be realised in the long-term.
My closing thought is that moving forward, the largest generators of data will be those who will want to get their genome sequenced. Over one billion people, mostly in developed countries, will soon demand to have their genome sequenced primarily because it opens up some amazing health solutions. That's when the scientists will truly begin to worry about Big Data and appropriate storage facilities. That's also when we will all start to get to grips with and truly understand how big big really is.