‘Big Data’ refers to large, complex, longitudinal and dispersed data sets as it is for example generated by the healthcare industry generating substantial financial value to the U.S. healthcare system with $300 billion yearly revenue and ~0.7% annual productivity growth (McKinsey Global Institute).
Biggest amount of data in healthcare probably originates from medical images including computer tomography (CT), positron emission tomography (PET) or magnetic resonance image (MRI) scans producing hundreds of gigabytes of data each day just in a single medical institution. Data that have to be analyzed, backed-up and achieved.
Large data volume also derives from Electronic Medical Health Records (EMRs) with terabytes of information that is stored in medical databases (patient histories, test results, prescriptions records, treatment records and other sensitive information).
“Genomics” is increasingly pushing the limits of electronic storage and data analytics because genomic sequencing is taking up hundreds and hundreds of gigabits of storage capacity.
Finally, the expanse of biomedical literature doubles every seven years and already reaches amounts of information that no single computer can store and process.
Innovative technologies, already used in other areas, may be adjusted for healthcare demands and workload-optimized systems may ensure that applications use the best resources for high performance, reliability and efficiency.
A number of innovative technology have been developed to handle ‘Big Data’ (Hadoop, NoSQL, data warehouse appliances and columnar databases) and numerous analytical models are available (e.g. Mahout by Apache Software).
Today’s and future main focus of ‘Big Data’ management for medicine includes:
- Information Gathering: Collection, aggregation and storage of information
- Data Mining: Finding all significant factor combinations that potentially solve problems including Natural Language Processing (NLP), Hadoop, Machine Learning and data reporting techniques
- Constraint Resolution: Reducing data to only valid combinations and bringing data volume down to unique sets
- Data Optimization: Scoring combinations against pre-defined criteria
- Visualization: Leveraging virtualization technology to visualize unique data
- Centralizing Data Management: Reducing the data footprint, virtualizing recycle processes and data storage through centralization allows converting ‘Big Data’ into ‘Small Data’ that can be managed like virtual data.
Benefits of this ‘Big Data’ management includes less time required to process data, improved data security and more precise data analysis since entire data copies are visible which, besides improving diagnostic, prevention and patient care, can help preventing drug interaction problems and other diagnostic mistakes, which account for 20% of medical errors today.