June 12, 2018

Each year, billions of dollars are spent globally on research, experiments and trials that hope to advance the world’s understanding of diseases- like cancer, heart disease, diabetes and Alzheimer’s. The goal is simple- to understand the physical body’s ability to foster and grow disease and work towards solutions that both prolong life and eventually fully eradicate. The journey, however, is more complex. In order to progress at this attempt, researchers must work to apply a global library of knowledge and data using advance structures and frameworks that test and compare back to previous results. One example of this is the experimentation of treatments of nuclei; the topic at the core of this year’s Kaggle Data Science Challenge.

Neudesic’s Data Science Team, led by Leith Akkawi, joined a competition of 5,000 advanced analytics experts, to create deep learning algorithms that can accelerate a researcher’s time to accurately identify a cell’s nucleus- an otherwise labor intensive, unautomated and time-consuming process.

Utilizing Amazon Web Services’ suite of data and advanced analytics tools, the team leveraged Neudesic’s Insights Development Workbench and Advanced Analytics Solution Framework to earn a top 5% placement in the competition, judged by the overall average accuracy of identifying whether a pixel had been identified as part of a nuclei. Here are few of the major steps taken on the journey.

Finding the Data

Once the objective was clearly defined, the team partnered with Kaggle to acquire data for visual analysis from the Broad Institute of Harvard and MIT, a renowned institution for advancing the understanding of biology and human disease, and the non-profit data partner of the 2018 Data Science Bowl.

Building a Modern Data Platform for Advanced Analytics (the tools)

The model creation and testing phases of the competition required a robust platform capable of analyzing thousands of images at a rapid pace. Leveraging the sophistication of Amazon Web Services, the team architected a modern data platform that utilized Graphical Processing Units capable of handling image analysis 100X faster than a basic computer. The architecture included a Ubuntu Server 16.04 AMI, with a p2.xlarge instance. In total, the team built the environment and was ready to begin testing models in less than one hour.

Designing the Model

The model selection process led the team to a late 2017 paper published by Facebook AI Research (FAIR) on object instance segmentation. The document was later converted to the Mask_RCNN model for detecting circles, squares and triangles, and placed on GitHub by Matterport. The model was selected as the chosen baseline given its accuracy during object detection and instance segmentation. The Mask_RCNN was then customized and converted by the Neudesic team to detect and identify nuclei in medical images.

The final model was created using Python 3.0 code and Keras libraries that leverage TensorFlow for mathematical computations. The team continually tested the parameters and training schedule in order to increase the accuracy of the model. It took 160 experiments conducted over the course of three months, to find the optimal model.

Achieving Results

Based on an overall score .494 which represented the average accuracy of identifying whether a pixel had been correctly identified as part of a nuclei, The Neudesic Data Science Team finished 167th out of 3,643 global teams- earning them a silver medal in this year’s competition. The overall winner of the competition finished with a score of .631.

Next Steps

The overall global effort led by this year’s Data Science Bowl provided researchers with a technological advancement for the process of identifying Nuclei. The program was so impactful that the code and models from the top five winners will be applied to enhance research going forward.

Through this competition, effective predictive analytics has been proven to accelerate a researcher’s ability to understand drivers of common, yet fatal diseases. In this scenario, the effort was meant to save lives, but in other scenarios the Neudesic Advanced Analytics Solution Framework can be cleanly transferred to predict other important business metrics like load forecasting, theft detection, manufacturer defects, and customer loyalty and churn, to name a few.

How does your data strategy map to predictive scenarios that can accelerate you to positive business outcomes?