Building Visualizations and Predictive Models - iSchool

A summer internship can be one of the key aspects of a student’s educational journey within the iSchool’s Applied Data Science Master’s program. It lets us dip our toes in the ‘real world’, apply our learnings in a practical setting. It also can give us a sense of fulfillment (and money!).

After a rigorous process of interviewing, I was elated to finally bag a summer internship at Mutual of Omaha as a data science intern. Being an early professional in the data science world, this seemed like the ideal opportunity to get a sense of how data is leveraged in an organization, and how the insurance industry benefits from it.

Mutual of Omaha is a Fortune 500 mutual insurance and financial services company based in, not surprisingly, Omaha, Nebraska (in the USA). The company provides a variety of financial services, including Medicare Supplement, life insurance, long-term care coverage and annuities, as well as group coverage including life, disability and 401(k) plans.

In this post, I will describe three projects I did during the summer as well as provide three key tips for others.

Project 1: Code-as-a-Service: An Auto EDA reporting tool

One of the first summer tasks was to optimize the exploratory data analysis that was done within my organization.

With the help of my manager, we envisioned a Code-as-a-Service tool that would automate a manual and time intensive process. This would reduce the turnaround time to perform EDA. So, I developed a Python Jupyter notebook took the data and some user specified columns as input, and return summaries, analysis, and subliminal trends. The steps are outlines below.

Get the user data and the specified columns as input. This helped us focus our analysis rather than shooting in the dark and getting verbose results.
Clean the data. For example, I checked for NaN values, but I needed to be careful, as we did not want to eliminate any feature just because it had a few NaN values.
Change data types. This was important because of the limitations of the Linux server we were using. In short, it was important to save memory and allocate data types based on the values in the column.
Perform analysis. The meat of the dish, this part was the most challenging and also the most fun! After research, we pinned down 3 ways in which our data would be analyzed. The analysis and visualizations included univariate, bivariate and multi-variate analysis.
Create interactive visualizations: Finally, to enhance the tool, I also used Plotly to develop an interactive GUI which would be more user-friendly for people across the organization.

My tool was well-received by the organization.

Project 2: Unbundling case study

Unbundling is a common scenario in the medical industry, where people are billed for more services than were actually performed. This is a hard problem in the industry. To solve this challenge, we developed an analytical approach to identify people billed for more services than were actually performed.

For this task, I enlisted the help of a data engineer who helped me pull roughly 3.5 million rows of data. Next, I created potential scenarios of unbundling. I read procedure codes could / could not be billed together. I also explored the impact of this type of billing on the insurance industry. Then, I analyzed medicare providers, individual bills, US states, and cost incurred to identify potential cases of unbundling.

I then shared and discussed the results with an external medical billing specialist.

Project 3: Model interpretability using LIME

In recent times, the regulations on machine learning models have increased considerably. Thus, it was important that we could ‘explain’ why our model was making a certain prediction.

Towards the goal of explainability, I worked with a model that was already in production, by working to produce explanations of its predictions. For this work, I used LIME (Local Agnostic Model Explanations) which is an approach to interpret machine learning (ML) models. LIME assumes that every complex model can be broken down into smaller linear models. These smaller linear models are easier to explain, as we can draw a decision boundary in a 2D space.

I applied LIME to a current model which accepted / rejected claims. Using LIME, I aggregated the explanations over the entire dataset. I then gave a presentation to 50+ people about the importance of model interpretability, regulations surrounding ML models, and my research on this specific topic.

We also used these results to perform selective feature engineering & improve the models’ performance & interpretability.

Conclusion and three key tips

Data Science is an ever-growing field and there is a lot you can learn from your internship team. I have summed up some key takeaway points:

Don’t be afraid to ask questions
Try & test all available tools to decide which one fits the needs for your project
Make mistakes, and then learn from them!

Explore Other Data Science Student Experiences