Each MPS student completes a two-semester project, which is supported by core courses. The project involves large-scale data analysis and is often completed in collaboration with a private company.
Four projects from the 2022 Spring semester
The three Best MPS Project Awards are given to
Team 2: Qingyi Fang, Sukriti Poddar, Toshihiro Tokuyama, and Xiangnan Zheng
Advisor: Dr. Martin Wells
Project Title: Identifying Spending Trajectories for Medicare Patients and the Factors that Affect the Pattern
Our project was sponsored by Trinity, a leading life sciences consulting firm that provides strategic and tactical insights to clients worldwide. To understand the overall distribution of costs of Medicare patients, identifying high Medicare cost patients is crucial. This is because the total net spending on Medicare is dominated by patients who have significantly high spending. Therefore, our objective was to evaluate and gain insight into variables that can predict high spenders using the Center for Medicare & Medicaid Services (CMS) data.
Quantile regression and logistic regression were applied to the CMS claims data to determine significant variables and the predictive power of the variables. The ability to assess the significant variable was evaluated using the p-values from the quantile regression output. The predictive power was assessed using two models. The purpose of the first model was to determine whether the variable could predict whether the patient’s next year’s payment would be higher than a specific quantile. The second model’s purpose was to predict membership in cost groups of patients. The model focused on whether a patient remained in the same cost group next year based current year’s data. Models were fitted with prior year total payment and other variables that were feature engineered using the raw claims data. The accuracy score and F1 score were used to evaluate the models.
Among the feature-engineered variables, most of them were statistically significant using quantile regression at the 60% quantile. However, the predictive power of these variables was not very high when converting the data to a binary classifier for the 50%, 60%, 70%, and 80% quantile. The label was encoded to 1 if the value was higher than the set quantile and 0 if lower. The F-1 score ranged from 0.27 to 0.68, with the accuracy score ranging from 0.66 to 0.76. Similar results were found for predicting cost group membership. The accuracy score ranged from 0.631 to 0.676.
By using quantile regression, our study was able to find significant predictors that were able to predict high-cost patients. However, the binary classification models did not signify high predictive power using the statistically significant variables. Although the significant variables would provide a substantial baseline, more predictors would be necessary to achieve higher predictive power.
The study examines cost distribution through various scopes and could potentially help contribute to potential interventions’ design, application, and timing.
Team 5: Jimi He, Joo Hyun Kim, Yizhou Liu, Namratha Sathish, and Shuang Wu
Advisor: Dr. David Matteson
Project Title: Critical Risk Index for Space Weather
ABSTRACT
Creating a stability index that captures the variations in magnetic fields using large-scale space-weather domain data obtained from ground stations across the globe with a focus on stations in the North American region which can be used as a proxy for potential space or solar activities over a period of time.
Keywords: Space weather, risk index, time series analysis, big data
Team 15: Ziyi Li, Tsz Fei Luk, Yifan Qian, and Zhe Song
Advisor: Dr. Xiaolong Yang
Project Title: Machine Learning Methods to Predict State of Charge and Battery Remaining Life in Electric Aircrafts
The honorable-mention award is given to
Team 11: Yimin Chen, Minxuan Hu, Miranda Lund, Weichen Zhang, and Wenzhuo Zhao
Advisor: Dr. Sreyoshi Das
Project Title: Statistical Evaluation of Ithaca Crashes
Four projects from the 2019-2020 academic year
Virtual Reality (VR) Simulator Operator Behavior Analysis
Project team: Yukun Cheng, Zackary Downey, Shan Lu, and Yangyang Wang
Sponsor: Raymond Corporation
Advisor: Dr. Xiaolong Yang
Recognition: Best Project Award recipient
In January of 2018, Raymond launched the Raymond VR Simulator, a product that enables operators to learn how to operate a Raymond forklift in a simulated warehouse environment while standing on a real truck. The system connects directly to the lift truck through the Simulation Port, disabling all vehicle motion so that manipulating the controls drives a simulated truck in virtual reality. Operators complete a series of guided lessons in VR and receive a score at the end of each lesson. A recording of the operator completing the lessons is available for replay so it can be reviewed with a trainer for further coaching.
One major appeal of the VR simulator is the amount of data it can collect on operators and the resulting insights this data can provide. This enables customers to continuously improve the performance of operators by identifying trends and correlations between leading indicators such as operator performance on the simulator and how these translate to how the operator will behave in the real warehouse. The objective of this project is to analyze the data in the replay files to determine if any of the data can be correlated with operator behavior.
In our paper, we introduce The Raymond Corporation, a major forklift manufacturer, located in Greene, NY, their VR Simulator, and the data analysis performed on the output of the VR Simulator. To meet the requirements of the project, we needed to use a variety of methods to answer a list of potential questions provided by Raymond Corporation.
First, we describe some of the questions provided. Then, we describe the data sets provided, both core and supplementary, as well as updates made over the semester. We will then walk through some initial data exploration techniques and results and an initial look into using supervised machine learning to answer one of the main questions. We proposed a new variable called the “Aggressiveness Index”, which allows us to quantify the activity of a user at a specific time and simplify the data set. Our supervised machine learning methods include (but are not limited to) LDA, Naive Bayes, KNNs, Hidden Markov Models, and XGBoost.
Due to data classification imbalances described later, we will propose an alternate data grouping technique to improve algorithm inputs. Using these new inputs, we refer to some improved supervised machine learning methods for predicting existence of user penalties at specific times. Finally, we use some unsupervised machine learning methods to answer a couple of the other originally proposed questions. These methods mostly center around clustering and visualization of proposed clusters to map certain characteristics to types of users.
We summarize our work, provide final conclusions, and describe some possible future enhancements to our efforts. In summary, we answer four of the main questions provided by Raymond Corporation:
- Is there a relationship between head rotation, vehicle position, and penalties?
- Can we use request variables to cluster users?
- Is there a relationship between horn use and the total score for a user?
- Can we predict success or failure from all provided variables?
The proposed answers to these may provide insight to future data analysis and data usage done by the company.
Default Risk: Swapping how we Measure Loan to Values
Project team: Junyi Bao, Yutong Hou, Jacob Schoifet, and Yuyang Ye
Sponsor: Home Diversification Corp.
Advisor: Dr. Sumanta Basu
Recognition: Best Project Award recipient
Home ownership is one of the most meaningful ways to create wealth for most families; however, it carries risks. Mortgage defaults negatively impact both lenders and homeowners. One of the most meaningful indicators of the default likelihood is loan-to-value (“LTV”), which is the loan amount owed by the homeowner as a percentage of the value of the house. More specifically, while the LTV on the date of purchase is a strong indicator of risk, the LTV at every point going forward from the purchase date (the “mark-to-market LTV” or “MLTV”) gives an indication at any time whether or not the homeowner has more value in the home than they owe the lender.
Our report investigates how MTLV can impact default rates and how swapping the mechanic measuring these MLTVs can help lower the overall risk of default. We analyzed the Freddie-Mac Loan level data from 2008 to 2010 and built graphs showing the risk-reducing potential of swapping our measuring mechanic from using Local Home Price Indices to a national one. We also used graphs showing the potential to reduce residential mortgage default credit losses through the swap and then we built a logistic regression model to capture the effect of the MLTV on default rates. We tested our model against data from 2014 to 2016. The model incorporated LTV and Combined Loan-to-Value at origination, MLTV, origination channel, Debt-to-Income, credit score, and first-time homebuyer flag. We finally discussed further studies to explore more datasets and to discover if swapping to the national is ideal.
End-To-End Analysis of Energy Data
Project team: Hyun Do Cha, Ruoqi Ge, Shan Huang, and Jian Shi
Sponsor: Ursa Space Systems
Advisor: Dr. Sumanta Basu
Recognition: Honorable Mention Award recipient
Abstract: Understanding the movements of global oil flow is critical towards forecasting trends in the energy industry and its associated economic sectors. This project assessed the viability of utilizing satellite-imaged oil inventory figures, energy analytics data on offtakes and loads, as well as vessel traffic records of nearby ships to compute pipeline flows and investigate oil terminal operations. In this project, spreadsheet modelling methods were applied to reconfigured data in order to estimate the total crude oil flow through the Sumed pipeline over 2019. Additionally, geospatial visualization techniques and canonical correlation analyses were employed to assess the existence of licensing or ownership relations between individual oil tanks and shipowning entities. The results indicate that tank fill data can reconcile a reliable lower bound estimate on total oil flow, with robustness improving with either higher granularity of data or less frequent internal fluctuations of port inventory totals. Furthermore, it was found that most ships or shipowning entities are unlikely to be extracting crude oil from licensed tanks. These conclusions return insights into the Sumed Company’s Egyptian closed oil system and illuminate the potential of a cost-effective, high-frequency imaging project over a shortened period in order to accurately model oil flow and refine a list of ship-tank pairs.
Financial Markets and Housing Sector: A Cross-Country Empirical Study
Project Team: Harshita Garg, Jiayin Liu, Shutong Wu, and Yi Zhu
Sponsor: Dr. Yunhui Zhao (IMF Economist)
Advisor: Dr. Yang Ning
Recognition: Honorable Mention Award recipient
Housing is by far the most important asset in households’ balance sheets across the world, so it is extremely important to understand the driving forces for high housing prices. However, despite the forceful and frequent government interventions, the housing prices in some countries such as China have been on a fast-increasing trend. Motivated by this dilemma, the project applies a variety of empirical approaches to a cross-country panel dataset, including panel regressions (which in turn select variables based on Lasso regressions), difference-in-difference models (which in turn include fixed effect and random effect models), and machine-learning models that ensure higher similarity among the restricted subsample. Results from all these approaches support the findings in Bayoumi, Xie and Zhao (2020) (which studies the housing market in China), and suggest that countries with “underdeveloped” or incomplete financial markets (such as shallow bond markets and stock markets that are plagued by insider trading) tend to experience higher housing price growth, after controlling for other key supply-side and demand-side factors in the housing market. The results imply that to eradicate the root causes of the high housing price issue, policymakers need to go beyond the housing market itself; instead, it may be desirable to deepen the financial markets because these markets would help channel financial resources to productive sectors instead of to housing speculations and help enhance the overall efficiency of the economy.
Disclaimer: The views expressed here are those of the author(s) and do not necessarily represent the views of the IMF, its Executive Board, or IMF management.