Data Analysis

Projects

Norwich IPEDS Data

The goal of this project was to help the faculty at Norwich University make informed decisions about the future of the school, with a particular focus on budget matters. Using the Integrated Postsecondary Education Data System (IPEDS), I collected information on core revenue and expenses from 2006 to 2022 and on staffing from 2012 to 2022. I also used data from the American Association of University Professors to analyze tenure status from 2002 to 2022. The data from these sources is already mostly clean, so I only had to clean up names and perform some pivots. I then used the tidyverse package to create the visuals and compute some extra columns such as the discount rate. Some of my key observations are that the discount rate when from 27% in 2008 to 59% in 2022 and that the number of part-time faculty has grown from 72 in 2002 to 239 in 2022 while the number of full-time tenure track and tenured faculty numbers have remained relatively constant. Below you will find a summary and full report of the data, a dictionary outlining the specific meanings of the categories, and a repository of the files I used for the analysis.

Best Original Score Data Analysis

In analyzing the best original score category for the Oscars, I hoped to challenge myself with a particularly unclean data set while learning more about two of my interests: movies and music. I scraped the data from the wikipedia page. On that page the nominees and winners are stored in separate tables by decade, values span across multiple rows and columns, category names are intermixed with film names, the nominees are listed in the same cell, separated by commas and the word "and," and the winners are merely highlighted in the tables. In addition to this the category name changed over a dozen times over the years and there is a many-many relationship between years and category names. There is also a many-many relationship between films and nominees. Using power query, I organized this data into one table with columns for year, award show number, film name, category, whether or not it won, and the nominee name. I then used the organized data to show the films and nominees, find the nominees with the most nominations and the most wins, and highlight how the categories have changed over time. I made sure to use slicers and drill-through effectively. The results clearly showed that John Williams was an outlier and that the Newman family was exceptional. Acting on this, I made separate pages dedicated to John Williams and the Newman family. To understand the Newman family I imported the hierarchy chart visual, wrote a table summarizing the family tree and joined it with the table of films. Below you will find a pdf sample of the power BI report as well as the power BI file itself. I do not have the correct license to share a web version of the interactive dashboard, otherwise I would.

Medical Spending of the US Elderly (Updated for 2021)

In 2016, De Nardi, French, Jones, and McCauley analyzed medical spending of the US elderly (link here). They used the Medicare current beneficiary survey (MCBS). I used this paper and the MCBS as on opportunity to learn about processing medical information. In the 2016 paper they found that when looking from 1996 to 2010, a higher percentage of women's health care is paid by Medicaid, and a higher percentage of women's health care costs come from nursing homes. I pulled the MCBS data from the most recent year (2021) to see if these trends had changed. The MCBS data is weighted, so they recommend using the survey package for R to process it. I followed their instructions along with the definitions in their legend and recreated the tables from the 2016 paper with the data from 2021. One major limitation is that for 2021 the MCBS excluded information on people who had spent time in nursing homes from the publicly available data set. Still, I was able to detect some interesting new trends in increased medicine expenses and decreased hospital expenses. Below you will find my report as well as a repository of the files I used for my analysis.

Training

You can see a good portion of my training progress on my LinkedIn Page. Beyond that I've worked through the following resources:

Microsoft Power BI training (source).
R for Data Science (source).
Python Data Science Handbook (source).
Database Design and Relational Theory (source).
SQL for Data Analysis (source).