The purpose of this project is to demonstrate data exploration using Python and the use of pandas libraries. The dataset I chose contains SAT and GPA scores for 1000 students at an unnamed college. The data was obtained from www.openintro.org.

Import Data

As a first step, let’s import the data into Jupyter Notebook.

What does each column mean?

So essentially what we have in our dataset is gender, SAT scores, high school GPA and first year of college GPA.

The column description and format are listed below:
1. sex – Gender of the student

2. sat_m – Math SAT percentile

3. sat_sum – Total of verbal and math SAT percentiles

4. hs_gpa – High school grade point average

5. fy_gpa – First year (college) grade point average

Given that ‘sex’ is in binary format I will assume that 1=male and 2=female. This is purely an assumption. To make things less confusing let’s add a ‘gender’ column.

QC the data

Before doing any analysis or interpretation it’s always a good idea to review the data. As a start I like to review the data format type then check for null values, duplicates and zero values.

Good news! Our data is in great shape and I have to say this is rare. In cases where we have zeroes and nulls we need to figure out how to deal with this; however that is another post in itself.

For now let’s move on and see some basic statistics.

We know that the SAT scores are provided in percentiles so we expect the values to be between 0 and 100. This is a good sign because the min/max values of verbal and math sat scores reflect this.

Another interesting point of interest is the mean high school and first year GPA. First year GPA averages are much lower than high school GPA sugggesting that most students struggled in their first year of college.

I also noted that within the GPA columns we would expect the values to be between 0 and 4.0. So why is the max value for high school GPA 4.5? After some research I have concluded that AP and honour classes often give an extra point on the GPA. Consider it like this, a student takes three regular classes and three AP classes. Each AP class is awarded an extra point and the student obtains A’s in all classes. For example,

3×4.0 + 3×5.0 = 27 points

Divide that by 6 classes, and you get a 4.5 GPA average.

Let’s check how many students achieved a GPA above 4.0.

This row looks erroneous. This student’s verbal and math scores were well below the 50th percentile yet their high school GPA was 4.5. Their first year college score also looks to be in line with their high school GPA. For now we will leave the row in place and potentially remove it later. My preference is to remove as little data as possible. We know that the only suspect value is the high school GPA.

Explore the Data

Now let’s have a look at the gender split in pandas and visualise this. I really like utilising the ‘groupby’ and ‘size()’ functions in pandas. Pandas also has some basic plotting functions for quick and easy visualisation.

In order to have a quick look at the data we can utilise some of the functions in the seaborn library. Since our dataset is not too large I will run the pairplot function. Keep on mind if your dataset is large or you have a lot of columns will take some time to run.

I love this plot. Right away you can see some trends. Starting with the histograms we can see that the math, verbal and total SAT scores are normally distributed where the males are achieving a slightly higher average in maths.

In terms of high school GPA’s the histograms indicate a bimodal distribution where the female are performing slightly better than the males. To better visualise this I like to use the seaborn boxplot technique.

There is also a very loose positive correlation between high school GPA and first year college. It’s definitely not as simple as creating a linear regression using high school GPA to predict first year college.

Now let’s look at high school GPA versus first year college. A simple way to look at this is to take the difference of first year GPA minus high school GPA. If the values are positive then students did better in college. Conversely if the values are negative students did worse.

Wow! The plot indicates most students did worse. Let’s calculate the average difference and count the number of students by gender that did worse in first year.

On average, both male and females had lower GPA’s in their first year of college. The average decline was -0.73 GPA and was pretty much the same between males and females. In total 89% of students saw a decline in GPA across their first year of college.

Summary

In this dataset we looked at SAT and GPA scores in high school and how this translated to first year college performance. Overall we observed that student’s GPAs declined an average of -0.73 in their first year of college. The decline was even for both males and females. We also observed that females tended to perform slightly better in both high school and first year. I have completed the exploration summary with a viz below.