top of page

Overview

What is linear regression?

  • Linear regression is the concept of taking an independent variable and dependent variables and using those to predict future values.

​

An independent variable or predictor is a variable that remains unchanged due to changes in other variables. 

​

For example: When selling a house, the price of the home is dependent on the number of rooms, the square footage, plot size, etc. In this case, the number of rooms, square footage, and plot size where the home was built is not going to change but will definitely influence the price the home sells for.

sample linear regression

From this photo, the concept of linear regression is taking data points, and trying to find a line of best fit that summarizes all points. It will not go through every point but will give a best guess based on the data provided.

​

The closer the points are to the line, the stronger the linear relationship is between two variables. For example, the number of rooms in a house influences the square footage in a home, but the paint on the exterior of the home may not be the best predictor on the price of a home. It may influence if it's sold or not, but not the physical price.

​

Information sourced from: DataCamp & SpiceWorks

Advantages

  • Simple implementation 

  • Easy to interpret results

  • If the relationship is known already to be linear, this is the best solution as it's the least complex

  • Susceptible to overfitting

Limitations

  • Outliers affect results HEAVILY

  • Assumes independence between attributes even if there is not independence

  • Does not give the full picture, it is not the complete description of relationships between variables,

  • Oversimplifies relationships between complex variables

Data Prep

For this linear regression portion, the Harvard data set will be used but in a reduced format. 

​

The columns the data set has been reduced to are: Age at any headache, Age at chronic headache diagnosis, Duration of care (years), BMI

​

The two relationships that will be explored are Age at any headache & Duration of care in years, and Age at chronic headache diagnosis & BMI

Code & Results

For the linear regression analysis, the code will be completed in R.

​

The first relationship being explored is Age at any headache & Duration of care (in years)

r studio data printout

The formula generated by the linear regression model is:

linear regression formula

What this means is if someone had a headache at 10 years old, they likely would receive about -0.3240(10) + 17.0540 = 13.814 years of care for chronic migraines.

​

The p-value being 0.006452 shows that there is significant statistical evidence of a relationship between these two variables.

Visualizing the results shows clear evidence of a linear relationship between these two variables. The slope of the line indicates that the older a patient gets when having their first headache, the less duration in years of care is needed.

​

Most of the points fall pretty close to the trend line as well. However, there is some significant outliers for patients who were 5 when their first headache and 75 years of care needed! These outliers did not seem to impact the slope of the line as much, thank goodness.

graphed linear regression

The second relationship being explored is Age at chronic headache diagnosis & BMI

secondary r studio data printout

The formula generated by the linear regression model is:

secondary linear regression formula

What this means is if someone has a BMI of 30, they likely would have been diagnosed with chronic migraines at  0.1916(30) + 28.8699 = 34.6179 years old.

​

The p-value being 0.3452 shows that there not significant statistical evidence of a relationship between these two variables.

Visualizing the results shows that there's not a strong linear relationship between these variables. 

​

There are quite a few points that lie really close to the trend line, but there are a lot of points that lie well above and below the trend line too.

​

From this visualization it can be concluded that BMI and age at chronic diagnosis might be related, but not as strongly as a different variable.

secondary graphed linear regression

Conclusion

From these two different visualizations, it is clear that there are discernable relationships between these variables. However, it cannot be definitively said that these are the only things contributing to these relationships. 

​

Logically, a person who has a headache at 10 years old may or may not have 13 years of chronic migraine treatment. Similarly, a person who has a BMI of 30 most likely won't be diagnosed with chronic migraines at 34 years old. This is where the pitfall of linear regression comes into play. So many additional factors need to be considered for an accurate prediction of what could cause a migraine, for example, weather pattern changes, diet, screen time, genetics, etc. While these results are interesting and show some promises in relationships between variables, it does not give the full picture, however, they can point research in the direction of focusing on relationships that are more closely linked instead of using all variables to make predictions. Additionally, this is just one sample of 50 individuals, if this test were to be run for 5000 individuals, the results may be entirely different! 

​

The key takeaway is that while linear regression is a good way to determine relationships, it's not the be-all end-all. It should be taken with a grain of salt at face value and used as a stepping stone into other algorithms to help guide research; especially when the relationships between variables are complex and can change on a whim.

© 2027 by Bridget Litostansky

bottom of page