top of page

Exploratory Data Analysis - EDA

Gathering, Cleaning, & Exploration

Data Sources

For this project data has been collected from four different web sources.

​

The ClinicalTrials.gov dataset contains 500 records of clinical trials in various stages of research phases.

 

The Harvard dataset contains extensive records about migraine locations, age when diagnosed, risk factors, MRI and CT record results, symptoms, and medications taken for 51 patients. 

​

The Kaggle dataset contains data on patient symptoms experienced during a migrainous episode along with age.

​

These three different datasets contain valuable information regarding proposed questions and hopefully will shed light into migraines and further understanding of this disease.

Data Cleaning

Kaggle Data Set

The Kaggle dataset did need some minor adjustments and cleaning. There were no null records, no missing values, and all categorical values were identically matching text, there were no weird fill-in-the-blank type values. When reading in the csv, R converted all 1-0 values to be dbl type so those were converted back to int. Column names were named meaningfully and did not add any extra complexity to understanding the data.

cleaned data set 1
cleaned data set 2

ClinicalTrials.gov Data Set

The ClinicalTrials.gov data set came from the API hosted on the group's website. For the project, the endpoint for studies is being used. 

uncleaned data

A look at the raw data, it's breathtakingingly beautiful!

There were quite a lot of data columns returned that were unnecessary for the research questions posed on the introduction slide and contained numerous null values. For this project's sake, those columns were dropped as they did not add any value to the table and data visualizations. There were 146 unique columns initially in this data set and tables had multiple words with multiple sectioned names. For readability, those names were stripped off and all that remains is just the rightmost portion of text.

 

​

After the drop and renaming, there are 48 columns of data about migraine-related clinical studies. There are still some null values in secondary fields, but that should not affect the data being pulled.

all columns data set

Just look at those long column names, YIKES!

Renamed columns

Much improved, looks much cleaner.

Harvard Data Set

The Harvard data set had an extensive history of patients suffering from chronic migraines. There is tons of information on prescriptions, migraine locations, MRI and CT scan results, and other great unidentified HIPAA compliant clinical data. 

csv before processing

On initial exploration, data looks good. Column names not so much.

For the cleaning portion of this data set, the data columns were renamed to not have spaces in between words.

cleaned csv data

Much improved!

Visualization #1

status of 500 migraine clinical trials barchart

Figure 1 shows the current status of 500 migraine clinical trials. Roughly 280 of these trials are closed, 75 are actively recruiting, and then the rest fall under various statuses like withdrawn, terminated, etc. 

​

This figure was created to see the classification of clinical trials so the conclusions drawn about them would be applied correctly.

Visualization #2a 2b

Minimum age of study participants
Maximum age of study participants

Figures 2a and 2b show both the minimum and maximum age required for the clinical trials based on their status.

 

For the minimum age, most studies required the participants to be 16-18 years of age to participate.

 

For the maximum age, most studies required participants to be no older than 65-70 years of age. 

​

It is interesting to see the maximum age at 65. Intuitively, it would make sense to cut the age to 65 due to complications in older adults that may potentially skew data sets. However, valuable data still could be found if they were to be included.

​

These figures were made to get a better understanding of where the age groupings should be drawn for future analysis, which is 18-65.

Visualization #3

donut chart

Figure 3 shows data from the Kaggle set by taking the Duration in days as the inner ring of the pie chart and the frequency of days in the outer ring. 

​

A majority, ~57% of patients have a duration of 1 day, ~25% have a duration of two, and the final ~18% have a duration of 3 days.

​

Upon further examination, migraines trend in all three durations to either frequent for 1-2 times or 5 times. The polar extremes of each other. 

​

Visualization #4

histogram of age range
min age of participants 15
mean age of participants 31
max age of participants 77

Figure 4 shows data from the Kaggle set by grouping the ages and turning it into a density plot.

​

Upon review, the data set is heavily skewed below the mean (black line) of 31.705 however there is hardly any patients who belong to that bucket. There are quite a lot of older patients who balance out and bring the average higher. 

​

The youngest patient in the data set is 15 years old and the oldest patient in the data set is 77 years old.

​

This confirms what was seen in Visualization #2 with clinical trials being roughly from ages 18-25 with a few outliers.

Visualization #5

intensity of migraine box and whisker plot
mean median scores

Figure 5 shows a box plot summary from the Kaggle data set of the intensity scores given by patients. 

​

From this visualization, the minimum score given for intensity pain is 0, the median intensity pain is 3, and the average intensity pain is 2.47.

​

Surprisingly, the intensity scores are not as high. 

Visualization #6

histogram of bmi distribution
mean bmi 30.7
max bmi 58
min bmi 15

Figure 6 shows a histogram breakdown of the BMI scores of all 50 patients in the Harvard data set. 

​

From this visualization, the minimum BMI is 15.19, the maximum BMI is 58, and the average BMI is 30.75.

​

Interestingly, the BMI results tend to center around the mid-20s but due to the few very high BMI scores, it is pushing the mean further to the right. Similar to the behavior seen in Figure 4

Visualization #7

duration of care in days graph

Figure 7 the comparison of the duration of care in days concerning years from the Harvard data set.

​

From this visualization, it can be discerned that there are three distinct clusters. The first being 0-150 days, the second being from 300-725 days, and the third being from 750-1000+ days

​

It is a good mix of patients and should provide some valuable insights with further machine-learning techniques.

Visualization #8

bar chart of medications prescribed

Figure 8 shows the top 6 most prescribed medications to chronic migraine sufferers from the Harvard data set.

​

The top 6 medications are topiramate, botox, amlodipine, propranolol, gabapentin, and amitriptyline.

​

These medications are commonly given to migraine patients. They vary in class from tricyclic antidepressants to anti-seizure drugs. 

​

Learn more about migraine medications

Visualization #9

pulse variations chart

Figure 9 shows the different types of pulse classifications and how each patient was recorded.

​

It looks like 50% of patients had normal pulses and the other 50% fell into an unnormal pulse classification.

​

Additionally, the largest abnormal pulse deviation was in the weak, bilateral category, meaning both sides of the brain were registering a weak pulse. 

​

Visualization #10

donut chart vertigo

Figure 10 shows if a patient has vertigo on the inner ring and if they have vision loss on the outer ring.

​

54% of patients fall into having vertigo and 46% fall into unknown or unsure. 

​

From there, of those 54%, 33% experience vision loss. About 15% experience Ipsi or ipsilateral vision loss, meaning "same side" as the migraine is on. And another ~4% have diplopia which is the medical term for double vision. 

​

It is strange that those who do not get vertigo lose their vision at a pretty high rate, almost 50%!

​

© 2027 by Bridget Litostansky

bottom of page