Lazy loaded image
Python y Librerías para Ciencias de Datos
Lazy loaded imageCase Study: FitFlow
Palabras 5180Tiempo de lectura 13 min
Jan 4, 2025
Jan 19, 2025
type
status
slug
summary
tags
category
icon
password

★ღ Introduction ★ღ

FitFlow is a fitness tracking and wellness app that enables users to track their workouts and health metrics. This case study analyzes a sample of data from the app to identify patterns in health habits and workout behaviors. By examining variables such as activity type, workout duration, intensity, calories burned, and health metrics like heart rate, sleep hours, stress levels, and hydration, the goal is to uncover trends and relationships that provide deeper insights into how users maintain their fitness routines and health status. This analysis will focus on understanding the correlations between health conditions, lifestyle factors, and fitness levels across the dataset.
 

★ღ Disclaimer ★ღ

This page outlines the process of my project and demonstrates how I applied my recent knowledge. The final code is available on my GitHub account; however, it does not include detailed explanations or numerous visualizations:
notion image

★ღ Dataset Overview ★ღ✰

Demographic Information

participant_id: Unique identifier for each participant.
age: Age of participant (18-65 years).
gender: Gender (M/F/Other).
height_cm: Height in centimeters.
weight_kg: Weight in kilograms.
bmi: Body Mass Index calculated from height and weight.
notion image
notion image

Activity Metrics

activity_type: Type of exercise (Running, Swimming, Cycling, etc.)
duration_minutes: Length of activity session.
intensity: Exercise intensity (Low/Medium/High).
calories_burned: Estimated calories burned during activity.
daily_steps: Daily step count.
 

Health Indicators

avg_heart_rate: Average heart rate during activity.
resting_heart_rate: Resting heart rate.
blood_pressure_systolic: Systolic blood pressure.
blood_pressure_diastolic: Diastolic blood pressure.
health_condition: Presence of health conditions.
smoking_status: Smoking history (Never/Former/Current).
notion image
notion image

Lifestyle Metrics

hours_sleep: Hours of sleep per night.
stress_level: Daily stress level (1-10).
hydration_level: Daily water intake in liters.
fitness_level: Calculated fitness score based on cumulative activity.
 

★ღ Data Exploration ★ღ✰

  • To get a general idea of how the DataFrame is constructed:
participant_id
date
age
gender
height_cm
weight_kg
activity_type
duration_minutes
intensity
calories_burned
stress_level
daily_steps
hydration_level
bmi
resting_heart_rate
blood_pressure_systolic
blood_pressure_diastolic
health_condition
smoking_status
fitness_level
1
2024-01-01
56
F
165.3
53.7
Dancing
41
Low
3.3
3
7128
1.5
1.96
69.5
110.7
72.9
NaN
Never
0.04
1
2024-01-04
56
F
165.3
53.9
Swimming
28
Low
2.9
7
7925
1.8
1.96
69.5
110.7
72.9
NaN
Never
0.07
1
2024-01-05
56
F
165.3
54.2
Swimming
21
Medium
2.6
7
7557
2.7
1.96
69.5
110.7
72.9
NaN
Never
0.09
1
2024-01-07
56
F
165.3
54.4
Weight Training
99
Medium
10.7
8
11120
2.6
1.96
69.5
110.7
72.9
NaN
Never
0.21
1
2024-01-09
56
F
165.3
54.7
Swimming
100
Medium
12.7
1
5406
1.5
1.96
69.5
110.7
72.9
NaN
Never
0.33
  • To identify the data types I’m working with:
  • Quick overview of the columns present in this DataFrame:
  • To determine the size of the DataFrame:
  • To identify null values:
🔍
We can observe that the health_condition column has a large number of null values, while the rest of the columns are free of null values.

★ღ Handling Missing Values ★ღ

Missing values can have a significant impact on our analysis or models. By calculating the percentage of null values, we get a clear idea of the magnitude of the problem:
🔍
The null percentage of the health_condition column is: 71.29% which is quite high.
  • To see the values in the health_condition column, including the missing ones:
🔍
In this case, the null values represent the absence of a health condition. It is appropriate to use fillna to replace null values with 'No health condition’.
  • To replace null values:

★ღ 1. Demographic Analysis ★ღ✰

This category focuses on understanding the characteristics of the participants, including their age, gender, BMI, and smoking habits. It serves as the foundation for segmenting the population into meaningful groups for deeper analysis. The objective is to identify patterns in participant demographics and understand how these characteristics influence other aspects of health and fitness.

★ 1.1 Age Distribution ★

  • To count the values in age column:
  • To get basic summary statistics:
🔍
The age in this dataset has a mean of 41.66 years, with a fairly wide spread (standard deviation of 13.58). Most participants are between the ages of 18 and 64, with the median age being 42, and the majority falling between 30 and 53.
  • To visualize this information:
notion image

★ 1.2 Gender Distribution ★

  • To sort the age data by gender, we first need to examine the gender distribution in our dataset:
  • To visualize the gender distribution:
notion image
🔍
  • Females (F): There are 338,856 female entries, which is the largest group in the dataset. This suggests that the majority of participants or individuals in the data are women.
  • Males (M): There are 334,023 male entries, which is a close second to females. The gender distribution appears fairly balanced between males and females.
  • Other: There are 14,822 entries classified as "Other," which represents a much smaller portion of the dataset. This could include individuals who identify as non-binary or another gender outside the traditional male/female categories.

★ 1.3 Age by Gender ★

  • To visualize the age distribution by gender:
notion image
  • We have ages ranging from 18 to 64 years. To simplify our analysis, We can create the following groups:
participant_id
date
age
gender
height_cm
weight_kg
activity_type
duration_minutes
intensity
calories_burned
stress_level
daily_steps
hydration_level
bmi
resting_heart_rate
blood_pressure_systolic
blood_pressure_diastolic
health_condition
smoking_status
fitness_level
age_category
1
2024-01-01
56
F
165.3
53.7
Dancing
41
Low
3.3
3
7128
1.5
1.96
69.5
110.7
72.9
NaN
Never
0.04
Older Adult
1
2024-01-04
56
F
165.3
53.9
Swimming
28
Low
2.9
7
7925
1.8
1.96
69.5
110.7
72.9
NaN
Never
0.07
Older Adult
1
2024-01-05
56
F
165.3
54.2
Swimming
21
Medium
2.6
7
7557
2.7
1.96
69.5
110.7
72.9
NaN
Never
0.09
Older Adult
1
2024-01-07
56
F
165.3
54.4
Weight Training
99
Medium
10.7
8
11120
2.6
1.96
69.5
110.7
72.9
NaN
Never
0.21
Older Adult
1
2024-01-09
56
F
165.3
54.7
Swimming
100
Medium
12.7
1
5406
1.5
1.96
69.5
110.7
72.9
NaN
Never
0.33
Older Adult
  • To visualize our age data sorted in groups:
notion image
🔍
  • Older Adult (218,625): This group has the largest representation in the dataset. It suggests that older adults are the most prevalent demographic in the data.
  • Young Adult (175,403): This is the second largest group, with a sizable proportion of the dataset representing younger adults.
  • Middle-Aged Adult (147,698): This group is also fairly large but comes after young adults in terms of representation.
  • Adult (145,975): The "Adult" category, which likely overlaps with younger adults and middle-aged adults, is the smallest group but still represents a significant portion of the data.
  • Now that data is organized, let’s visualize age categories by gender:
The bar chart is a good choice in this case because it provides clarity for comparison while offering multi-variable insights:
notion image
🔍
  • Young Adult: Males have a slightly higher representation in the "Young Adult" category, though the gap is small. The "Other" category still has a smaller presence, though higher than in the "Adult" category.
  • Adult: Females are slightly more represented in the "Adult" category compared to males, but the gap is not very wide. The "Other" category represents a smaller proportion in this age group.
  • Middle-Aged Adult: The distribution between males and females in the "Middle-Aged Adult" category is nearly equal, with a very small difference. The "Other" category remains relatively small.
  • Older Adult: There are more males than females in the "Older Adult" category, which contrasts with the trend seen in younger age categories. The "Other" category also has a slightly larger representation here compared to other age categories.
  • Subploting the data to make it simpler:
notion image

★ 1.4 BMI Categories

  • Quick overview of the BMI column:
💡
Raw BMI values are numeric and can be harder to interpret for most audiences. Categorizing them into groups like "Underweight," "Normal," "Overweight," and "Obese" makes the data easier to understand and communicate.
  • To categorize BMI column:
notion image
🔍
Most individuals in this population fall within the healthy BMI range, which could indicate a generally healthy demographic. A substantial portion of the population is at risk of health issues related to being overweight, such as cardiovascular disease or diabetes. It could suggest the need for targeted health or fitness programs.
Understanding how BMI distributes by gender helps reveal potential differences in health trends, risks, and needs across genders.
  • Quick overview of BMI category by gender:
  • To visualize this data:
notion image
🔍
The majority of individuals are classified as "Normal" weight, with females having higher numbers, especially in older age groups. "Overweight" is most common in males, while "Obese" cases are more frequent in middle-aged and older adults, particularly among males. "Underweight" is more common in young adults, particularly females. The "Other" gender category has lower counts across all BMI and age groups.

1.5 Smoking Status

Knowing the smoking status in this dataset is important because smoking can have significant effects on various health and fitness metrics.
  • Quick overview of smoking_status column:
🔍
  • Never (416,800): This group represents the participants who have never smoked.
  • Former (164,570): This group includes participants who used to smoke but have quit.
  • Current (106,331): This group contains participants who are currently smokers.
  • To know the percentages:
  • To visualize this data:
notion image
To analyze smokers versus non-smokers, it is ideal to create a new column that categorizes them:
  • Percentage of smokers by age group:
  • To visualize this data:
notion image
  • Percentage of smokers by gender:
  • To visualize this data:
notion image
  • Percentage of smokers by BMI:
  • To visualize this data:
notion image
  • Percentage of smokers vs non-smokers:
  • To visualize this data:
notion image
  • To make it simpler, I make a whole chart with subplots:
notion image
🔍
The data reveals that across all age categories, genders, and BMI categories, the proportion of non-smokers consistently outweighs that of smokers. The percentage of smokers is relatively low, ranging from around 14% to 16%, while non-smokers account for 83% to 85%. The trend is similar across age categories, with young adults showing the highest percentage of non-smokers (85.5%) and the lowest percentage of smokers (14.5%). Gender and BMI category comparisons show slight variations, but overall, the data suggests that non-smoking behavior is predominant in the population, regardless of age, gender, or BMI.

★ღ 2. Physical Activity ★ღ✰

★ 2.1 Activity Type Popularity

  • To know the unique values in the activity_type column:
  • To visualize the data:
notion image
🔍
The data shows nearly equal participation across most activities.
  • Percentage of Activity Type by Gender:
  • To visualize the data:
notion image
  • Percentage of Activity Type by Age:
  • To visualize the data:
notion image
 
  • Subploting the data to make it simpler:
notion image
🔍
Yoga, Weight Training, and HIIT are the most popular activities, while Running is the least preferred. Females dominate Yoga and Dancing males prefer Basketball and HIIT, and Older Adults favor Tennis and Walking, while Young Adults opt for HIIT and Weight Training. These trends can guide tailored fitness programs for different demographics.

2.2 Duration vs. Calories Burned

  • Grouping by activity_type and calculating the mean duration and calories burned:
  • To visualize the data:
notion image
🔍
  • The average duration for all activities is remarkably consistent, hovering around 70 minutes.
  • The highest calorie-burning activity is HIIT with an average of 25.99 calories per minute, making it the most efficient for burning calories.
  • Activities like Running (21.33) Cycling (18.50), and Basketball (17.36) fall in the middle range for calorie burning.
  • Walking (8.23) and Yoga (6.51) are the least calorie-intensive activities.

2.3 Activity Intensity

  • Looking for outliers in the dataset:
  • Using boxplots to visualize the data:
notion image
💡
There are outliers in avg_heart_rate and calories_burned columns. The percentage of outliers in avg_heart_rate (0.27%) and calories_burned (2.54%) is small relative to the total dataset. This means their influence on overall averages is minimal. Their influence on the overall averages is minimal, though some activities may naturally result in higher values.
  • To visualize the data:
notion image
🔍
The data shows that heart rate remains similar across activities, while calories burned vary more significantly, with higher values for activities like HIIT and Running. Stress levels and fitness levels are relatively consistent across all activities, with minimal variation. High-intensity activities tend to burn more calories, while others show steadier heart rates and lower calorie burn.

★ღ 3. Health Conditions ★ღ✰

  • Percentage of Health Conditions:
  • To visualize the data:
notion image
  • Grouping health condition by gender:
  • To visualize the data:
notion image
  • Grouping health condition by age:
  • To visualize the data:
notion image
  • Subploting the data to make it simpler:
notion image
🔍
The data shows that Hypertension and Diabetes are most common in Older Adults and Middle-Aged Adults, with women having more cases. Asthma is less prevalent, while the largest group has No Health Condition, especially in Adults. The data suggests a focus on managing Hypertension and Diabetes in older adults and women.

★ღ 4. Sleep and Recovery ★ღ✰

  • To know the unique values in the hours_sleep column:
  • Creating categories for the hours_sleep column:

4.1 Sleep Demographics

  • Grouping gender and sleep_category columns:
  • To visualize the data:
notion image
  • Grouping age and sleep_category columns:
  • To visualize the data:
notion image
  • Subploting the data to make it simpler:
notion image
🔍
Most people fall into the Optimal sleep category, with Older Adults and Young Adults having the highest numbers. Short sleep is also common, especially among Older Adults and Young Adults, while Long and Very Short sleep are less common. Women tend to sleep more in Optimal and Short categories, while men have slightly higher counts in the Very Short category.

4.2 Sleep vs Stress

  • Grouping sleep and stress level columns:
  • To visualize the data:
notion image
🔍
The data shows minimal variation in stress levels across different sleep categories, with slightly higher stress in those who sleep longer (5.27) and slightly lower stress in those who sleep very short (5.20). Overall, sleep duration has a small impact on stress levels.
 
上一篇
Tabular Data
下一篇
Diccionarios