Original article

Social media captures demographic and regional physical activity

Abstract

Objectives We examined the use of data from social media for surveillance of physical activity prevalence in the USA.

Methods We obtained data from the social media site Twitter from April 2015 to March 2016. The data consisted of 1 382 284 geotagged physical activity tweets from 481 146 users (55.7% men and 44.3% women) in more than 2900 counties. We applied machine learning and statistical modelling to demonstrate sex and regional variations in preferred exercises, and assessed the association between reports of physical activity on Twitter and population-level inactivity prevalence from the US Centers for Disease Control and Prevention.

Results The association between physical inactivity tweet patterns and physical activity prevalence varied by sex and region. Walking was the most popular physical activity for both men and women across all regions (15.94% (95% CI 15.85% to 16.02%) and 18.74% (95% CI 18.64% to 18.88%) of tweets, respectively). Men and women mentioned performing gym-based activities at approximately the same rates (4.68% (95% CI 4.63% to 4.72%) and 4.13% (95% CI 4.08% to 4.18%) of tweets, respectively). CrossFit was most popular among men (14.91% (95% CI 14.52% to 15.31%)) among gym-based tweets, whereas yoga was most popular among women (26.66% (95% CI 26.03% to 27.19%)). Men mentioned engaging in higher intensity activities than women. Overall, counties with higher physical activity tweets also had lower leisure-time physical inactivity prevalence for both sexes.

Conclusions The regional-specific and sex-specific activity patterns captured on Twitter may allow public health officials to identify changes in health behaviours at small geographical scales and to design interventions best suited for specific populations.

What are the findings?

  • Men mentioned engaging in higher intensity physical activities than women, which agrees with previous studies suggesting that women are less likely to meet recommendations for aerobic physical activity.

  • There were differences in the types of physical activities reported across the four US regions.

How might it impact clinical practice?

  • Differences in the types of physical activities reported across sex and regions in the US can encourage discussions between clinicians and patients regarding exercise choices for weight loss and cardiovascular health.

  • In the future, with patient consent, clinicians can use individual patient reports on physical activity posted on social media for personalized guidance.

Introduction

Insufficient physical activity is considered a modifiable risk factor for non-communicable diseases (such as cardiovascular diseases and diabetes) and has been associated with loss of life1 2 and significant global economic cost. In 2016, worldwide healthcare costs associated with physical inactivity were approximately $53.8 billion, and inactivity contributed to productivity losses of about $13.7 billion.3 The WHO member states have agreed to develop and implement policies aimed at reducing physical inactivity rates by 10% by 2025.4 5 Achieving this target requires timely surveillance of physical activity prevalence across populations.

In order to decrease physical inactivity prevalence, it is important to target interventions towards at-risk groups and regions with higher prevalence. However, estimating inactivity prevalence using traditional survey approaches can be costly and delayed, and may be subject to social desirability or recall bias.6–8 Moreover, estimates of inactivity prevalence may not include information regarding which forms of exercise individuals prefer.

Digital technology, including sensors found in cell phones and wrist bands9–11 and mobile fitness applications (such as RunKeeper or Strava), may be used to document physical activity.12–14 While these tools provide valuable information about movement, social media applications (such as Twitter and Instagram) can provide insight into both preferred activities and attitudes towards physical activity.15 16 Reports of physical activity on these platforms are not restricted to preset choices, thereby enabling the use of descriptive textual information that may publicly capture a diverse array of preferred activities, exercise intensity and attitudes towards physical activity in real time and at scale. Timely estimates of physical activity prevalence from combining social media data with other data sources could be useful for monitoring spatial and temporal trends and for augmenting traditional methods for monitoring physical activity.

Here, we use data from Twitter to assess sex-specific similarities and differences in self-reported leisure-time physical activity across US counties and regions. First, we assess differences in the types of activities in which users report engaging and the intensity of these activities as measured by calories burned in 30 min of activity. Second, we quantify the association between estimates of physical inactivity from the US Centers for Disease Control and Prevention (CDC) and physical activity postings and sentiment while controlling for internet search activity, county demographics and environmental variables associated with physical activity.

Methods

Extraction of Twitter data

Mentions of physical activity were retrieved from the Twitter streaming application programming interface (API) from April 2015 to March 2016 using a set of 376 keywords (online supplementary table S1) gathered from fitness questionnaires and apps that document a range of activities.17 18 These included team sports, gym exercises, outdoor recreational activities and household chores. These data were collected as part of a larger project designed to assess the association between happiness and health indicators constructed from social media data with known public health measures. Initial findings using these data are reported in cited references.19 20

Table 1
|
Mixed-effects regression for county-level, female-specific inactivity by region

Data processing

Several steps were taken to clean and ensure data reliability. Content from users at the top 99th percentile of tweet activity was removed. A keyword-matching algorithm was used to identify relevant tweet content. Tweets containing popular phrases that denote irrelevant content (eg, ‘walk away’ or ‘running late’) and mentions of the television show ‘Walking Dead’ were removed. For team sports, only tweets that contained the word play/playing/played in conjunction with the activity were retained. This was to ensure that only tweets indicating engagement in physical activity were used in our analysis. To assess the performance of this algorithm, a subset of categorised tweets was compared with hand-labelled tweets. The accuracy was 85% with an F1 score (defined as the harmonic mean of the precision and recall/sensitivity; 1 is the highest value) of 0.90. We also tested several supervised machine learning classifiers, including a feed-forward neural network, support vector machines (SVMs), gradient boosting and fastText.21 The keyword matching algorithm performed best.

Exercise intensity was calculated as calories burned during 30 min of a particular physical activity by a 155 lb individual, the average weight of an American adult.17 22 Each tweet was mapped to a US county based on its geocode (ie, latitude and longitude). See online supplementary table S2 for a sample of tweets. Note that select words and characters in these tweets (online supplementary table S2) have been changed to maintain user privacy (eg, ‘went running with the bf’ may be changed to ‘went running with my bf’).23

Table 2
|
Mixed-effects regression for county-level, male-specific inactivity by region

Analysing sentiment

A maximum entropy text classifier in Java’s MALLET toolkit and ground truth data from Kaggle, Sentiment 140 and Sanders Analytics were used to categorise tweet sentiment and to assign each tweet a ‘happiness’ score between 0 and 1. As previously noted, this dataset was originally constructed to assess the association between happiness and health indicators constructed from social media data with public health outcomes. Tweets with a score of 0.80 or higher were classified as ‘happy’. Sentiment, in the context of this study represents the proportion of tweets that are labelled as happy. For more information on data processing and sentiment analysis, refer to cited references.19 20

Classification of Twitter users’ sex

We applied a previously developed gender classifier that employs a weighted ensemble approach and uses features/information from users’ names to predict whether users are male or female. We acknowledge that there is a distinction between gender and sex (West and Zimmerman, 1987), but we use gender estimates as a proxy for sex in order to be consistent with CDC measures. This technique combines three classification approaches: (a) matching users’ first names to data from the US Social Security Administration24 (which captures approximately 60% of Twitter names), (b) an SVM classifier applied to word and character n-gram features from users’ names25 and (c) a decision tree classifier applied to features constructed from the linguistic structure of users’ names, including the count of syllables, vowels, consonants, bouba (round) and kiki (sharp) vowels and consonants,26 27 and whether or not the last character is a vowel.28 For each user, we combined the predictions from all three classifiers using a weighted stacked logistic regression framework.29 The ensemble classifier achieved an accuracy of 0.83, a recall of 0.85 and an F1 score of 0.84. The ensemble classifier performed better than methods b and c, and captured all users with alphanumeric names, unlike method a. See these papers by Cesare et al. for a detailed description of this classification framework.30,31

Leisure-time physical inactivity estimates

We obtained 2009–2013 county-level measures of leisure-time physical inactivity from the US CDC. These measures were generated from self-reported physical activity engagement from the Behavioural Risk Factor and Surveillance System survey using small-area estimation techniques.32 Note that the most recent estimates from 2013 do not overlap with our Twitter data, which wre collected between 2015 and 2016. We therefore used linear autoregressive models to forecast physical inactivity prevalence 2 years ahead. When predicting 2013 inactivity prevalence based on 2011 data, our models captured approximately 88% and 83% variations in inactivity estimates for men and women, respectively. We applied the same models to forecast physical inactivity for 2015.

Google trends

Google Trends provides an index between 1 and 100 representing relative search volume related to terms and topics across time and space. We searched for terms related to physical activity and fitness fads, including ‘intermittent fasting’, ‘workout’, ‘fitness centre’, ‘gym’, ‘weight loss’ and ‘physical fitness’ within the same time span as our Twitter data (April 2015–March 2016). To select which terms may be relevant in our model, we analysed the correlation between each term and inactivity prevalence, as well as the correlation between each term and the others to avoid multicollinearity. We selected the terms fitness centre and weight loss for this analysis. The geographical distribution of Google Search Index values used can be found in online supplementary figure S1.

Statistical analysis

To examine the association between indicators of physical activity constructed from Twitter data and county-level estimates of physical inactivity based on CDC data, we used a series of linear mixed-effects regression models with varying state intercepts.33

We accounted for socioeconomic and demographic variables that have been associated with inactivity prevalence.34–39 These included median household income, racial/ethnic composition and median county age. Per cent non-Hispanic white was strongly correlated with per cent non-Hispanic black and per cent Hispanic. We therefore used only per cent non-Hispanic black and per cent Hispanic. These data were obtained from the 2015 5-year American Community Survey.

We also account for environmental factors that may impact community health, including community and road safety and access to usable exercise space.40–42 We obtained data on the per cent of individuals in each county who have access to exercise space, the violent crime rate in the county and the rate of driving deaths (measured as the total number of driving deaths divided by the county population) from the County Health Rankings and Roadmap project.43 Data on walkability were unavailable at the county level. We report our findings with 95% CI.

Patient and public involvement

There were no patients involved in this study. We used publicly available data, but members of the public were not involved to comment on study design, to interpret results or to contribute to the writing and editing of this document.

Results

There were sex and regional differences in physical activities

Our analysis included 1 382 284 physical activity geotagged tweets (80 million tweets collected) from 481 146 users (55.65% men and 44.35% women) in 2992 and 2932 counties, respectively, for men and women. We grouped our findings into four geographical regions in the USA: West, South, Northeast and Midwest.44 Sentiment towards exercise was distributed with some uniformity across the USA (online supplementary figure S2). Overall, men and women shared similar sentiments towards physical activity (sentiment scores 0.660 (95% CI 0.660 to 0.661) and 0.657 (95% CI 0.656 to 0.657), respectively).

The top exercise terms were ‘walk’, ‘dance’, ‘golf’, ‘workout’, ‘run’, ‘pool’, ‘hike’, ‘yoga’, ‘swim’ and ‘bowl’. Walking was the most popular physical activity for both groups across all regions. We note overall and sex-specific regional variations in preferred activities (figure 1). For women, hiking was the second most popular activity in the West, representing 15.18% (95% CI 15.01% to 15.35%) of tweets. This activity represented only 3.24% (95% CI 3.13% to 3.35%) to 3.79% (95% CI 3.71% to 3.87%) of tweets in the Midwest and South, respectively. Mentions of participation in yoga also varied by region for women, representing 6.97% (95% CI 6.83% to 7.12%) of tweets in the Northeast, but only 3.87% (95% CI 3.79% to 3.95%) of tweets in the South. We saw similar patterns for hiking among men, representing 12.23% (95% CI 12.10% to 12.36%) of tweets in the West but only 2.38% (95% CI 2.30% to 2.47%) to 4.31% (95% CI 4.20% to 4.42%) of tweets elsewhere. Golf also varied in popularity among men, representing 12.29% (95% CI 12.11% to 12.48%) of tweets in the Midwest but only 7.85% (95% CI 7.72% to 8.00%) of tweets in the Northeast.

Figure 1
Figure 1

The 10 most frequently mentioned activities and the proportion of tweets represented by sex and region.

Men and women mentioned performing gym-based activities at approximately equivalent rates (4.68% (95% CI 4.63% to 4.72%) and 4.13% (95% CI 4.08% to 4.18%) of tweets, respectively). CrossFit was the most popular workout class among men (14.91% (95% CI 14.52% to 15.31%) of gym-based tweets), whereas yoga was the most popular workout class among women (26.66% (95% CI 26.03% to 27.19%) of gym-based tweets). However, there were some differences, although not significant, in the estimated intensity of exercises reported by men and women as measured in calories burned (figure 2). The average number of reported calories burned per 30 min of reported exercise was 201.27 (95% CI 201.01 to 201.54) for men, and 191.66 (95% CI 191.37 to 191.95) for women. There were also regional variations in reported exercise intensity within sex. Women in the West reported exercises with the highest average caloric expenditure (ie, 194.78 (95% CI 194.25 to 195.31)), followed by the Northeast (193.26 (95% CI 192.60 to 193.92)), the Midwest (192.62 (95% CI 191.92 to 193.32)) and the South (187.96 (95% CI 187.48 to 188.44)). In contrast, men in the Midwest reported exercises with the highest caloric expenditure (202.71 (95% CI 202.07 to 203.36), followed by the South (202.58 (96% CI 202.13 to 203.04), the West (200.34 (95% CI 199.86 to 200.83)) and the Northeast (198.93 (95% CI 198.32 to 199.54)). The most significant sex disparities were noted in counties within Southern states; the average difference between men and women was 8.51 calories per 30 min of activity.

Figure 2
Figure 2

State-level comparisons of self-reported calories burned estimated based on physical activities mentioned by men and women on social media by state and region. Sex-based disparities are on average more significant in the South.

Overall, counties that reported higher levels of physical activity on Twitter also had lower physical inactivity prevalence

The proportion of exercise tweets in a county was negatively associated with leisure-time physical inactivity prevalence for both men and women across regions (see figure 3). These correlations were strongest in the Northeast (r=−0.234 and –0.373 for men and women, respectively) and in the West (r=−0.217 and –0.267 for men and women, respectively). The national association between tweet sentiment and physical inactivity was similar for both men and women (r=−0.115 for men and r=−0.116 for women), but regional disparities exist. This relationship was stronger for men in the West (r=−0.194 for men and r=−0.076 for women) and stronger for women in the Northeast (r=−0.271 for women and r=−0.063 for men). There was a weak negative association between exercise intensity and physical inactivity for both groups (r=−0.061 for men and −0.001 for women), but stratified by region, this effect was much stronger for men in the West (r=−0.203) and the Midwest (r=−0.123). The association between each Twitter variable and inactivity prevalence by sex and region can be found in online supplementary table S3.

Figure 3
Figure 3

The relationship between model-estimated and Centers for Disease Control and Prevention-forecasted inactivity prevalence based on mixed-effects linear models that control for measures of physical activity via Twitter, demographic variables and built environment contextual variables. Lines represent a linear fit.

The association between Twitter variables and inactivity remained in models that accounted for demographic, socioeconomic and built environment variables associated with physical inactivity (tables 1 and 2). This relationship was statistically significant for all regions for men, and all regions except the Midwest for women. Also, counties with more positive sentiment towards physical activity had lower inactivity prevalence in the West for both men and women, and in the Midwest for women. Furthermore, counties that reported high-intensity exercises on Twitter also had lower inactivity prevalence for men in the Midwest and the Northeast. There was no significant relationship between exercise intensity and physical inactivity prevalence for women.

We also observed different patterns in the association between Google searches for fitness centres and weight loss and physical inactivity prevalence in the two demographic groups. Specifically, counties in the Northeast with higher searches for ‘fitness centres’ also had lower physical inactivity for women, while counties in the Northeast and South with higher searches for weight loss had higher inactivity for women. Among men, counties with higher searches for fitness centre had lower inactivity prevalence, while counties with higher weight loss searches had higher inactivity prevalence in the Midwest. The directionality of these relationships suggests that populations seeking weight loss information online tend to have higher physical inactivity prevalence, while those seeking information on fitness centres are more likely to be active.

Notable disparities exist in the 2013 and 2015 (forecasted) prevalence of physical inactivity between men and women, with 79.7% and 78.4% of counties showing higher prevalence for women, respectively (figure 43). Our estimates of physical inactivity prevalence using physical activity tweet volume and sentiment, and Google search volume, while controlling for county demographics, and access to exercise space, are overall reflective of the disparities reported in CDC physical inactivity prevalence estimates. Overall, out-of-sample estimates are better for women than for men (average r=0.89 for women, average r=0.82 for men). Correlations between estimated and actual values were higher in the South for both men and women (r=0.79 and r=0.82, respectively).

Discussion

This is the first study to assess the effectiveness of using Twitter for monitoring physical activity for men and women separately. Our main findings were (1) there were sex and regional differences in physical activities reported on Twitter, and (2) counties that reported higher levels of physical activity on Twitter tended to have lower physical inactivity prevalence based on survey estimates from the US CDC.

Some regions demonstrated higher associations between Twitter postings and survey estimates of physical inactivity from the CDC than others. For example, both tweet volume and sentiment were negatively associated with inactivity in the West for men and women. This agrees with research suggesting that the lowest inactivity prevalence in the USA is in the West.37 Furthermore, the popularity of ‘hiking’—an outdoor activity—in the West comports with research that finds a negative association between urbanicity and inactivity in this region.45 High levels of inactivity have been documented in the South, particularly among women.46 47 Interestingly, we noted a strong negative relationship between tweet volume and inactivity, and a strong positive association between Google searches for weight loss and inactivity in this region. These may be important measures to monitor for addressing female-specific inactivity in this region.

Our findings also support research that states women are less likely to meet federal physical activity guidelines compared with men.48 Women reported lower intensity exercises on Twitter compared with men, particularly in the South. Public health officials could focus on promoting other forms of leisure-time physical activity, such as transportation-based activity, which has documented health benefits.49 50 Future analysis will focus on assessing deviations in activities undertaken by diverse sex and age populations during different times of the year, which may lead to more effective targeting of interventions.

Sex and regional deviations indicate that social media-based physical activity interventions should not be applied uniformly across the USA. Monitoring physical activity prevalence using social media and other digital sources can enable timely, geographically fine-grained estimates compared with surveys, thereby allowing for early intervention aimed at improving health over the life course. However, social media use varies by sex51 and may also vary by place of residence, age or ethnicity. Public health officials should understand how individuals within specific groups use social media and target areas using appropriate platforms.

Findings from this study also suggest the need to combine data from Twitter with other digital sources because Twitter users may be more likely to report physical activities that involve social interaction. Combining Twitter data with other data sources might mitigate limitations inherent in a single digital source.

Limitations

The measure of exercise intensity is arbitrary since actual reported exercise duration is unknown. This might have affected the reported similarities and differences in exercise intensity across sex and regions. There are also several forms of bias that may impact measures of physical activity from Twitter. For one, individuals have a tendency to over-report or overestimate their actual activity time and exertion.52–54 Additionally, reports of physical activity on Twitter may be subject to selective residential bias. Individuals who are more likely to electively discuss personal physical activity may elect to live in areas that are more socially and structurally amenable to physical activity.55–58 Furthermore, reports of physical activity on Twitter may not precisely correspond with engagement in physical activity; individuals may reflect on recent activity or express intent to engage in a physical activity. Finally, these data are subject to representation bias as well. Because our unit of analysis is the county, regions in which county density is proportionally lower are less likely to be represented. Given that several of the most densely populated states are in the Northeast, for instance (eg, Rhode Island, Massachusetts and Connecticut),59 this region may be over-represented in our data.

One other limitation is the lack of county-level estimates of leisure-time physical inactivity from the CDC for 2015. We address this limitation by using autoregressive linear models to predict values for 2015 based on data from 2009 to 2013. These estimates are approximate, and interpretation of coefficients must consider this uncertainty. Future work will develop processes for combining non-traditional data sources to estimate small-area health outcomes, which are currently delayed by years.

Conclusions

Digital data, including social media, provide valuable information for monitoring health behaviours.16 19 20 60–65 This study illustrates that Twitter is a useful tool for measuring small-area trends in physical activity, an important risk factor for non-communicable diseases, but that its usefulness might vary by sex and by US region. Monitoring physical activity using social media will allow public health officials to identify changes in health behaviours at small geographical scales across the USA. Findings from this study provide an important step in this direction.