Subtitle: Can we Really Use Strava Data as a Proxy for Total Cycling Traffic?
Wouldn’t it great if you could log all cycle activity to create a map of where every cyclist actually rode? With that sort of data you could target new cycling infrastructure to fulfil the cycle “desire lines” and simultaneously minimise the capital outlay on designing and building routes that few cyclists will want to use. Cycle tracking apps exist and are used by a subset of riders but if you are ever bold enough to suggest that the aggregated data from these is sufficient to model the bulk of cyclists’ activity, be prepared for a robust defence. Here is mine based on data from the Strava tracking app.
In cycle campaigning circles when Strava is mentioned, objections will generally fly regarding its validity as a campaigning tool: “Strava is mainly used by men, it’s the Lycra crowd, users are sports obsessives, users are a self-selecting sub-group” and so on. There is truth in the objections; many users do encompass some or all of these characteristics but does this invalidate the aggregated tracking data as being meaningful relative to cyclist’s behaviour in general? In practical terms, the sub-set of cyclists who use the app generate a map of the roads they prefer to use (see the Strava heatmap dedicated portal), the frequency of their road usage showing as brighter colours and wider tracks on the heatmap. Anyone who cycles and looks at the Strava heatmap will probably notice that roads in your area showing high rates of Strava activity are likely the ones you use yourself, Strava user or not. Can we thus apply Strava activity, as illustrated by the heatmap, to cycle usage in general, or is it highly skewed by the self-selecting user group?
CONSIDERATIONS & ASSUMPTIONS
Despite Strava users being a self-selecting their recorded cycling activity covers the whole road network, and records are made for 365 days/year, 24 hour/day. By contrast, “gold-standard” cycle roadside surveys capture all cyclists but are very limited in time, usually a few hours or days, and even more limited in space, just a few selected point locations for any given region. Table 1 summarises some of the pluses and minuses of Strava vs roadside survey data.
One can thus objectively say that neither data-type is an ideal record of cycling activity, each having its own real or perceived limitations. Nonetheless, despite the limitations of the roadside surveys we can take them as an accurate record of cycle activity of specific locations at specific times. If we can thereafter show a relationship of them to the Strava activity, we could demonstrate that Strava data could be used to model cycling activity away from the survey locations.
One of the properties of the Strava dataset is an aggregation of the number of total cycle journeys over a road segment during a calendar year, denoted by the parameter TACCNT (total activity count). Strava road segments are typically >200m in length and it is possible to cross-plot the TACTCNT values with cycle counts at the survey locations overlying the relevant Strava segment. In the county of East Renfrewshire, located to the SW of Glasgow Scotland (Figure 1), I have access to both the Strava data and roadside survey data (Figure 2) for the years 2014 and 2015. Simple cross-plots of these 2 datasets were made and the results for the 2 years are shown as Figures 3 to 6.
Figure 3 shows the cross-plot for 2014. The correlation is clear; higher Strava usage correlates well with higher survey counts. However, a single survey location on the A77 (survey point 15) appears anomalous. If this is excluded, the R2 correlation coefficient improves from 0.784 to 0.9471 (Figure 4). The data for the year 2015 show no anomalous points and a very good correlation coefficient of 0.9249 is revealed (Figure 5). In Figure 6, the data from both years are aggregated and specific roads highlighted. Both datasets concur that the busiest road for cycling in East Renfrewshire is far and away the main road, the A77 (survey points 7, 14, 15, 16, 17), with other major roads also taking a major share of the cycle traffic (Figure 6). The side roads in comparison have comparatively little traffic thought the direct correlation between the datasets is poorer at these lower usage levels.
(Note: Correlation of the roadside survey data to the Strava activity on the specific survey days is possible, but that granularity of data was not available to me)
- Despite some data uncertainties, the correlations between the East Renfrewshire roadside surveys and 2014 and 2015 Strava data are good to very good.
- Roads that are shown to be busy by survey are shown to be busy on Strava, but on side roads with lower cycle usage, Strava data is of lesser utility as a proxy model.
- Strava thus arguably offers an excellent method to infill between limited survey points. Despite the self-selecting nature of Strava users, their cycling behaviour significantly mimics the behaviour or the cycling population as a whole.
- The analysis shows that cyclists in East Ren (as elsewhere) have a strong preference for moving along the main roads with the A77 revealed as the primary cycle corridor in the county (slide 2). Other major roads are also busy. Side roads have limited cycle traffic in comparison.
- Strava data can therefore be adopted with confidence in unsurveyed areas as a proxy for total cycle traffic, but with lesser confidence on cycle low-usage roads. Note that there are several major roads in East Renfrewshire with high cycle traffic on Strava but have no survey points. Without Strava data, the cycle traffic on these roads would be in considerable doubt.
(Strava data provided under licence courtesy of http://www.Strava.com and Urban Big Data Centre, University of Glasgow, https://www.ubdc.ac.uk). Roadside survey data provided courtesy of East Renfrewshire Council.