Take-home Ex 2

Published

December 7, 2023

Modified

December 18, 2023

Singapore Train commuter flows

1 Project Brief

As city-wide urban infrastructure such as public buses, mass rapid transit, public utilities and roads become digital, the data sets obtained can be used for tracking movement patterns through space and time. This is particularly true with the recent trends in massive deployment of pervasive computing technologies such as GPS on vehicles and SMART cards used by public transport commuters.

Unfortunately, this explosive growth of geospatially-referenced data has far outpaced the planner’s ability to utilize and transform the data into insightful information. There has not been significant practice research carried out to show how these disparate data sources can be integrated, analysed, and modelled to support policy making decisions, and a general lack of practical research to show how geospatial data science and analysis (GDSA) can be used to support decision-making.

This study aims to demonstrate the potential value of GDSA to integrate publicly available data from multiple sources for building spatial interaction models to determine factors affecting urban mobility patterns in public bus transit.

2 Loading Required packages

The following packages are used in this exercise:

  • tmap for cartography
  • mapview for interactive map backgrouds & leaflet.providers for basemap customisation
  • sf and sp for geospatial data handling
  • stplanr for creating ‘desire lines’
  • tidyverse and reshape2 for aspatial data transformation
  • sfdep and spdep for computing spatial autocorrelation
  • Hmisc for summary statistics
  • kableExtra and DT for formatting of dataframes
  • ggplot2, patchwork and ggrain for visualising attributes
  • performance ang ggpubr for Spatial Interaction modeling
  • urbnthemes for consistent plot themes
  • scales for ggplot axis break formatting
code block
options(repos = c(CRAN = "https://cran.rstudio.com/"))

pacman::p_load(tmap, sf, sp, tidyverse, sfdep, stplanr,
               mapview, leaflet.providers,
               Hmisc, kableExtra, DT, reshape2,
               ggplot2, patchwork, ggrain, urbnthemes, knitr,
               performance, ggpubr, scales)

3 Importing the Data

3.1 Aspatial Data

This study uses 2 aspatial datasets pertaining to Public Train Trips:

  • train, a dataset from LTA Datamall, Passenger Volume by Origin Destination Train Stations for October 2023.
  • train_codes, also from LTA Datamall, lists train station codes and names. This is used to assign MRT train station codes to the geospatial data set with train station locations.

The downloaded dataset is in .csv format. We use the function read_csv() to import the data into the R environment.

train_oct23 <- read_csv("data/aspatial/origin_destination_train_202310.csv")

# remove any duplicated rows
train_oct23 <- distinct(train_oct23)

str(train_oct23)
tibble [800,595 × 7] (S3: tbl_df/tbl/data.frame)
 $ YEAR_MONTH         : chr [1:800595] "2023-10" "2023-10" "2023-10" "2023-10" ...
 $ DAY_TYPE           : chr [1:800595] "WEEKENDS/HOLIDAY" "WEEKDAY" "WEEKENDS/HOLIDAY" "WEEKDAY" ...
 $ TIME_PER_HOUR      : num [1:800595] 9 6 12 12 12 12 19 19 9 9 ...
 $ PT_TYPE            : chr [1:800595] "TRAIN" "TRAIN" "TRAIN" "TRAIN" ...
 $ ORIGIN_PT_CODE     : chr [1:800595] "EW32" "BP4" "NE15" "SW5" ...
 $ DESTINATION_PT_CODE: chr [1:800595] "DT24" "EW31" "SW5" "NE15" ...
 $ TOTAL_TRIPS        : num [1:800595] 1 22 51 87 48 73 1 3 1 2 ...

train_oct23 is a tibble dataframe consisting of the following variables:

  • YEAR_MONTH: Month of data collection in YYYY-MM format
  • DAY_TYPE: Category of Day
  • TIME_PER_HOUR: Extracted hour of day
  • PT_TYPE: Public transport type
  • ORIGIN_PT_CODE: ID of Trip Origin Train Station
  • DESTINATION_PT_CODE: ID of Trip Destination Train Station
  • TOTAL_TRIPS: Sum of trips made per origin-Destination
Hmisc::describe(train_oct23)
train_oct23 

 7  Variables      800595  Observations
--------------------------------------------------------------------------------
YEAR_MONTH 
       n  missing distinct    value 
  800595        0        1  2023-10 
                  
Value      2023-10
Frequency   800595
Proportion       1
--------------------------------------------------------------------------------
DAY_TYPE 
       n  missing distinct 
  800595        0        2 
                                            
Value               WEEKDAY WEEKENDS/HOLIDAY
Frequency            424314           376281
Proportion             0.53             0.47
--------------------------------------------------------------------------------
TIME_PER_HOUR 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
  800595        0       20    0.997    14.01    6.042        6        7 
     .25      .50      .75      .90      .95 
      10       14       18       21       22 
                                                                            
Value          0     5     6     7     8     9    10    11    12    13    14
Frequency   4083 23741 37878 42772 43653 43289 43466 44412 45463 45298 44883
Proportion 0.005 0.030 0.047 0.053 0.055 0.054 0.054 0.055 0.057 0.057 0.056
                                                                
Value         15    16    17    18    19    20    21    22    23
Frequency  45358 46141 47505 46815 44605 42622 41491 38360 28760
Proportion 0.057 0.058 0.059 0.058 0.056 0.053 0.052 0.048 0.036

For the frequency table, variable is rounded to the nearest 0
--------------------------------------------------------------------------------
PT_TYPE 
       n  missing distinct    value 
  800595        0        1    TRAIN 
                 
Value       TRAIN
Frequency  800595
Proportion      1
--------------------------------------------------------------------------------
ORIGIN_PT_CODE 
       n  missing distinct 
  800595        0      171 

lowest : BP10 BP11 BP12 BP13 BP2 , highest: TE4  TE5  TE6  TE7  TE8 
--------------------------------------------------------------------------------
DESTINATION_PT_CODE 
       n  missing distinct 
  800595        0      171 

lowest : BP10 BP11 BP12 BP13 BP2 , highest: TE4  TE5  TE6  TE7  TE8 
--------------------------------------------------------------------------------
TOTAL_TRIPS 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
  800595        0     4672    0.998      104    171.3        1        1 
     .25      .50      .75      .90      .95 
       4       15       60      214      455 

lowest :     1     2     3     4     5, highest: 16799 17670 18763 19585 21050
--------------------------------------------------------------------------------

From the summary statistics above, we can derive that:

  • There are 800,959 Origin-Destination (OD) trips made in October 2023
  • Data is collected for 24 hours, starting from 0 Hrs to 23 Hrs in TIME_PER_HOUR
  • The highest number of trips for an OD route is 21,050, while the 95th Percentile is only 455. This suggests a highly right-skewed distribution, with particularly busy routes.

As we are interested in studying the passenger flows for peak hour periods only, the number of trips are calculated for each period as defined below:

Peak Period Tap in Time (hr)
Weekday Morning 6 - 9
Weekday Evening 17 - 20
Weekend/PH Morning 11 - 14
Weekend/PH Evening 16 - 19

The dataframe now shows the traffic volume by peak period for each Origin-Destination route:

code block
train_od <- train_oct23 %>%
   # Categorize trips under period based on day and timeframe
  mutate(period = ifelse(DAY_TYPE == "WEEKDAY" & 
                         TIME_PER_HOUR >= 6 & TIME_PER_HOUR <= 9, 
                         "Weekday morning peak",
                    ifelse(DAY_TYPE == "WEEKDAY" & 
                           TIME_PER_HOUR >= 17 & TIME_PER_HOUR <= 20,
                           "Weekday evening peak",
                      ifelse(DAY_TYPE == "WEEKENDS/HOLIDAY" &
                             TIME_PER_HOUR >= 11 & TIME_PER_HOUR <= 14,
                              "Weekend/PH morning peak",
                        ifelse(DAY_TYPE == "WEEKENDS/HOLIDAY" & 
                              TIME_PER_HOUR >= 16 & TIME_PER_HOUR <= 19,
                               "Weekend/PH evening peak",
                    "Others"))))
  ) %>%
  # Only retain needed periods for analysis
  filter(
    period != "Others"
  ) %>%
 # compute number of trips per origin busstop per month for each period
  group_by(
    ORIGIN_PT_CODE,
    DESTINATION_PT_CODE,
    period
  ) %>%
  summarise(
    num_trips = sum(TOTAL_TRIPS)
  ) %>%
  # change all column names to lowercase
  rename_with(
    tolower, everything()
  ) %>%
  ungroup()

There are several instances where ORIGIN_PT_CODE and DESTINATION_PT_CODE are not composite codes, representing MRT Stations that are interchanges with multiple station lines. For the purpose of this investigation, only a single station code is required.

code block
train_od <- train_od %>%
  separate_wider_delim(origin_pt_code,
                       delim = "/", 
                       names = c("origin_station_code", "unused_origin"),
                    # to capture first station for those without multiples
                       too_few = "align_start",
                    # to remove any other unused station columns for 3 or more
                       too_many = "drop"
  ) %>%
  separate_wider_delim(destination_pt_code,
                       delim = "/", 
                       names = c("dest_station_code", "unused_dest"),
                       too_few = "align_start",
                       too_many = "drop"
  ) %>%
  select(
    origin_station_code,
    dest_station_code,
    period,
    num_trips
  )

DT::datatable(head(train_od,20))

This dataset lists train station codes and names. It is used to join with the geospatial dataset for station identification.

train_codes <- read_csv("data/aspatial/train_codes.csv")

str(train_codes)
spc_tbl_ [203 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ stn_code           : chr [1:203] "NS1" "NS2" "NS3" "NS4" ...
 $ mrt_station_english: chr [1:203] "Jurong East" "Bukit Batok" "Bukit Gombak" "Choa Chu Kang" ...
 $ mrt_line_english   : chr [1:203] "North-South Line" "North-South Line" "North-South Line" "North-South Line" ...
 $ type               : chr [1:203] "MRT" "MRT" "MRT" "MRT" ...
 - attr(*, "spec")=
  .. cols(
  ..   stn_code = col_character(),
  ..   mrt_station_english = col_character(),
  ..   mrt_line_english = col_character(),
  ..   type = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 

4 What is the Distribution of Passenger Traffic across peak periods?

To determine which time period to analyze further, we visualize the distribution of trips by peak period:

code block
set_urbn_defaults(style = "print")

trip_density <- train_od %>%
  ggplot(
    aes(x = period,
        y = num_trips,
        fill = period,
        color = period)
  ) +
  geom_violin(
    # shift violin plot upwards
    position = position_nudge(x = .3, y = 0), alpha = .8
  ) +
  geom_point(
    aes(y = num_trips,
        color = period),
    position = position_jitter(width = .15),
    size = .5,
    alpha = 0.8
  ) +
  labs(
    title = "Oct 2023: Widest range of MRT trips during Weekday Evening Peak"
  ) + 
  theme(
    axis.title.y = element_blank(),
    axis.title.x = element_blank(),
    axis.ticks.y = element_blank(),
    axis.ticks.x = element_blank(),
    legend.position = "none"
  ) +
  coord_flip()

trip_density

The density plot reveals a highly right-skewed distribution for all trips, especially during weekday evening peak periods. This could point towards congested MRT train stations, and would be useful to analyse further to determine the possible factors leading to this high passenger volume.

The OD data for weekday evening peak periods will thus be the focus of the study, and data is extracted using the following code chunk:

weekday_pm_od <- train_od %>%
  filter(period == "Weekday evening peak")

write_rds(weekday_pm_od, "data/rds/weekday_pm_od.rds")

4.1 Geospatial Data

The following geospatial dataframes are used for this exercise:

  • train_station, the location of train stations (RapidTransitSystemStation) from LTA Datamall
  • Business, entertn, fnb, finserv, recreation and retail, geospatial data sets of the locations of business establishments, entertainments, food and beverage outlets, financial centres, leisure and recreation centres, retail and services stores/outlets compiled for urban mobility studies
  • mpsz, masterplan boundary 2019

4.2 Preparing Geospatial Files: train station

As the files are all based on Singapore Maps, they are in SVY21 coordinate reference system (CRS) and projected in ESPG code 3414 using st_transform()

train_station is a Simple feature polygon layer based on SVY21 coordinate reference system (CRS).

train_station <- st_read(dsn = "data/geospatial",
                         layer = "RapidTransitSystemStation") %>%
          st_transform(crs = 3414)
Reading layer `RapidTransitSystemStation' from data source 
  `C:\haileycsy\ISSS624-AGA\Take-home_Ex\the2\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 220 features and 4 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 6068.209 ymin: 27478.44 xmax: 45377.5 ymax: 47913.58
Projected CRS: SVY21

Dataset limitations

  • train_station only lists the Type of station (MRT or LRT), Station name and geolocation of each station. There is no station code to join with the aspatial dataset. Thus, the use of train_codes is needed to assign a unique train code identifier to each station

  • There are only 203 Station codes in train_codes, compared to 220 entries in train_station. There may be duplicated station names, which needs further investigation.

  • GDAL Message 1: Non closed ring detected – This warning message was received when loading in the geospatial layer. This means that there are polygons that are not closed (starting and ending points are not joined). There is a need for further geospatial data wrangling to rectify this.

code block
mpsz <- st_read(dsn = "data/geospatial",
                layer = "MPSZ-2019") %>%
  st_transform(crs = 3414)
Reading layer `MPSZ-2019' from data source 
  `C:\haileycsy\ISSS624-AGA\Take-home_Ex\the2\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS:  WGS 84

Checking for duplicated station entires in train_station:

train_duplicates <- train_station %>%
  # remove geometry layer
  st_set_geometry(NULL) %>%
  group_by(TYP_CD_DES,
           STN_NAM_DE
  ) %>%
  # count number of rows per train station
  summarise(
    count = n()
  ) %>%
  ungroup() %>%
  # retrieve stations with more than single row
  filter(count > 1)

DT::datatable(train_duplicates)

The code above reveals that there are several entries for the stations above in train_station. This may be due to the fact that some stations are interchanges and have multiple train lines. However, there is no way to differentiate this without station code, and this study will assume that the location of these stations are within the same boundary anyway. Only one geospatial entry per station is selected using slice():

# Remove duplicates on selected columns
train_station <- train_station %>%
  group_by(STN_NAM_DE) %>%
  slice(1) %>%
  st_as_sf()

As train_station does not have station codes that may be joined to the OD dataset for analysis, it is joined to train_codes by train station name. However, the station names in train_codes are in lowercase and without the suffix “MRT STATION” in the station names – there is thus an extra layer of data wrangling to be done before joining.

train_codes_new <- train_codes %>%
  # Assign suffix to train station names and change to upper case
  mutate(station_name = ifelse(type == "MRT", 
                               paste0(toupper(mrt_station_english), " MRT STATION"),
                               paste0(toupper(mrt_station_english), " LRT STATION"))
  )

The code chunk below assigns train station code to the geospatial layer to identify each station by code.

train_station_comb <- train_station %>%
  left_join(train_codes_new,
            by = join_by(STN_NAM_DE == station_name,
                         TYP_CD_DES == type)
  ) %>%
  select(
    stn_code,
    STN_NAM_DE,
    TYP_CD_DES
  )

There are several stations that have no stn_code:

code block
train_station_comb %>%
  filter(is.na(stn_code)) %>%
  datatable()

As these are mainly train depots that are not identified by code and irrelvant to analysis of OD flows, these are removed from the dataset and saved as a new file, train_station_list:

train_station_list <- train_station_comb %>%
  filter(!is.na(stn_code))

We use st_is_valid() to retrieve the invalid polygon datapoints in train_station_list:

# Retrieve invalid geometries
invalid_indices <- which(!st_is_valid(train_station_list))
print(invalid_indices)
[1] 195

The code chunk above reveals that index 195 is invalid. This is identified as UPPER THOMSON MRT STATION:

print(train_station_list[195,])
Simple feature collection with 1 feature and 3 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 27808.12 ymin: 37278.27 xmax: 28080.89 ymax: 37543.25
Projected CRS: SVY21 / Singapore TM
# A tibble: 1 × 4
# Groups:   STN_NAM_DE [1]
  stn_code STN_NAM_DE                TYP_CD_DES                         geometry
  <chr>    <chr>                     <chr>                         <POLYGON [m]>
1 TE8      UPPER THOMSON MRT STATION MRT        ((27808.12 37518.2, 27811.57 37…

To rectify this invalid geometry, use st_make_valid():

train_station_valid <- st_make_valid(train_station_list)

Check validity of the geometry layer again:

invalid_indices2 <- which(!st_is_valid(train_station_valid))
print(invalid_indices2)
integer(0)

The code chunk shows that there are no more invalid geometries in the file.

# Save as file 
write_rds(train_station_valid, "data/rds/train_station_valid.rds")
code block
train_station_valid <- read_rds("data/rds/train_station_valid.rds")

tmap_mode("view")

tm_basemap("CartoDB.Positron") +
  tm_shape(train_station_valid) +
  tm_dots()

4.3 Creating hexagon grid

From the mapview above, the current simple features dataframe only shows the train stations as points on the map, which not suitable for understanding commuter flows between areas. In urban transportation planning, Traffic Analysis Zones (TAZ) are the basic units of spatial areas delineated to tabulate traffic-related models.

The following code chunks create a hexagonal grid frame spanning 750m between edges for each hexagon:

# create hexagon frame
train_hex <- st_make_grid(
    train_station_valid, 
    # for hexagonal cells, cellsize = the distance between opposite edges
    cellsize = 750, 
    square = FALSE
  ) %>%
  st_sf() %>%
  rowid_to_column("hex_id")

This creates a hexagonal grid over the entire area:

code block
tmap_mode("plot")
qtm(train_hex)

# join to train station names
train_stops <- st_join(
    train_station_valid,
    train_hex,
    join = st_intersects
  ) %>%
  st_set_geometry(NULL) %>%
  group_by(stn_code) %>%
  summarise(
  # Ensure that each station gets assigned to a single hex_id
    hex_id = first(hex_id)
  ) %>%
  ungroup() %>%
  group_by(hex_id) %>%
  summarise(
    station_count = n(),
    station_codes = str_c(stn_code, collapse = ","),
  ) %>%
  ungroup()
train_hex_final <- train_hex %>%
  left_join(train_stops,
          by = "hex_id"
  ) %>%
  replace(is.na(.), 0)
code block
# Get hexagons with at least one station 
train_hex_filtered <- train_hex_final %>%
  filter(station_count > 0)

tmap_mode("view")

tm_basemap("CartoDB.Positron") +
  tm_shape(train_hex_filtered) +
  tm_fill(
    col = "station_count",
    palette = "-RdYlGn",
    style = "cont",
    id = "hex_id",
    popup.vars = c("No. of Train Stations: " = "station_count",
                   "Train station codes: " = "station_codes"),
    title = " "
  ) +
  tm_layout(
    # Set legend.show to FALSE to hide the legend
    legend.show = FALSE
  )

5 Which trips are the most congested on Weekday Evening Peaks?

We seek to understand the inter-zone trip volume per TAZ, to understand which routes are the most popular during weekday evening peak periods.

5.1 Preparing Origin-Destination flow data

The following steps are done to prepare O-D flow dataframes at hexagon level:

station_by_hex <- train_hex_final %>%
  select(
    hex_id,
    station_codes
  ) %>%
  st_drop_geometry() %>%
  # create separate rows for each hex_id - station_code pair
  separate_rows(station_codes, 
                sep = ","
  ) %>%
  # drop any hexagon without stations
  filter(station_codes != 0)
od_data <- left_join(
    weekday_pm_od, 
    station_by_hex,
    by = c("origin_station_code" = "station_codes")
  ) %>%
  rename(
    origin_stn = origin_station_code,
    origin_hex = hex_id,
    dest_stn = dest_station_code
  ) %>%
  distinct()
od_data <- left_join(
    od_data,
    station_by_hex,
    by = c("dest_stn" = "station_codes")
  ) %>%
  rename(
    dest_hex = hex_id
  ) %>%
  group_by(
    origin_hex,
    dest_hex
  ) %>%
  summarise(
    weekday_pm_trips = sum(num_trips)
  ) %>%
  ungroup()
write_rds(od_data, "data/rds/od_data.rds")

5.2 Visualising Spatial Interation

The following steps are taken to visualise the traffic flows between TAZs.

As we are only interested in inter-zonal flows, we remove the intra-zonal trips from od_data dataframe:

od_data <- read_rds("data/rds/od_data.rds")

od_data_inter <- od_data[od_data$origin_hex!=od_data$dest_hex,]

Desire lines are straight lines that connect origin to destination. od2line() function is used to create these:

od_flow <- od2line(flow = od_data_inter, 
                    zones = train_hex_filtered,
                    zone_code = "hex_id")

quantile() reveals that there is a large range between 75th and 100th quantiles. This will affect the visualisation of OD flows.

quantile(od_flow$weekday_pm_trips)
      0%      25%      50%      75%     100% 
    1.00    36.00   161.00   648.75 63626.00 

To visualise this further, we bin the number of trips into custom limits using cut():

od_flow <- od_flow %>%
  mutate(
    trips_quantile = cut(weekday_pm_trips, 
                         breaks = c(0, 50, 100, 250, 500, 1000, 5000, 10000, 20000, Inf),
                         labels = c("< 50", "< 100", "100 ~ 250", "250 ~ 500",
                                    "500 ~ 1000", "1000 ~ 5000", "5000 ~ 10000",
                                    "10000 ~ 20000", "> 20000"),
                         ordered_result = TRUE
  ))
code block
wd_pm_trips <- tm_shape(mpsz) +
  tm_fill(
    col = "#dfdfeb"
    ) +
tm_shape(train_hex_filtered) +
  tm_fill(
    col = "station_count",
    palette = "-RdYlGn",
    alpha = .7,
    style = "cont"
  ) +
  tm_borders(
    col = "#dfdfeb",
    lwd = .5
  ) +
tm_shape(od_flow) +
  tm_lines(
    lwd = "weekday_pm_trips",
    scale = 1.5,
    col = "#1F363D",
    alpha = .7
  ) +
  tm_layout(
    title = "Weekday Evening Peak Traffic Flow",
    scale = .9,
    frame = FALSE
  ) +
  tm_facets(
    along = "trips_quantile",
    free.coords = FALSE
  )

# Save animation as gif
tmap_animation(wd_pm_trips,
               "wd_pm_trips.gif",
               loop = TRUE,
               delay = 80,
               outer.margins = NA,
               restart.delay = 100)

The animated O-D map shows that the flows with trip count < 10,000 is too dense for effective visualisation. As the number of trips increases, we also observe a slight concentration of flow lines within the Central/CBD district. To investigate this further, we focus on looking at the most popular O-D passenger flows for Weekday PM trips > 20,000:

code block
tmap_options(check.and.fix = TRUE)
tmap_mode("plot")

tm_shape(mpsz) +
  tm_fill(
    col = "#dfdfeb"
    ) +
tm_shape(train_hex_filtered) +
  tm_fill(
    col = "station_count",
    palette = "-RdYlGn",
    alpha = .8,
    style = "cont",
    # set legend title
    title = "Station Count",
    popup.vars = c("No. of Train Stations: " = "station_count",
                   "TAZ: " = "hex_id",
                   "Train station codes: " = "station_codes")
  ) +
  tm_borders(
    col = "#dfdfeb",
    lwd = .5
  ) +
od_flow %>% 
  filter(weekday_pm_trips > 20000) %>%
tm_shape() +
  tm_lines(
    lwd = "weekday_pm_trips",
    scale = 1.5,
    style = "quantile",
    n = 6,
    col = "#451F55",
    alpha = .7
  ) +
  tm_layout(
    title = "Weekday Evening Peak Traffic Flow",
    scale = .9,
    legend.stack = "horizontal",
    legend.position = c("right", "bottom"),
    frame = FALSE
  ) +
  tmap_style("white")

.

Insights from OD flow map

  • A higher train station density in TAZs does not correlate to higher traffic. This is possibly due to the fact that a single station may encompass multiple lines (for instance, TAZ #1074 has the highest station density with 5 station codes, but this only corresponds to 3 stations: Marina Bay, Downtown & Shenton Way)

  • There are several TAZs with more flowlines, that are scattered across the country. This is suggestive of either being a key origin or destination zone for Weekday evenings. Namely:

    • TAZ #1058 (Telok ayer and Raffles Place MRT Stations)
    • TAZ #534 (Jurong East MRT Station)
    • TAZ #988 (Yishun MRT Station)
  • There are also TAZs with fewer but thicker flowlines, indicating highly popular origin or destination train stations, that are likely to be congested during Weekday evening peak periods. These are:

    • TAZ #1544 (Pasir Ris MRT Station)
    • TAZ #1028 (Novena MRT Station)
    • TAZ #774 (Woodlands MRT Station)
    • TAZ #573 (Yew Tee MRT Station)

Top 15 O-D flows by number of trips during Weekday Evening Peak Period:

code block
od_data %>%
  left_join(train_hex_filtered,
            by = c("origin_hex" = "hex_id")
  ) %>%
  rename(
    origin_stations = station_codes
  ) %>%
  left_join(train_hex_filtered,
            by = c("dest_hex" = "hex_id")
  ) %>%
  rename(
    dest_stations = station_codes
  ) %>%
  st_drop_geometry() %>%
  select(origin_hex,
         origin_stations,
         dest_hex,
         dest_stations,
         weekday_pm_trips
  ) %>%
  arrange(desc(weekday_pm_trips)) %>%
  slice_head(n = 15) %>%
  datatable()

The table above reveals that Jurong East MRT Station is the busiest origin station, with top OD trips originating from there. Only 2 TAZs occurred multiple times as top destination stations: Boon Lay and Newton MRT Stations. The fact that these stations are not in the city centre is slightly surprising, as one would assume that the busiest origin station would be in the CBD area where people would be commuting from after work.

Further analysis is conducted by using Spatial Interaction Models (SIMs) to determine factors explaining flow density between these TAZs.

7 Key Takeaways from Study

From the Goodness-of-fit and Linearity tests conducted, the doubly constrained SIM consistently outperformed origin-constrained and destination-constrained SIMs. This suggests that:

  • The simplicity of the model is effective in capturing underlying patterns in the data. Despite having fewer variables compared to the other two models, the doubly-constrained SIM achieved better performance – thus indicating that the additional complexity of the other models may not necessarily improve the explanatory power of the models.

  • Distance between TAZs is thus the key explanatory variable influencing number of trips during Weekday Evening Peak Periods, where shorter distances tend to lead to higher number of trips. Even in the origin and destination-constrained models, distance had the highest negative association (around -0.79); this figure represents the effect of a unit of increase in distance on the expected number of trips. When we compare the top 10 trips by distance versus number of trips below, we see that the most frequent trips by passenger volume are ~10 times shorter in distance than the least popular trips:

Most popular trips recorded an average distance of ~3500m

code block
od_data_dist %>%
  st_drop_geometry() %>%
  arrange(desc(weekday_pm_trips)) %>%
  mutate(
    distance = round(distance,0)
  ) %>%
  slice_head(n = 10) %>%
  select(
    origin_hex,
    dest_hex,
    weekday_pm_trips,
    distance
  ) %>%
  datatable()

Least popular trips recorded an average distance of ~37000m

code block
od_data_dist %>%
  st_drop_geometry() %>%
  arrange(desc(distance)) %>%
  mutate(
    distance = round(distance,0)
  ) %>%
  slice_head(n = 15) %>%
  select(
    origin_hex,
    dest_hex,
    weekday_pm_trips,
    distance
  ) %>%
  datatable()
  • However, it is not indicative that the origin-constrained and destination-constrained SIMs are not valid, or that other variables have no explanatory value. There is a need for further calibration of the model, such as including data from a longer time-range sample (over 6 months)

  • The explanatory variables used for the SIMs are related to the quantity of specific types of facilities within each TAZ, but does not account for the quality of these features. More qualitative data such as types of shops in retail areas or popularity of F&B joints could be considered as other factors

  • Spatial Interaction Models assume independence among observations in each TAZ, and does not consider the relative attractiveness or propulsiveness of the neighbouring areas. Further spatial econometrics modifications could be made to the SIMs by adding weighted metrics to understand the relative influence of the neighborhood on the TAZs.