Propensity Score Matching

Introduction

Propensity score (PS) matching is a statistical technique widely employed in Real World Evidence (RWE) studies to address confounding bias and facilitate the estimation of causal treatment effects. By estimating the probability of treatment assignment based on observed covariates, PS matching aims to create comparable treatment and control groups. While both SAS and R provide methods to do PS matching, the syntax and available options differ considerably, potentially leading to variations in analytical outcomes. The PS matching process generally involves several key stages: first, the estimation of propensity scores using a regression model; second, the specification of a region of common support to ensure overlap between treatment and control groups; and third, the matching of treated and control subjects based on their calculated propensity scores.

Differences

Given the extensive number of parameters and arguments available in both PROC PSMATCH and matchit(), we will focus only on the observed differences.

Options

Option	PROC PSMATCH	matchit
Distance	PS, LPS; only for matching: mahalanobis, euclidean	PS, euclidean, scaled_euclidean, mahalanobis, robust_mahalanobis
PS methods	logistic regression	glm, gam, gbm, lasso, ridge, elasticnet with different links, partitioning tree, RF, single-hidden-layer neural network, covariate balancing propensity score (CBPS) algorithm, Bayesian additive regression trees (BART)
Region	Always used; allobs, treated or cs (common support), with allowed extension	none, treated, control, both (common support)
Caliper	Applies only to the distance metric, such that distance<=caliper	Can be applied to both covariates and distance metric. If a positive value is supplied, it functions as in SAS; If a negative value is supplied, it imposes the condition distance > abs(caliper).
PS std.dev formula when caliper applied as multiplier	sqrt((sd(PS_trt)²+sd(P_control)²)/2)	sd(PS_all)
Matching methods	greedy (greedy nn), full, optimal, replace, varratio	nearest (greedy nn), full, optimal, quick, genetic, cem, exact, cardinality, subclass
Method=optimal options	Only K should be supplied	In addition to k (number of matches), the tol option can also be specified, and its value can significantly impact the matching results.
Method=full options	n max controls, n max treated, mean n treated, n controls, pct controls	n min controls, n max controls, mean n controls, fraction to omit, tol, solver
Mahalanobis distance	PS are always calculated to determine the region; only numeric covariates are allowed in MAHVARS option	mahvars accepts any var type; region calculation is ommited when distance=mahalanobis, but can be performed when distance=glm and mahvars is supplied with a formula
Covariance matrix for mahalanobis distance	Computed from trt obs and control obs	Is pooled within-group, computed from trt mean-centered covariates of the full sample
Replace	Could be specified only as method=replace; in the output, match IDs are shared among all subjects within a matched set.	Could be specified as an argument (e.g. replace=TRUE). In the output, each treated subject receives up to K matched controls
Estimands	ATT, ATE	ATT, ATE, ATC
Reestimate	Not available	Could be specified as an argument (e.g. reestimate=TRUE)
Exact, anti-exact	Only exact matching is available and can only be performed on variables listed in the CLASS statement.	Both exact and anti-exact matching are available, with no restrictions on the variable types.
Normalization	Not available	Could be specified as an argument (e.g. normalize=TRUE)

Output

The way SAS and R show the results of matching is different. The output from SAS matching is presented as a dataset where each treated subject (or all subjects) is given a row, and a match ID column containing a unique pair (or group) identifier is included:

	TRTP	_MatchID
1	trt	1
2	trt	2
3	control	1
4	control	2

In contrast, the matchit() function in R returns a matrix, where each treated unit’s identifier is associated with the identifiers of its k matched control units:

	[,1]
1	“3”
2	“4”

Statistics

In SAS, descriptive statistics for assessing balance are primarily generated through the ASSESS statement within PROC PSMATCH. Conversely, in R, balance diagnostics could be obtained by applying the summary() function to the output object returned by the matchit(). The following table summarizes the balance statistics available in SAS and R:

Stat	All	Region	Matched
N	SAS & R	SAS	SAS & R
PS mean	SAS & R	SAS	SAS & R
PS std	SAS	SAS	SAS
PS min	SAS	SAS	SAS
PS max	SAS	SAS	SAS
Vars mean	SAS & R	SAS	SAS & R
Vars std	SAS	SAS	SAS
Vars min	SAS	SAS	SAS
Vars max	SAS	SAS	SAS
PS mean diff	SAS	SAS	SAS
PS SMD	SAS & R	SAS	SAS & R
PS perc. red.	SAS	SAS	SAS
PS var ratio	SAS & R	SAS	SAS & R
Vars mean diff	SAS	SAS	SAS
Vars SMD	SAS & R	SAS	SAS & R
Vars perc. red.	SAS	SAS	SAS
Vars var ratio	SAS & R	SAS	SAS & R
PS eCDF min	R	-	R
PS eCDF max	R	-	R
Vars eCDF min	R	-	R
Vars eCDF max	R	-	R
PS std pair distance	R	-	R
Vars std pair distance	R	-	R

Figures

Plot type	R	SAS
Love plot	plot(): - cobalt: many different settings.	Displayed for: all/region/matched/weighted matched, trt/control. Includes PS, all numeric and binary variables.
General distribution plots	plot(): Displayed for: all/matched. Includes all variables; numeric - distribution plots, character and factor - histograms. cobalt : Highly customizable plots.	Displayed for: all/region/matched/weighted matched, trt/control. Includes PS and all variables. For PS and numeric - boxplots, character - barplots.
eCDF plots	plot(): Displayed for: all/matched. Includes all variables.	Displayed for: all/region/matched/weighted matched, trt/control. Includes PS and all numeric variables.
eQQ plots	plot(): Displayed for: all/matched. Includes all variables.	-
Cloud plots	plot(): Displayed for: all/matched, trt/control. Includes PS. Presented as 4 separate clouds.	Displayed for: all/region/matched/weighted matched, trt/control. Includes PS and all numeric variables. Presented as 2 separate clouds per variable.
PS histogram	plot(): Displayed for: all/matched, trt/control. Includes PS. cobalt: Highly customizable plots.	-

Dataset

The dataset used in the example below can be found here: [ps_data.csv](https://github.com/PSIAIMS/CAMIS/blob/main/data/ps_data.csv)


                           trt              control       Standardized
Characteristic          (N = 120)          (N = 180)       Mean Diff.

  sex                                                        0.1690
    F                  70 (58.3 %)        90 (50.0 %)        
    M                  50 (41.7 %)        90 (50.0 %)
    
  age                 61.5 (17.12)       49.4 (10.55)        0.7057
  
  weight              67.3 ( 7.33)       63.8 ( 9.64)        0.4741
  
  bmi_cat   
    underweight        34 (28.3 %)        63 (35.0 %)       -0.1479  
    normal             57 (47.5 %)        61 (33.9 %)        0.2726
    overweight         29 (24.2 %)        56 (31.1 %)       -0.1622

Matching Examples

Greedy Nearest Neighbor 1 to 1 matching with common support region

SAS

proc psmatch data=data region=cs(extend=0);
    class trtp sex bmi_cat;
    psmodel trtp(Treated="trt")= sex weight age bmi_cat;
    match distance=PS 
        method=greedy(k=1 order=descending) 
        caliper(MULT=ONE)=0.25;
    output out(obs=match)=ps_res matchid=_MatchID ps=_PScore;
run;

R

ps_res <- matchit(trtp ~ sex + weight + age + bmi_cat, 
                  data=data, 
                  method="nearest", 
                  distance="glm", 
                  link="logit", 
                  discard="both",
                  m.order="largest",
                  replace=FALSE,
                  caliper=0.25, 
                  std.caliper=FALSE,
                  ratio=1,
                  normalize=FALSE)

The following arguments, when altered in the previous example, can still produce matching results comparable between SAS and R:

region (discard)
caliper value
order
k (ratio)
exact