Study design and participants
Inclusion criteria were as follows:
-
1.
Class II Division 1 malocclusion (Class II molar relationship with proclined upper incisors)
-
2.
Class II molar relationship
-
3.
Initial overjet of ≥6 mm when measured on casts
-
4.
Patient was debanded between ages 13 and <20 years of age
-
5.
Treatment types (2): non-surgical orthodontic-only or a combination of orthodontics/surgical treatment
-
6.
Availability of full records
Patients with craniofacial anomalies or syndromes were not included in the study.
Sample size estimation
The sample size estimation for this study was based on the cephalometric measurements presented by Proffit et al [22] and the American Board of Orthodontics-Cast Occlusal Grading System (ABO-COGS) scores presented in the report by Cansunar and Uysal [33]. We deemed a 1- to 2-point difference in cephalometric outcomes between the surgical and non-surgical groups to be clinically significant. The mean ABO-COGS score in the study by Cansunar and Uysal ranged from 16.80 (standard deviation of 8.54) to 19.05 (standard deviation of 8.41) [33]. We deemed a one-standard deviation in ABO-COGS (between surgical and non-surgical groups) to be clinically significant. We set the alpha at 0.007 (to account for multiple testings) and power at 80%. Two-sided tests were to be used. Based on our sample size and power calculations, we estimated that each group should have 18 (for ABO-COGS scores) to 20 patients (for cephalometric variables). We planned on including 20 patients in the surgical group and 40 patients in the non-surgical group. We intentionally doubled the number of patients in the non-surgical group as this group is likely to have a wider range of biomechanical strategies (extraction, non-extraction, use of functional appliances, etc.) and a larger sample size would enable us to examine within group variations in outcomes.
Key study variables
Treatment data was gathered from all subjects. Initial and final lateral cephalometric radiographs were scanned into Dolphin Imaging software. The following cephalometric landmarks were traced and used for recording measurements: Sella, Porion, Orbitale, Nasion, A Point, B Point, U1 Incisal Edge, U1 Root Tip, L1 Incisal Edge, L1 Root Tip, Menton, and Constructed Gonion. Figures 1 and 2 provide a visual representation of these landmarks.
Cast grading was performed on pre- and post-treatment casts. Initial casts were graded using parameters determined by the ABO Initial Discrepancy Index Form (DI) (which is used to quantify the difficulty of an untreated case). Final casts were graded using the Final Cast Grading Form, also provided by the ABO, which provides a numerical representation of the finish of cases—higher numbers indicated more occlusal discrepancies in a finished case.
Outcomes examined
Outcomes gathered in this study were as follows: deband lateral cephalometric outcomes (ANB, FMIA, IMPA, U1 to SN, overbite, overjet), cast occlusion grading outcomes (measured through the ABO-COGS), and retention protocol. Independent variables in this study were as follows: the type of treatment (surgical versus non-surgical), the initial discrepancy index (DI), initial cephalometric variables (ANB, FMIA, IMPA, U1 to SN, overbite, overjet), starting age of treatment, and gender.
Examiner reliability
Inter-examiner and intra-examiner reliability analyses were performed using intra-class correlation coefficients (Cronbach alpha values) for each of the outcome variables. To compute intra-examiner reliability, one researcher measured initial casts and final casts for 20 cases two times within a 1-week interval to over 0.90 positive correlation. Inter-examiner reliability was performed between two different examiners. Both examiners used the initial discrepancy index form provided by the ABO and also the ABO Cast Grading form, which details instructions for cast grading at deband. In addition, both examiners took the same online tutorial for final cast grading, thereby having the same degree of training prior to measuring data. Both examiners were blinded with regard to the cases whether they were treated surgically or non-surgically. Correlation for inter-examiner reliability was >0.90. Cephalometric tracing also was reported with >0.90 correlation found for both intra- and inter-examiner reliability: two examiners independently traced the same ten radiographs two times over the course of two consecutive weeks (intra-examiner) and a second examiner traced the same ten later to compare results (inter-examiner).
Statistical analysis
The baseline descriptives and outcomes were compared between the two groups using Mann-Whitney tests. Multivariable linear regression analyses were performed to examine the association between treatment (surgical versus non-surgical orthodontic treatment) and final lateral cephalometric numbers (adjusted for initial cephalometric numbers, age at start of treatment, initial DI, gender) and ABO-COGS. The multivariable linear regression models were fit using the ordinary least squares method. Sensitivity analyses were conducted using propensity scoring approach to account for the non-randomized nature of treatment assignment (surgical versus non-surgical). In this approach, we first computed the probability of a patient having undergone surgical or non-surgical treatment approach by using patient level covariates (age at start of treatment, gender, initial discrepancy index, initial ANB angle, initial FMIA angle, initial IMPA angle, initial U1 to SN angle, initial overbite, and initial overjet) as predictors in a logistic regression model fit by the maximum likelihood method. This model fitness was assessed by the Hosmer and Lemeshow goodness-of-fit test. After confirming that the model fit was good (Hosmer and Lemeshow goodness-of-fit chi-square value was 3.10 and p = 0.93), we used the predicted probability (propensity score) of being treated surgically or non-surgically in the second stage model as a covariate. The second stage model was fit using generalized linear model (GLM) methods. In this model, the primary independent variable was the type of treatment (surgical or non-surgical) and the propensity score was used as a covariate along with all other patient level variables. This approach was used to account for imbalances in treatment groups and reduces bias by mimicking randomization of subjects into treatment groups (surgical or non-surgical) [34]. The end-of-treatment outcomes between the surgical and non-surgical groups were assessed by propensity score regression adjustment and propensity score stratification approaches. In the stratification approach, five bins (quintiles) were used to stratify the propensity scores and the quintile was used as a covariate in the regression models. All the regression models were assessed for their fitness’. Several sensitivity analyses with different mix of covariates were conducted and the best fitting models with the highest R-square values were presented in this study. Since seven different end-of-treatment outcomes were assessed, to account for multiple outcomes assessment and minimize type 1 errors, we set the p value to be deemed statistically significant at p < 0.007. For comparing the baseline descriptives between the surgical and non-surgical groups, a p value of <0.05 was deemed to be statistically significant. All statistical tests were two-sided. All statistical analyses were conducted by the SPSS version 23.0 (IBM Corp, New York City, NY) software.