The igate package implements the initial Guided Analytics for parameter Testing and controland Extraction (iGATE) framework for manufacturing data. The goal of iGATE is to enable guided analytics in industry, that is, to provide a statistically sound framework for process optimization that is easy enough to be used and understood by employees without statistical training. The goal is to have simplicity and interpretability while maintaining statistical rigor. The full methodology has been published in (Stein et al. 2021).
Having identified a manufacturing ‘problem’ to be investigated, a
data set is assembled for a ‘typical’ period of operation,
i.e. excluding known disturbances such as maintenance or equipment
failures. This data set includes the so called target variable,
a direct indication or proxy for the problem under consideration and the
variable whose variation we want to explain, and a number of parameters
representing suspected sources of variation (SSVs),
i.e. variables that we consider potentially influential for the value of
the target
. Parameters with known and explainable
relationships with the target variable should be excluded from the
analysis, although this can be addressed in an iterative manner though
subsequent exclusion and repeating of the process. Care has been taken
to robustify the approach against outliers and missing data, in order to
make it a reliable tool that can be used with possibly messy or
incomplete real-world data sets. The iGATE procedure consists of the
following steps (detailed explanations follow below):
versus
argument of igate
/
categorical.igate
.Steps 1-4 are performed using the igate
function for
continuous target variables or the categorical.igate
function for categorical target variables. Especially for categorical
targets with few categories robust.categorical.igate
is a
robustified version of categorical.igate
and should be
considered.
When running igate
/ categorical.igate
with
default settings, any outliers for the target variable are excluded and
the observations corresponding to the best 8 (B) and worst 8 (W)
instances of the target variable are identified. For each of these 16
observations, each SSV is inspected in turn. The distribution of the
values of the SSV of the 8 BOB and 8 WOW are analyzed by applying the
Tukey-Duckworth test (Tukey 1959). If the
critical value returned by the test is larger than 6 (this corresponds
to a p-value of less than 0.05), the SSV is retained as being
potentially significant. This test was chosen for its simplicity and
ease of interpretation and visualization. SSVs failing the test are
highly unlikely to be influential whilst SSVs passing the test may be
influential. The Wilcoxon-Rank test performed in step three of iGATE
serves as a possibly more widely known alternative, that might, however,
be harder to explain to non-statisticians. The main function of these
steps is to facilitate dimensionality reduction in the data set to
generate a manageable population for expert consideration.
Step 5 is performed by calling igate.regressions
, resp.
categorical.freqplot
. These functions produce a regression
(for continuous target) resp. frequency (for categorical targets) plot
and save it to the current working directory. The domain expert should
review these plots and decide which parameters to keep for further
analysis based on goodness of fit of the data to the plot.
For the validation step, the production period from which the
validation data is selected is dependent on the business situation, but
should be from a period of operation consistent with that from which the
initial population was drawn, i.e. similar product types, similar level
of equipment status etc. The validation step then considers all the
retained SSV as a collective in terms of good and bad bands, and
extracts from the validation sample all the records which satisfy the
condition that all retained SSVs lie within these bands. The expectation
is that where all the SSVs lie within the good band, then the target
should also correspond to the best performance, and vice versa where the
retained SSVs all lie in the bad bands we expect to see bad performance.
The application gives feedback on the extent to which this criterion is
satisfied in order to help the user conclude the exploration and make
recommendations for subsequent improvements. Validation is performed via
the validate
function.
We consider the last step, the reporting of the results in a
standardized manner, an integral part of iGATE that ensures that
knowledge about past analyses is retained within a company. This is
achieved by calling the report
function.
Install igate
just like any other R package directly
from CRAN and load it afterwards by running
We recommend changing the working directory to a new, empty
directory, as various functions in the igate
package will
save plots to the current working directory. The working directory can
be changed using the setwd()
function or, when using R
Studio, via clicking Session -> Set Working Directory ->
Choose Directory.
We use the iris
data as an example for performing igate
on a continuous target.
set.seed(123)
n <- nrow(iris)*2/3
rows <- sample(1:nrow(iris), n)
df <- iris[rows, ]
results <- igate(df, target = "Sepal.Length", good_end = "high", savePlots = TRUE)
#> 0 outliers have been removed.
#> Retaining 100 observations.
#> Using pairwise comparison with 8 BOB vs. 8 WOW.
#> Using counting method with Wilcoxon rank test as follow up test.
#> Warning in wilcox.test.default(x, y): cannot compute exact p-value with ties
#> Warning in wilcox.test.default(x, y): cannot compute exact p-value with ties
results
#> Causes Count p.values good_lower_bound good_upper_bound
#> 1 Petal.Length 16 0.0008831660 6.1 6.9
#> 2 Petal.Width 16 0.0007638596 1.6 2.3
#> bad_lower_bound bad_upper_bound na_removed ties_lower_end
#> 1 1.0 1.5 0 0
#> 2 0.1 0.3 0 0
#> competition_lower_end ties_upper_end competition_upper_end adjusted.p.values
#> 1 0 3 1 0.001527719
#> 2 0 3 1 0.001527719
The significant variables are shown alongside their count summary
statistic from the Tukey-Duckworth Test as well as the p-value from the
Wilcoxon-Rank test. Also, we see the good and bad control bands as well
as several summary statistics to ascertain the randomness in the results
(see documentation of igate
for details). Remember to use
the option savePlots = TRUE
if you want to save the boxplot
of the target variable as a png. This png will be needed for producing
the final report of the analysis.
Next, we perform a sanity check for the found results
igate.regressions(df, target = "Sepal.Length", ssv = results$Causes, savePlots = TRUE)
#> 0 outliers have been removed.
#> Retaining 100 observations.
#> Causes outliers_removed observations_retained regression_plot r_squared
#> 1 Petal.Length 0 100 TRUE 0.7813656
#> 2 Petal.Width 0 100 TRUE 0.6743999
#> gradient intercept
#> 1 0.4347797 4.196576
#> 2 0.9524464 4.690730
A data frame is output, showing us that the regression succeeded
(column regression_plot
) for both SSV as well as displaying
the respective r2
value, gradient and intercept values etc. Regression plots of each SSV
against the target will be plotted. Remember to set the option
savePlots = TRUE
in the call of
igate.regressions
to save the regression plots as png
files. These will be needed if you want to produce a report with the
report
function. Upon visual inspection, the expert can
decide if they want to keep the SSV for further analysis or not.
validation_df <- iris[-rows,]
val <- validate(iris, target = "Sepal.Length", causes = results$Causes, results_df = results)
#> Guessing that perfromed igate was continuous. Using type = 'continuous'.
#> Warning: `data_frame()` was deprecated in tibble 1.1.0.
#> ℹ Please use `tibble()` instead.
#> ℹ The deprecated feature was likely used in the igate package.
#> Please report the issue at <https://github.com/stefan-stein/igate/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
If the type (continuous or categorical) of igate to be validated is
not specified, validate
will guess it automatically. The
output val
is a list of three data frames: The first
contains all the observations in the validation data set falling into
all the good resp. all the bad control bands plus an additional column
expected_quality
, indicating whether the observation falls
into the good or the bad band.
head(val[[1]])
#> Sepal.Length Petal.Length Petal.Width expected_quality
#> 106 7.6 6.6 2.1 good
#> 108 7.3 6.3 1.8 good
#> 118 7.7 6.7 2.2 good
#> 119 7.7 6.9 2.3 good
#> 123 7.7 6.7 2.0 good
#> 131 7.4 6.1 1.9 good
The second data frame has one row for each validated SSV and a column
Good_count
reps. Bad_count
giving the number
of observations from the validation data frame that fall into the good
resp. bad control band for this SSV. The first data frame is the
intersection of all these these observations for all the SSV.
Lastly, the third data frame summarizes the first. If our target was
continuous, it contains minimum and maximum target value of the
observations in the first data frame with expected_quality
good resp. bad.
val[[3]]
#> # A tibble: 2 × 3
#> expected_quality max_target min_target
#> <chr> <dbl> <dbl>
#> 1 good 7.9 7.3
#> 2 bad 5.8 4.3
As we can see, indeed those observations with predicted good quality have higher values than those with predicted bad quality, indicating that our analysis was successful and we found the significant process parameters.
Finally, if we specified savePlots = TRUE
in
igate
and igate.regressions
and saved the
corresponding plots to the current working directory, we can produce a
standardized report summarizing our results by running
validatedObs <- val[[1]]
validationCounts <- val[[2]]
validationSummary <- val[[3]]
# choose a directory you want to save the plot into.
output_dir <- "YOUR_DIRECTORY"
report(df = df,
target = "Sepal.Length",
type = "continuous",
good_outcome = "high",
results_path = "results",
validation = TRUE,
validation_path = "validatedObs",
validation_counts = "validationCounts",
validation_summary = "validationSummary",
output_name = "testing_igate",
output_directory = output_dir)
This will create a html file titled “igate_Report.html” in the current working directory.
Using igate for categorical target variables is completely analogous,
simply run categorical.igate
and
categorical.freqplot
instead of igate
and
igate.regressions
.