Replies: 1 comment 4 replies
-
|
Thanks @rempsyc for your message (and on the team's behalf for your sponsorship 💌), and yes we'll gladly look on areas of possible overlap & improvement For check_outliers, in general that's a function that could benefit from some love and improvements. In general, the path for change look like this:
All that to say, given that 1) is done, and 3) is wip (I'm sure @mattansb wouldn't mind you to give #49 a go too tho), do not hesitate to open a PR in performance for 2) to improve check_outliers(). perhaps also adding a vignette on outliers detection would be something to look for |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment

Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
MAD workflow & package overlap
Like many others, my colleagues and I have over time developed a couple of functions for data processing—before we knew about the
easyverseanddatawizard.I have realized that some of these functions, which I have integrated in the
rempsycpackage, overlap with theeasyverse. Of course, ideally one would rely entirely on theeasyverseto avoid redundancy and scattering (also, less maintenance).For example, one common workflow taught in our R stats university class is to standardize data, identify outliers, and finally winsorize, all using the MAD (median absolute deviation), based on this publication:
Here I would like to compare this workflow between the two packages, see what could be deprecated/removed from
rempsyc, and what would require further implementation/refinement indatawizard.Standardize based on MAD
Created on 2022-06-12 by the reprex package (v2.0.1)
Conclusion: They provide the same results (except for the extra attributes in
datawizard). That's a perfect scenario. It suggests I could deprecate/removerempsyc::scale_madand point users/colleagues todatawizard::standardizeinstead. That's a good start!Find outliers based on MAD
(Here we have to reach for
performanceinstead ofdatawizard)Created on 2022-06-12 by the reprex package (v2.0.1)
Same result. Fantastic! So far, so good. However, their functionality differ a bit when feeding several variables.
Created on 2022-06-12 by the reprex package (v2.0.1)
Whereas
check_outliersprovides an indiscriminate list of outliers for all variables (so you don't know each person was an outlier for which variable),find_madspecifies each outlier per variable, but also counts how many times each row/observation is identified as an outlier. This can be useful, for example, to identify if one participant provided particularly bad data which manifests in being an outlier for several variables.Perhaps a possible area of improvement? (At the same time, it might not be possible to modify
performance::check_outlierslike this if it's being used in specific ways under the hood for other functions.) No big deal though.Winsorize based on MAD
Created on 2022-06-12 by the reprex package (v2.0.1)
This time, the results differ, and for good reason.
rempsyc::winsorize_maduses the MAD, whereasdatawizard::winsorizedoesn't (my understanding from the documentation is that the threshold refers to the percentile). @mattansb did suggest allowing more thresholds for winsorizing in #49 (e.g., fixed values, relative Z score, relative robust Z score). I would suggest also adding the possibility to use the MAD.Once that is done, I will be able to change my scripts and workflow to stick only to the
easyversefor processing data, and encourage others to do the same.Beta Was this translation helpful? Give feedback.
All reactions