Medusa-Overview.Rmd
This vignette dives provides details concerning the functions that comprise the medusa package. The readme file will demonstrate how to use the three main functions to clean your data. The purpose of this vignette is to provide further context concerning the standards we have set for each of the demographic variables. To a certain degree, this vignette could be thought of as an update to the SIOP Membership Analytics Data Prep Guide that was previously used to keep track of the various data cleaning strategies required to clean the yearly membership data. Most of the functions developed were based from the logic and work of previous volunteers. Therefore, we always invite others to collaborate and share information or methods that would better help streamline the work we are continually doing on the data. To begin, some background context concerning the data itself is provided below:
The information below serves as a deep dive into what the medusa
package does. However, if you are simply looking to quickly hit the road
running, medusa does contain a function within itself,
build_dataset
, that allows you to output a cleaned CSV file
to a location of your choice. Simply feed it the arguments required, of
which there are currently three:
year
- What year would you like to clean? Current
acceptable range is 2017-2022.
set_root_dir
- this specifies the location of the
folder that contains the three pertinent types of files within them: the
various dues files, the single demographic file, and the single staff
file. All must be present, in .xlsx format, for it to be picked up
appropriately by medusa.
save_output_to
- Where would you like the cleaned
CSV to be saved to?
Just as clarification, this function is not meant to output a tibble within an R environment. It will not return anything when running successfully, but you should see the csv file in the location that you specified in the function.
The original dataset provided by the AO contains a lot of columns and variables. Not all these features are equally important for the time being. We have focused our initial efforts on streamlining and cleaning the critical demographic variables that are used either for modeling or provided in the annual membership report. This section of the vignette will outline the various variables, their type, and levels.
There are several clusters for cleaning data. One of which, geography, focuses on cleaning the information regarding a member’s location information, mainly: * City * State * Country
For state, an important stipulation is noted that this variable is only populated for members within the United States.
City
The world city of a member’s residence. Smaller towns or cities may be recoded into a larger, nearby city to help streamline analytics and aggregation. Also, acronyms for cities such as East, West, South, could be consolidated into the main city name. For example, East Orange would turn into Orange. Additionally, abbreviations such as Ft. or Mt. are turned into their full names for consistency.
State
The U.S. state of member’s residence.For classifying State, the following were not included as valid states: * Guam, with the acronym GU * Virgin Islands, with the acronym VI * Puerto Rico as PR - IS THIS TRUE??????? TODO
As an aside, in Australia, there is a state known by the acronym of WA. For cases that identify as their country of residence being Australia, all State responses were recoded to NA in order to avoid any confusion with Washington state in the USA.
Country
The country of a member’s residence. There is no cleaning that goes into this function step because the country names are clean. The only significant change that happens to this demographic variable is the imputation of United States where State_US has a valid state abbreviation.
Cleaning the membership columns deals primarily with three demographic variables: career, membership category and whether the member is in an academic or applied setting.
Primary_Sector
The Primary_Sector
variable describes the type of field
the member is operating in. The fields to be determined most relevant
for analyses are included in the table below, under Levels.
TODO: Insert more background information on why this categorization exists. TODO: Make this a factor-level variable?
Primary_Occupation
The Primary_Occupation
variable is the second output
from the step_sector
cleaning function. It provides a much
more nuanced approach to reporting the occupations of SIOP members.
There exist at least thirty categorizations for this variable, ranging
from Consulting firm to Financial services. This variable, in
combination with Primary_Sector
provides an interesting
perspective on types of roles by the various sectors.
MemberGroup
MemberGroup documents whether a member in the data table is an individual member within the society, or a prospect that has not yet fully signed up.
TODO: Flesh this out more, why do we capture this and how does it work?
Academic_Applied
The AcademicApplied variable captures whether an individual is within the academic sector, or the applied sector. It is a dichotomous variable.
TODO: We categorize Other Sector as Applied, is this okay?
Education cleaning deals with demographic variables such as highest degree completed, institution attended, and the type of institution. Since the School variable is open ended, a custom database has been built that joins with the main analysis file to help streamline reporting of schools and their profit or non-profit status.
Highest_Degree
Highest_Degree specifies the level of education that was the highest obtained for a specific SIOP member. This variable requires a lot of cleaning due to the messy nature of the data. Lots of members included various acronyms, and specialized degree titles for this field, making the standardization and reporting tedious.
Update on 02/01/2023 - We decided to do away with the include_ABD parameter and converted all ABD status people to a Master’s degree. We also included the MOP degree under Masters.
Priv_Publ
Priv_Publ specifies the level of education that was the highest obtained for a specific SIOP member. This variable requires a lot of cleaning due to the messy nature of the data. Lots of members included various acronyms, and specialized degree titles for this field, making the standardization and reporting tedious.
ProfitNon
The final school related demographic cleaning step concerns whether a school or institution is labelled as a for-profit or non-profit. The categorizations were developed as an independent study that culminated in the creation of a database of schools
Demographic cleaning deals with traditional variables, such as gender, ethnicity, birthdate, and age. The categorizations used are considered to be best practices.
Gender
Gender is reported as male or female.
Birthdate
Birthdate represents the date of birth of the member. It is coded as a proper date variable.
TODO: Clean out birthdates that are obviously wrong, like > 100 years of age.
Ethnicity
Ethnicity represents the race of the SIOP member. The standard seven categorizations are used for this.
Age Group
Age group is not directly queried from members. Instead, we use the Birthdate variable to calculate the current age of the member, by comparing the birthdate to a function that obtains today’s date.
Interest Areas
This variable consists of a series of three columns in the original demographics dataset. Minimal cleaning was performed, such as removing multiple categories in the first, second, or third interest area. Several members put more than one category in a single variable, for example, for int_1, somebody may have put Psychometrics and Organizational Culture. In this instance, organizational culture would have been dropped and Psychometrics retained for int_1. This is to help streamline the analytics and set expectations on the data file year after year.
Involvement Seeking
A variable asks members if they are seeking any involvement in a wide range of activities that are aligned with SIOP initiatives. Each unique category is created into a column with a 0 and 1 for whether a member possesses the inclination toward that activity.
Organizational Affilitations
A variable asks members if they are associated with any other organizational or professional affilitiations. Each level is transformed into a unique column.
The second Excel file to be read into the analysis for building a SIOP membership year is the dues file. The most important variable in the dues file specifies the date of the transaction for registration as a member. This date serves as classifying the member into the appropriate SIOP calendar year. Additionally, the dues file contains information concerning the type of membership that a member is either registering for, or renewing. All cleaning functions concerning the dues excel file are enclosed in the step_dues function. They are not partitioned out as observed in the demographics excel file cleaning stage described above.
Membership Dues
Membership dues describes the type of transaction that occurred. This information helps us ensure the correct value is charged to the current type of member.
TODO: THE PRICING TABLE FOR SIOP MEMBERS
MD Number
The MD Number is back translated from the output of
MembershipDues
and categorized as a numeric value. In the
future, I am not so sure this step will be required, as we can just
continue to use MembershipDues
for any type of analysis
that requires the MD_Number
.
TODO: VARIABLE REQUIRED OR NOT?
Student or Professional
The student professor variable is linked to the
MD_Number
. It categorizes a 1, or, Student Affiliate, as a
student, and all else as a professional.