Variable Definitions • medusa

This vignette dives provides details concerning the functions that comprise the medusa package. The readme file will demonstrate how to use the three main functions to clean your data. The purpose of this vignette is to provide further context concerning the standards we have set for each of the demographic variables. To a certain degree, this vignette could be thought of as an update to the SIOP Membership Analytics Data Prep Guide that was previously used to keep track of the various data cleaning strategies required to clean the yearly membership data. Most of the functions developed were based from the logic and work of previous volunteers. Therefore, we always invite others to collaborate and share information or methods that would better help streamline the work we are continually doing on the data. To begin, some background context concerning the data itself is provided below:

Membership Year Ranges

From 2000 - 2001, a SIOP year ran from 05/01 through 08/31.
From 2002 - 2006, a SIOP year ran from 07/01 through 10/31.
From 2007 onward, a SIOP year runs from 03/01 to 06/30. (For 2021: 03/01/20 - 06/30/2021)
SID is a unique identifer, created by the SIOP AO and serves as a unique key.

The Whole Game

The information below serves as a deep dive into what the medusa package does. However, if you are simply looking to quickly hit the road running, medusa does contain a function within itself, build_dataset, that allows you to output a cleaned CSV file to a location of your choice. Simply feed it the arguments required, of which there are currently three:

year - What year would you like to clean? Current acceptable range is 2017-2022.
set_root_dir - this specifies the location of the folder that contains the three pertinent types of files within them: the various dues files, the single demographic file, and the single staff file. All must be present, in .xlsx format, for it to be picked up appropriately by medusa.
save_output_to - Where would you like the cleaned CSV to be saved to?

Just as clarification, this function is not meant to output a tibble within an R environment. It will not return anything when running successfully, but you should see the csv file in the location that you specified in the function.

Data Cleaning Steps

Global Variables

The original dataset provided by the AO contains a lot of columns and variables. Not all these features are equally important for the time being. We have focused our initial efforts on streamlining and cleaning the critical demographic variables that are used either for modeling or provided in the annual membership report. This section of the vignette will outline the various variables, their type, and levels.

`SID`

A unique way of identifying members in an anonymous fashion. The length of the SID is typically six digits long. It serves as the primary method to run joins between various datasets for further analyses.

Type: Double

Geography Cleaning Stage

There are several clusters for cleaning data. One of which, geography, focuses on cleaning the information regarding a member’s location information, mainly: * City * State * Country

For state, an important stipulation is noted that this variable is only populated for members within the United States.

`City`

The world city of a member’s residence. Smaller towns or cities may be recoded into a larger, nearby city to help streamline analytics and aggregation. Also, acronyms for cities such as East, West, South, could be consolidated into the main city name. For example, East Orange would turn into Orange. Additionally, abbreviations such as Ft. or Mt. are turned into their full names for consistency.

Function step name: step_city
Input Variable Name: City
Output Variable Name: City
Format: Open ended response
Type: Character

`State`

The U.S. state of member’s residence.For classifying State, the following were not included as valid states: * Guam, with the acronym GU * Virgin Islands, with the acronym VI * Puerto Rico as PR - IS THIS TRUE??????? TODO

As an aside, in Australia, there is a state known by the acronym of WA. For cases that identify as their country of residence being Australia, all State responses were recoded to NA in order to avoid any confusion with Washington state in the USA.

Function step name: step_state
Input Variable Name: STATE
Output Variable Name: State_US
Format: Open ended response
Type: Character

`Country`

The country of a member’s residence. There is no cleaning that goes into this function step because the country names are clean. The only significant change that happens to this demographic variable is the imputation of United States where State_US has a valid state abbreviation.

Function step name: step_country
Input Variable Name: Country
Output Variable Name: Country
Format: Open ended response ??????????
Type: Character

Membership Cleaning Stage

Cleaning the membership columns deals primarily with three demographic variables: career, membership category and whether the member is in an academic or applied setting.

`Primary_Sector`

The Primary_Sector variable describes the type of field the member is operating in. The fields to be determined most relevant for analyses are included in the table below, under Levels.

TODO: Insert more background information on why this categorization exists. TODO: Make this a factor-level variable?

Function step name: step_sector
Input Variable Name: Academic Sector (Primary), Government Sector (Primary), Private Sector (Primary), Other Sector (Primary), Not Provided (Primary)
Output Variable Name: Primary_Sector
Format: Open ended response ??????????
Type: Character
Levels: Academy, Government, Private, Other, Not Provided

`Primary_Occupation`

The Primary_Occupation variable is the second output from the step_sector cleaning function. It provides a much more nuanced approach to reporting the occupations of SIOP members. There exist at least thirty categorizations for this variable, ranging from Consulting firm to Financial services. This variable, in combination with Primary_Sector provides an interesting perspective on types of roles by the various sectors.

Function step name: step_sector
Input Variable Name: Academic Sector (Primary), Government Sector (Primary), Private Sector (Primary), Other Sector (Primary), Not Provided (Primary)
Output Variable Name: Primary_Occupation
Format: Open ended response ??????????
Type: Character
Levels: Too many to explicitly document

`MemberGroup`

MemberGroup documents whether a member in the data table is an individual member within the society, or a prospect that has not yet fully signed up.

TODO: Flesh this out more, why do we capture this and how does it work?

Function step name: step_membergroup
Input Variable Name: MemberGroup
Output Variable Name: MemberGroup
Format: Drop-down selection
Type: Factor
Levels: Individual, Prospect

`Academic_Applied`

The AcademicApplied variable captures whether an individual is within the academic sector, or the applied sector. It is a dichotomous variable.

TODO: We categorize Other Sector as Applied, is this okay?

Function step name: step_academicapplied
Input Variable Name: Academic Sector (Primary), Government Sector (Primary), Private Sector (Primary), Other Sector (Primary)
Output Variable Name: AcademicApplied
Format: Open ended response ??????????
Type: Factor
Levels: Academic, Applied

Education Cleaning Stage

Education cleaning deals with demographic variables such as highest degree completed, institution attended, and the type of institution. Since the School variable is open ended, a custom database has been built that joins with the main analysis file to help streamline reporting of schools and their profit or non-profit status.

`Highest_Degree`

Highest_Degree specifies the level of education that was the highest obtained for a specific SIOP member. This variable requires a lot of cleaning due to the messy nature of the data. Lots of members included various acronyms, and specialized degree titles for this field, making the standardization and reporting tedious.

Update on 02/01/2023 - We decided to do away with the include_ABD parameter and converted all ABD status people to a Master’s degree. We also included the MOP degree under Masters.

Function step name: step_highestdegree
Function Last Updated: February 1, 2023
Input Variable Name: Degree
Output Variable Name: Highest_Degree
Format: Open ended response through 2021. Then close ended response > 2021
Type: Factor
Levels: Associate, Bachelors, Doctorate, Other, Masters

`Priv_Publ`

Priv_Publ specifies the level of education that was the highest obtained for a specific SIOP member. This variable requires a lot of cleaning due to the messy nature of the data. Lots of members included various acronyms, and specialized degree titles for this field, making the standardization and reporting tedious.

Function step name: step_privatepublic
Input Variable Name: Control
Output Variable Name: Priv_Publ
Format: Open ended response ???
Type: Factor
Levels: International, Private, Public

`ProfitNon`

The final school related demographic cleaning step concerns whether a school or institution is labelled as a for-profit or non-profit. The categorizations were developed as an independent study that culminated in the creation of a database of schools

Function step name: step_privatepublic
Input Variable Name: Control
Output Variable Name: Priv_Publ
Format: Open ended response ???
Type: Factor
Levels: International, Private, Public

Demographics Cleaning Stage

Demographic cleaning deals with traditional variables, such as gender, ethnicity, birthdate, and age. The categorizations used are considered to be best practices.

`Gender`

Gender is reported as male or female.

Function step name: step_gender
Input Variable Name: Gender
Output Variable Name: Gender
Format: Close ended response
Type: Factor
Levels: Male, Female

`Birthdate`

Birthdate represents the date of birth of the member. It is coded as a proper date variable.

TODO: Clean out birthdates that are obviously wrong, like > 100 years of age.

Function step name: step_birthdate
Input Variable Name: Birth Date
Output Variable Name: Birthdate
Format: Open ended response
Type: Date

`Ethnicity`

Ethnicity represents the race of the SIOP member. The standard seven categorizations are used for this.

Function step name: step_ethnicity
Input Variable Name: Ethnicity
Output Variable Name: Ethnicity
Format: Close ended response
Type: Factor
Levels: White, Black or African American, Asian, Two or more races, Hispanic or Latino, other, Native American/Alaska Native, Native Hawaiian or other Pacific Islander

`Age Group`

Age group is not directly queried from members. Instead, we use the Birthdate variable to calculate the current age of the member, by comparing the birthdate to a function that obtains today’s date.

Function step name: step_agegroup
Input Variable Name: Birthdate
Output Variable Name: AgeGroup
Format: Not a direct response from members
Type: Factor
Levels: 20 and under, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59, 60-64, 65-69, 70+

`Interest Areas`

This variable consists of a series of three columns in the original demographics dataset. Minimal cleaning was performed, such as removing multiple categories in the first, second, or third interest area. Several members put more than one category in a single variable, for example, for int_1, somebody may have put Psychometrics and Organizational Culture. In this instance, organizational culture would have been dropped and Psychometrics retained for int_1. This is to help streamline the analytics and set expectations on the data file year after year.

Function step name: step_interests
Input Variable Name: Interest Area (first choice), Interest Area (second choice), Interest Area (third choice),
Output Variable Name: int_1, int_2, int_3
Format: Close ended response
Type: character
Levels: Coaching/Leadership Development, Testing/Assessment, Leadership, Inclusion/Diversity, Careers/Career Planning/Mentoring/Socialization/Onboarding/Retirement, Staffing, Groups/Teams, Job Analysis/Job Design/Competency Modeling, Job Attitudes/Engagement, Occupational Health/Safety/Stress & Strain/Aging, Organizational Performance/Change Management/Downsizing/OD, Measurement/Statistical Techniques, Organizational Culture/Climate, Performance Appraisal/Feedback/Management/Goal-Setting, I-O Psychology Education/Professional Development, Conflict/Counterproductive Behavior/Harassment/Violence/Workplace Deviance, Personality, Global/International/Cross-Cultural Issues, Training, Research Methodology, Work and Family/Non-Work Life/Leisure/Flexible Work Arrangements, Motivation/Rewards/Compensation, Consulting Practices/Ethical Issues, Emotions/Emotional Labor, Employee Withdrawal/Retention, Strategic HR/Utility/Changing Role of HR, Job Performance/Citizenship Behavior, Technology, Legal Issues/Employment Law/Union Issues/Labor Relations, Innovation/Creativity, Judgment/Decision Making, Other, Customer Service/Consumer Psychology, Human Factors/Ergonomics, Organizational Justice, Teaching I-O Psychology/Student Affiliate Issues/Professional Development in I-O Psychology, Pro-Social

`Involvement Seeking`

A variable asks members if they are seeking any involvement in a wide range of activities that are aligned with SIOP initiatives. Each unique category is created into a column with a 0 and 1 for whether a member possesses the inclination toward that activity.

Function step name: step_involvement
Input Variable Name: Involvement Sought
Output Variable Name: inv_share_knowledge, inv_collaborate_on_projects, inv_conduct_research, inv_join_interdisclipinary_teams, inv_other, inv_present_at_meetings_conferences, inv_join_communities_of_interest, inv_network, inv_consult_to_business
Format: Numeric
Type: Double (integer)
Levels: 0, 1

`Organizational Affilitations`

A variable asks members if they are associated with any other organizational or professional affilitiations. Each level is transformed into a unique column.

Function step name: step_other_orgs
Input Variable Name: Other Organizational Affiliations, HRCI Certification: Do you hold PHR, SPHR, or GPHR certification?, SHRM Certification: Do you hold SHRM-CP or SHRM-SCP certification?, Are you a licensed psychologist?
Output Variable Name: org_aff_AOM, org_aff_APA, org_aff_APS, org_aff_SHRM, org_aff_EAWOP, org_aff_IAAP, org_aff_CPA, org_aff_HRCI, org_aff_SHRM, licensed
Format: Numeric
Type: Double (integer)
Levels: 0, 1

Dues Cleaning Stage

The second Excel file to be read into the analysis for building a SIOP membership year is the dues file. The most important variable in the dues file specifies the date of the transaction for registration as a member. This date serves as classifying the member into the appropriate SIOP calendar year. Additionally, the dues file contains information concerning the type of membership that a member is either registering for, or renewing. All cleaning functions concerning the dues excel file are enclosed in the step_dues function. They are not partitioned out as observed in the demographics excel file cleaning stage described above.

`Membership Dues`

Membership dues describes the type of transaction that occurred. This information helps us ensure the correct value is charged to the current type of member.

TODO: THE PRICING TABLE FOR SIOP MEMBERS

Function step name: step_dues
Input Variable Name: ItemDescription
Output Variable Name: Membership_Dues
Format: Not a direct response from members
Type: Character
Levels: Associate, Fellow, Member, Retired Associate, Retired Fellow, Retired Member, Student Affiliate, International Affiliate

`MD Number`

The MD Number is back translated from the output of MembershipDues and categorized as a numeric value. In the future, I am not so sure this step will be required, as we can just continue to use MembershipDues for any type of analysis that requires the MD_Number.

TODO: VARIABLE REQUIRED OR NOT?

Function step name: step_dues
Input Variable Name: Membership_Dues
Output Variable Name: MD_Number
Format: Not a direct response from members
Type: Double (integer)
Levels: 1 = Student affiliate, 2 = Associate, 3 = International Affiliate, 4 = Member, 5 = Fellow, 6 = Retired International, 7 = Retired International Affiliate, 8 = Retired Member, 9 = Retired Fellow

`Student or Professional`

The student professor variable is linked to the MD_Number. It categorizes a 1, or, Student Affiliate, as a student, and all else as a professional.

Function step name: step_dues
Input Variable Name: MD_Number
Output Variable Name: Student_Prof
Format: Not a direct response from members
Type: Double (integer)
Levels: Student, Professional