This vignette dives provides details concerning the functions that comprise the medusa package. The readme file will demonstrate how to use the three main functions to clean your data. The purpose of this vignette is to provide further context concerning the standards we have set for each of the demographic variables. To a certain degree, this vignette could be thought of as an update to the SIOP Membership Analytics Data Prep Guide that was previously used to keep track of the various data cleaning strategies required to clean the yearly membership data. Most of the functions developed were based from the logic and work of previous volunteers. Therefore, we always invite others to collaborate and share information or methods that would better help streamline the work we are continually doing on the data. To begin, some background context concerning the data itself is provided below:

Membership Year Ranges

  • From 2000 - 2001, a SIOP year ran from 05/01 through 08/31.
  • From 2002 - 2006, a SIOP year ran from 07/01 through 10/31.
  • From 2007 onward, a SIOP year runs from 03/01 to 06/30. (For 2021: 03/01/20 - 06/30/2021)
  • SID is a unique identifer, created by the SIOP AO and serves as a unique key.

The Whole Game

The information below serves as a deep dive into what the medusa package does. However, if you are simply looking to quickly hit the road running, medusa does contain a function within itself, build_dataset, that allows you to output a cleaned CSV file to a location of your choice. Simply feed it the arguments required, of which there are currently three:

  • year - What year would you like to clean? Current acceptable range is 2017-2022.

  • set_root_dir - this specifies the location of the folder that contains the three pertinent types of files within them: the various dues files, the single demographic file, and the single staff file. All must be present, in .xlsx format, for it to be picked up appropriately by medusa.

  • save_output_to - Where would you like the cleaned CSV to be saved to?

Just as clarification, this function is not meant to output a tibble within an R environment. It will not return anything when running successfully, but you should see the csv file in the location that you specified in the function.

Data Cleaning Steps

Global Variables

The original dataset provided by the AO contains a lot of columns and variables. Not all these features are equally important for the time being. We have focused our initial efforts on streamlining and cleaning the critical demographic variables that are used either for modeling or provided in the annual membership report. This section of the vignette will outline the various variables, their type, and levels.

SID

A unique way of identifying members in an anonymous fashion. The length of the SID is typically six digits long. It serves as the primary method to run joins between various datasets for further analyses.

  • Type: Double

Geography Cleaning Stage

There are several clusters for cleaning data. One of which, geography, focuses on cleaning the information regarding a member’s location information, mainly: * City * State * Country

For state, an important stipulation is noted that this variable is only populated for members within the United States.

City

The world city of a member’s residence. Smaller towns or cities may be recoded into a larger, nearby city to help streamline analytics and aggregation. Also, acronyms for cities such as East, West, South, could be consolidated into the main city name. For example, East Orange would turn into Orange. Additionally, abbreviations such as Ft. or Mt. are turned into their full names for consistency.

  • Function step name: step_city
  • Input Variable Name: City
  • Output Variable Name: City
  • Format: Open ended response
  • Type: Character

State

The U.S. state of member’s residence.For classifying State, the following were not included as valid states: * Guam, with the acronym GU * Virgin Islands, with the acronym VI * Puerto Rico as PR - IS THIS TRUE??????? TODO

As an aside, in Australia, there is a state known by the acronym of WA. For cases that identify as their country of residence being Australia, all State responses were recoded to NA in order to avoid any confusion with Washington state in the USA.

  • Function step name: step_state
  • Input Variable Name: STATE
  • Output Variable Name: State_US
  • Format: Open ended response
  • Type: Character

Country

The country of a member’s residence. There is no cleaning that goes into this function step because the country names are clean. The only significant change that happens to this demographic variable is the imputation of United States where State_US has a valid state abbreviation.

  • Function step name: step_country
  • Input Variable Name: Country
  • Output Variable Name: Country
  • Format: Open ended response ??????????
  • Type: Character

Membership Cleaning Stage

Cleaning the membership columns deals primarily with three demographic variables: career, membership category and whether the member is in an academic or applied setting.

Primary_Sector

The Primary_Sector variable describes the type of field the member is operating in. The fields to be determined most relevant for analyses are included in the table below, under Levels.

TODO: Insert more background information on why this categorization exists. TODO: Make this a factor-level variable?

  • Function step name: step_sector
  • Input Variable Name: Academic Sector (Primary), Government Sector (Primary), Private Sector (Primary), Other Sector (Primary), Not Provided (Primary)
  • Output Variable Name: Primary_Sector
  • Format: Open ended response ??????????
  • Type: Character
  • Levels: Academy, Government, Private, Other, Not Provided
Primary_Occupation

The Primary_Occupation variable is the second output from the step_sector cleaning function. It provides a much more nuanced approach to reporting the occupations of SIOP members. There exist at least thirty categorizations for this variable, ranging from Consulting firm to Financial services. This variable, in combination with Primary_Sector provides an interesting perspective on types of roles by the various sectors.

  • Function step name: step_sector
  • Input Variable Name: Academic Sector (Primary), Government Sector (Primary), Private Sector (Primary), Other Sector (Primary), Not Provided (Primary)
  • Output Variable Name: Primary_Occupation
  • Format: Open ended response ??????????
  • Type: Character
  • Levels: Too many to explicitly document
MemberGroup

MemberGroup documents whether a member in the data table is an individual member within the society, or a prospect that has not yet fully signed up.

TODO: Flesh this out more, why do we capture this and how does it work?

  • Function step name: step_membergroup
  • Input Variable Name: MemberGroup
  • Output Variable Name: MemberGroup
  • Format: Drop-down selection
  • Type: Factor
  • Levels: Individual, Prospect
Academic_Applied

The AcademicApplied variable captures whether an individual is within the academic sector, or the applied sector. It is a dichotomous variable.

TODO: We categorize Other Sector as Applied, is this okay?

  • Function step name: step_academicapplied
  • Input Variable Name: Academic Sector (Primary), Government Sector (Primary), Private Sector (Primary), Other Sector (Primary)
  • Output Variable Name: AcademicApplied
  • Format: Open ended response ??????????
  • Type: Factor
  • Levels: Academic, Applied

Education Cleaning Stage

Education cleaning deals with demographic variables such as highest degree completed, institution attended, and the type of institution. Since the School variable is open ended, a custom database has been built that joins with the main analysis file to help streamline reporting of schools and their profit or non-profit status.

Highest_Degree

Highest_Degree specifies the level of education that was the highest obtained for a specific SIOP member. This variable requires a lot of cleaning due to the messy nature of the data. Lots of members included various acronyms, and specialized degree titles for this field, making the standardization and reporting tedious.

Update on 02/01/2023 - We decided to do away with the include_ABD parameter and converted all ABD status people to a Master’s degree. We also included the MOP degree under Masters.

  • Function step name: step_highestdegree
  • Function Last Updated: February 1, 2023
  • Input Variable Name: Degree
  • Output Variable Name: Highest_Degree
  • Format: Open ended response through 2021. Then close ended response > 2021
  • Type: Factor
  • Levels: Associate, Bachelors, Doctorate, Other, Masters
Priv_Publ

Priv_Publ specifies the level of education that was the highest obtained for a specific SIOP member. This variable requires a lot of cleaning due to the messy nature of the data. Lots of members included various acronyms, and specialized degree titles for this field, making the standardization and reporting tedious.

  • Function step name: step_privatepublic
  • Input Variable Name: Control
  • Output Variable Name: Priv_Publ
  • Format: Open ended response ???
  • Type: Factor
  • Levels: International, Private, Public
ProfitNon

The final school related demographic cleaning step concerns whether a school or institution is labelled as a for-profit or non-profit. The categorizations were developed as an independent study that culminated in the creation of a database of schools

  • Function step name: step_privatepublic
  • Input Variable Name: Control
  • Output Variable Name: Priv_Publ
  • Format: Open ended response ???
  • Type: Factor
  • Levels: International, Private, Public

Demographics Cleaning Stage

Demographic cleaning deals with traditional variables, such as gender, ethnicity, birthdate, and age. The categorizations used are considered to be best practices.

Gender

Gender is reported as male or female.

  • Function step name: step_gender
  • Input Variable Name: Gender
  • Output Variable Name: Gender
  • Format: Close ended response
  • Type: Factor
  • Levels: Male, Female
Birthdate

Birthdate represents the date of birth of the member. It is coded as a proper date variable.

TODO: Clean out birthdates that are obviously wrong, like > 100 years of age.

  • Function step name: step_birthdate
  • Input Variable Name: Birth Date
  • Output Variable Name: Birthdate
  • Format: Open ended response
  • Type: Date
Ethnicity

Ethnicity represents the race of the SIOP member. The standard seven categorizations are used for this.

  • Function step name: step_ethnicity
  • Input Variable Name: Ethnicity
  • Output Variable Name: Ethnicity
  • Format: Close ended response
  • Type: Factor
  • Levels: White, Black or African American, Asian, Two or more races, Hispanic or Latino, other, Native American/Alaska Native, Native Hawaiian or other Pacific Islander
Age Group

Age group is not directly queried from members. Instead, we use the Birthdate variable to calculate the current age of the member, by comparing the birthdate to a function that obtains today’s date.

  • Function step name: step_agegroup
  • Input Variable Name: Birthdate
  • Output Variable Name: AgeGroup
  • Format: Not a direct response from members
  • Type: Factor
  • Levels: 20 and under, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59, 60-64, 65-69, 70+
Interest Areas

This variable consists of a series of three columns in the original demographics dataset. Minimal cleaning was performed, such as removing multiple categories in the first, second, or third interest area. Several members put more than one category in a single variable, for example, for int_1, somebody may have put Psychometrics and Organizational Culture. In this instance, organizational culture would have been dropped and Psychometrics retained for int_1. This is to help streamline the analytics and set expectations on the data file year after year.

  • Function step name: step_interests
  • Input Variable Name: Interest Area (first choice), Interest Area (second choice), Interest Area (third choice),
  • Output Variable Name: int_1, int_2, int_3
  • Format: Close ended response
  • Type: character
  • Levels: Coaching/Leadership Development, Testing/Assessment, Leadership, Inclusion/Diversity, Careers/Career Planning/Mentoring/Socialization/Onboarding/Retirement, Staffing, Groups/Teams, Job Analysis/Job Design/Competency Modeling, Job Attitudes/Engagement, Occupational Health/Safety/Stress & Strain/Aging, Organizational Performance/Change Management/Downsizing/OD, Measurement/Statistical Techniques, Organizational Culture/Climate, Performance Appraisal/Feedback/Management/Goal-Setting, I-O Psychology Education/Professional Development, Conflict/Counterproductive Behavior/Harassment/Violence/Workplace Deviance, Personality, Global/International/Cross-Cultural Issues, Training, Research Methodology, Work and Family/Non-Work Life/Leisure/Flexible Work Arrangements, Motivation/Rewards/Compensation, Consulting Practices/Ethical Issues, Emotions/Emotional Labor, Employee Withdrawal/Retention, Strategic HR/Utility/Changing Role of HR, Job Performance/Citizenship Behavior, Technology, Legal Issues/Employment Law/Union Issues/Labor Relations, Innovation/Creativity, Judgment/Decision Making, Other, Customer Service/Consumer Psychology, Human Factors/Ergonomics, Organizational Justice, Teaching I-O Psychology/Student Affiliate Issues/Professional Development in I-O Psychology, Pro-Social
Involvement Seeking

A variable asks members if they are seeking any involvement in a wide range of activities that are aligned with SIOP initiatives. Each unique category is created into a column with a 0 and 1 for whether a member possesses the inclination toward that activity.

  • Function step name: step_involvement
  • Input Variable Name: Involvement Sought
  • Output Variable Name: inv_share_knowledge, inv_collaborate_on_projects, inv_conduct_research, inv_join_interdisclipinary_teams, inv_other, inv_present_at_meetings_conferences, inv_join_communities_of_interest, inv_network, inv_consult_to_business
  • Format: Numeric
  • Type: Double (integer)
  • Levels: 0, 1
Organizational Affilitations

A variable asks members if they are associated with any other organizational or professional affilitiations. Each level is transformed into a unique column.

  • Function step name: step_other_orgs
  • Input Variable Name: Other Organizational Affiliations, HRCI Certification: Do you hold PHR, SPHR, or GPHR certification?, SHRM Certification: Do you hold SHRM-CP or SHRM-SCP certification?, Are you a licensed psychologist?
  • Output Variable Name: org_aff_AOM, org_aff_APA, org_aff_APS, org_aff_SHRM, org_aff_EAWOP, org_aff_IAAP, org_aff_CPA, org_aff_HRCI, org_aff_SHRM, licensed
  • Format: Numeric
  • Type: Double (integer)
  • Levels: 0, 1

Dues Cleaning Stage

The second Excel file to be read into the analysis for building a SIOP membership year is the dues file. The most important variable in the dues file specifies the date of the transaction for registration as a member. This date serves as classifying the member into the appropriate SIOP calendar year. Additionally, the dues file contains information concerning the type of membership that a member is either registering for, or renewing. All cleaning functions concerning the dues excel file are enclosed in the step_dues function. They are not partitioned out as observed in the demographics excel file cleaning stage described above.

Membership Dues

Membership dues describes the type of transaction that occurred. This information helps us ensure the correct value is charged to the current type of member.

TODO: THE PRICING TABLE FOR SIOP MEMBERS

  • Function step name: step_dues
  • Input Variable Name: ItemDescription
  • Output Variable Name: Membership_Dues
  • Format: Not a direct response from members
  • Type: Character
  • Levels: Associate, Fellow, Member, Retired Associate, Retired Fellow, Retired Member, Student Affiliate, International Affiliate
MD Number

The MD Number is back translated from the output of MembershipDues and categorized as a numeric value. In the future, I am not so sure this step will be required, as we can just continue to use MembershipDues for any type of analysis that requires the MD_Number.

TODO: VARIABLE REQUIRED OR NOT?

  • Function step name: step_dues
  • Input Variable Name: Membership_Dues
  • Output Variable Name: MD_Number
  • Format: Not a direct response from members
  • Type: Double (integer)
  • Levels: 1 = Student affiliate, 2 = Associate, 3 = International Affiliate, 4 = Member, 5 = Fellow, 6 = Retired International, 7 = Retired International Affiliate, 8 = Retired Member, 9 = Retired Fellow
Student or Professional

The student professor variable is linked to the MD_Number. It categorizes a 1, or, Student Affiliate, as a student, and all else as a professional.

  • Function step name: step_dues
  • Input Variable Name: MD_Number
  • Output Variable Name: Student_Prof
  • Format: Not a direct response from members
  • Type: Double (integer)
  • Levels: Student, Professional