2022 WiD Datathon | Let's Talk Science

Women's Equality in the Workplace

"According to the World Economic Forum, in 2022, the global gender gap has been closed by 68.1% and at the current rate of progress, it will take 132 years to reach full parity.

Unpaid work, societal expectations, employer policies, and the availability of care continue to play an important role in a woman’s choice of education and career pathways.

The intended outcome for this Datathon is to provide awareness, education, and recommendations for how we improve women’s equality in the workplace."

This Datathon was the second one I had joined that was intended for women participants. The first one I participated in was in early 2022, hosted by a different organization- Women in Data Science (WiDS). There are SO many datathons and competitions out there, it can be intimidating to navigate the landscape.

If you're interested in joining a challenge, I recommend searching for one related to your interests and try it. I think there are competitions for nearly every discipline, including healthcare, materials science, cybersecurity, logistics, and lots more. Soon I'll put up a resource page with links where you can find some of these challenges. This has been a great way for me to learn, because there are hard deadlines and milestones, and you're typically working on a team (unlike self-paced/unstructured offerings like Datacamp or Udemy).

I was nervous to join, and I'll probably be nervous the next time I sign up for one, but the organizations that are well-established and have experience hosting these competitions typically provide great support and structure to ease those anxieties. With my professional experience working in outreach and science communication, I know that people in data-related disciplines are not the most outgoing bunch of folks you might encounter. You don't have to be super social to participate and learn in these challenges, and it can be a safe place for you to practice team science and collaboration. If you've got the bandwidth, I highly recommend trying one out.

The WiD 2022 Datathon officially launched on September 7 and a toolkit was distributed to teams one week later. The toolkit included the challenge rules, suggested data sources, and the guidelines for the final submissions. From the toolkit:

Your challenge is to develop a problem statement, analyze data across our focus areas, and provide insights and recommendations for how we improve women’s equality in the workplace.

My team discussed a number of potential topics, and we decided to focus on the gender pay gap, primarily because of the large amount of publicly available relevant data. The general process was to find some data, explore and analyze it, develop some recommendations based on the results, and create a video presentation. The instructions for the final submission were to upload the video file, code files, and a written report that documented the work completed. I'm not going to post everything we did here, but I do want to share some of our submission, with a focus on the data analysis.

My team decided to explore datasets from the World Bank and ILOSTAT to see if changes in laws related to discrimination against women in the workplace affect the gender pay gap. The World Bank has a Gender Data Portal with datasets on many indicators, like education, assets, entrepreneurship, and laws related to gender discrimination all over the world. There's so much to explore, and it has built-in visualization tools and reports.

We reviewed several of the indicators and chose eight things (dependent variables, or features) to use in our analysis, based on the completeness of the data (the highest number of observations, or independent variables). The features we chose from the World Bank portal were binary, which means the values can only be one of two choices.

DETOUR

A variable is something that you measure. Some examples are temperature, location, quantity, quality, color, size, or weight. An independent variable is not changed by other variables you are trying to measure in an experiment. For example, if you want to measure housing prices in different zip codes, the independent variable would be the zip code- it doesn’t change. A dependent variable is affected by other factors- in the housing prices example, the price of the house would be dependent on a number of things, the size, age, condition, and location of the house. Dependent variables are typically represented on the y-axis of a standard Cartesian plane or graph (like the one shown on the right), and independent variables are typically represented on the x-axis.

Binary variables are typically represented with yes/no, on/off, true/false, 0/1, etc. In fact, the word 'bit' just means binary digit. A bit is the smallest unit of information, and a byte (the smallest unit of memory in a computer) is a string of 8 bits. Using only 1 byte, you can encode more than 16 million combinations of bits (8^8). The binary alphabet is how computers read and store information, and binary digits are critical to the foundation of information science.

After reviewing some of the Gender Data Portal datasets, we ended up with a .csv file that had 9828 rows, or observations, of the 8 binary features, or columns, that we were interested in; plus four index features. Index features are variables that contain identifying information about an observation. in this case, our dataset had a column that listed the country name, a three-letter country code abbreviation, the year (from 1970 - 2021), and a column that we created with the combination of the year and three-letter country code.

Here is the list of the eight binary variables related to gender discrimination laws that we included, with the description from the data dictionary. The column names are underlined, and I've included a screenshot of the .csv file below.

SG.LEG.SXHR.EM: There is legislation on sexual harassment in employment (1=yes; 0=no)
SG.DML.PRGW: Dismissal of pregnant workers is prohibited (1=yes; 0=no)
SG.LAW.EQRM.WK: Law mandates equal remuneration for females and males for work of equal value (1=yes; 0=no)
SG.LAW.NODC.HR: The law prohibits discrimination in employment based on gender (1=yes; 0=no)
SG.LAW.CRDD.GR: The law prohibits discrimination in access to credit based on gender (1=yes; 0=no)
SG.LAW.NMCN: The law provides for the valuation of non-monetary contributions (1=yes; 0=no)
SG.GET.JOBS.EQ: A woman can get a job in the same way as a man (1=yes; 0=no)
SG.BUS.REGT.EQ: A woman can register a business in the same way as a man (1=yes; 0=no)

Screen Shot 2022-12-26 at 1.18.57 PM.png

Ready to learn?

Download the Dataset

Download the project dataset and other materials from the project repository on Github. If you're not familiar with Github, you can download them by clicking the button below. You'll need the .csv file and you need to know its filepath, or where it lives on your computer after you download it.

Project Dataset (click to download)

Download the Notebook

Download this project file, it's a Jupyter Notebook file (.ipynb) that can be used in Google Colab, Visual Studio Code, and other Python developer environments. If you've never used this file type or have never used Google Colab, check out this video to walk through the Colab environment.

Project Notebook (click to download)

Follow along with me

I would suggest starting a blank notebook in Colab and typing the commands and stuff yourself, but feel free to work from my notebook and follow along with my project video series to complete the data processing and analysis yourself. Pause the videos whenever you need to, and refer back to this page for additional support.

Google Colab (click to open)

This project will take at least 30 minutes to complete, and probably an hour or more for beginners.

Take your time, and feel free to revisit this page and pause the videos whenever you need to.

Follow along with me as I walk you through the project and help you conduct the analysis yourself.

Part 1:

Project Intro +

When to use Python, R, or Excel

Guess what? Most programmers spend lots of time looking up function names, syntax, parameters, and error messages. This first video is designed to encourage anyone who wants to learn to code. You'll spend just as much time on Stack Overflow as you will writing code, if not more. That's normal! It's often joked about in the data science and computing community, but not widely discussed in academic settings.

Part 2:

Some Coding Tips +

Setting up the Project Environment

This covers some basics of Google Colab, Jupyter Notebooks, importing the project dataset, and importing the Python packages you'll need for the project. I also (briefly) discuss programming syntax and code annotation.

Part 3:

Data Wrangling + Analysis

This video will walk you through the Python data manipulation and analysis for the project, with side-by-side comparison of the Python functions used with their Excel counterparts. You'll also encounter NaNs and learn some foundational Python skills, and get familiar with Pandas documentation.

More coming soon....work in progress!