Introduction

Welcome to the Canadian Salary Data repository. This repository aims to provide a comprehensive and unified dataset of Canadian salary information derived from the Stack Overflow surveys. The data spans from 2011 to 2023 and includes additional datasets to enhance the accuracy and depth of the salary analysis. This repository serves as a valuable resource for understanding salary trends and distributions across various roles, industries, and cities in Canada. The data processing steps, from collection to cleaning and integration, ensure that the dataset is robust and ready for further analysis.

The final dataset can be found at Canadian Data/CanadaData.csv:.

A Toy application showing the dataset in action can be found at: https://www.canada-tech-salary.tech

Kaggle Dataset: https://www.kaggle.com/datasets/moun12345/canadian-salary-data-from-stack-overflow-survey

File Structure

Canadian Data Preprocessing/*: Contains the jupyter notebook used to clean the data.
Canadian Data/CanadaData.csv: Contains the final data set for the Canadian data. This data set is used to create the final machine learning model. The salary column is in USD dollars.
Canadian Data/GovSalaryCombined.csv: Dataset that was constructed to map canadian city to the salary of the stack overflow survey. All the salary in this dataset is in CAD dollars.
Canadian Data/CanadianCitiesPopulation.csv: Contains the population of the Canadian cities.
Canadian Data/Stack Overflow/stack-overflow-canada.csv: Contains the unified and uncleaned view of all Canadian salary data from the Stack Overflow survey.
Canadian Data/Stack Overflow/*: Normally contains all the stack overflow survey, but since they are too large for the repo. You need to download them Stack Overflow Survey and store them in this format (up to 2023.csv):

Data Processing Pipeline

This is a brief description of the few noticeable problems that I had to solve to build the dataset used for this application. I refer to "user" as the person who had answered the Stack Overflow Survey.

A visual representation of the data used can be found at: https://www.canada-tech-salary.tech/data

The Repository of the website can be found at: https://github.com/MounirAia/Canadian-Tech-Salary-Prediction-App

Data Collection

Sources

My primary source of data is coming from the Stack Overflow Survey. I treated the data from 2011 up to 2023. All the datasets that are not from the Stack Overflow Survey were used to enhance the Stack Overflow data. I treated in total 25 files.

13 Stack Overflow Survey files
- Only processed the data of users working in Canada
Canada Government data
Canada's most populous city

Data integration

All stack overflow datasets differs in how they store their data (Column name, and Column value). I had to integrate all the different columns to follow a uniform structure, to make the model training possible. This step of data processing was marked by a heavy use of ChatGPT that really shines for string processing column.

The data collected in the Stack Overflow Survey was:

Company size of the person's company
Industry type of the company
Title of the job
Years of experience of the person
Country (fix to Canada)
City of the person's job
Salary

Title problem

For each dataset, it had different ways of storing a similar role name (the Title of the user job). This is problematic when you want to aggregate all the similar into similar sections, to strengthen the model. For example, the system needs to understand that: 'Full-stack developer' == 'Developer, full-stack' and this Title should be considered as the same role.

To solve this problem, I used predefined bins of various title in the tech industry, using the titles coming from the 2023 survey. I mapped all the titles of the other surveys in those bins. I extensively used ChatGPT for the mapping.

Data Cleaning

Missing Values - City Problem

None of the stack overflow datasets contained the column City, which is a key predictive feature for the model.

To solve this problem, I guessed the city of the user based on its: salary, title and experience. I found the most plausible Canadian city where he/she is working. I collected the: average salary for each role title associated with a certain level of experience for each Canadian city using the Canada Government data. I then mapped each user to the city that minimizes the distance between its salary and the average city’s salary for his/her info.

Missing Values - Industry Problem

Many Stack overflow datasets did not contain the company industry where the person was working in.

To solve this problem, I collected the proportion distribution of the different industries for each survey that I had the data of the industry. I then mapped each user randomly for the stack overflow that I did not have the industry for, to match the closest (in terms of year) industry proportion.

Outliers - Salary problem

Various Salary were too high or too low. Also, the Stack Overflow did not ask for the base salary of the user, but rather the total salary (including stocks and advantages) of the user.

To solve this problem, I computed the Z-score of the salaries’ users based on the mean and the cities’ standard deviation and average salary for each Experience category [0 to 1 year, 2-4, 5-9, 10-*] (using the Canada Government data). I removed all the data that had a standard deviation > |2|.

Data Exploration

This step was mainly done with the goal of finding anomalies in the dataset.

For example, I found that the highest average salary was attributed to people that had 0 to 1 year of experience, which is not logical. This is why the Z-score computation above considers the average salary per city by experience.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Canadian Data Preprocessing		Canadian Data Preprocessing
Canadian Data		Canadian Data
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

File Structure