Data Science Project

A First Data Science Project

Beau Bellamy
8 min readOct 12, 2019

--

Stackoverflow Survey 2019

When looking for jobs for data science, there is one thing you need; a portfolio of projects that show off your skills. This is hard for people trying to get their first job in data science, like me, because they don’t have access to business data and its often hard to know what question they should try to answer. When you do get hold of a good data set, you often don’t know what question to ask. In addition to this, the questions need to be relevant for a business, such as;

  • What is the ROI?
  • How much time will this save?
  • What opportunities will this unlock?
  • etc…

So, why not make the search for a data science job, a data science project.

Elements of a Data Science Project

  1. A Question that will add value to the business
  2. Collect the data
  3. Process the data and gain insight
  4. Communicate the insight
  5. Make a decision.

1. Question

This is usually the hardest part of the project for a budding data scientist to come up with, which is why most find it hard to start a project.

There are 2 questions relevant for any job seeker that add value to the job search.

  1. How much can I earn?
  • You want to know what to say when a prospective employer asks “What are your salary expectations?”. You don’t want this to be too high, because this might indicate to the employer that you think too highly of yourself, or they can’t afford you, so there is no point in asking you in for further interviews. You also don’t want to ask for too little, because this will indicate to the employer that you don’t have the experience or you don’t really understand what the role is.
  1. How much competition will there be?
  • You might want to know how much competition there is, in order to gauge your chances of actually getting the job.

2. Collect Data

Finding a relevant data set is typically the next hardest part of a data science project. Some data sets that might be useful in our case, will be datasets from job search websites. These might have details of job roles and salary expectations. The membership of these websites could be the measure of the competitions for each role. However, these data sets are typically the way these websites bring in revenue, so they’re not likely to make the data available to the public. The best data set for us is likely to be a survey of people in similar industries that are looking for work.

There just happens to be an organisation that surveys its global membership on various aspects of their job roles and experience in similar industries, Stackoverflow.

Each year, Stackoverflow conducts a developer survey designed to ask the developer community about everything from their favorite technologies to their job preferences. 2019 marks the ninth year they published the Annual Developer Survey results with 90,000 developers taking part. Obviously they conduct their own detailed analysis which can be found here, but they also provide the anonymised data for us to use.

So, now we have our data set.

3. Process data and gain insight

This is where you, the budding data scientist can apply all those techniques you’ve learned. Depending on the data set and the goal of the project, there will be some Exploratory Data Analysis (EDA) of the data and some potential machine learning techniques making some predictions.

For us, we wont be making any predictions, we just need to understand what the expected salary of a data science role will be, and how many people will be looking for data science roles at the same time.

4. Communicate the insight

This can be considered one of the most important aspects of the data science project, because in a business sense, if the decision makers aren’t convinced about the outcome, the required decision wont be made.

The main information you want to communicate is the salaries others in similar roles are getting in the industry.

5. Make a Decision

Once all the analysis has been conducted and communicated to ‘management’, they make the decision that will hopefully provide more money for the business. For this project, the decision maker will be the interviewer, and you want to convince them of the highest reasonable salary.

You have a business question, and our data set. Now you need to conduct some EDA to gain some insight from the data.

First, a quick look at the data that is available.

View of available data

You can see that there are some features that have null values. This is where you will conduct some preliminary EDA and impute values for these missing data. You can ignore these for this analysis, because none of the features you are concerned with, have missing data.

Since the data set is a survey of the global community, you need to subset the data to the area that you are concerned with. For us, this is Sydney, Australia. The survey data isn’t that granular, so we will have to stick with all of Australians.

Australia

How are Australians employed

There were 1903 Australian respondents in the survey, which is 2.14% of the total population.

Employment distribution of Australian respondents

You can see that around 90% of the Australian respondents are employed in some sense.

What developer roles do Australians do?

Australians in Development roles

Since you are a budding data scientist looking for work, you are concerned with the Australians that are in data roles, such as Data Analyst, or Data Scientist.

What are the Salaries of Australians employed in data roles?

Australian Data Analyst and Data Scientist Salaries
  • Average Data Analyst Salary: $107,644.97 ± $9,189.8
  • Average Data Scientist Salary: $115,734.8 ± $12,616.18

You now have the answer to our first question. What is the answer to the employer asking “What is your salary expectation?”

Salary expectations for Data Analyst and Data Scientists

You can now go onto our second question, “How much competition will there be?”

How many people are looking for work?

We saw earlier that around 90% of Australian respondents are employed, this is broken down by the following:

Employment
Employed full-time 74.67
Employed part-time 5.47
Independent contractor, freelancer, or self-employed 9.51
Not employed, and not looking for work 3.78
Not employed, but looking for work 4.26
Retired 0.53

What about those in the data roles?

% of Australians in Data Roles

As of 2019 StackOverflow survey, only 4.2% of Australian respondents were looking for work. There are no respondents currently working in data science that are looking for any work. This might suggest that there is no competition, but you must realise that this is due to the selection effect of the survey.

Let’s assume the worst case, and take the 4.2% as the portion of the population that is looking for work.

In order to determine how many people you will be in competition with, you need to get some data relating to the population in each state and the employment market.

Australian state populations provides the population for each state and territory. Using the LinkedIn search capability will give you an estimate of the total number of data scientists currently working in each state.

Estimate of total Working Data Scientists

as of 26 May, 2019

Percentage of data scientist`s in Australia: 0.062%
Percentage of data scientist`s in NSW: 0.07%
Number of people in NSW looking for work as a data scientist is 237.0

Publish

Once you’ve done your analysis and communicated the points of interest, ie the salary expectations, you will need to publish your project. I have found GitHub to be a great forum to leave your projects. GitHub supports, not only code files, but free form text, markup and even website files. I have used a Jupyter notebook

so I can include code, text and pretty figures in the one file.

Publishing content on the web or GitHub can be a blog post in itself, so I won’t go in to the detail on how to do that here. However you can find a detailed explanation on how to create a GitHub account and upload your Jupyter notebook here.

Once that is complete, you can just direct any potential employer to your GitHub site, “https://github.com/[yourname]". This will allow you to create many projects along your learning journey, where any employer can see the progress you are making.

Conclusion

You now have a project showing off some analytical skills to any employer, which also subtly indicates your salary expectations for your first role.

In this project, you determined a reasonable estimate of what to expect for a role as a data scientist, $110,000 — $130,000. However, you should realise that this estimate does include the managers and senior level data scientists. So you might want to do some further analysis using additional features in the data set to determine a value that is specific to you.

You also have a maximum estimate of the number of people that you will be in competition with, 237. You should realise that this is the maximum number of people within the entire state of NSW, with admittedly, most being within the Sydney CBD. Also, this number contains people looking for roles at all levels, which will not be relevant for people trying to get into the industry.

One issue to consider when analysing the results, is the effectiveness of surveys in general. Even though this survey was anonymous, some people still don’t like to divulge personal information, like salary. They may enter a obscenely massive number or a really small number, so these values will skew the results. This is seen in the salary estimate plots, showing around 10% of salaries that are less than $10,000 per year.

--

--