The Data Science Process

Jun 24

If you’re interested in working on data science projects to build your portfolio or to solve problems at work, there is a typical structure that you should be aware of before getting started. While the art of problem solving in the world of data science is never linear, the process generally remains the same. Whether I’m working on a research project or a project of personal interest, these are the 9 steps I generally follow to ensure completeness of my work.

Step 1: Determine the Goals of the Project

Before doing anything else, you must determine the goals of the project. You need to have a clear understanding of what you want to discover or what questions you want to answer. In some cases, data scientists phrase their goal as a question that they intend to answer by using data modeling and other techniques. This is a very important step in the process because this is what sets the tone and direction of the project. This is what enables you to decide what is relevant or not. This is what helps you to decide what type of data you will need to try to answer the questions that you have.

Step 2: Collect Data

Once you have a goal or a few goals in mind, you can begin the next step, which is to search online to see if the data you need already exists. If it does - great! If not, you can do your own collection via social media, surveys, web scraping and other methods. For personal portfolio building, there are several websites where data is available for free download. Step 2 in my previous blog on how to get started lists 13 potential data sources for personal projects. This stage is not always easy because you need to thoroughly evaluate datasets to see if they are suitable for the project. Do you see any fields that you could use? Do you only have numeric data? Do you need text data? Do you need to combine multiple datasets? Are there too many missing values in the dataset? Do you need unstructured data? These are the types of questions you should be asking yourself as you go through the data collection process.

Step 3: Enhance Domain Knowledge

Once you’ve collected your data or even before you collect data, you should be doing research about the topic you’re going to build your project around. If you are not an industry expert, it would be in your best interest to understand the industry before you start analyzing the data. This is extremely important because towards the end of the process you may need to make conclusions and recommendations which have to be relevant to the industry and make sense in the context of that industry. It would also be helpful to understand current events that may have had an impact on the industry and the data by extension. This can be ongoing as you may learn new things while you explore the data and see different patterns that may intrigue you to do more research.

Step 4: Enhance Data Understanding

In order to use appropriate models and to know how to use the variables in your dataset, you have to understand the data that you’ve collected. This is closely related to domain knowledge as the datasets may contain terms or notations that you don’t understand. Try to avoid making too many assumptions about the data, and try to understand the contents of each dataset so you’ll have an idea of what you can do with them. Visualize the data with some basic charts and graphs as you peruse the datasets and try to find general trends and patterns. The benefit of looking at the raw data is that you will start making discoveries that will help you to figure out what you can focus on in the dataset. This data understanding process is typically called Exploratory Data Analysis or EDA.

Step 5: Clean Data

Going through EDA will help you to identify how the data needs to be cleaned. This process includes tasks such as correcting the data types for fields (e.g. converting a string to an integer), replacing nulls with 0s if appropriate, among other things that will get the data to a format that is suitable for the following steps. Data cleaning usually takes up a large portion of time in this entire process, so there’s no need to rush through this step. Be very thorough and keep in mind that you may have to come back to this stage after doing other steps as you notice more cleaning that needs to be done. In some companies, data engineers do all of the cleaning so that data scientists can focus on the next step.

Step 6: Run Models

Based on the questions you want to answer and the data that you’ve collected, you will have to pick a model or a few models to help you answer the questions. You could be trying to figure out customer profiles of frequent shoppers which may require the use of clustering or classification algorithms. You may be trying to predict a number based on certain factors which would require a prediction model. This is why step 1 is so important. You don’t usually pick a model then determine what questions you can answer with it. You must pick the question first then select appropriate models to run. In many cases, the datasets are also split into 3 sections for the sake of ensuring rigor and accuracy: training, testing and validation. Based on the use case and the sample size, you can split up your data so that the model can train on one set of data that’s representative of the wider dataset; test on a different set of data from the same dataset to determine how well the model was trained; and validate on another set of data before deploying the model into production.

Step 7: Evaluate Results (Determine the Best Model)

Personally, I find it (sometimes) necessary to use different models on the data because while you may think that one type of model might be the best or most appropriate one to use, there may be a different model that’s comparable and produces much better results. In some cases, it may also be worthwhile to combine elements from different models to produce a hybrid that performs even better than either one of them could perform on their own. I would encourage using different strategies to determine the best model that you will eventually use to make conclusions.

Step 8: Conduct Explanatory Analysis

In Step 4, I mentioned Exploratory Data Analysis, but this step goes beyond initial discovery. Explanatory Data Analysis is the stage at which conclusions are made. This is the stage in which you will explain the results of your best model and expound on what it means. After doing your analysis, you may need to document your findings in a report or a presentation, along with compelling visualizations, to relevant stakeholders. Either way, it’s important to know how to communicate what you’ve discovered to technical stakeholders as well as business stakeholders. This is where collaboration with data or business analysts would be paramount.

Step 9: Recommend Next Steps

The final step is to recommend what’s next for the company or what actions can be taken as a result of your modeling and analysis. The work never ends after you complete one project. Even if you are just building your portfolio of side projects, you should adopt the habit of being forward thinking because when you start doing this type of work professionally, the business value and next steps are tremendously important. This stage is also one where business or data analysts may play larger roles as they generally liaise with data scientists and business stakeholders.

In theory, this is what the end-to-end process of a data science project looks like, but in reality, the steps may not be so clear and standard. Your professional experience will be based on the organization you work for, the processes that are in place and several other factors like deadlines, project scope, project scale and policies of the decision makers that you work for. While it’s wonderful to have an idea of how you can find structure in this field, keep in mind that the role of a data scientist differs from one company to the next and that expectations may vary as you may be on a team with data engineers and analysts or you may be on your own. Despite such differences, I think this is generally a good way to approach problem solving in the data world, and it would be worthwhile to keep these steps in mind as your career progresses.

Odessa Elie-David