Case Study: Using Data to Predict Storms

In February 2022, the Eastern Caribbean Central Bank, the World Bank and the University of the West Indies published a call for data scientists around the world to participate in the first ever Caribbean-based Data and Artificial Intelligence Challenge. The challenge was part of the 6th annual Growth and Resilience Dialogue with the goal of encouraging individuals to use data based on Caribbean countries to answer a research question that addresses a climate change issue. Approximately 20 teams participated, and my two-person team made it to the finals in April and placed 4th overall. Since our project is not really an opensource project and we are unable to share the data, this blog post is a summary of my team’s solution.

Defining the Problem

To begin, we started with the problem. With both of us being from Dominica, we observed that Caribbean countries have been vulnerable to effects of climate change including rising temperatures, reduced precipitation and more intense and frequent hurricanes and floods. According to the United Nations Office for the Coordination of Humanitarian Affairs (OCHA), the Caribbean and Latin American region is the second most disaster-prone region in the entire world. Additionally, last year, the United Nations (UN) reported that the North American, Central American & Caribbean Region resulted in 18% of weather, climate & water related disasters worldwide as well as 4% of climate related deaths worldwide (>74,000 deaths). The UN also reported that the Caribbean and Central America contributed to 7% of economic losses worldwide due to climate/weather changes. For a small region, 7 % is quite a large number, so it was clear that this is a valid concern.

After defining the problem, my teammate and I had to determine our research question. We were not very well versed on the intricacies of climate change, so we had to do some additional research to understand important terms and get different perspectives on the climate situation in the Caribbean. We also wanted to understand the situation as a whole, so we began with a broad research question: “What are the relationships between climate-related variables in the ECCU countries?” My teammate and I wanted to understand relevant trends and significant factors to consider when making decisions, and our goal was to develop a model that could help us to do some type of prediction for disaster preparedness.

Data Understanding & Data Preparation

Before we started building models, we had to collect the data, which proved to be a challenge for us as we didn’t know where to find reliable data from reputable sources on short notice. We had about 1 month to procure the data, clean it, analyze it, test hypotheses, build models, compare models and make recommendations. Since it was so difficult for us to find our own datasets, we used the data provided by the challenge organizers. We compiled the datasets into one using basic Excel functions and pivot tables, then we pivoted to Google Colaboratory where we uploaded the data and started working on it with the Python programming language. The data understanding and preparation phases are usually always the most time-consuming phases of a data science project, and we spent more than 80% of our time making sure that we addressed certain outliers, maintained the correct column data types, and imputed some missing values. You cannot and should not build models on unclean data. Hence, this is without a doubt the most important part of the data science process.

As we tried to understand the relationships between climate variables, we learned that there were some moderate relationships between some numeric variables, but not strong relationships (i.e. no correlations above 0.7). However, given the fact that there were some moderate relationships, we concluded that it might be worth it to try to do some predictions. We were not sure if we wanted to predict rainfall or temperatures, but as we did more exploratory data analysis, we decided to create some new variables that were not originally present in the dataset. We added 2 binary variables which we thought were important for our model. The first one was a hurricane season variable with a 1 to indicate that a month was part of the hurricane season and a 0 to indicate otherwise. The second was a storm recorded variable with a 1 to indicate that a storm was recorded and a 0 to indicate otherwise. Adding new variables is part of a process called feature engineering, which is important to consider when you want to use your data to tell a story.

Modeling & Deployment

After attempting different models, my team decided that our decision tree was the best model. A decision tree is a model that takes various inputs and tries to classify or predict something by essentially asking a series of yes/no questions. For example, the algorithm would check to see if the monthly rainfall exceed 33 mm. If yes, it would take action A, if not, it would take action B. We decided to predict storms by using one of the variables we created as our response variable - “storm recorded”. We randomly partitioned the data with 80% dedicated to training and 20% dedicated to testing. It’s important to create training datasets to expose the model to the data and give it a chance to learn the important trends so that it can perform with a level of accuracy when new data becomes available. As the names suggest, the training set is used to train the model, and the test set is used for checking its accuracy and precision. Sometimes, a third set called the validation set is used to ensure that the model is robust, but in this case, we only used training and testing. The decision tree is also great because the output is a set of rules that can be used to consistently predict your target variable. This means that our tree gave us an understanding of the exact scenarios which must occur for a storm to be predicted. If those rules are not met, the model would not predict a storm.

The decision tree produced the following results with 88% accuracy:

  • Rule #1: Predict storm when Rainfall <= 133 mm AND Temperature <= 26 AND not in Hurricane Season

  • Rule #2: Predict storm when Rainfall > 133 mm

  • Rule #3: Don’t predict storm when Rainfall <= 133 mm and Temperature > 26

  • Rule #4: Predict storm when Rainfall <= 113  mm AND Temperature <= 26 AND in Hurricane Season 

Our goals for this model were to be able to do some predictive analysis and potentially help with early warning systems for such storms. However, there is still much more work to be done! My team also created a front-end user interface using a combination of Google Data Studio dashboards, Heroku and GitHub. We recognized that the ability to accurately predict a storm is valuable and should not be taken lightly. If we had the opportunity to continue developing a solution, we would work on a number of different things, including integrating our solution with IoT devices for real-time analysis and better predictive capabilities. We understand that the overall goal is to save lives and be prepared for disasters, so this prototype is in no way a final solution as there is a lot more that can be done.

This was our first time participating in an AI challenge, and we are so proud of our effort. In a few weeks we were able to gather, clean and explore data; build multiple models and evaluate their effectiveness; build a small web app prototype and make it to the top 4 among remarkable, experienced and brilliant competitors. If you’re interested in viewing our presentation (and that of the other finalists), feel free to watch it here.

In conclusion, this is only one example of a real-world application of data science, and there are many other cases where data can be valuable for problem-solving. I’m looking forward to finding and participating in more challenges and sharing them with you as I continue my learning journey.

Previous
Previous

Free Resources for Data Visualization

Next
Next

Top 10 Data Trends in 2022