Session Summary: What to do with a World of Data

Hi DataFam! I hope you're all enjoying the holidays and looking forward to the new year. This is my final blog for 2022 and I can’t believe that it has already been one year since I started blogging!

In this blog, which is long overdue, I will be sharing a loose transcript from my Women in Tech Caribbean Conference talk dubbed “What to do with a World of Data”. During the conference in August, I shared 4 use-cases across different industries indicating what types of problems people can solve with data and technology. I’ve already shared details about the first and second use-cases in prior blogs, but the others are new to my readers. Regardless, this is a summary of a few applications of data science, so let’s jump in!

Data 101

Firstly, what is data and where does it come from? Data is everywhere and almost anything can be considered data. The contacts in your cell phone, the shows you've streamed on Netflix, the transactions against your bank account, the pictures that you post on social media, and even transactions on receipts or certificates that you receive in hard copy. All of these things can be considered data. With the growing popularity of the Internet of Things, several trillion bytes of data are being generated every second of every day. Whether we realize it or not, we live with data, we ingest it and we produce it. So, what can we do with a world of data? To be honest, we can do a lot - from engineering, to making predictions, analyzing or prescribing action plans for an institution or company. If you have access to data, don’t just let it sit there - you can do something meaningful with it. One of my favorite things to do is to work on predictive analytics projects. I think it's just one of the most amazing things to try to predict the future and to actually be accurate. There's a lot we can do with data and I find that very exciting. So let's begin. What can we do with a world of data?

Use-Case 1: Recommend Products to Consumers

With all data projects, we are generally trying to solve a problem. For our first scenario, the problem is threefold:

1. people want good skin care products that are appropriate for their skin type and preferences;

2. some skin care products may be suitable for your skin type but they are harmful to the environment or they contain ingredients that cause cancer or other harm to humans; and

3. there is no convenient way to learn about a product's impact on the spot while shopping.

I'm very certain that we don't want to spend time researching every single ingredient on a product's label before we order that product and this is where modern technology comes in. For this project, we needed to find a way to give people recommendations for products that are not harmful to them or the environment, without making them put in too much effort.

To solve this problem, a combination of data engineering and data science techniques were used. The goal was to recommend non-harmful skin care products to users. To do this, skin care product data was collected as well as data about the harmfulness of certain ingredients. The data came from multiple sources and had to be uploaded to a cloud platform, where we could build a data warehouse. Once the data was in, we had to do some transformations and calculate a product impact score based on the number of harmful vs non-harmful ingredients in a product. The higher the score, the more harmful, and the less interested we were in recommending those products to users. After building the data warehouse, the data was prepped and a recommendation engine was built using the item-based collaborative filtering algorithm. Basically, with this algorithm we had to find the correlations between products based on previous users' shopping patterns, then finally build the front end interface for the users to receive their recommendations.

To complete this project, Amazon Web Services or AWS, SQL, Python and Google Data Studio were used. AWS was used for data loading and storage, SQL for creating the database and tables, Python for data cleaning and building the engine, and Google Data Studio as the front end interface.

Use-Case 2: Predict Storms

The second thing you can do with data is predict something - in this case, the likelihood of a storm. Caribbean people have all experienced storms no matter how light or how intense. Weather and Climate data are recorded by various institutions, and from that data we know that global temperatures are rising, rainfall is decreasing and the Caribbean is still very vulnerable to these effects.

Given this problem, I thought it would it be interesting to try to predict when a storm would occur. This was a data science solution focusing on using a decision tree algorithm to come up with rules that determine a yes or a no for a storm. The yes or no were based on 3 factors: rainfall, temperature and whether the data was recorded during the hurricane season or not. We started by collecting the data, then doing some exploratory analysis and cleaning. We tried different models, but the decision tree produced the most accurate results. The tree gave us 3 major rules as to when to predict that a storm will come, and 1 rule to predict no storm. Overall, from all the rules, we learned that if we have rainfall somewhere between 113 - 133 mm and temperature falls below 26 degrees celsius, whether we are in the hurricane season or not, there will be a storm (not necessarily a hurricane, but it could be a less intense storm). This model doesn't tell us the intensity of the storm but with the data we had, we could only detect “storm” or “no storm”. The other situation where we would automatically detect a storm is the case where rainfall exceeds 133 mm. It’s automatically a yes when there is heavy rainfall, and we don’t even need to consider temperature or other factors. With more data you can get even more granular than this or more specific, but in any case, this is a contribution towards disaster preparedness - which is completely different from the previous use-case that we talked about.

A combination of Python, Heroku and Google Data Studio were used for this project and these are all free resources (at the time of the conference). Python was used for cleaning and analyzing the data as well as building the decision tree model. Heroku was used to host some data reports on the internet and Google Data Studio was used to build an informative dashboard that users could click through to learn more about climate in the Caribbean.

Use-Case 3: Build Customer Profiles

The third use-case here is a situation where a bank realized that they had a customer churn problem. Customers were closing their accounts and the bank wanted to identify the reasons so that they could attempt to prevent future customers from churning.

To solve this problem I decided to take a prescriptive analytics approach where I tried to understand the data, do some research about recent events, run some models and come up with an action plan for the institution. For this scenario I ended up using the K-prototype algorithm which helps to create clusters with numeric and non-numeric data. The algorithm came up with clusters to describe the customers who may be at risk of churning. So I could get a profile of the age, gender, account balance range and other pertinent information that describes groups of customers. I could then recommend to the bank how to target the at-risk customers based on these profiles. Projects like this can keep the bank in business as they can address things quickly and reach out to customers or market relevant services to them.

For this project I used a software from SAS called SAS Viya. It is not free but I didn't have to do any coding because the software has various algorithms embedded. I used an application to do this but it can also be done with code like Python for free. And if anyone is interested in machine learning or trying to compare your code with results from analytics software, there is a free software called weka (W-E-K-A) that does the same thing as SAS Viya.

Use-Case 4: Analyze Risk

The last use-case is centered around assisting companies with analyzing their data to manage their risk. In some industries, like the oil and gas industry, large amounts of data are accumulated each day and stored in hundreds of tables in a database. When this data is collected by transactional systems, there is limited ability to customize and develop sophisticated reports and perform analytics.

To solve the problem, a combination of data engineering and data visualization techniques were used. Data was extracted from the software, loaded into the cloud, processed and transformed to create relevant reports. Such reports explain and break down data points of interest so that companies can understand profit and loss and examine their business risk everyday. Various tools were used to solve the problem including Microsoft Azure, SQL, Python and Power BI.

Conclusion

The field of Data Science provides a sea of opportunities for everyone who is interested. As you could tell, none of these use-cases that I described were alike, and that’s what I love about working on data projects. There are so many angles you could take to solve a problem. I've used data for recommending products, disaster preparedness, banking operations and risk analytics. That tells us that there is no shortage of possibilities of what we can do with data! We can use data to design economic development projects, manage risk, perform behavioral analysis, analyze retail data and analyze geographic data, among many other things. As a woman in tech, I find comfort in knowing that we can work in any industry we want, especially when it comes to using data to solve problems.

To learn more about my experience at the conference, you can find my reflection here and a video recording of the session can be found on the Women in Tech Caribbean YouTube channel. I look forward to completing more data projects at work and outside of work in the new year and can’t wait to share more resources, knowledge and tips with you. Thank you to all my readers for following my journey on social media and for supporting this blog from its inception. I wish you all a peaceful end to this year and a prosperous 2023!

Previous
Previous

3 Reasons Why Data Engineering Matters

Next
Next

Get Certified: Azure Fundamentals