Before diving into time series forecasting and its various methods and models, it is always beneficial to understand what time series data is and what characteristics such data typically have.
With this clarity in mind, understanding the various methods and models in time series analysis and forecasting becomes easier.
Time series data surround us. Finance, public administration, energy, retail, healthcare, etc., are dominated by time series data. Some popular examples of this data are:
- Stock value or share price variation
- The number of covid infections
- Rainfall per day
- Sensor data
- Population growth
- Revenue of an organization
- Sales of a retail store, etc.
The common property of all the above examples is that the value changes over time. So, any data that deals with changes over time is time series data. It’s a collection of data points that are stored with respect to their time.
For example, consider the following data frame:
As we can observe, we have a time component (in our case, its the month and year) and the actual time series variable (in our case, it is the number of passengers)
Any time series data can be simply visualized by a line graph.
The dataset above is one of the most common time series data. It shows us how the number of air passengers varies over the months. As you can see, we have a time column (year/month) which implies that this data is a time series data. Also, the observations are arranged in chronological order – which is again a very important property of time series data.
With time series data, two things can typically be done. Time series analysis and time series forecasting.
Time series analysis involves developing models that best capture or describe an observed time series to understand the underlying causes. This field of study seeks the “why” behind a time series dataset. This often involves making assumptions about the form of the data and decomposing the time series into constituent components.
Whereas time series forecasting involves making predictions about the future. Models are trained on past historical data and are then used to predict future observations.
Why is time series data important?
Time series data can be used in many ways by organizations. Good analysis of time series data can be used for good data storytelling, finding the reason for some previous event (like a crash in the stock market or fall in sales, etc.), etc. It helps organizations understand the underlying causes of the behavior (like trend or seasonality). It can be used to compare the data from various segments. Overall, good time series analysis provides a better understanding of the underlying processes and helps make better and more informed decisions.
On the other hand, time series forecasting can be used to predict future events. It can help you make changes accordingly to manage future events (something like increased demand or increased customer churn, etc.).
So, overall, time series analysis and forecasting are essential components for making good business decisions.
Different Types of time series data
Time series datasets can be broadly of two types:
- Univariate time series
- Multivariate time series
As the name suggests, univariate time series deals with only one variable, i.e., a single time series. Only one variable varies over time. It does not deal with causes or relationships. For example, a sensor measuring the temperature of a room every second.
On the contrary, multivariate time series deals with two or more variables, i.e., multiple time series that are interdependent on each other. It can efficiently model relationships between multiple variables. For example, consider a time series capturing the sales of ice creams. Here, we may have another variable, i.e., temperature. Here both are time series and also interdependent on each other. If temperature increases, typically, ice cream sales increase, and vice versa.
Characteristics of time series data
We broadly discuss the following characteristics of time series: trend, seasonality, cyclicity, and residues (errors). Followed by this, we show how a time series can be decomposed into these respective characteristics.
Trend is a pattern that is observed over a period of time and represents the mean rate of change with respect to time. It shows the tendency of data to increase (uptrend) or decrease (downtrend) during the longer run. It usually happens for some time and doesn’t repeat.
In essence, it captures the overall behavior of the data.
For example, we can say that there is an overall upward trend in the number of air passengers in the time series we plotted above. Its trend can be represented below
Seasonality represents the periodical fluctuation where the same pattern occurs at a regular intervals. It is a similar behavior repeatedly observed over an interval of time.
There can be many types of seasonalities:
- Hourly, etc.
For example, we can see the seasonal pattern in our above air passengers time series wherein, typically, around the holiday season, the number of passengers shoots up. And this is repeated each year (hence seasonal). The seasonality for our above time series can be shown below:
Cyclical components are similar fluctuations that we can sometimes observe frequently in the long run, but they are not seasonal (i.e., they are not periodic). They ‘may’ occur because of a certain event, but there is no predefined time for their occurrence.
For example, the stock price of a company may have an overall upward trend, but sometimes maybe because of certain ‘internal agitations,’ the price may fall. This similar fall in stock price can occur each time the agitation occurs, and this agitation doesn’t happen with a predefined periodicity. It can happen anytime, but whenever it happens, similar behavior will be observed in the time series.
To differentiate between seasonal and cyclical patterns, a cyclic pattern exists when data exhibit rises and falls that are not of the fixed period.
After removing the trend and seasonality patterns from our time series, the patterns that are left behind are typically unexplainable and are called errors or unexpected variations, or residuals.
These may occur due to unforeseen circumstances and are often unavoidable. We just have to learn to live with such errors and uncertainties.
Decomposition of time series
Any time series is composed of three components, i.e., trend, seasonality, and residue (errors). Decomposition is the process of isolating these different components.
Consider a time series Y[t]. Say the trend is written as T[t], seasonality is written as S[t], and the residual error is written as e[t].
If the time series has linear components, we can do additive decomposition. i.e.,
Y[t] = T[t] + S[t] + e[t]
On the other hand, if the components are nonlinear i.e., exponential or quadric, then we do multiplicative decomposition. i.e.,
Y[t] = T[t] * S[t] * c[t] * e[t]
For example, for our air passengers dataset, the time series can be split as shown below:
Basic pipeline of handling time series data
Here we go over the basic pipeline of handling time series data.
- Data Loading – Time series data is present in various different formats. It can be stored in the form of flat files or in the form of well-formed relational data tables, or even semi-structured forms. All of this data has to be loaded into the system to proceed with any further modeling. Data can also be in different tables and sources and may require some basic wrangling to get the required fields. It can also be in wide form. We may need to convert it into a long-form for efficient modeling.
- Data Visualization and analysis – This is the crux of time series analysis. This is the process of translating information into a visual context, such as a map or graph, to make data easier for the human brain to understand and pull insights from. The main goal of data visualization is to make it easier to identify patterns, trends, and outliers in large data sets. Getting good insights from the data will help businesses make intelligent and informed decisions.
- Data Preprocessing – Real-world data is far from perfect. It is often corrupted with a lot of noise in the form of outliers or missing values, or some other randomness. As the saying is popular in data science, ‘Garbage in, Garbage out’. So, feeding good and clean data to the models is crucial for good results. This is where many preprocessing techniques come into play which handle this process of cleaning the data and making it fit for feeding to our forecasting models.
- Modeling – Here, we build our forecasting models. Using the clean time series data, we choose from a wide array of models to build that will then be used to forecast the future values of the time series. There are many classes of models ranging from classical statistical forecasting models to conventional machine learning models and further to state-of-the-art deep learning models.
- Model Deployment – (or, in general, MLOps) Here, we try to streamline the process of taking the machine learning models to production and then maintaining and monitoring them. Often in the case of time series forecasting models, we need to frequently retrain the models as the new and most recent data comes in. Using MLOps, we can manage all of this efficiently and with little to no involvement of humans.
We have described the pipeline very briefly here and have skipped all of the technical intricacies.
In future blogs, we aim to go over all these steps in detail and describe how Predactica’s solution makes your life easier by handling all the heavy lifting so that you can focus on using the results and insights to grow your business.
Author: Mohammed Safi Ur Rahman Khan
- Hands-on Time Series Analysis with Python – BV Vishwas, Ashish Patel – Apress media
- Machine Learning for Time Series Forecasting with Python – Francesca Lazzeri – John Wiley & Sons, Inc.
- Machine Learning for Time-Series with Python Forecast, predict, and detect anomalies with state-of-the-art machine learning methods – Ben Auffarth – Packt