Time Series

考法

concept, procedure, logic, data challenge

eg. 什么是平稳过程,为什么

Type of Analysis

Exploratory Analysis

  • 画图看数据长什么样

  • Autocorrelation analysis, examine serial dependence

  • Spectral Analysis

  • Decomposition of time series

Curve Fitting

  • Interpolation (smoothing, regression analysis)

  • Extrapolation

Function Approximation

  • Known Target Function 知道function长什么样 做近似

  • Curve Fitting

Prediction, Forecasting and Classification

Spark-TS as a third party package for large scale data

Signal Detection

Fourier Transformation 对声音降噪

Segmentation Analysis

区分对话中不同人的对话声

Definition

Time Series

  1. Time series研究的是joint distribution,联合概率

  2. Xt是random variable, 一般不知道,可以认为是一种distribution;xt是一次实现,是一个数据

比如观测股价一年的变化,我们有每天的股价数据,那么这就是365个变量,我们在试着求这365个变量的联合分布,但是手上却只有365个data point,每个变量只有一个。

如果是联合概率,完全没有任何限制,有各种情况组合,而我们观测时只看到这些X的一次实现,那这个问题非常复杂。要试着把问题简化,比如,平稳过程,我们认为X1...X5的分布一样,于是就可以用X2的实现推断X1的distribution。不研究X1...X5的联合分布,而是找到一阶expectation value,二阶expectation... 也就是不需要知道distribution,而是通过找到mean和variation分析问题。有时,知道first and second moments就够用。

Mean Function

Covariance Function

Stationary Process (weakly stationary)*****

Remark 1: strictly stationary is if the joint distribution is independent of t

mean function is independent of t, and covariance function is independent of t for each h (和起点终点无关,和距离有关)所以x1和x3的covariance和x2与x4的covariance相同,以及variance function和时间也无关。只在乎first and second moment.

Why do we care about 'stationary'?: make a prediction, assume something does not change with time

Examples

  1. √ i.i.d. noise: with 0 mean and finite variance

  2. √ white noise: uncorrelated random variables, each with zero mean and finite variance 因为uncorrelated意味着只要i≠j,它们的correlation就=0,所以满足了weakly stationary的定义。

AutoCovariance Function (ACVF)

Auto Correlation Function ACF

covariance除以variance, 做standardization

Sample Mean

Sample ACVF

Sample ACF

PACF Partial Autocorrelation Function

residual 之间的correlation。eg. 看10.1和10.10之间的correlation,但是不直接比较这两个,因为这中间隔了很多中间信息。所以partial correlation就是把中间的信息挖了。

General Approach to Time Series Modeling(procedure)

  1. Plot the series and examine the main features of the graph. Check if there is: 也就是Exploratory - Trend - Seasonal Component - Sharp Change in Behavior - Outlying Observation

  2. Remove the trend and seasonal components to get stationary residuals. 获得后先放下,看residual是否还有信息 - Stationary residual: in between iid and seasonal

  3. Choose a model to fit the residuals to make use of various sample statistics including the sample autocorrelation function.

  4. Forecasting is achieved by forecasting the residuals and then inverting the transformation.

De-trend

1.Spline Regression (Parametric)

  1. Run a linear regression with time index

  2. Run a polynomial regression with time index 弱点1: non-local, 所以如果在某一个点附近y值有了变化,整个poly就变了 弱点2: 多项式的fit 可能更高项的fit更好,但是却是overfitting 所以针对这样的弱点,弥补方式就是分段 构造k个阶梯函数,找这些阶梯函数的线性组合。

  3. Fit a spline regression 用光滑的曲线,比较没有棱角。(最后还可以用penalized spline regression)

2.Smoothing (Non-Parametric)

都是nonparametric methods, 是trend estimation的,不是用于model building。

  • Moving Average

Assume Mt是linear 大多数情况下,只要每个segment足够小,都可以认为是linear的。

  • Exponential Smoothing

这个只依赖于过去的值,所以可以用于forecasting;另外最佳的alpha的值是需要subjective judgement然后试的。

3. Differencing

Less parametric, no assumption that the same trend among observation period

其中B是一个operation,‘back’,delta=1-B。这样简写的时候比较方便。

不研究xt,而是研究delta,因为做差之后就消掉了linear trend;如果说有nonlinear trend,但是在足够小的时候可能可以忽略,再或者如果把差值再减一次,二次方也消了;再多次做差,多次项都可以消。

De-Seasonality

也是做differencing,但是不是和相邻的做,而是根据周期项,峰值减峰值,谷值减谷值。

除此之外,也可以试着去和sin或者cosine去拟合,或者把季节直接拿出来。

Beyond Trend and Seasonality

除了时间序列上的分析之外,也可以去联合其他feature去做regression,因为我们有一项Y,可以是隔壁竞品的价格,可以是股市,可以是温度...

也可以用RNN先把有时间的部分先解释了,然后剩下的stationary的部分再用时间序列模型继续分析。如何知道Time Series是Stationary的

Method 1:visualization

Method 2:split,calculation

把time series的数据分成几份,每份上都算mean, variance, auto correlation. 接着对比. 虽然并不是充分条件,但却是快速的检验方法。

Method 3:Hypothesis Testing

AD Test (Augmented Dickey Fuller Test)

H0: 需要验证的是non-stationary; 所以p很小的时候reject null hypothesis, 然后认为它是stationary。

KPSS Test

H0: 需要验证的是stationary; 所以p很小的时候reject null hypothesis, 认为它是non-stationary

如果不仅发现是stationary,还发现了是white noise (uncorrelated),那应该就是到此为止,没有其他可以做的了。

Role of ACF

Auto-correlation function 到底有怎样的作用?

假设一些time series data,它的mean是c,那么这个时候best predictor of X 就是c

总之,ACF很显式的对于prediction起了很重要的作用。

面试考公式,PhD考推导

现在的做法是non-paramatric,但是如果有model了,就能more stable,因为多次结果降低了方差。 原来的思路是 数据算ACF,然后算best predictor,这可能算了n多个parameter; 如果有了(正确的)model,那么就是先从数据算model的系数,比如MA里面的theta1,然后有了Xt之后算rho,因为rho是theta1的function,再回来算ACF。 这里也有个tradeoff 因为model会有bias,但是它会降低variance; 而不使用model,没有bias,但是variance更大。

Classical Time Series Model

First-order Autoregression ( AR(1) Process)

当phi的绝对值小于1时构造出来的AR1 Process是stationary process(但在某些其他情况下也有可能stationary)

起始时x0=z0

AR(p) AutoRegressive process of order P

Note: the bounds are +/- 1.96/sqrt(n)

First-order moving average or MA(1) Process

如果把correlation function画成图,横轴的lag是上面公示的h,纵轴是correlation。

MA(q) Moving average with order q

这里需要注意PACF里lag=2时不是0了。intuition是

ARMA(p,q) Process

这里不能从图直接知道几阶了

ARIMA(p,d,q) process

Rule 1: if a series has positive autocorrelations out to a high number of lags, then it probably needs a higher order of differencing (如果一大堆positive 再differencing)

Rule 2:if the lag-1 autocorrelation is zero or negative, or the autocorrelations are all small and patternless, then the series does not need a higher order of differencing. If the lag-1 autocorrelation is -0.5 or more negative, the series may be over-differenced.

以上都是rule of thumb,但是实际操作时 看情况。

Model Estimation

简直太多了...

  • Preliminary Estimation

  • Yule-Walker Estimation

  • Burg's Algorithm

  • The Innovation Algorithm

  • The Hannan-Rissanen Algorithm

  • Maximum Likelihood Estimation

Q: 能用OLS计算AR(1)的系数吗?

Least Square/Regression在这里不满足,因为不满足assumption:如果要regression,需要观测之间互相independent

Model Diagnostics

  • Check the significance of the coefficients

  • Check the ACF of residuals. (should be non-significant)

  • *Check the Box-Pierce (Ljung) tests for possible residual autocorrelation at various lags

  • Check if the variance is non-constant (use ARCH or GARCH model)

More than 1 Model looks OK

  • Choose the one with fewest parameters

  • Pick lowest standard error

  • AIC, BIC, MSE to compare models

自学列表:

ARIMA

Moving Average

其他

这个系列文,从头看到尾

时间序列中距离的计算方法

时间序列统计特征,时间序列的熵特征,时间序列的分段特征

时间序列聚类的,《Robust and Rapid Clustering of KPIs for Large-Scale Anomaly Detection》;另外一篇文章是关于时间序列异常检测的,重点检测时间序列上下平移的,《Robust and Rapid Adaption for Concept Drift in Software System Anomaly Detection》。

时间序列聚类

趋势分析

Last updated