Exploration on statistical and time-series models Jonathan Samama, Jonathan Horyn Julien Bect, Emmanuel Vazquez SUPELEC Cécile Germain-Renaud LRI
The problem: Statistical characterization and models of job arrival and components load Here component = CE
Data from the RTM More than 18M jobs, 20GB 10 first CEs= 31% of total jobs Top 30
Data from the RTM 1st CE : 626K jobs ce03-lcg.cr.cnaf.infn.it 2nd CE : 579K jobs lcgce01.gridpp.rl.ac.uk 4th CE : 384K jobs ce101.cern.ch 33th CE : 107K jobs ce2.egee.cesga.es 4
Examples Nominal is an open question – exponential is inadequate CE n°3 (lcgce01.gridpp.rl.ac.uk) (pourcentage de garde: 92%) CE n°97 (ramses.dsic.upv.es) (pourcentage de garde : 59%) The histograms are truncated at 2 minutes Nominal is an open question – exponential is inadequate Extremal behaviour is easier
Inter-arrival time QQ plot against exponential Definitely not exponential Concave: heavy tailed
f(x)= P(X >u + x | X > u) Distribution tails Tail: restrict to values larger than u f(x)= P(X >u + x | X > u) H = density of f Theoretical answer: generalized Pareto distribution x = 0: exponential x > 0: heavy tailed If x > 1/k, the k-th order moment does not exist if With else
f(x)= P(X >u + x | X > u) Distribution tails Too low: not in tail Too high: unreliable parameter estimation Tail: restrict to values larger than u f(x)= P(X >u + x | X > u) H = density of f Theoretical answer: generalized Pareto distribution x = 0: exponential x > 0: heavy tailed If x > 1/k, the k-th order moment does not exist if With else
Threshold identification For a proper u0, the conditional expectation is linear b(u)=s+x(u-m) Mean Excess Plot (MEP) method: Empirical expectation Identify a linear area in the graph To confirm with a constant x
Fitting a Pareto distribution (IAT) u0 = 270s u0 = 1500s The estimation of x should be constant u0 = 1100s u0 = 600s (?)
Pareto fit for IAT x = 0.51 x = 0.45 x = 0.55 x = 0.68 (?)
The heavy-tail hypothesis stands Pareto fit for IAT Y quantiles Y quantiles x = 0.51 x = 0.45 X quantiles The heavy-tail hypothesis stands Small parameter range X quantiles Y quantiles Y quantiles x = 0.55 x = 0.68 (?) X quantiles X quantiles
Not so consistent behaviors Load tails 90% percentile Not so consistent behaviors Classification 20% percentile 60% percentile
Are statistics relevant? Arrival process intensity Inverse of the average IAT Average range: day, week Stationarity Intensity (and average) do not depend on date Poisson process (exponential IAT) are stationary
« portemanteau » whiteness test Stationarity « portemanteau » whiteness test Statistics Box-Pierce (not implemented): Dufour-Roy[1985] rank statistics At the day scale Always rejected on active CEs Not always on less active CEs
Whiteness tests CE n°3 : p-value du test de blancheur
Example with threshold = 10s Bursts Goal: exhibit a stationary process For a given threshold, a burst is a set of jobs with interarrival time smaller than the threshold Seuil diminue taille du burst (en nb de jobs) diminue Seuil augmente durée du burst (en secondes) augmente Example with threshold = 10s First burst 6 jobs, then a more than 10s interval, then a second burst of 7 jobs Size and duration of the burst should be increasing functions of the threshold
Burst behavior: Poisson process Processus de Poisson simulé (intensité 10-2) : taille et durée moyenne des bursts VS seuil (diagrammes semi-log)
Burst behavior: CE IAT CE n°3 (lcgce01.gridpp.rl.ac.uk) Taille et durée moyenne des bursts VS seuil (diagrammes semi-log)
Burst behavior: CE IAT CE n°97 (ramses.dsic.upv.es) Taille et durée moyenne des bursts VS seuil (diagrammes semi-log)
Burst intensity Reciprocal of the mean IAT of the bursts The independence hypothesis of the IAT is not systematically rejected even on the most active CEs At the week scale
Whiteness test for the burst IAT CE n°3 : p-value du test de blancheur CE n°6 : p-value du test de blancheur CE n°13 : p-value du test de blancheur CE n°97 : p-value du test de blancheur
Stalactite diagrams How to read a stalactite diagram: X-Axis : time in days Y-Axis : threshold in minutes Color : mean burst size Note : color is normalized on each row How to read a stalactite diagram: On a single row, clear areas indicate smaller-sized bursts while darker areas stand for bursts gathering more jobs Dark vertical areas reveal bursts left undivided by progressive threshold reduction Interpretation: the more the threshold is reduced, the more jobs are dispatched between shorter bursts EXCEPT for some “stalactites” Adequate tool: wavelets
Simulated Poisson (intensity 10-2) Stalactite diagrams CE n°6 (ce101.cern.ch) : Simulated Poisson (intensity 10-2)
Load(t) available at t+n Forecasting the load Load(t)=sum of execution time of the queued jobs at time t Sampling frequency: 30 minutes Only known information may be used If t0 is the present date, and t1 the last date where the load was known, t0 - t1 is typically of the order of a few days in active periods Thus we must extrapolate the load with a horizon of a few days Arrival at time t Exec End of execution at t+n Load(t) available at t+n
Forecasting the load: simple methods Two naive prediction strategies Linear from the load history As the mean of the past executions x number of jobs in the queue From the load history The horizon is a few days From the past executions The correlation of the series of averaged execution times decreases very fast The horizon for a linear prediction of the execution times is one day at best
Forecasting the load: simple methods are defeated
CE n°3 (lcgce01.gridpp.rl.ac.uk) Allure de la charge moyennée sur 4h A local approach The load process is probably very un-stationary Analysis on time windows where the inactive period is smaller than a few hours In an integrated study, a window is a burst of load CE n°3 (lcgce01.gridpp.rl.ac.uk) Allure de la charge moyennée sur 4h
ARCH models (I) Autoregressive conditional heteroskedasticity (Engle, 1982). Widely used in finance modeling Fat tails Time-varying volatility clustering:changes of the same magnitude tend to follow Leverage effects: volatility negatively correlated with magnitude in change The one-step-ahead forecast error are zero-mean random disturbances uncorrelated from one period to the next, but not independent
Log returns If Xn is the load at time n, the log return is Yn measures the variation of the load ARCH Low correlation of Yn Strong correlation of Yn2 CE n°3 (lcgce01.gridpp.rl.ac.uk) Allure des log-returns de la charge moyennée sur 30min
Caractéristiques de la charge (IV) Présence de corrélation sur la série des . Filtrage AR préalable des données. Choix de l’ordre du filtre: ~28 Résidus doivent vérifier les propriétés du modèle GARCH. CE n°3 (lcgce01.gridpp.rl.ac.uk) Allure des log-returns de la charge moyennée sur 30min Intervalles de confiance (bleu)
Caractéristiques de la charge (V) Allure des log-returns de la charge moyennée sur 30 mins Allure du carré des log-returns de la charge moyénnée sur 30 mins Allure des log-returns de la charge moyennée sur 30 mins Absence de corrélation sur la série des . Présence de corrélation sur la série des . Hétéroscédasticité conservée.
ARCH models: definition The time series Zn is ARCH(p) iff Un-stationary white noise Analogy |Zn| ~ speed ie D(load) sn ~ acceleration ie D(log-return) sn depends on the past of the series
ARCH model: estimation The time series Zn is ARCH(p) iff Parameter estimation for a given order and noise, using Usual tests on normalized residuals Block variance: Bartlett Distribution identity: Kolmogorov-Smirnov Normality: Shapiro-Francia,Lilliefors Gaussian Student p in the range 1-20 limit for convergent estimation
ARCH Models: experiment Gaussian, ARCH(5) Inadequate, rejected by all tests High kurtosis, Student’s distribution Student, ARCH(5) Better, but rejected by all tests -> GARCH model
GARCH models: definition The time series Zn is ARCH(p) iff ARCH model supplemented with an AR part on the variance Empirical order selection: p, q <5
GARCH model GARCH (1,3), Student’s distribution is validated
Summary of results Only per CE IAT Load Heavy tailed, consistent pareto distributions Limist on statistics: un-stationary process Bursts might be stationary Load Simple predictors don’t work Might be heteroskedastic: only the variance could be predicted BUT: inside activity windows
Conclusion and future work Multi-scale phenomenon The CE’s model remains largely to elucidate Models for the overall system, the VOs and the users has not yet been touched Data extraction, analysis and results must be automated and organized
Conclusions et pistes de recherche Essentiellement découverte et analyses simples des données Temps inter-arrivées Pistes de modélisation si choix de l’échelle adapté Valeurs extrêmes : intérêt pour le diagnostic de pannes, etc. Charge Etude préliminaire avec outils de séries chronologiques classiques – utilisation possible des modèles APGARCH Pas de résultat de prédiction (dans cette étude)