Exploration on statistical and time-series models

Exploration on statistical and time-series models
Jonathan Samama, Jonathan Horyn Julien Bect, Emmanuel Vazquez SUPELEC Cécile Germain-Renaud LRI

The problem: Statistical characterization and models of job arrival and components load
Here component = CE

Data from the RTM More than 18M jobs, 20GB
10 first CEs= 31% of total jobs Top 30

Data from the RTM 1st CE : 626K jobs ce03-lcg.cr.cnaf.infn.it
2nd CE : 579K jobs lcgce01.gridpp.rl.ac.uk 4th CE : 384K jobs ce101.cern.ch 33th CE : 107K jobs ce2.egee.cesga.es 4

Examples Nominal is an open question – exponential is inadequate
CE n°3 (lcgce01.gridpp.rl.ac.uk) (pourcentage de garde: 92%) CE n°97 (ramses.dsic.upv.es) (pourcentage de garde : 59%) The histograms are truncated at 2 minutes Nominal is an open question – exponential is inadequate Extremal behaviour is easier

Inter-arrival time QQ plot against exponential
Definitely not exponential Concave: heavy tailed

f(x)= P(X >u + x | X > u)
Distribution tails Tail: restrict to values larger than u f(x)= P(X >u + x | X > u) H = density of f Theoretical answer: generalized Pareto distribution x = 0: exponential x > 0: heavy tailed If x > 1/k, the k-th order moment does not exist if With else

f(x)= P(X >u + x | X > u)
Distribution tails Too low: not in tail Too high: unreliable parameter estimation Tail: restrict to values larger than u f(x)= P(X >u + x | X > u) H = density of f Theoretical answer: generalized Pareto distribution x = 0: exponential x > 0: heavy tailed If x > 1/k, the k-th order moment does not exist if With else

Threshold identification
For a proper u0, the conditional expectation is linear b(u)=s+x(u-m) Mean Excess Plot (MEP) method: Empirical expectation Identify a linear area in the graph To confirm with a constant x

Fitting a Pareto distribution (IAT)
u0 = 270s u0 = 1500s The estimation of x should be constant u0 = 1100s u0 = 600s (?)

Pareto fit for IAT x = 0.51 x = 0.45 x = 0.55 x = 0.68 (?)

The heavy-tail hypothesis stands
Pareto fit for IAT Y quantiles Y quantiles x = 0.51 x = 0.45 X quantiles The heavy-tail hypothesis stands Small parameter range X quantiles Y quantiles Y quantiles x = 0.55 x = 0.68 (?) X quantiles X quantiles

Not so consistent behaviors
Load tails 90% percentile Not so consistent behaviors Classification 20% percentile 60% percentile

Are statistics relevant?
Arrival process intensity Inverse of the average IAT Average range: day, week Stationarity Intensity (and average) do not depend on date Poisson process (exponential IAT) are stationary

« portemanteau » whiteness test
Stationarity « portemanteau » whiteness test Statistics Box-Pierce (not implemented): Dufour-Roy[1985] rank statistics At the day scale Always rejected on active CEs Not always on less active CEs

Whiteness tests CE n°3 : p-value du test de blancheur

Example with threshold = 10s
Bursts Goal: exhibit a stationary process For a given threshold, a burst is a set of jobs with interarrival time smaller than the threshold Seuil diminue  taille du burst (en nb de jobs) diminue Seuil augmente  durée du burst (en secondes) augmente Example with threshold = 10s First burst 6 jobs, then a more than 10s interval, then a second burst of 7 jobs Size and duration of the burst should be increasing functions of the threshold

Burst behavior: Poisson process
Processus de Poisson simulé (intensité 10-2) : taille et durée moyenne des bursts VS seuil (diagrammes semi-log)

Burst behavior: CE IAT CE n°3 (lcgce01.gridpp.rl.ac.uk) Taille et durée moyenne des bursts VS seuil (diagrammes semi-log)

Burst behavior: CE IAT CE n°97 (ramses.dsic.upv.es) Taille et durée moyenne des bursts VS seuil (diagrammes semi-log)

Burst intensity Reciprocal of the mean IAT of the bursts
The independence hypothesis of the IAT is not systematically rejected even on the most active CEs At the week scale

Whiteness test for the burst IAT
CE n°3 : p-value du test de blancheur CE n°6 : p-value du test de blancheur CE n°13 : p-value du test de blancheur CE n°97 : p-value du test de blancheur

Stalactite diagrams How to read a stalactite diagram:
X-Axis : time in days Y-Axis : threshold in minutes Color : mean burst size Note : color is normalized on each row How to read a stalactite diagram: On a single row, clear areas indicate smaller-sized bursts while darker areas stand for bursts gathering more jobs Dark vertical areas reveal bursts left undivided by progressive threshold reduction Interpretation: the more the threshold is reduced, the more jobs are dispatched between shorter bursts EXCEPT for some “stalactites” Adequate tool: wavelets

Simulated Poisson (intensity 10-2)
Stalactite diagrams CE n°6 (ce101.cern.ch) : Simulated Poisson (intensity 10-2)

Load(t) available at t+n
Forecasting the load Load(t)=sum of execution time of the queued jobs at time t Sampling frequency: 30 minutes Only known information may be used If t0 is the present date, and t1 the last date where the load was known, t0 - t1 is typically of the order of a few days in active periods Thus we must extrapolate the load with a horizon of a few days Arrival at time t Exec End of execution at t+n Load(t) available at t+n

Forecasting the load: simple methods
Two naive prediction strategies Linear from the load history As the mean of the past executions x number of jobs in the queue From the load history The horizon is a few days From the past executions The correlation of the series of averaged execution times decreases very fast The horizon for a linear prediction of the execution times is one day at best

Forecasting the load: simple methods are defeated

CE n°3 (lcgce01.gridpp.rl.ac.uk) Allure de la charge moyennée sur 4h
A local approach The load process is probably very un-stationary Analysis on time windows where the inactive period is smaller than a few hours In an integrated study, a window is a burst of load CE n°3 (lcgce01.gridpp.rl.ac.uk) Allure de la charge moyennée sur 4h

ARCH models (I) Autoregressive conditional heteroskedasticity (Engle, 1982). Widely used in finance modeling Fat tails Time-varying volatility clustering:changes of the same magnitude tend to follow Leverage effects: volatility negatively correlated with magnitude in change The one-step-ahead forecast error are zero-mean random disturbances uncorrelated from one period to the next, but not independent

Log returns If Xn is the load at time n, the log return is
Yn measures the variation of the load ARCH Low correlation of Yn Strong correlation of Yn2 CE n°3 (lcgce01.gridpp.rl.ac.uk) Allure des log-returns de la charge moyennée sur 30min

Caractéristiques de la charge (IV)
Présence de corrélation sur la série des . Filtrage AR préalable des données. Choix de l’ordre du filtre: ~28 Résidus doivent vérifier les propriétés du modèle GARCH. CE n°3 (lcgce01.gridpp.rl.ac.uk) Allure des log-returns de la charge moyennée sur 30min Intervalles de confiance (bleu)

Caractéristiques de la charge (V)
Allure des log-returns de la charge moyennée sur 30 mins Allure du carré des log-returns de la charge moyénnée sur 30 mins Allure des log-returns de la charge moyennée sur 30 mins Absence de corrélation sur la série des . Présence de corrélation sur la série des Hétéroscédasticité conservée.

ARCH models: definition
The time series Zn is ARCH(p) iff Un-stationary white noise Analogy |Zn| ~ speed ie D(load) sn ~ acceleration ie D(log-return) sn depends on the past of the series

ARCH model: estimation
The time series Zn is ARCH(p) iff Parameter estimation for a given order and noise, using Usual tests on normalized residuals Block variance: Bartlett Distribution identity: Kolmogorov-Smirnov Normality: Shapiro-Francia,Lilliefors Gaussian Student p in the range 1-20 limit for convergent estimation

ARCH Models: experiment
Gaussian, ARCH(5) Inadequate, rejected by all tests High kurtosis, Student’s distribution Student, ARCH(5) Better, but rejected by all tests -> GARCH model

GARCH models: definition
The time series Zn is ARCH(p) iff ARCH model supplemented with an AR part on the variance Empirical order selection: p, q <5

GARCH model GARCH (1,3), Student’s distribution is validated

Summary of results Only per CE IAT Load
Heavy tailed, consistent pareto distributions Limist on statistics: un-stationary process Bursts might be stationary Load Simple predictors don’t work Might be heteroskedastic: only the variance could be predicted BUT: inside activity windows

Conclusion and future work
Multi-scale phenomenon The CE’s model remains largely to elucidate Models for the overall system, the VOs and the users has not yet been touched Data extraction, analysis and results must be automated and organized

Conclusions et pistes de recherche
Essentiellement découverte et analyses simples des données Temps inter-arrivées Pistes de modélisation si choix de l’échelle adapté Valeurs extrêmes : intérêt pour le diagnostic de pannes, etc. Charge Etude préliminaire avec outils de séries chronologiques classiques – utilisation possible des modèles APGARCH Pas de résultat de prédiction (dans cette étude)

Exploration on statistical and time-series models

Présentations similaires

Présentation au sujet: "Exploration on statistical and time-series models"— Transcription de la présentation:

Présentations similaires

Notre projet

Feed-back

Entrer

S'autoriser via un réseau social:

Exploration on statistical and time-series models

Présentations similaires

Présentation au sujet: "Exploration on statistical and time-series models"— Transcription de la présentation:

Présentations similaires

Notre projet

Feed-back