Exploration on statistical and time-series models

Slides:



Advertisements
Présentations similaires
Cliquez et modifiez le titre Cliquez pour modifier les styles du texte du masque Deuxième niveau Troisième niveau Quatrième niveau Cinquième niveau 1 Cliquez.
Advertisements

How to solve biological problems with math Mars 2012.
Mardi 20 Novembre 2012 Recap I can
ANSWERS. What is Verb Conjugation? For one thing, conjugating a verb is simply putting a verb in an orderly arrangement. We will use a chart. To create.
Information Theory and Radar Waveform Design Mark R. bell September 1993 Sofia FENNI.
Irregular Adjectives Not all adjectives are made the same.
Le Passif...getting to know the Passive Voice in French!
Making PowerPoint Slides Avoiding the Pitfalls of Bad Slides.
PERFORMANCE One important issue in networking is the performance of the network—how good is it? We discuss quality of service, an overall measurement.
An Introduction To Two – Port Networks The University of Tennessee Electrical and Computer Engineering Knoxville, TN wlg.
Traffic Sign Recognition Jacob Carlson Sean St. Onge Advisor: Dr. Thomas L. Stewart.
IP Multicast Text available on
Template Provided By Genigraphics – Replace This Text With Your Title John Smith, MD 1 ; Jane Doe, PhD 2 ; Frederick Smith, MD, PhD 1,2 1.
Reviewing how to conjugate ER verbs in the present tense
The nation now known as The Democratic Republic of Congo was at one time the personal property of the King of Belgium.
Qu’est-ce qu’ils aiment faire?
l y a which we have already learned means “there is/are l y a which we have already learned means “there is/are.” When we put a measure of time.
Les pentes sont partout.
Réunion service Instrumentation Activités CMS-Traces
Qu’est-ce qu’on mange au...
Strengths and weaknesses of digital filtering Example of ATLAS LAr calorimeter C. de La Taille 11 dec 2009.
ABAQUS I Summary Program Capability Components of an ABAQUS Model Elements, Materials and Procedures Modules (analysis, pre and post processing) Input.
Quantum Computer A New Era of Future Computing Ahmed WAFDI ??????
the Periodic Table the Periodic Table 2017/2018 Made by : NEDJAR NASSIMA Made by : NEDJAR NASSIMA MS:HAMZA 1.
MATCHSLIDE : INT contribution Patrick HORAIN Hichem ATTI Waheb LARBI Presented as : "TELESLIDE: Technical aspects ", Jacques Klossa & Patrick Horain, Joint.
Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics.
Conditional Clauses By Mª Mercedes Sánchez Year
© 2004 Prentice-Hall, Inc.Chap 4-1 Basic Business Statistics (9 th Edition) Chapter 4 Basic Probability.
F RIENDS AND FRIENDSHIP Project by: POPA BIANCA IONELA.
P&ID SYMBOLS. P&IDs Piping and Instrumentation Diagrams or simply P&IDs are the “schematics” used in the field of instrumentation and control (Automation)
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 1-1 Chapter 1 Introduction and Data Collection Basic Business Statistics 10 th Edition.
G. Peter Zhang Neurocomputing 50 (2003) 159–175 link Time series forecasting using a hybrid ARIMA and neural network model Presented by Trent Goughnour.
Author : Moustapha ALADJI PhD student in economics-University of Guyana Co-author : Paul ROSELE Chim HDR Paris 1-Pantheon Sorbonne Economics / Management.
Lect12EEE 2021 Differential Equation Solutions of Transient Circuits Dr. Holbert March 3, 2008.
Essai
Distributed Radiation Detection Daniel Obenshain Arthur Rock SURF Fellow.
Introduction to Computational Journalism: Thinking Computationally JOUR479V/779V – Computational Journalism University of Maryland, College Park Nick Diakopoulos,
Le soir Objectifs: Talking about what you do in the evening
Qu’est-ce que tu as dans ta trousse?
Efficacité des algorithmes
QUANTIFICATION OF THE VIRAL LOAD IN THE ITALIAN APIARIES
Quelle est la date aujourd’hui?
Quelle est la date aujourd’hui?
Information available in a capture history
Qu’est-ce que tu as dans ta trousse?
MATLAB Basics With a brief review of linear algebra by Lanyi Xu modified by D.G.E. Robertson.
Quelle est la date aujourd’hui?
University graduates and unemployement job in Tunisia Faculté des sciences a Sfax Département Informatique et Télécommunication Élaboré par: Année Universitaire.
Definition Division of labour (or specialisation) takes place when a worker specialises in producing a good or a part of a good.
Quelle est la date aujourd’hui? Aujourd’hui c’est mardi 19 septembre!
C’est quel numéro? Count the numbers with pupils.
Roots of a Polynomial: Root of a polynomial is the value of the independent variable at which the polynomial intersects the horizontal axis (the function.
Quelle est la date aujourd’hui?
Quelle est la date aujourd’hui?
1-1 Introduction to ArcGIS Introductions Who are you? Any GIS background? What do you want to get out of the class?
Manometer lower pressure higher pressure P1P1 PaPa height 750 mm Hg 130 mm higher pressure 880 mm Hg P a = h = +- lower pressure 620 mm Hg.
Making PowerPoint Slides Avoiding the Pitfalls of Bad Slides.
POWERPOINT PRESENTATION FOR INTRODUCTION TO THE USE OF SPSS SOFTWARE FOR STATISTICAL ANALISYS BY AMINOU Faozyath UIL/PG2018/1866 JANUARY 2019.
5S Methodology How to implement "5S" and get extraordinary results.
1 Sensitivity Analysis Introduction to Sensitivity Analysis Introduction to Sensitivity Analysis Graphical Sensitivity Analysis Graphical Sensitivity Analysis.
Avoiding the Pitfalls of Bad Slides Tips to be Covered Outlines Slide Structure Fonts Colour Background Graphs Spelling and Grammar Conclusions Questions.
Le Passé Composé (Perfect Tense)
Techniques de l’eau et calcul des réseaux
Lequel The Last Part.
University : Ammar Telidji Laghouat Faculty : Technology Department : Electronics 3rd year Telecommunications Professor : S.Benghouini Student: Tadj Souad.
Le Climat : Un dialogue entre Statistique et Dynamique
Will G Hopkins Auckland University of Technology Auckland NZ Quantitative Data Analysis Summarizing Data: variables; simple statistics; effect statistics.
Over Sampling methods IMBLEARN Package Realised by : Rida benbouziane.
M’SILA University Information Communication Sciences and technology
Transcription de la présentation:

Exploration on statistical and time-series models Jonathan Samama, Jonathan Horyn Julien Bect, Emmanuel Vazquez SUPELEC Cécile Germain-Renaud LRI

The problem: Statistical characterization and models of job arrival and components load Here component = CE

Data from the RTM More than 18M jobs, 20GB 10 first CEs= 31% of total jobs Top 30

Data from the RTM 1st CE : 626K jobs ce03-lcg.cr.cnaf.infn.it 2nd CE : 579K jobs lcgce01.gridpp.rl.ac.uk 4th CE : 384K jobs ce101.cern.ch 33th CE : 107K jobs ce2.egee.cesga.es 4

Examples Nominal is an open question – exponential is inadequate CE n°3 (lcgce01.gridpp.rl.ac.uk) (pourcentage de garde: 92%) CE n°97 (ramses.dsic.upv.es) (pourcentage de garde : 59%) The histograms are truncated at 2 minutes Nominal is an open question – exponential is inadequate Extremal behaviour is easier

Inter-arrival time QQ plot against exponential Definitely not exponential Concave: heavy tailed

f(x)= P(X >u + x | X > u) Distribution tails Tail: restrict to values larger than u f(x)= P(X >u + x | X > u) H = density of f Theoretical answer: generalized Pareto distribution x = 0: exponential x > 0: heavy tailed If x > 1/k, the k-th order moment does not exist if With else

f(x)= P(X >u + x | X > u) Distribution tails Too low: not in tail Too high: unreliable parameter estimation Tail: restrict to values larger than u f(x)= P(X >u + x | X > u) H = density of f Theoretical answer: generalized Pareto distribution x = 0: exponential x > 0: heavy tailed If x > 1/k, the k-th order moment does not exist if With else

Threshold identification For a proper u0, the conditional expectation is linear b(u)=s+x(u-m) Mean Excess Plot (MEP) method: Empirical expectation Identify a linear area in the graph To confirm with a constant x

Fitting a Pareto distribution (IAT) u0 = 270s u0 = 1500s The estimation of x should be constant u0 = 1100s u0 = 600s (?)

Pareto fit for IAT x = 0.51 x = 0.45 x = 0.55 x = 0.68 (?)

The heavy-tail hypothesis stands Pareto fit for IAT Y quantiles Y quantiles x = 0.51 x = 0.45 X quantiles The heavy-tail hypothesis stands Small parameter range X quantiles Y quantiles Y quantiles x = 0.55 x = 0.68 (?) X quantiles X quantiles

Not so consistent behaviors Load tails 90% percentile Not so consistent behaviors Classification 20% percentile 60% percentile

Are statistics relevant? Arrival process intensity Inverse of the average IAT Average range: day, week Stationarity Intensity (and average) do not depend on date Poisson process (exponential IAT) are stationary

« portemanteau » whiteness test Stationarity « portemanteau » whiteness test Statistics Box-Pierce (not implemented): Dufour-Roy[1985] rank statistics At the day scale Always rejected on active CEs Not always on less active CEs

Whiteness tests CE n°3 : p-value du test de blancheur

Example with threshold = 10s Bursts Goal: exhibit a stationary process For a given threshold, a burst is a set of jobs with interarrival time smaller than the threshold Seuil diminue  taille du burst (en nb de jobs) diminue Seuil augmente  durée du burst (en secondes) augmente Example with threshold = 10s First burst 6 jobs, then a more than 10s interval, then a second burst of 7 jobs Size and duration of the burst should be increasing functions of the threshold

Burst behavior: Poisson process Processus de Poisson simulé (intensité 10-2) : taille et durée moyenne des bursts VS seuil (diagrammes semi-log)

Burst behavior: CE IAT CE n°3 (lcgce01.gridpp.rl.ac.uk) Taille et durée moyenne des bursts VS seuil (diagrammes semi-log)

Burst behavior: CE IAT CE n°97 (ramses.dsic.upv.es) Taille et durée moyenne des bursts VS seuil (diagrammes semi-log)

Burst intensity Reciprocal of the mean IAT of the bursts The independence hypothesis of the IAT is not systematically rejected even on the most active CEs At the week scale

Whiteness test for the burst IAT CE n°3 : p-value du test de blancheur CE n°6 : p-value du test de blancheur CE n°13 : p-value du test de blancheur CE n°97 : p-value du test de blancheur

Stalactite diagrams How to read a stalactite diagram: X-Axis : time in days Y-Axis : threshold in minutes Color : mean burst size Note : color is normalized on each row How to read a stalactite diagram: On a single row, clear areas indicate smaller-sized bursts while darker areas stand for bursts gathering more jobs Dark vertical areas reveal bursts left undivided by progressive threshold reduction Interpretation: the more the threshold is reduced, the more jobs are dispatched between shorter bursts EXCEPT for some “stalactites” Adequate tool: wavelets

Simulated Poisson (intensity 10-2) Stalactite diagrams CE n°6 (ce101.cern.ch) : Simulated Poisson (intensity 10-2)

Load(t) available at t+n Forecasting the load Load(t)=sum of execution time of the queued jobs at time t Sampling frequency: 30 minutes Only known information may be used If t0 is the present date, and t1 the last date where the load was known, t0 - t1 is typically of the order of a few days in active periods Thus we must extrapolate the load with a horizon of a few days Arrival at time t Exec End of execution at t+n Load(t) available at t+n

Forecasting the load: simple methods Two naive prediction strategies Linear from the load history As the mean of the past executions x number of jobs in the queue From the load history The horizon is a few days From the past executions The correlation of the series of averaged execution times decreases very fast The horizon for a linear prediction of the execution times is one day at best

Forecasting the load: simple methods are defeated

CE n°3 (lcgce01.gridpp.rl.ac.uk) Allure de la charge moyennée sur 4h A local approach The load process is probably very un-stationary Analysis on time windows where the inactive period is smaller than a few hours In an integrated study, a window is a burst of load CE n°3 (lcgce01.gridpp.rl.ac.uk) Allure de la charge moyennée sur 4h

ARCH models (I) Autoregressive conditional heteroskedasticity (Engle, 1982). Widely used in finance modeling Fat tails Time-varying volatility clustering:changes of the same magnitude tend to follow Leverage effects: volatility negatively correlated with magnitude in change The one-step-ahead forecast error are zero-mean random disturbances uncorrelated from one period to the next, but not independent

Log returns If Xn is the load at time n, the log return is Yn measures the variation of the load ARCH Low correlation of Yn Strong correlation of Yn2 CE n°3 (lcgce01.gridpp.rl.ac.uk) Allure des log-returns de la charge moyennée sur 30min

Caractéristiques de la charge (IV) Présence de corrélation sur la série des . Filtrage AR préalable des données. Choix de l’ordre du filtre: ~28 Résidus doivent vérifier les propriétés du modèle GARCH. CE n°3 (lcgce01.gridpp.rl.ac.uk) Allure des log-returns de la charge moyennée sur 30min Intervalles de confiance (bleu)

Caractéristiques de la charge (V) Allure des log-returns de la charge moyennée sur 30 mins Allure du carré des log-returns de la charge moyénnée sur 30 mins Allure des log-returns de la charge moyennée sur 30 mins Absence de corrélation sur la série des . Présence de corrélation sur la série des . Hétéroscédasticité conservée.

ARCH models: definition The time series Zn is ARCH(p) iff Un-stationary white noise Analogy |Zn| ~ speed ie D(load) sn ~ acceleration ie D(log-return) sn depends on the past of the series

ARCH model: estimation The time series Zn is ARCH(p) iff Parameter estimation for a given order and noise, using Usual tests on normalized residuals Block variance: Bartlett Distribution identity: Kolmogorov-Smirnov Normality: Shapiro-Francia,Lilliefors Gaussian Student p in the range 1-20 limit for convergent estimation

ARCH Models: experiment Gaussian, ARCH(5) Inadequate, rejected by all tests High kurtosis, Student’s distribution Student, ARCH(5) Better, but rejected by all tests -> GARCH model

GARCH models: definition The time series Zn is ARCH(p) iff ARCH model supplemented with an AR part on the variance Empirical order selection: p, q <5

GARCH model GARCH (1,3), Student’s distribution is validated

Summary of results Only per CE IAT Load Heavy tailed, consistent pareto distributions Limist on statistics: un-stationary process Bursts might be stationary Load Simple predictors don’t work Might be heteroskedastic: only the variance could be predicted BUT: inside activity windows

Conclusion and future work Multi-scale phenomenon The CE’s model remains largely to elucidate Models for the overall system, the VOs and the users has not yet been touched Data extraction, analysis and results must be automated and organized

Conclusions et pistes de recherche Essentiellement découverte et analyses simples des données Temps inter-arrivées Pistes de modélisation si choix de l’échelle adapté Valeurs extrêmes : intérêt pour le diagnostic de pannes, etc. Charge Etude préliminaire avec outils de séries chronologiques classiques – utilisation possible des modèles APGARCH Pas de résultat de prédiction (dans cette étude)