CIVA Home FAI Homepage User login Links Lists

CIVA Fair Play System (FPS)

<strong>The CIVA FairPlay System</strong>
What does it do, how does it do it – and why?

CIVA ceased using its TBLP statistical software at the end of 2005 and is now running an entirely new system to produce the results at international championships, designed from the outset to fit with modern patterns of judging. The programme employs the new FairPlay scoring system to appraise judges’ raw marks against a set of clear and unambiguous criteria. It also provides for each pilot a new “Processed Marks Sheet” to show what has been done and indicates clearly why any changes have been made. All of these pilots’ process sheets are subsequently available on the CIVA website, ensuring that everyone can see not only each sequence score but also every pilots figure marks and scores from each judge. Any data rejected is highlighted alongside the substitution FairPlay makes on the way to the final result. This frank and even handed appraisal of the entire data process from judging line to final results compilation is unusual in a modern sporting context, and underlines how seriously the commission takes the responsibility of providing fair and unbiased results whilst working to develop the skills of it’s judges and other contest officials where this is appropriate.

So …. FairPlay. Why have we got it, where did it come from, and most importantly – what does it actually do? Here is an explanation in, as far as possible, a non-technical style. It should allow you to make good sense of the key steps that this very thorough system takes. The full CIVA narrative is in their Sporting Code at section 6 chapter 8, but you’d better find a quiet room and a wet towel before getting stuck into all that….

Judges raw marks
To judge a complex aerobatic figure demands a sharp eye and a good understanding of the judging rules. Every detail seen must be rapidly assessed according to the established requirements, leading to an appropriate single mark for the figure. This is a testing but feasible task. We should be in no doubt that the necessary skills and interpretations are rarely identical even amongst well trained judges, and so expect the given marks to depend on just what did register during each fleeting moment in each judges minds eye. Not every judge sees the same things, the marks may not be consistent, and it is possible that some minor or major aspects will occasionally be missed altogether by one or more judges. This is human nature, and any system we design must be able to discern between major and minor differences and address the anomalies it finds in a fair and even handed way to substitute sensible and arguably correct alternatives.

Why not just add up all the marks and allow the anomalies to be ironed out in the process? The straight answer is that you wouldn’t like it, every scrap of good, overly generous, miserly low, biased and / or allegedly misjudged marking would have exactly the same influence when calculating the final numbers, and the arguments could go on for ever. Arbitrarily discard the highest and lowest marks and use the rest? Well, if the one judge who generally marks lower than the others is the most accurate, it would not be constructive to throw away most of his marks. These approaches are blunt tools and do little to eliminate poor judgements and bias from the results. Every pilot has been on the receiving end of both shallow and steep comments applied to the same 45° line, hesitations seen by some judges but not others, simple errors that get by even at the most prestigious events .... we just can’t turn a blind eye to them. If a mark looks ‘wrong’ we must employ all our skills to calculate a fairer mark, and say so.

Competition practicalities
At every contest judges do their best to record what they believe they see, and each Form-A or optical-reader card provides one judges assessments per pilot of how truthfully that complex path between the first and final wing-rocks matched the geometry of the sequence as intended. Mostly judges are well experienced at their allotted task, some of them are pilots, and all of them apply their interpretation of the CIVA bumper book of rules through long days of blistering sun, biting cold, wind and rain, often at preposterous locations and usually a long way from the social hub of pilots and refreshment. We should take comfort that their marks generally have the consistency that as pilots we aim so hard to achieve ourselves.

We now have to mix all this data together to produce an accurate and reliable set of ranked results. The protocols we employ must be robust and easily used, and trap silly input errors. The system must provide output in the many formats we need, but – and here’s the crucial bit – it must first test the thousands of recorded judgements for true acceptability and take common-sense steps to sort out things that “don’t fit” whenever this appears necessary.

Evolution
Aerobatic scoring processes have come a long way from the early add-everything-up approach, via Bauerising (remember that?), TBLP and now to FairPlays' smart statistical engine. In fact the much developed Tarasov Bauer Long progressive system tackled its objectives quite well, but it lacked the figure grouping and analytical finesse that FairPlay provides and pilots certainly cursed when it seemed to work against them. Worst of all, the inner workings were always hidden. CIVA realised that a fresh approach was needed, starting with a clean sheet and taking the best advice that could be found. Christened ‘FairPlay’ the adopted solution was conceived in the UK by Dr Derek Pike, a senior professor of statistics at Reading University, based on descriptions of the CIVA Judging system and typical data from previous contests. Detailed information on the way CIVA contests work was provided by Steve Green and Alan Cassidy, who serve on the CIVA Judging and Rules sub-committees. The concept is much more ‘open’ than hitherto, it uses best practise in modern statistical analysis and in worldly terms is a generation on from it’s precedents. In addition to the pilot oriented tasks, the system is able to provide a clear review of the judging process as it evolves through each successive sequence and over the whole event – a real bonus for the Chief Judge.

So how does the system work?
Before the FairPlay statistical engine starts, at least four key issues must already have been correctly handled:

1. Hard Zeros:
In the CIVA world any figure that is “wrong” - i.e not the one on the Form-A, has a single error greater than 90° or some essential element missing, is called a Hard Zero. We must never impose such a harsh decision without review and concensus from the panel of judges, backed up by viewing the post-flight video where necessary, so before the paperwork leaves the judging line the Chief Judge must verify or deny the correctness of all Hard Zero’s on a figure by figure / majority decision basis. Unless the HZ view is dominant or the video settles the issue the pilot must get the benefit of any doubt.

Any figure that is thus confirmed as a Hard Zero is recorded as “CHZ” on the Chief Judge’s card or Form-A, which becomes the definitive verdict on all Hard Zeros and penalties for each flight. The other judges subsequently must not revise their marks if their original opinion was different, FairPlay uses the Chief Judges’ official verdict to determine whether each judge is correct or not and so judges the judges accordingly. For a CHZ figure therefore, where a judge has supplied a non-HZ mark the system replaces it with a HZ, or conversely will label as “Missing” a HZ that any judge has erroneously applied to non-CHZ figure – this will later receive a calculated Fitted Value replacement. In every case the change is tracked by FairPlay as part of it’s monitor of judge performance.

2. Soft Zeros:
The mark 0.0 or SZ has two sources – when a judge’s downgrades in any figure reach ten, and also as a result of decisions that are matters of perception such as flick-rolls (that don’t flick as required), spins (that don’t spin correctly) etc.. The soft zero is a proper mark rather than a valid / invalid figure decision like the HZ, and so is always retained unchanged until later in the processing.

3. Not Seen:
If a figure isn’t seen fully or clearly enough then a judge can request the scoring system to provide an averaged mark. An “A” annotation on the judges Form-A or card shows where this has been requested, and these are also noted by FairPlay in the judges performance record.

4. Marks:
All other figures, plus the Positioning and Harmony assessments, must get a mark from 0.5 to 10.0 in the usual way.

The Chief Judges’ Role
For major championships as envisaged by CIVA, the Chief Judge is effectively a non-scoring manager concerned primarily with the CHZ verdicts and penalties summaries. It is however also possible to incorporate a full set of marks from the Chief Judge within the FairPlay system, in this case the confirmed HZ’s are simply Click the image to enlargeClick the image to enlargethe HZ’s from the Chief Judges marks. In either case the CHZ’s on the Chief Judge’s Form-A together with a penalties summary are the ones that guide FairPlay’s policing of the other judges figure marks, HZ’s and the entry of penalties.

Raw Marks Check-Sheets
Once the raw marks are in the computer the first active process is to provide each pilot with a Raw Marks Check-Sheet attached to the judge’s paperwork. These must show the judges marks and any penalties that have been applied to the flight, so that pilots can sign-off their data entries as accurate and truthful or seek immediate redress from either the scorer (for keyboard accuracy), the Jury or the Chief Judge should there be doubt over any aspect of the judging itself. The ACRO sheet also provides the pilot with a simple evaluation of the average mark and the “equivalent” score for each figure.

Running the FairPlay system
When the marks have all been entered and signed-off by the Jury, the FairPlay process can be run to analyse the marks and calculate the results. Although this can be done before all the pilots have flown, to provide a completely balanced service the system does need every scrap of sequence judging material from the first to last flight. The subsequent entry of more pilots marks may often cause small but inevitable changes, and interim score rankings must be considered as just that – best estimates prior to compilation of the final results.

The FairPlay system does it’s job in two quite separate consecutive moves:
Stage 1 – to assess the raw marks and create a completely balanced review of them that is essentially free from anomalies and bias.
Stage 2 – to combine the reviewed marks with the figure K-factors and construct final scores for each judge / each pilot, again checking for unwanted anomalies and bias.

FairPlay – Stage 1
This first stage is all about validating individual figure marks, and the process works generally within the following major task areas:

CHZ’s: The first step is to detect and record each judge’s submission of a mark where by common agreement there should have been a Hard Zero, or a Hard Zero where there should have been a mark. This is directed by the CHZ instructions on the Chief Judges Form-A or card.

Missing data: Where the above CHZ test fails and for any Averages requested, FairPlay labels the judges mark as “Missing”. All such missing data points will later be replaced by new ‘fitted’ numbers that are calculated during the FairPlay process.

Figure Groups: Before analysing any marks it is essential to re-organise the sets of figure data so that the system always compares like with like – mixing judges opinions for loops with stall turns or flick-rolls with rolling circles would not be sensible. This is one of the major improvements that FairPlay offers over the old all-in-one TBLP approach. To achieve this, the figures and their associated marks for all the pilots are collated into separate analysis groups by using the K-factor Click the image to enlargeClick the image to enlargeand figure number as the “key”. This ensures that within each group the figures themselves are either the same or of very similar difficulty. For Free programmes each pilots figure K-factors can be very different, so the keys are prefixed by special identifiers called SuperFamilies that the scorer enters with the pilot’s sequence figure K-factors, to stream similar types of basic figure and some specific manoeuvres together. The overall sequence K-factors, plus those for Positioning and also Harmony if applicable, are grouped according to their K value (and SuperFamily in Free’s) along with the rest of the marks.

In known or unknown sequences where each pilot flies the same set of figures with the same K-factors, the groups comprise the data for just one figure and thus all will be the same size. In Free sequences the main group boundaries are set by the SuperFamily divisions, then divided into similar K-factor sub-groups at or close to a preferred size as the range of K-factors here is likely to be much wider.

Normalising the Style: Taking each group in turn, the process must first make an assessment of the style differences between all the judge’s and adjust each judges set of marks to a common style without changing the pilot ranking. In this context style is a measure of whether a judge gives predominantly higher or lower marks than the panel average, and whether their scatter of marks has a broader or narrower range than the other judges. After this step each judges overall marks average and scatter will be the same, and from this point the marks for every figure can be compared between all the judge’s on a balanced and meaningful basis.

This simple “normalisation” procedure uses the standard deviations and averages of the columns and rows of data within each group, and is a common preparation tool inClick the image to enlargeClick the image to enlarge all branches of statistical analysis. Here we exclude the Soft and Hard Zero marks and the “Missing” data from the normalisation process, and concern ourselves only with the important non-zero marks from 0.5 to 10.0.

Fitted Values: Within each group every normalised mark is reviewed by judge and by figure against all the other pilots’ marks for all judges and all figures, and FairPlay calculates a mirror table of “fitted” marks. Think of these as best-fit numbers, they are the marks you would expect the judge to give to exactly match the pattern of all of the available information about this figure / other judges and similar figures / all judges.

Mark Validation Criteria: The system must also calculate and store a set of local criteria for measuring the credibility of every normalised mark by comparison with the fitted mark that has been prepared. Each criterion is used as the basis for an “uncertainty” calculation which is the test for survival of every normalised mark.

Mark Uncertainty Tests: FairPlay now runs every normalised mark through a calculated uncertainty test, this time including the Soft Zeros – they are after all a true mark, so their survival must undergo the same test as marks from 0.5 to 10.0. The level of confidence required from this test is 97.5% or 2.24 Standard Deviations. In tech-speak this is where a statistical confidence test based on an analysis of variance shows whether a mark lies outside the allowable range that you would normally expect to see. At the 97.5% level of confidence, if a mark lies more than 2.24 standard deviations away from the average normalised mark for that figure, it is considered significantly different to what we would expect to see by chance and so is noted as an outlier - a rogue mark that probably cannot be trusted. Whenever this happens the normalised mark is deleted and the empty slot labelled “Missing”, to be later re-filled with a fitted value which is the best estimate of what a fair, unbiased mark for that judge for that figure should be.

The 60% Rule: In a final measure of the acceptability of all the marks for each figure, if the proportion of marks set to “Missing” has exceeded 60% of the number of working judges then all of the remaining marks are considered unreliable and must be replaced by the corresponding initial set of fitted values.

Re-normalisation: For every group within which one or more of the normalised marks has been labelled “Missing” or the 60% Rule has been applied, the system must run the style normalisation procedure again and build a new fitted values table. This time therefore all of the first-pass anomalies will have been removed, and this second fitted values table is “final” and valid for every data position.

Replacement of the “Missing” data: Wherever the group contains a “Missing” mark from either a failed CHZ test, where the judge has called for an average to be provided or the uncertainty test has failed, the final Fitted Value is used to provide a complete and valid set of marks.

Marks re-assembly: Finally – when all groups of data have been fully processed in this first stage, all the processed figure marks are re-assembled into a single table in preparation for the calculation of sequence scores. This table is now considered to be completely free of anomalies and figure bias, and can be used as the basis for the calculation of each judge’s sequence scores for each pilot.

FairPlay – Stage 2
In the second stage each judge’s score is constructed for each pilot from the final processed table of marks above, and these scores are reviewed to ensure that any slight favouritism or subconscious bias is detected and removed.

Judge / Pilot scores: Using this new table of processed marks on a pilot-by-pilot basis, the system calculates each judges score – this is the total of the final normalised or fitted marks times the figure K-factors for each pilot.

Scores Normalisation: The sequence scores can now be assembled into another table with pilot rows and judge columns, and the usual style normalisation procedure carried out to ensure that the opinion of each judge has equal importance.

Fitted Score Values: A separate table of fitted scores is calculated to mirror the normalised score data, each being what you would ideally expect the judge to provide in the light of all other information.

Score Validation Criteria: For each normalised score a set of local criteria is established for testing the acceptability of the normalised score when compared to the fitted score.

Score Uncertainty Tests: An uncertainty test is now calculated using every pair of normalised and fitted scores. The confidence requirement for sequence score acceptance is 90.0% or a Standard Deviation of 1.65 – this is a little less stringent than the earlier criteria for the figure marks because FairPlay has already refined the judges marks by replacing untrustworthy judgements with fairer fitted values.

Fitted Score Value substitution: On any occasion where this confidence test fails, the fitted score is substituted for the normalised score. When this final process is finished we can be confident that we have a fair and realistic score from each judge for every pilot.

Where the number of pilots exceeds 30: At major events where there are more than 30 pilots there will be one final twist. Because it is likely that the variations between judges for pilots with the most figure downgrades (the pilots at the bottom of the final ranking) will be more extreme and thus more susceptible to inconsistency between judges, the higher likelihood of FP substitutions in this area could have undue influence over the way in which FairPlay treats other figure marks and scores further up the ranking. Where there are more than 30 pilots therefore the whole results process should be re-run but excluding those pilots whose final score is less than 60% for “Q”, known and Free sequences and 50% for unknown sequences. The final rankings for the higher scoring pilots will now be constructed from the re-run FairPlay scores, and the original FairPlay scores used for the pilots excluded from the second iteration.

The FairPlay process is now complete.
All the marks and sequence scores have been thoroughly examined for anomalies and bias, and any effect of these two influences has been sought out and carefully minimised. It is never possible to remove all trace of these disturbing factors, but this is our best estimate of truth as seen from the judging line …. although in reality it can never be “actual truth”.

Applying the Penalties:
The judges’ final scores are now averaged for each pilot, and any penalties the pilot has incurred subtracted to give each pilot’s final score.

Output of Results
At this stage these results can be used to drive two quite separate functions:

i) FP Processed Marks Check-SheetsClick the image to enlargeClick the image to enlarge
A key part of the ‘open’ philosophy of FairPlay is to provide each pilot with a printed assessment of how the process treated his or her marks. Here the raw marks, CHZ revisions, normalised marks and fitted value substitutions, judges scores and fitted score substitutions, and the penalties are all laid out in systematic order so that pilots can see where FP has made it’s changes and the reasoning behind each one. In the UK we have again added some simple arithmetic columns so that pilots can easily see where their marks gave them the greatest (or least!) value, and these are the basis for our web publication of each pilot’s marks via hyper-links from the contest results. Everyone can see and assess what happened to each pilot – and why. A complete ‘decode’ for these sheets is quite important too, the one we use on our web providing a short explanation of each feature so that pilots can clearly follow the annotations that the system has provided.

ii) The Final Results
The final scores after penalties have been deducted can all be copied into a composite results table, and the pilots ranked by the sum of their sequence scores to produce the contest results.

iii) Team Results
A key duty at major championships is to collate together the results from each nations pilots in order to produce a ranking based on the performance of each national team. This requires a relatively simple grouping and subsequent sorting of the cumulated scores, but can become quite complex if the teams are of mixed male and female pilots – particularly when the split amongst the sexes depends on the numbers within each sub-group for various nations.

So – where has FairPlay got us?
Let’s get back to basics for a moment. Our primary aim is to have an objective and reliable method for handling all of the judging line data to provide the best possible ranking of all the pilots in each sequence. Then, those authorised must have access to a complete audit trail of what the system has done, to if necessary unravel where and why it has made changes.

Our ideal system therefore:
1. Must find and deal effectively with the complete spread of normal and occasionally irrational judgements and personal styles from typical judging panels, then tell us what it has done and why. Demonstrably FairPlay does this very well.
2. Must come with the very best credentials, from real-world practical working statisticians who know what they’re doing. This is where we started, tick the box.
3. Can NEVER be a substitute for lack of basic judging capability, that whole area will always be down to us and training. It can however show us where further targeted education may be valuable.
4. Typically, the more data we give it the better it should work, and with the FairPlay system this is certainly the case. Whereas at major events we can see that the uncertainty tests are quick to identify even moderate irregularities, with fewer pilots and judges it has a diminishing degree of impact – an essential quality.

This is a complicated process requiring a huge number of iterative calculations for every sequence, but that’s just fine, it is after all exactly what we have computers for. In our experience this confirms that FairPlay provides a good set of methodical and practical protocols to nail the inevitable disagreements, errors and style differences in national and international aerobatic judging, that it produces reliable and sensible results, and has the stability to be developed even further to improve it’s ability to provide sensitive and educative feedback concerning judging standards.

Judging the Judges
A constant requirement at all contests is the need to review the judges consistency in applying the rules to arrive at the right mark, and a sensitive analysis of judges decisions is a valuable tool in the drive to offer targeted advice to improve all our judging performances where FP shows this may be needed. This monitoring process is important not only to CIVA but to all national judging regimes.

There are now two separate Judge Analysis formats in the ACRO software, both aimed at illuminating the finer detail of how each judge has performed in relation to his or her peers.

1. Judge Sequence Analysis:
Each judge has a personal report that shows the complete process from raw marks to final FairPlay processed data for every pilot.Click the image to enlargeClick the image to enlarge Marks and scores that have failed the FP confidence test are shown boxed in red with the reason for their interception, and where the score relates to a pilot of the same country both the pilot and the score are boxed in blue. There’s a summary of the FP revised marks at the foot of the table, together with a histogram showing the judge’s use of marks in blue as compared to the overall post-FP spread of marks from the whole panel, which is shown in green.

This report is of necessity a pretty daunting set of numbers, but with a little perseverance it becomes quite simple to see not only where any significant groups of discrepancies lie but also how the format of the raw marking from each judge compares to the final version as seen at the conclusion of the FairPlay process. From the two reports this is the one with the more detailed view, able to give the Chief Judge an intimate review of each judge’s style. It can also be presented to each judge in isolation, as it gives away nothing “private” about the other judges save their collective influences after they have been smoothed and merged together by the system.

2. Overall Judge Analysis:
In a second analysis that consolidates data from the whole panel into a single report, the data from the FairPlay system is used to presentClick the image to enlargeClick the image to enlarge an overall picture of the judging panel by detailing the relative strengths of each judge in all important areas. The report can be constructed for a single sequence, or any group of sequences so the data is cumulative. Here the emphasis is on comparing the outputs from all the judges in a way that allows the Chief Judge to recognise those aspects of their performance that does not fit well with the rest of the panel, or that might indicate a degree of bias relating to the nationality of each judge compared to that of the pilots.

The analysis is divided into three main sections – the use of marks, anomalies seen by the FairPlay system at the figure level, and finally anomalies at the sequence level. The judges are also ranked left to right across the table by their calculated Ranking Index (RI), cumulated across several sequences if the report has been made this way, so the skill displayed by each judge in getting close to the final FairPlay result is clear to see.

When both of these reports are considered together a detailed and insightful view can quickly be made of the capabilities and style of the whole panel of judges at an event, representing a considerable step forward from the previous set of indexes that failed to provide the depth and scope that this far more open analysis can give.

The Ranking Index (RI):
A good deal of thought and effort has been spent deriving a single number from all the data to act as a rule-of-thumb in grading each member of the panel of judges within each sequence. The FairPlay developers view is that above all it is the ability of each judge to rank all of the pilots in the correct order by getting their scores right that is most important, and the new Ranking Index (RI) is derived from these two factors. The ranking of all pilots by each judge is really the key, but the relative scores are important too to show that the judge has indeed considered each pilots flight in a fair and balanced way.

Nick Buckenham
February 2008