One thing makes my work life miserable.
In some domains of knowledge people acknowledge domain difficulty and seek expertise. Nobody tells rocket scientists how to build rockets. Pre-employment screening is, on other hand, similar to politics. Everybody thinks they would be the greatest president ever, although they have zero qualifications for it. Similarly, every tech event I go to I meet at least one person claiming their company has a great employment procedure. But when they start explaining, I try really hard not to roll my eyes.
Here are some of proposed methods, and my reasons for disagreement:
|Proposed screening method||Counter-argument|
|“We only hire people who finished university X or Y. Don’t get me wrong, other universities are good, but we need the best.”||So, you wouldn’t hire a young Bill Gates (dropped out of Harvard Univerisity), or Steve Jobs (dropped out of Reed College)?
I understand, Gates and Jobs are good, but your company needs the best.
|“We only hire people who have X+ years of experience. We need people who know stuff.”||So you wouldn’t hire Nick D’Aloisio? Although he was only 17 years old when he sold his app to Yahoo for 30 million dollars, I think that proves he knows stuff.|
|“We only hire people who are team players.”||In the screening process, this usually means discriminating against introverts.
A friend of mine was able to cheat on psychometric tests every single time—it is actually easy to spot such questions. He passed as an extrovert and got a job at a bank.
Why are companies confident in their screening process?
In my opinion, that is because they don’t measure the long-term effects of a screening method. Because of that, there is no feedback loop that can correct the mistakes.
There are two types of mistakes that happen in the screening process:
- False positive is when candidate that was great at testing does poorly on the job. It can happen because a test measures the wrong thing, and not the skill important for the job.
For example, call centers sometimes screen customer service representatives by their typing skills. Five months later new employees quit the call center job, not because they couldn’t type fast enough, but because they lost the motivation to do the job.
- False negatives means that the candidate who would be great at job fails the screening process. For example, good software developers often get rejected because they are introverted and shy. Introversion influences their psychometric test, and they also don’t like to brag in the résumé or the interview.
How to correct for those mistakes? It is possible, but it is hard.
To measure false positives, company needs to:
- Record in the database all the nitty details of the hiring procedure: résumé score, test scores, scores from different rounds of the interview, opinions from recruiters and hiring managers.
- A few years after the hire, someone needs to follow up and correlate actual job performance with original hiring scores. Companies are always surprised that some scores have positive, some zero, and some negative correlation with the actual job performance.
Most companies fail at A, because they have an informal hiring process. If recruiters and managers don’t write down their reasons for pass/reject, they are lost forever. But almost all companies fail on B, because none of them invest time to follow up a few years after the hire.
One company that does false positive analysis correctly is Google. They found some surprising results. Brain-teasers, like the question “How many golf balls can you fit into an airplane?”, are a complete waste of time. In the words of Google’s VP, “they serve primarily to make the interviewer feel smart.” Another surprising discovery is was that G.P.A.’s are worthless as a criteria for hiring, except for brand-new college grads. Those results don’t seem logical, but they are correct. Data doesn’t lie.
Measuring false negatives is even harder. Similar to the false positives procedure, you need to record all details of the hiring procedure, and follow up few years later. But, candidates you need to follow up with were rejected, so they are working at another company! You can’t just call another company ask them to fill the 20 question form about that employee. None of the companies I know does that kind of research.
On the other hand, confirmation bias is exactly why companies are so convinced in the black magic of their assessment process. Out of the noise of all possible signals, they choose ones that seem logical to them. Everyone who gets hired passes the criteria, so that criteria becomes a part of company culture. Company grows and makes more money. It is hard for an employee in such a company to be critical and suggest that maybe they would be even more successful if they hired differently.
There is a way to improve this, but it requires people to look at testing science.
The first thing you need to know for a testing procedure is predictive validity:
Predictive validity is a correlation between test scores and future job performance.
For example, predictive validity of 0.5 means that 50% of difference in job performance is explained with difference in test scores. As I explained with false positives and false negatives, predictive validity is very hard to calculate. That is why many test creators mostly advertise another measure, which is a bit misleading:
Reliability is a measure of a test’s consistency, when taken multiple times.
Reliability is in the range [0, 1], where of 0 means that the test gives completely random results, and 1 means that the same person would always get the same score on the same test. Reliability is easy to calculate; the simplest method is to compare the score on one half of the test with the score on the second half of the test (split-half method).
Let’s give an example of this. Measuring a person’s height has a higher reliability than measuring weight. That is because a person’s weight changes during a day and from month to month, while height is more constant over the years. Does it mean height is better for accountant screening? No, because both height and weight have zero predictive validity with future accountant performance (they probably have predictive validity for basketball performance, but not for accounting).
However, testing companies often give reliability in their test specifications. People without a training in statistics often mistakenly think is how “reliable” test is for predicting job success—it is not! The only link between validity and reliability is that reliability influences maximum validity. If the reliability is zero (random), then validity can maximally be zero (no correlation whatsoever).
Real human resources example: Myers-Briggs is used by 89 out of 100 fortune companies, to decide on the future of their candidates. M-B proponents claim a reliability between 0.61 and 0.87. Note that an average reliability of 0.74 would mean that in 26% of cases the same person on the same test will be assigned to a different personality type. But it gets even worse, as more than 20 studies concluded that M-B has non-existant or low construct validity. Construct validity means correlation to the other psychometric tests. Proof of occupational validity (for job screening) is non-existent. As one professor put it, “do not treat the archetype scores of M-B as anything more than Astrology.”
But what does the science say on the quality of methods?
A test’s occupational validity can be different for different occupations. As we said, height is a good predictor for basketball players but bad for accountants. Unfortunately, in 2018 there is no good public data on a per occupation basis. But there is data on general method validity aggregated over all occupations.
Here is the big insight, based on the Schmid and Hunter’s 1998 research, that did a meta-analysis of thousands of studies over more than eight decades:
As before, there are some very surprising results.
Years of education and interests, both common in résumés, have quite a low validity of 0.1.
Years of job experience (another résumés favourite) has a prediction of only 0.18.
Reference checks and unstructured interviews (what most companies do) are slightly better, with a validity of 0.26 and 0.38.
If that doesn’t work, then what does? Research has found only three methods have validity above 0.5:
- Structured employment interviews – 0.51 validity – The same interviewer needs to ask different candidates the same questions, in the same order, write down the answers (or record the entire session), and give marks to the answers. That way different candidates can transparently be compared with the same criteria. However, if an interview steers into a chat about a candidate’s résumé, that is an unstructured interview (0.38 validity).
- General mental ability (GMA) tests – 0.51 validity – Measure the ability of a candidate to solve generic problems, like numerical reasoning, verbal reasoning or problem solving. They don’t guarantee that a candidate has the required skills, but they can probably be trained as they have the mental capability.
- Work sample tests – 0.54 validity – The basic idea is: to test if a candidate will be good at work, give them a sample of the actual work. Quite simple, but this has the highest validity of all methods! Note that work sample tests need to be representative of the real work, and need to be transparently scored.
We can draw an important conclusion from the above chart:
No single method of screening has high enough validity, so you need to combine multiple methods for better validity.
For example, in 2014, Google did three rounds of off site interviews, and then flew a candidate in for another five interviews! By having so many rounds with different people they minimize the probability of a false positive. The disadvantage of their approach is that they get false negatives, for example a candidate fails round two because they were nervous. Google can still have that approach as they have millions of candidates applying.
As I said, the above validity numbers are for general hiring. The situation changes for different occupations, so let’s examine that.
GMA tests (also known as IQ tests) can be used across occupations, which is both a blessing and a curse. Employers need to prove a specific general ability is both needed for a job and non-discriminatory. For example, testing nurses with a numerical reasoning test is a case for a lawsuit. IQ tests have a general validity between 0.2 and 0.6, but how much “IQ” is needed for a specific job? In the 1971 US supreme court case, Duke Power company’s use of IQ tests was ruled illegal, because it discriminated against minorities and there was no tangible link to the actual job performance. As I said, the data on per-occupation validity is still missing, so US companies are generally avoiding IQ tests.
Work sample tests have an another problem. Companies need a different test for every job profession: accounting, programming, customer support, etc. In addition to that, HR personnel often can’t qualify the answers—they need to ask an expert in the company to review the candidate’s work samples. But despite drawbacks, work sample testing has the best validity. Even better, the validity of work sample testing increases with the complexity of the job. You can train rookie to do customer support in a few months. But to become software architect, surgeon, or a musician takes years of practice. It doesn’t matter if the candidate has the aptitude for that profession—companies need someone with the expertise now.
Let me end the article with a story that will maybe convince you of the importance of transparent screening.
Up until 1970, US and European orchestras were proud of their hiring procedure. Aspiring musicians would get on stage in front of the committee. All candidates were asked to play the same songs, to demonstrate their musical ability.
What could be more fair—all candidates were given the same task, and isn’t playing an instrument work sample testing? However, one thing was lacking. The selection process was not really transparent, as committee members were doing subjective evaluation of one’s performance. That didn’t seem to be a problem, as committee members were the crème de la crème of society: conductors, orchestra directors and professional classical musicians.
However, some women were suggesting that orchestras don’t hire female musicians and proposed that auditions should be held behind a thick curtain. Some orchestras started doing blind auditions and results soon followed. It showed that male musicians were still more likely to pass. That was expected—it was preposterous to claim that the refined, liberal, artistic segment of society would discriminate by race, nationality or gender.
However women were still frustrated and suggested that for a blind test to be really blind, one thing is missing—a carpet. Female musicians usually wore high heels, and the sound of heels on the stage gave them away before they hit the first note. If the carpet was not present, female candidates took their shoes off, or even tromped around in hiking shoes to sound “male”. The probability of female musicians passing an audition increased by 50% and the probability of hiring increased several folds. In the 1970s about 10% of orchestra members were female, compared to about 35% in the 1990s.
Now, the above data shows correlation, which doesn’t imply causation. Maybe the percentage of female musicians increased because of other factors that changed in those decades. But, if having a proper blind audition requires only a curtain and a carpet, why not do it? A musician friend of mine surprised me when she explained such blind auditions are a standard in both the US and Europe. I kept thinking that, if she was in another profession, the first interview request would be “Tell me more about yourself?”, not the best way to demonstrate your talent.
In my modest opinion, the lesson of the story is:
Because differentiating between candidates is hard, you need to minimize subjective bias, otherwise it will dominate the hiring process.
If you are not doing transparent, reproducible, work sample testing in your company, you are destined to have a lot of false positives and false negatives in your hiring process. Even if that is not a big problem as your company is floating in money, it is a bit unfair to judge other people on irrational factors. There is little excuse, as today there are many available work-sample tests, GMA tests, and templates for structured interviews, that don’t require a lot of effort from either candidates or companies.
Posted on November 27, 2018 by zeljkos