Posted on August 14, 2019 by Petar-Krešimir Marić
Screening programmers is different than screening other employees, and we learned that the hard way.
First, you will almost always want to hire an already good programmer. Why not hire an average one and then train them up? Years of experience and training don’t guarantee that somebody will become a great programmer. The large majority of great programmers love programming and have done it from an early age. In that sense, programmers are similar to musicians. Would you ever hire a mediocre 30-year-old musician with the hope that they will improve? Absolutely not—if they were really into music, they would already be great.
Second, since the output of programming is a working program, the testing of programmers can be mostly automated. And the great thing about automated testing is that it can find really good outliers. For example, one of the top competitors at the data-science-challenge platform Kaggle.com is a 16-year-old high schooler.
Third, there is a very large variety of productivity levels among programmers. We are not sure why, but good programmers can write code that is shorter, more elegant, runs faster, and has fewer bugs, ten times quicker than bad programmers.
Most people who apply for a programming job can’t solve even the simplest coding task. This effect has been observed for a long time. Imran Gorky wrote in 2007, “I’ve come to discover that people who struggle to code don’t just struggle on big problems, or even smallish problems (i.e., write a implementation of a linked list). They struggle with tiny problems.” He proceeded to develop the famous FizzBuzz question. I don’t recommend using FizzBuzz today, as even programmers who had been living under a rock have heard about this task. But I can confirm that the effect is real, because at TestDome we accidentally bumped into a similar question, which is even shorter than FizzBuzz.
Our FizzBuzz was discovered, when creating a user-interface tutorial for candidates. We provided them with a complete template and a formula that works in some cases. All they need to do is correct two minor bugs in the formula. Take a look at our average-method problem:
|Fix the bugs.
public static double Average(int a, int b)
return a + b / 2;
Can you find both bugs in the allotted time of five minutes? We didn’t want it to be a trick question, so we also provided reasons why the formula failed:
These hints make it obvious that operator priority makes the calculation wrong (primary school math), and that an integer divided by another integer results in an integer division (this is the same in C++, Java, or C#). But, even if a candidate can’t find the bugs with the current code, they could write their own. A good programmer should be able to write an average method in less than 30 seconds.
To our great surprise, a third of people were not able to come up with any of the correct solutions, such as “return (a+b)/2.0;”, or “return (a+b)/2d;” or “return (double)(a+b)/2;” And our FizzBuzz is just one line of code.
That is why writing simple code should be the first step of your screening process. Some of the people who couldn’t solve this task had nice-looking résumés listing experience in cloud computing, multi-tier architectures, and other buzzwords. Résumé screening would provide completely wrong information.
However, despite all of the reasons presented above, it has taken two decades for the idea of automated programmer testing to catch on. The main reason is the complexity of automated testing. Because of that complexity, you’re unlikely to be developing a testing platform yourself, but using one of the existing platforms, such as Codility, HackerRank, or my own—TestDome. To be honest, exactly which platform you pick is not so important—it’s how you test. Here’s a walkthrough for programmer testing, created using TestDome but replicable on any of the testing platforms. The examples are for a .NET Web Developer job interview.
The first test is short, and part of the candidate application process. The key is to focus on just the two or three most important job requirements that can be automatically tested.
The first question asks to call a few methods with proper exception handling. The second asks for a method to modify the page after a button click. Both questions are easy, but require candidates to write at least some code.
Although the job ad clearly stated the requirements, out of 415 candidates that applied, only 140 were able to solve both questions (a candidate pass rate of 34%). This pass-rate seems low, but it’s typical for both us and our customers.
It is only after the screening test that we proceed to résumé screening. We have predetermined criteria for what we search for, which this reduces our biases, improves prediction, and saves time.
After we have a smaller pool of candidates, we pat them on the back, saying that we like their résumé and that their score on the screening test was good. But since we have many more candidates, we need another test. We divide necessary coding skills into discrete tasks, using short questions on specific topics. Take a look at our Detailed Web Developer test:
A few things about how the testing is organized:
- The test is 90-minutes long, which still takes up less of a candidate’s time than coming in for an interview.
- The C# skills are tested in two distinct areas: object-oriented programming (OOP), and algorithms and data structures. The OOP question asks to implement a fictive IBird interface, while the algorithm question asks to calculate where a grasshopper would be after N steps.
- The CSS skill is tested with the Ad question, which asks to style a specific <div> element.
- All questions are automatically scored.
When we used this test, we set the minimum pass score to 78%. This allows a candidate to flop one question and still be invited for an interview. In the end, only 20% of invited candidates passed (28 of the 140 that made it to this round).
The next step was conducting Skype interviews with candidates, recorded and reviewed by another staff member. We ended up hiring a candidate who didn’t have the top test score, but scored highly in the interview. Technical expertise is just one factor, but it is the easiest and cheapest to test.
How Testing Platforms Work
Let’s start with automated checking of answers. How is that implemented? There are two standard ways:
- Text input and output. The candidate writes a program that reads standard input and writes to standard output. Text output is compared to the correct output. This works in the same way for all programming languages, so test cases can be language-agnostic.
- Unit testing. The candidate needs to perform just the task (e.g., fixing a bug), and doesn’t need to write code for reading input and output. The solution is tested using a unit testing framework, specific for each language.
Most testing platforms use the text input and output approach, but we opted to use unit tests. It’s easier for candidates and allows us to test many more things. Because we have HTML/CSS unit tests running in headless Chrome, we can ask questions like this:
In addition to simply testing if a program’s output is valid, we can also test algorithmic complexity, performance, and memory consumption.
Algorithmic complexity is denoted with Big O. O(1) means that the program is guaranteed to run in constant time. O(N) means the maximum program running time is proportional to length N of input. There is also logarithmic complexity O(log N), quadratic complexity O(N2), and many others. The lower the complexity, the better. Determining exact O requires manual examination, but there is a hack. Testing platforms measure complexity indirectly, by measuring performance. Here are the steps:
- Limit the maximum running time of a program, e.g., 2 seconds.
- Code a solution that has the lowest possible O.
- Increase N (size of input) until your optimal solution runs within a given running limit, e.g., 1.6 seconds.
- Candidate solutions that are not optimal exceed the maximum running time for N, and thus, fail the test.
In other words, we test if the answer has the optimal O or not. An example is our Binary Search Tree question, where one third of the points are given for performance.
Memory consumption can be measured in the same way. If we limit memory consumption of a candidate’s code to 12 MB, and we know that the optimal solution takes 10 MB, then candidate solutions that take more than 20% of optimal memory will fail.
There are other aspects of programming that would be nice to check, but are unfortunately a bit subjective.
Code length should be as short as possible, while maintaining readability. Two common measures people use are lines of code (LOC) and byte length of code. They are both simple, but problematic. The LOC measure gives an advantage to programmers who put multiple statements in the same line, which decreases readability. Similarly, byte length of source code gives an advantage to programmers who name variables “a” instead of “average”, which is bad practice. If you want to compare the code length of two solutions, there is a better way: compare the sizes of zipped source code instead. Repeating patterns of whitespace are compressed, and repeating the variable name “average” doesn’t take much more space then “a.” The popular Computer Language Benchmarks Game uses this measure.
Readability of code is hard to judge. For example, functional code or code that uses a certain design pattern will be hard to read for some, but completely understandable to others. Still, checking readability of code makes sense, because people in the company need to be able to read the code employees produce.
The Evaluator code at TestDome scores program validity, performance, and memory consumption. We believe that these are the most important—a program that works is better than a shorter/more readable program that doesn’t work. If you’re not satisfied with the conclusions of the automatic tests, you can manually change a candidate’s pass/fail status.
There is one more thing that companies use to score candidates: real-time coding. Personally, I don’t know why watching someone code would give better insights than merely seeing the end result. In the case of a whiteboard interview, candidates who talk a lot often appear smarter than quiet candidates. In the case of an online test, a candidate who doesn’t code anything in the first five minutes (while thinking about the problem) would appear worse than a candidate who writes some partially correct code at the same time. My skepticism aside, we have implemented a submissions-timeline feature in TestDome, as many users requested it.
Now, we need to address one common concern regarding online tests—the fear that candidates will cheat. This cheating comes in two forms.
Finding answers online. A valid concern, if you are a testing company or test thousands of people with the same questions. If you test less than that, don’t worry; it takes time for questions to leak. The majority of candidates will not share questions online because of the fear that they are going to be identified and sued. As we are a testing company, it is a concern for us. After we had more than 500 paying customers, some candidates started posting our questions online. We combatted this with automated leak detection. A script periodically queries Google for unique parts of our questions. If leaked, we get a notification. You can do the same for your questions by setting automated Google Alerts.
Another person taking the test. This happens rarely with individual candidates. However, a few years ago, we tried to use oDesk (now Upwork) to find developers for low-level tasks. We were impressed with the quality of the candidates who applied, only to discover later that they had no intention of doing the work. After passing the screening process, they delegated the work to an inexperienced person, while the expert would be busy applying for further jobs to outsource. Having such an “interview expert” is a common tactic for outsourcing agencies in developing countries.
A way to combat this is to verify identity using a webcam proctoring. There are three types of proctoring: live, record-and-review, and automated. At TestDome, we currently offer record-and-review proctoring, where pictures of a candidate are taken periodically:
For HR departments, the easiest way to hire is to have the same process for every applicant, and to not have separate tools for developers, students, salespeople, etc. But if you screen developers like every other employee, you are wasting your time and not getting the best talent. Programmer screening is extremely advanced and there are 20 years of experience in using better tools than a phone or a whiteboard. Live coding, automated checking of correctness and memory consumption, coding replay, code analysis, and online proctoring are all available on various platforms where you can even enter your own custom questions. And let’s not forget that such testing is asynchronous, meaning candidates can take the test whenever works best for their schedule and avoid traveling to your office.
If some of your developer applicants look resentful when faced with a whiteboard interview, maybe their dissatisfaction is a bit justified. Setting up an automated developer screening process takes upfront effort, but in the end saves time and works better for both you and the candidates.