What Killed the Job Script? User Support is on the Case

This intrepid team of high-performance computing analysts has healed many a researcher’s headache.

 

Daniel Stubbs et Sofia Fassi Fehri offer frontline support.

“A lot of the time, the problems we’re dealing with are almost like a detective story,” says Daniel Stubbs, frontline support coordinator at Calcul Québec.

Graduate students vying for a degree might not see the romance, however. Their research, involving huge amounts of data and complex calculations, might take months for a laptop to manage, constantly open, fans blazing, and otherwise useless. Calcul Québec’s supercomputers can do them in days, but that doesn’t make it any less stressful for them. 

”When your real job is being a biologist or a chemist, a physicist, a geographer,” says Stubbs, “using a supercomputer can be a pretty mysterious and obscure affair.” Understandably, they sometimes make mistakes, are stumped by an error message or the job script runs slower than expected.

“The goal is to create a bridge between researchers and high-performance computing,” says Maxime Boissonneault, research support team lead at Calcul Québec and the Digital Research Alliance of Canada. 

That bridge is made up of Daniel and Sofia Fassi Fehri on frontline support, with Maxime supporting them and taking on training, documentation, software installation and other tasks. Daniel and Sofia resolve “tickets” (from the ticket management software they use to receive help requests), though the users might call them catastrophes. These problems go from very simple to long games of “whodunnit?”

“There are different kinds of tickets,” says Sofia. “Some for opening or renewing accounts […] recurrent problems with always the same answer.” Other times a single issue can take from hours to days. “You have to do a lot of investigating, asking people, finding out what really happened, to finally understand what killed the job,” says Daniel.

Maxime Boissonneault, Research Services Team Lead

A job script describes the software needed for a user’s project and the arguments or parameters for when it runs. Because Calcul Québec has far more users than resources, users submit a job script rather than using the hardware interactively, like double-clicking on an icon. A scheduler programme runs the job whenever the resources become available, and the user hopes their results will be good enough for an article in a peer-reviewed journal. Frequently, however, at least in the initial stages, the job won’t run, the software won’t install, the results are off—it just won’t work.

The problem can be as simple as a missing comma. “When you’ve seen a thousand job scripts,” says Daniel, “it jumps out at you.” But some cases require some sleuthing, finding the right software package, discussing with colleagues or cluster administrators, doing elaborate diagnostics or contacting the software vendor. “It’s often like a puzzle,” says Sofia.

“If there’s any request we can’t answer,” she says, “we find the person who can.” 

Daniel and Sofia are mathematicians and programmers, but some tickets require specialized knowledge, like chemistry or astrophysics. Thankfully, they have 200 or so colleagues in Québec and Canada who might have experience with a type of software or background knowledge in a certain field. Maxime facilitates communication with these specialists. It becomes a team effort because each problem can be too much for one person to handle. 

“If you get the right person,” says Daniel, “you can solve the problem in ten minutes. But if you don’t, you can spend days barking up the wrong tree.” 


The Perks of the Job

For Sofia, working for Calcul Québec allows her to follow her passion. Last spring, she was recruited to train with the national women’s wheelchair basketball team. “The national team takes up a lot of my time but [Calcul Québec] manages to make it work so that I can do both.”

All three analysts agree, however, that one of the biggest perks is the contact with users.

“We’re close to researchers from all disciplines,” says Maxime. 

“[The problems] are often quite interesting,” says Daniel. “They come from backgrounds I’m not familiar with at all.”

“I enjoy interacting with people from the humanities,” he adds. “There are not that many of them but they often ask very interesting questions. The way they perceive a supercomputer is often quite different than how I do or perhaps most of our users do.”

Sofia has even discovered an interest in bioinformatics. “I saw a lot of tickets in that field and started looking into it,” she says. She now studies part-time in the subject.

Ultimately, as Maxime puts it, it’s about being “close to the researchers and contributing to the advancement of science.”