The heroes we deserve
- CommentsYou may be aware that openSUSE Leap 42.2 is now in the release candidate stage, and there’s a lot of activity aimed at squashing those pesky bugs before they hit the final release. One particular bug proved to be quite tough to fix, and it was only solved thanks to the “heroes” mentioned in the title. This is the history of the bug.
The report
October 6th, 2016 - A bug iss reported against Plasma reporting a hard freeze of Plasma when using the Noveau driver, but not with the closed NVIDIA blob. Although the effect is deleterious to Plasma and not other desktop environments, there is evidence that the issue is in the driver itself, but there’s only partial indications, and no conclusive proof.
The problem is that no one of the current KDE team members has access to a NVIDIA card so it’s hard to determine what is actually going on. After thinking over it for a while, I decided it was time to call in the pros. And in KDE, Martin Graesslin of KWin fame is the best bet when graphics and KWin interactions are involved. He suggested to get a backtrace of the freezes and crashes to ensure what exactly is happening. At the same time, Antonio Larrosa from the KDE team tried to get hold of a test system to investigate the cause.
Antonio eventually managed to reproduce the problem with a specific NVIDIA card and Noveau, and his initial results pointed at issues in interactions between the Noveau driver and KWin itself. Martin, being a nice person, also subscribed to the report, and once the bactraces came in he was able to find a solution to the riddle: when using OpenGL, Mesa waited for a buffer and in turn blocked KWin. The net result was an apparent freeze of the workspace when logging in.
Patches had been proposed to fix the issue, but according to upstream Noveau developers, they just made matters worse (instability).
As an aside, Noveau, despite the heroic efforts from its developers, has still several issues when using apparently “normal” workflows: for example any application using QWebEngine will crash on Noveau because while the driver does work well with multi-threaded rendering, the Blink engine uses different threads even when Qt is using single-threaded rendering.
Once the problem was found, the 5 eurocent question was: what can we do to fix the situation?
The hunt for a solution
One major problem with this issue was that not all NVIDIA cards were affected. Only specific models exhibited the problem, which meant blanket-disabling OpenGL for KWin when using Noveau was too restrictive. But at the same time, the only environment affected was Plasma. The situation was extremely dire for the default desktop in openSUSE.
But two of our today’s heroes did not give up. Martin and Antonio sat around a virtual table and tried to work out a solution. Martin suggested to use the same mechanism that KWin used normally to determine if the use of OpenGL is “unsafe” when starting up, disabling it if any problems arose. It didn’t work in the specific case only because the freeze occurred when rendering started, that is past this checkpoint.
The discussion was fruitful. Across the several hypotheses mentioned, Fabian Vogt, also from the openSUSE KDE team, thought about a “dead man’s switch”: KWin would get killed and restarted if a freeze occurred, but disabling OpenGL after the restart. That was enough for Antonio and Martin to come up with a strategy: checking with a timer if KWin was frozen during rendering. If the timer went off, KWin would get killed and restarted automatically, but disabling OpenGL (more technically, activating the “OpenGL unsafe protection”), and now would be able to continue without freezes. Antonio posted his patch for review and that is where we meet another hero of the day, David “d_ed” Edmundson. During the patch review, he asked Antonio what kind of card exhibited the issue, and promptly acted to get one to run tests himself.
Patches went back and forth for a number of days, scrapping one solution after the other, until Martin was finally able to accept the final revision, which was merged by Antonio in the Plasma/5.8 branch of kwin (meaning, everyone will benefit from it). Fabian then proceeded to submit these patches to openSUSE Leap and to openSUSE Tumbleweed.
As final icing on the cake, Antonio was able to come up with a patch to QWebEngine to disable the GPU if Noveau was detected, preventing crashes at the price of reduced performance (and adding two environment variables to force or disable the behavior, respectively).
The bottom line
What could I say: upstream-downstream collaboration is truly awesome, and even more so when such a difficult bug is tackled and fixed. The way my fellow KDE team members acted is truly commendable, and so the behavior of upstream KDE (despite the false “they don’t listen” mantra) that helped and offered assistance in getting a proper solution out.
So if you ever meet Antonio, David, Fabian, and Martin, please offer them a beverage of their choice. They’re the heroes Free Software deserves.
Bottom note
Other noteworthy people need to be mentioned here due to their involvement:
- Dominique Leuenberger and Ludwig Nussel, namely the Tumbleweed and Leap release managers, for keeping up at their jobs (that is, ensuring that awesome software is released timely and properly);
- The SUSE X11 developers, for their assistance on the Mesa side of things;
- The openSUSE community for bug reporting and testing, or this bug would’ve never been discovered.