git bisecting kernel: some pitfalls

I like to call myself git “expert”, but I failed pretty badly few weeks ago I needed to bisect the kernel source code to figure out one bug.

General git bisect process is, You have two known commits, one is good and one is bad, and you keep searching for bad commit by splitting history in two. Lets say you have following changelog (not the actual changelog I was debugging, but example):

45cc63002ec6e [UPSTREAM] remoteproc: qcom: enable in-kernel PD mapper
a8dd6dafe94a0 [UPSTREAM] soc: qcom: add pd-mapper implementation
ce3a6459fe7ad [UPSTREAM] soc: qcom: pdr: extract PDR message marshalling data
84a820a8b238e [UPSTREAM] soc: qcom: pdr: fix parsing of domains lists
18c193d52e85c [UPSTREAM] soc: qcom: pdr: protect locator_addr with the main mutex
b58dfb8ef0388 wifi: ath10k: make in-order rx amsdu buffers persistent
adf83ce902253 arm64: dts: qcom: sdm660-xiaomi-lavender: Split by display
024ea90587931 dt-bindings: arm: qcom: Add lavender variants
3676184086106 arm64: dts: qcom: sdm660-xiaomi-lavender: Enable display
147b94f8204f0 drivers: gpu: drm: panel: Add BOE TD4320
0e10b0b1b4143 drm/panel: simple: Add Tianma NT36672a panel used in Xiaomi Redmi Note 6 Pro
0ba269e4c525a drm/panel: simple: Add Tianma NT36672a panel used in Xiaomi Redmi Note 7
c272bba226304 iommu/arm-smmu-qcom: Add SDM630/660 mdp5 compatibles for identity

You know that c272bba226304 is good commit and 45cc63002ec6e is bad commit, so you would do git bisect, and it would give you almost halfway of point commit, adf83ce902253 to check, if it is good commit it will check commits in second half and so-on. This works pretty good for simpler linear history.

But it somehow got messy when trying to bisect between 2 kernel releases, e.g. if I want to bisect between 5.15.x and 5.16.x kernel. Here is example bisection log coming out of real bisection, where I tested almost 13 changes without thinking much about it and result was totally bogus.

git bisect start
# status: waiting for both good and bad commits
# bad: [ddcc536f6f35e2589ca24cc41c053931b1817674] Fix for gcc12 compile issues in ubcmd-util.h
git bisect bad ddcc536f6f35e2589ca24cc41c053931b1817674
# status: waiting for good commit(s), bad commit known
# good: [8bb7eca972ad531c9b149c0a51ab43a417385813] Linux 5.15
git bisect good 8bb7eca972ad531c9b149c0a51ab43a417385813
# good: [84882cf72cd774cf16fd338bdbf00f69ac9f9194] Revert "net: avoid double accounting for pure zerocopy skbs"
git bisect good 84882cf72cd774cf16fd338bdbf00f69ac9f9194
# good: [6f2b76a4a384e05ac8d3349831f29dff5de1e1e2] Merge tag 'Smack-for-5.16' of https://github.com/cschaufler/smack-next
git bisect good 6f2b76a4a384e05ac8d3349831f29dff5de1e1e2
# good: [79ef0c00142519bc34e1341447f3797436cc48bf] Merge tag 'trace-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
git bisect good 79ef0c00142519bc34e1341447f3797436cc48bf
# bad: [52cf891d8dbd7592261fa30f373410b97f22b76c] Merge tag 'kvm-riscv-5.16-2' of https://github.com/kvm-riscv/linux into HEAD
git bisect bad 52cf891d8dbd7592261fa30f373410b97f22b76c
# bad: [509bfe3d979672cd69c318d520420cf95b474fd9] KVM: X86: Cache CR3 in prev_roots when PCID is disabled
git bisect bad 509bfe3d979672cd69c318d520420cf95b474fd9
# bad: [e710c5f6be0eb36f8f2e98efbc02f1b31021c29d] KVM: x86/mmu: Pass the memslot around via struct kvm_page_fault
git bisect bad e710c5f6be0eb36f8f2e98efbc02f1b31021c29d
# bad: [f1c4a88c41ea04a7036409a37e17cf22a8dbe9e2] KVM: X86: Don't unsync pagetables when speculative
git bisect bad f1c4a88c41ea04a7036409a37e17cf22a8dbe9e2
# bad: [e8f65b9bb4832028cdbd5927ddb67f66c6ccdd27] KVM: x86: Remove defunct setting of XCR0 for guest during vCPU create
git bisect bad e8f65b9bb4832028cdbd5927ddb67f66c6ccdd27
# bad: [baff59ccdc657d290be51b95b38ebe5de40036b4] KVM: Pre-allocate cpumasks for kvm_make_all_cpus_request_except()
git bisect bad baff59ccdc657d290be51b95b38ebe5de40036b4
# bad: [11476d277e06bbd7e1ba3315e0cfc78f529be9e2] KVM: use vma_pages() helper
git bisect bad 11476d277e06bbd7e1ba3315e0cfc78f529be9e2
# bad: [feb3162f9debbbeee5b00ad5a4e776f826dd9161] KVM: nVMX: Reset vmxon_ptr upon VMXOFF emulation.
git bisect bad feb3162f9debbbeee5b00ad5a4e776f826dd9161
# bad: [64c785082c21a88d3c25c2b95f16fe29eb5ee862] KVM: nVMX: Use INVALID_GPA for pointers used in nVMX.
git bisect bad 64c785082c21a88d3c25c2b95f16fe29eb5ee862
# first bad commit: [64c785082c21a88d3c25c2b95f16fe29eb5ee862] KVM: nVMX: Use INVALID_GPA for pointers used in nVMX.

We started with ddcc536f6f as a bad commit, 8bb7eca972a as a good commit (v5.15) and it went pretty much downhill from there. Commit given as first bad commit was, 64c785082c2 and what I was debugging had nothing to do with KVM subsystem, heck it was not even correct hardware! So what happened here?

Edit

After publishing this on mastodon, Kernel developer Thorsten Leemhuis pointed me out that issue I was having had nothing to do with how merged history and something else went wrong. They also pointed me out to official documentation on kernel.org for bisecting kernel. Therefore I have removed what I thought was root-cause of an issue I was having and have modified blog post to point what worked for me instead.

Thanks @kernellogger for advice!

Answer to this question is how linux kernel development workflow works, If you see history of kernel with git log --oneline --merges --graph --decorate you will soon realize that kernel history is not at all linear.

There are multiple level of merges involved here very roughly,

  • Developer prepares branch and sends to maintainer
  • Maintainer either applies patches to their tree, or merges that branch to their tree
  • Maintainer then may send it to subsystem maintainer (e.g. KVM, PCI, arm etc)
  • Subsystem maintainer then merges all these branches in their branches
  • Finally it gets sent to Linus/Greg to merge in mainline

Due to complexity and large codebase of linux kernel such an workflow is necessity and no other workflow would scale to changes that gets mainlined in kernel.

After trying for few more times in hope that I made a human error and marked some revision as a bad when it should be good or vice-versa etc. I tried few more times without much luck, so back to documentation of git-bisect.

I found this,

--first-parent

Follow only the first parent commit upon seeing a merge commit. In detecting regressions introduced through the merging of a branch, the merge commit will be identified as introduction of the bug and its ancestors will be ignored. This option is particularly useful in avoiding false positives when a merged branch contained broken or non-buildable commits, but the merge itself was OK.

What it will basically do is, it will not check commits below merge commits, but just check merge-heads as a testing points. Using this already gave me pretty good results, instead of testing of 13-14 revisions previously, I had to test only 6 revisions, and found a merge commit which was causing issue. Now there is two option: I can use --first-parent similar to this to bisect between merge base and merge commit, or I can go through changes to find bad change yourself.

I decided to go for first route and used bisect to bisect between merge base and merge head again, This time I had to test 3 more revisions and finally bad commit was found. Reverting it confirmed that regression was fixed!

Git can be pretty simple tool as well as pretty advanced swiss-army-knife. Most of the times git documentation is pretty extensive on what you want to do, but sometimes you need to browse through internet looking for solution too, everyone has those days! 🥲