抓虫日记之 kgdb 与 softlockup(1)

DDD  2009年11月09日 星期一 19:33 | 3780次浏览 | 1条评论

softlockup: add sched_clock_tick() to avoid kernel warning on kgdb resume

A: BUG现象
intel_mpcbl0040-standard-glibc_small has the simillar problem
# BUG: soft lockup - CPU#3 stuck for 78s! [swapper:0]
Modules linked in: kgdboe ipv6 [last unloaded: kgdboe]

Pid: 0, comm: swapper Not tainted (2.6.27.23-WR3.0ae_standard #1)
EIP: 0060:[<c0394844>] EFLAGS: 00000283 CPU: 3
EIP is at _spin_unlock_irq+0xb/0x28
EAX: 00000001 EBX: 0000007d ECX: 000a9a7e EDX: f78603f4
ESI: f7860000 EDI: 00000102 EBP: f7867e98 ESP: f7867e98
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 8005003b CR2: 4c42b6e0 CR3: 004c0000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
 [<c012e30f>] run_timer_softirq+0x142/0x1b4
 [<c0135113>] ? delayed_work_timer_fn+0x0/0x32
 [<c012a667>] __do_softirq+0xb6/0x19c
 [<c012a78c>] do_softirq+0x3f/0x45
 [<c013b55e>] ? hrtimer_interrupt+0x0/0x1bc
 [<c012a942>] irq_exit+0x48/0x55
 [<c0111e8c>] smp_apic_timer_interrupt+0xdf/0x129
 [<c0103cd8>] apic_timer_interrupt+0x28/0x30
 [<c0109452>] ? mwait_idle+0x32/0x37
 [<c01021ab>] cpu_idle+0x95/0x11a
 [<c0390263>] start_secondary+0x156/0x1c3
 =======================
BUG: soft lockup - CPU#0 stuck for 78s! [swapper:0]
Modules linked in: kgdboe ipv6 [last unloaded: kgdboe]

Pid: 0, comm: swapper Not tainted (2.6.27.23-WR3.0ae_standard #1)
EIP: 0060:[<c0109452>] EFLAGS: 00000246 CPU: 0
EIP is at mwait_idle+0x32/0x37
EAX: 00000000 EBX: 00000004 ECX: 00000000 EDX: 00000000
ESI: 00000000 EDI: c04bf300 EBP: c0475fa4 ESP: c0475fa4
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 8005003b CR2: b80e62a8 CR3: 004c0000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
 [<c01021ab>] cpu_idle+0x95/0x11a
 [<c0384f32>] rest_init+0x62/0x70
 =======================
BUG: soft lockup - CPU#1 stuck for 78s! [swapper:0]
Modules linked in: kgdboe ipv6 [last unloaded: kgdboe]

Pid: 0, comm: swapper Not tainted (2.6.27.23-WR3.0ae_standard #1)
EIP: 0060:[<c0109452>] EFLAGS: 00000246 CPU: 1
EIP is at mwait_idle+0x32/0x37
EAX: 00000000 EBX: c04bb018 ECX: 00000000 EDX: 00000000
ESI: 00000001 EDI: c04bf300 EBP: f7851f7c ESP: f7851f7c
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 8005003b CR2: 4c39f6a0 CR3: 004c0000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff4ff0 DR7: 00000400
 [<c01021ab>] cpu_idle+0x95/0x11a
 [<c0390263>] start_secondary+0x156/0x1c3
 =======================
BUG: soft lockup - CPU#2 stuck for 78s! [swapper:0]
Modules linked in: kgdboe ipv6 [last unloaded: kgdboe]

Pid: 0, comm: swapper Not tainted (2.6.27.23-WR3.0ae_standard #1)
EIP: 0060:[<c0109452>] EFLAGS: 00000246 CPU: 2
EIP is at mwait_idle+0x32/0x37
EAX: 00000000 EBX: c04bb018 ECX: 00000000 EDX: 00000000
ESI: 00000002 EDI: c04bf300 EBP: f785df7c ESP: f785df7c
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 8005003b CR2: b807f000 CR3: 37bf6000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
 [<c01021ab>] cpu_idle+0x95/0x11a
 [<c0390263>] start_secondary+0x156/0x1c3
 =======================


B: BUG重现步骤&现场分析
When CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is set sched_clock() gets the
time from hardware, such as from TSC.  In this configuration kgdb will
report a softlock warning messages on resuming or detaching from a
debug session.

Sequence of events in the problem case:

1) "cpu sched clock" and "hardware time" are at 100 sec prior
   to a call to kgdb_handle_exception()

2) Debugger waits in kgdb_handle_exception() for 80 seconds and on
   exit the following is called
    ...
    touch_softlockup_watchdog()
       --> __raw_get_cpu_var(touch_timestamp) = 0;


3) "cpu sched clock" = 100s (it was not updated, because the interrupt
   was disabled in kgdb) but the "hardware time" = 180 sec

4) The first timer interrupt after resuming from kgdb_handle_exception
   updates the watchdog from the "cpu sched clock"

update_process_times()
{
    ...
    run_local_timers()
          --> softlockup_tick()
              --> check (touch_timestamp == 0)
                (it is "YES" here, we have set "touch_timestamp = 0"
                at kgdb)
                --> __touch_softlockup_watchdog()
              ***(A)--> reset "touch_timestamp" to "get_timestamp()"
                     (Here, the "touch_timestamp" will still be set to
                      100s.)
    ...

    scheduler_tick()
      ***(B)--> sched_clock_tick()
               (update "cpu sched clock" to "hardware time" = 180s)
    ...
}


5) The Second timer interrupt handler appears to have a large jump and
   trips the softlockup warning.

update_process_times()
{
    ...
    run_local_timers()
        --> softlockup_tick()
           --> "cpu sched clock" - "touch_timestamp" = 180s-100s > 60s
           --> printk "soft lockup error messages"
    ...
}

note: ***(A) reset "touch_timestamp" to "get_timestamp(this_cpu)"

Why "touch_timestamp" is 100 sec, instead of 180 sec?

With the CONFIG_HAVE_UNSTABLE_SCHED_CLOCK" set the call trace of
get_timestamp() is:

get_timestamp(this_cpu)
      -->cpu_clock(this_cpu)
         -->sched_clock_cpu(this_cpu)
            -->__update_sched_clock(sched_clock_data, now)

C: BUG触发原因

The __update_sched_clock() function uses the GTOD tick value to create
a window to normalize the "now" values.  So if "now" values is too big
for sched_clock_data, it will be ignored.


D: BUG解决方法
The fix is to invoke sched_clock_tick() to update "cpu sched clock" in
order to recover from this state.  This is done by introducing the
function touch_softlockup_watchdog_sync(), which allows kgdb to
request that the sched clock is updated when the watchdog thread runs
the first time after a resume from kgdb.

E: BUG简要总结
这里的软锁警告并不是由真正的软锁产生的,从使用角度来说可以忽略,但由于是kgdb引起的,对于kgdb来讲,所以还是有责任去解决的.
这个问题是因kgdb关闭了所有中断,导致sched_clock_data存储的时间没有与硬件时间同步造成的.
所以解决方法就是在kgdb退出后立刻更新sched_clock_data.

F: others:
This fix was placed at:
http://git.kernel.org/?p=linux/kernel/git/jwessel/linux-2.6-kgdb.git;a=commit;h=ab3f0db8b34a34cbf84d94a7c0df342354291edf

评论

我的评论:

发表评论

请 登录 后发表评论。还没有在Zeuux哲思注册吗?现在 注册 !
DDD

回复 DDD  2010年02月08日 星期一 10:38

经过6个月的持久战争,这个比较难看的patch已经集成到内核里面的.

commit id:
d6ad3e286d2c075a60b9f11075a2c55aeeeca2ad

0条回复

暂时没有评论

Zeuux © 2024

京ICP备05028076号