DDD 2009年11月09日 星期一 19:33 | 3780次浏览 | 1条评论
softlockup: add sched_clock_tick() to avoid kernel warning on kgdb resume
A: BUG现象
intel_mpcbl0040-standard-glibc_small has the simillar problem
# BUG: soft lockup - CPU#3 stuck for 78s! [swapper:0]
Modules linked in: kgdboe ipv6 [last unloaded: kgdboe]
Pid: 0, comm: swapper Not tainted (2.6.27.23-WR3.0ae_standard #1)
EIP: 0060:[<c0394844>] EFLAGS: 00000283 CPU: 3
EIP is at _spin_unlock_irq+0xb/0x28
EAX: 00000001 EBX: 0000007d ECX: 000a9a7e EDX: f78603f4
ESI: f7860000 EDI: 00000102 EBP: f7867e98 ESP: f7867e98
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 8005003b CR2: 4c42b6e0 CR3: 004c0000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
[<c012e30f>] run_timer_softirq+0x142/0x1b4
[<c0135113>] ? delayed_work_timer_fn+0x0/0x32
[<c012a667>] __do_softirq+0xb6/0x19c
[<c012a78c>] do_softirq+0x3f/0x45
[<c013b55e>] ? hrtimer_interrupt+0x0/0x1bc
[<c012a942>] irq_exit+0x48/0x55
[<c0111e8c>] smp_apic_timer_interrupt+0xdf/0x129
[<c0103cd8>] apic_timer_interrupt+0x28/0x30
[<c0109452>] ? mwait_idle+0x32/0x37
[<c01021ab>] cpu_idle+0x95/0x11a
[<c0390263>] start_secondary+0x156/0x1c3
=======================
BUG: soft lockup - CPU#0 stuck for 78s! [swapper:0]
Modules linked in: kgdboe ipv6 [last unloaded: kgdboe]
Pid: 0, comm: swapper Not tainted (2.6.27.23-WR3.0ae_standard #1)
EIP: 0060:[<c0109452>] EFLAGS: 00000246 CPU: 0
EIP is at mwait_idle+0x32/0x37
EAX: 00000000 EBX: 00000004 ECX: 00000000 EDX: 00000000
ESI: 00000000 EDI: c04bf300 EBP: c0475fa4 ESP: c0475fa4
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 8005003b CR2: b80e62a8 CR3: 004c0000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
[<c01021ab>] cpu_idle+0x95/0x11a
[<c0384f32>] rest_init+0x62/0x70
=======================
BUG: soft lockup - CPU#1 stuck for 78s! [swapper:0]
Modules linked in: kgdboe ipv6 [last unloaded: kgdboe]
Pid: 0, comm: swapper Not tainted (2.6.27.23-WR3.0ae_standard #1)
EIP: 0060:[<c0109452>] EFLAGS: 00000246 CPU: 1
EIP is at mwait_idle+0x32/0x37
EAX: 00000000 EBX: c04bb018 ECX: 00000000 EDX: 00000000
ESI: 00000001 EDI: c04bf300 EBP: f7851f7c ESP: f7851f7c
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 8005003b CR2: 4c39f6a0 CR3: 004c0000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff4ff0 DR7: 00000400
[<c01021ab>] cpu_idle+0x95/0x11a
[<c0390263>] start_secondary+0x156/0x1c3
=======================
BUG: soft lockup - CPU#2 stuck for 78s! [swapper:0]
Modules linked in: kgdboe ipv6 [last unloaded: kgdboe]
Pid: 0, comm: swapper Not tainted (2.6.27.23-WR3.0ae_standard #1)
EIP: 0060:[<c0109452>] EFLAGS: 00000246 CPU: 2
EIP is at mwait_idle+0x32/0x37
EAX: 00000000 EBX: c04bb018 ECX: 00000000 EDX: 00000000
ESI: 00000002 EDI: c04bf300 EBP: f785df7c ESP: f785df7c
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 8005003b CR2: b807f000 CR3: 37bf6000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
[<c01021ab>] cpu_idle+0x95/0x11a
[<c0390263>] start_secondary+0x156/0x1c3
=======================
B: BUG重现步骤&现场分析
When CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is set sched_clock() gets the
time from hardware, such as from TSC. In this configuration kgdb will
report a softlock warning messages on resuming or detaching from a
debug session.
Sequence of events in the problem case:
1) "cpu sched clock" and "hardware time" are at 100 sec prior
to a call to kgdb_handle_exception()
2) Debugger waits in kgdb_handle_exception() for 80 seconds and on
exit the following is called
...
touch_softlockup_watchdog()
--> __raw_get_cpu_var(touch_timestamp) = 0;
3) "cpu sched clock" = 100s (it was not updated, because the interrupt
was disabled in kgdb) but the "hardware time" = 180 sec
4) The first timer interrupt after resuming from kgdb_handle_exception
updates the watchdog from the "cpu sched clock"
update_process_times()
{
...
run_local_timers()
--> softlockup_tick()
--> check (touch_timestamp == 0)
(it is "YES" here, we have set "touch_timestamp = 0"
at kgdb)
--> __touch_softlockup_watchdog()
***(A)--> reset "touch_timestamp" to "get_timestamp()"
(Here, the "touch_timestamp" will still be set to
100s.)
...
scheduler_tick()
***(B)--> sched_clock_tick()
(update "cpu sched clock" to "hardware time" = 180s)
...
}
5) The Second timer interrupt handler appears to have a large jump and
trips the softlockup warning.
update_process_times()
{
...
run_local_timers()
--> softlockup_tick()
--> "cpu sched clock" - "touch_timestamp" = 180s-100s > 60s
--> printk "soft lockup error messages"
...
}
note: ***(A) reset "touch_timestamp" to "get_timestamp(this_cpu)"
Why "touch_timestamp" is 100 sec, instead of 180 sec?
With the CONFIG_HAVE_UNSTABLE_SCHED_CLOCK" set the call trace of
get_timestamp() is:
get_timestamp(this_cpu)
-->cpu_clock(this_cpu)
-->sched_clock_cpu(this_cpu)
-->__update_sched_clock(sched_clock_data, now)
C: BUG触发原因
The __update_sched_clock() function uses the GTOD tick value to create
a window to normalize the "now" values. So if "now" values is too big
for sched_clock_data, it will be ignored.
D: BUG解决方法
The fix is to invoke sched_clock_tick() to update "cpu sched clock" in
order to recover from this state. This is done by introducing the
function touch_softlockup_watchdog_sync(), which allows kgdb to
request that the sched clock is updated when the watchdog thread runs
the first time after a resume from kgdb.
E: BUG简要总结
这里的软锁警告并不是由真正的软锁产生的,从使用角度来说可以忽略,但由于是kgdb引起的,对于kgdb来讲,所以还是有责任去解决的.
这个问题是因kgdb关闭了所有中断,导致sched_clock_data存储的时间没有与硬件时间同步造成的.
所以解决方法就是在kgdb退出后立刻更新sched_clock_data.
F: others:
This fix was placed at:
http://git.kernel.org/?p=linux/kernel/git/jwessel/linux-2.6-kgdb.git;a=commit;h=ab3f0db8b34a34cbf84d94a7c0df342354291edf
Zeuux © 2024
京ICP备05028076号
回复 DDD 2010年02月08日 星期一 10:38
comm
d