M5 Bugs

Login!
Register as a new userLost password?

for Project:

|

FS#337 — Checkpoint Tester Identifies Mismatches (Bugs) for X86_FS

Attached to Project— M5 Bugs
Opened by Brad Beckmann (beckmabd) - Thursday, 13 January 2011, 04:47PM
Bug
ISA Support
Unconfirmed
No-one
Linux
High
Normal
2.0beta5
Undecided
Undecided
0%
While using the checkpoint tester script, I noticed that X86_FS encounters differences in the checkpoint state. This problem exists for both atomic and timing mode, as well as classic and Ruby memory systems.

A short test with the checkpoint tester script, will identify the problem:

% util/checkpoint-tester.py -i 2000 -- build/ALPHA_FS_MOESI_hammer/m5.debug configs/example/fs.py --script test/halt.sh

Identified differences in the checkpoint:

--- checkpoint-test/m5out/cpt.10000/m5.cpt Wed Jan 12 14:59:28 2011
+++ checkpoint-test/test.4/cpt.10000/m5.cpt Wed Jan 12 15:00:42 2011
@@ -10,20 +10,20 @@
so_state=2
locked=false
_status=1
-instCnt=10
+instCnt=9

[system.cpu.xc.0]
_status=0
-funcExeInst=16
+funcExeInst=15
quiesceEndTick=0
iplLast=0
iplLastTick=0
floatRegs.i=0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0
-intRegs=549755813888 0 2097152 0 0 0 590336 0 0 0 0 0 0 0 0 0 0 2097208 380 0 0 0 0 2097189 0 0 0 0
0 0 0 0 133 0 0 0 0 0 0
+intRegs=549755813888 0 2097152 0 0 0 590336 0 0 0 0 0 0 0 0 0
+18446743523955834880 2097182 380 0 0
0 0 2097189 0 0 0 0 0 0 0 0 133 0 0 0 0 0 0
_pc=2097202
-_npc=2097208
-_upc=1
-_nupc=2
+_npc=2097210
+_upc=0
+_nupc=1
regVal=3758096401 0 0 458752 32 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4294905840 1024 2 243392 0 1288 0
0 0 260 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1974748653749254 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1280 0 0 0 0 0 0 0 0 0 0 0 0 0 0 132609
0 0 0 0 67108864 0 0 0 0 0 16 8 16 16 16 16 0 0 0 0 0 24 0 0 0 0 0 0 0 0 0 483328 0 0 0 0 0 0 0 0 0
0 0 0 483328 0 0 0 0 983295 983295 983295 983295 983295 983295 65535 65535 23 65535 65535 983295 655
35 45768 43728 45768 45768 45768 45768 45952 0 45952 45952 45952 43976 45952 0 0 0 0 0 0 0 0 0 0 0 4
276095232 0

[system.cpu.tickEvent]


This task depends upon

This task blocks these from closing
Comment by Gabe Black (gblack) - Saturday, 15 January 2011, 03:53AM

The differences in state look like basically one more (or less) instruction/microop executed between the two instances. My very off the cuff hunch is that the macroop isn't being checkpointed, so when the CPU starts up again it has to fetch the instruction, decode it to a macroop, and then start executing the microops. When it's not from a checkpoint, the macroop is already there and ready to go. If that is what's happening, no trivial fix springs to mind, though here are some non-trivial possibilities. One would be to force the current macroop to finish executing and consider that part of draining (right term?), although there's no guaranteed bound to how long a macroop can take. They would practically tend to be short, but "tend to" isn't something to design around. Alternatively, when dropping a checkpoint we could just forcefully lose track of the macroop so it has to be fetched again in both cases. A third option would be to serialize the macroop itself by serializing it's ExtMachInst, although I suspect there would be complications and it isn't clear it would be worthwhile unless checkpointing was frequent enough to cause statistically meaningful differences in behavior without that sort of thing.


Comment by Gabe Black (gblack) - Thursday, 20 January 2011, 10:45AM

Your command line to reproduce the problem looks like it's for ALPHA_FS, not X86_FS. Could you please provide the one you used for X86?


Comment by Brad Beckmann (beckmabd) - Thursday, 20 January 2011, 11:40AM

Oops...it is the same command line, just a different binary:

util/checkpoint-tester.py -i 2000 -- build/X86_FS_MOESI_hammer/m5.debug configs/example/fs.py --script test/halt.sh


Comment by Gabe Black (gblack) - Saturday, 22 January 2011, 01:42AM

I looked at this, and there are still some problems with your command line.

1. X86_FS_MOESI_hammer doesn't exist in the public repository. I created it by merging ALPHA_SE_MOESI_hammer and X86_FS
2. The X86 FS files available publicly don't yet, to the best of my knowledge, support the --script option to fs.py. It's unnecessary anyway since the simulation is stopped long, long before it gets to user land.
3. The version of fs.py in the public repository doesn't seem to want to run without --kernel being specified. I don't remember for sure if I added a default, but apparently I didn't.

I have some basic ideas about what actually makes the checker upset, but I need to look at it again more carefully.


Comment by Brad Beckmann (beckmabd) - Sunday, 23 January 2011, 11:48AM

Responses below:

1. Actually I don't believe the MOESI_hammer part of the binary is important at all. It just happened to be the particular binary I had built when observing the issue. Since the test uses X86_FS in atomic mode, any X86_FS binary should behave the same.
2. Yes, the script option doesn't matter either. I only specified it because I copied it from the example listed in checkpoint-tester.py.
3. I believe that since each user may have their kernel in different locations that it made sense that a default wasn't specified. I'll send you a separate mail with the specific kernel I used.

So in summary, the following commands also lead to the exact same problem:

% scons -j 4 default=X86_FS build/X86_FS/m5.debug USE_MYSQL=False NO_FAST_ALLOC=1 EXTRAS=
% util/checkpoint-tester.py -i 2000 -- build/X86_FS/m5.debug configs/example/fs.py


Comment by Gabe Black (gblack) - Sunday, 23 January 2011, 10:55PM

Yeah, I was able to reproduce the problem so that wasn't an issue, I just wanted to point out the differences in case somebody else wanted to reproduce it too. I poked at it a bit and have some idea what's going on, but I need to dig into what I was seeing a little deeper so I don't go charging off in the wrong direction.