> I'm certainly going to need your help with understanding the MOVX > predictor, that is a beauty It's simpler than it looks. The left side of the schematic produces just a single signal that gives a clock sync'd pulse when the 8051 is fetching one of the six opcode that can be MOVX. The right side is a 5 bit counter that starts at 0 and counts up to 31 and stops when it gets to 31. When it's at 31, the DMA_OK signal is asserted and it is ok for the controller to begin DMA and refresh operations. The pulse from the left side zeros the counter, so that no DMA operations can begin in the next 31 cycles. There's also a couple "done" signals that force it to 31 so that DMA can begin again immediately after we've serviced the MOVX (this probably isn't necessary... I doubt it makes any real difference). On the left side, there's a several cycle delay on PSEN which ultimately enables those flip-flops to (hopefully) catch the opcode in the middle of the time it's available on the bus (this timing has never really been verified well). A bit of combinatorial logic detects if the opcode was one of the 6 we need, and that same PSEN pulse is delayed another couple cycles and AND'd with the opcode detection so that the counter only gets its reset for one cycle, and only when PSEN makes its low to high transition. The timing of all this stuff has never really been verified, and this might indeed be the thing that can cause instability. Does the circuit really capture the opcode when it's nice and stable? Does the 31 cycle delay begin too late, and a DMA operation can begin and last long enough into the MOXV fetch that the controller doesn't service the MOVX in time? Is 31 cycles really long enough to wait for the worst case time from when we capture the 8051's opcode fetch to when it will assert RD or WR (and we see it, +/- 1 cycle since we're not on the same clock as the 8051)? But I think this circuit really does work, because when I've built the chip with bad timing constraints and ended up with a highly unstable FPGA, I saw read/write errors to DRAM when no DMA was running. Hmm... the refresh does always run, so maybe it could be here?16th December 2002
> I'm thinking that I will use ALE to determine the start of the next cycle > rather than the 32 byte counter. This is just a different approach, however The reason I went to the 32 cycle count was seeing the ALE signal stop pulsing when the 8051 executes from internal code memory (such as when we call those routines to write into the flash). I originally has a very simple scheme based on ALE, but it resulted in the refresh not getting to access the dram when ALE stopped pulsing. It sounds like you won't have this problem with your approach, but I just wanted to bring up the issue where the 8051 turns off ALE pulses while it's running from internal code.17th December 2002
>Other questions, does the timer need to start immediately or can we start it >at the start of the DRAM access? Well, there's some flexibility. It sounds like you've done quite a bit more looking at the 8051 bus than I ever did. I never connected a logic analyzer to a real 8051 and I never simulated anything. I just dreamed the whole thing up based on the datasheets timing diagrams. I sketched up some waveforms, made some little scribbels on bits of paper, and then I wired up gates on the schematic and downloaded bit files into the FPGA until it worked. So you can probably make changes and it might even work better. The 8051 bus is pretty slow, so there's some flexibility in how to do things. The main reason for 7.3728 MHz on the 8051 was to allow plenty of time in the RD and WR pulses to access the DRAM. Now, the way I originally intended it was for the timer to begin counting before the end of the PSEN pulse, and it would not reach zero before the RD or WR pulse begins. It does not matter if the counter reachs zero once the state machine enters the first state that services the request. All that matters is that the DMA_OK signal is de-asserted soon enough for whatever operation that might be in progress to completely finish, so the state machine will be waiting in the idle state and can respond immediately to the 8051's bus. While I'm thinking of this, I should mention a pitfall that you are probably already aware of. The MOVX prediction is quite simple... and code memory fetch of one of the six MOVX opcodes causes the timer to reset. We can't tell the difference between opcodes and operands, so any operand with one of those 6 bytes causes the DMA to stall for 31 cycles. This probably isn't a big deal, but then again, I've never really investigated how much time the DMA is being suspected needlessly. But I do know that my original choices for the inter-bank calling code were not so great. The code to call bank1 is at 0x0FE0, and the code to return to bank0 is at 0x0FF0. These locations are fixed in the 87C52 and we can't change them because code that jumps to different places would not be compatible with existing boards. The 2 important MOVXs are opcodes 0xE0 and 0xF0, so all jumps between the two memory banks will stall the DMA. Hindsight.... I suspect this doesn't actually slow things down much, but I really don't know what the impact is. I thought a few times about change the inter-bank calling to just use some code that exists in both banks, or to take a closer look at the calling conventions that Roger and Marco discussed at length some time ago. Anyway, this is (probably) a minor slowdown and it doesn't have any harmful effects. I just thought I'd mention it while I was think about the MOVX prediction circuitry. >It looks like DMA_OK off can freeze a >cycle (hold the state machine in the middle of a cycle). Is this really >what we want? That would definately be bad, but I can't see how it could happen. The DMA_OK signal goes into a bunch of gates that use the request lines to assert exactly one of the DO_xxx signals. When there are no requests, either DO_WAIT or DO_NOTHING is asserted (they both have the same effect... it would be interesting to see if the xilinx translate step removes the redundant logic). The way I had intended it was that all the DO_xxx signals only effect the state transition at the idle state. Once the state machine enters a sequence of states like S_RD_DRAM_x, there is no way the DMA_OK signal or any of the request bits can alter the flow until it returns to the idle state. That is why I wanted the DMA_OK signal to begin as soon as possible. The hope is that there are more cycles between the initial de-assertion of DMA_OK than there are in the longest operation (S_IDEXFER_x) so that in the worst case where the state machine begin an operation in the same cycle where DMA_OK is de-assered, that operation will return to the idle state before REQ_RD or REQ_WR are asserted due to the movx. This approach does have the drawback that none of these atomic operations can be really long, because they'd cause the controller to respond too late to the 8051's RD or WR pulse. It would have been really good to use a CPU with a wait state input. The other thing I had considered, was using the FPGA to clock the 87C52. It's supposed to be fully static, so at least in theory, the FPGA could suspend the 8051's clock until it is ready to respond to the MOVX. This could also allow the 8051 to clock quite a bit faster. The main reason I did not persue that was the difficulty of transitioning between the initial clock (before the FPGA is configured) and a clock from the FPGA.
>I'm a little confused about the polarity of the RD_RAW signal driving the >data bus output (DOUT => DBUS) enable. From the schematic it looks to be >active high, but I can't see how this could possibly work. Can you help me >here? It is active low. Almost every signal is active high, but in this case Xilinx's OFDTX8 symbol requires an active low enable.
> 1. The write timing you've used seems to use a second CAS > cycle to clock the data in. This is different from the > fast mode cycles described in the micron data sheets and > seems to be different from what you described (early > on) on the simm page. Is the method you have used > described anywhere I've never documented it (until now). Question 2 is related. > 2. I'm really confused about 8 bit write cycles. I can't see > how these are possible since we always write 16 bits (I think). > If this is the case the how do variables in the xdata space > ever work? The S_WR_DRAM_x states implement a read-modify-write operation to the DRAM. That is why you see two CAS pulses. The first one reads all 16 bits into the register in the FPGA (almost exactly the same as S_RD_DRAM_x). Then the 8051's 8 bits are written into whichever half of the register A0 specifies, and the second CAS pulse writes the modified 16 bits back out to the DRAM. > DRAM was a black art. I used to think that too before this project. It is a bit of pain compared to normal SRAM and peripherals, but it's not really that bad. > 3. Finally I take it the DRAM refresh address counter is internal > to the chips, all we have to do is to keep asking it to refresh Yes. All modern DRAM chips have a row address counter inside so all you have to do is assert CAS before RAS and it refreshes the next row. The main thing to remember is to allow for the "precharge time" after any operation (de-asserting RAS). DRAM reads are always destructive, since the tiny charge on the little capacitors makes a little change on the column lines that is picked up by the sense amplifiers and written to the row buffer. The time after RAS is needed for the chip to write the entire row back to that row of the memory array. According to the datasheets, 60 ns ought to be enough, so in theory one 68 ns cycle should do it. But my experience was that some simms were problematic until I allowed 2 cycles for precharge time. (I have never even attempted to overlap the percharge time with the IDEXFER for faster DMA... but it ought to be possible). Anyway, until it's working well, just leave an extra couple cycles for the precharge time to be cautious.15th October 2002
>If you have some kind of description of the states in the transfer machines >that would be great - I'm of course quite mystified by the RAS, CAS and >column select logic too. At this stage I've just coded it blind from your >schematics. My plan is to work towards getting a DRAM interface going >leaving out the IDE and using its pins for debug. I have some hand-drawn diagrams of the expected waveforms. Maybe that would help? I'll describe a bit.... The basic idea is that the DRAM requires the address provided in two parts (row and then column). The full address is formed from 12 bits from the 8051, and from the SRAM block that holds the page mapping. The 8051's A0 never makes it to the SIMM, since SIMM access is always 16 bits wide. The address mux logic takes the 24 bit SIMM address (16 meg of 16 bit wide locations), where 11 bits are from the 8051/DMA address, and the upper 13 bits are from the page mapping registers. 11 bits of 16 bit data = 4k block The low 20 bits get muxed to the low 10 bits of the DRAM. The next two above that select which of the 4 RAS lines will be asserted, and the top two go to A11 (16 and 32 meg simms). A couple signals force certain bits to 0 or 1 when accessing the IDE drive. I wrote quite a bit about how the SIMM access works some time ago, and it's archived here: http://www.pjrc.com/tech/mp3/simm/simm.html I should really dig up those waveform sketches. The basic idea is that the control signals are asserted at certain times in relation to other control signals, so that both halves of the address are output, and then after 1 idle state, the data is captured into the latch (for reading) or data is transmitted to the simm (for writing).
>I need some info on what constraints you've been using. I can see there are >some on the main schematic, but it difficult to tell what they apply to. Well, I've tried many different things in the constraints, and I've never really been happy with it. Maybe I didn't consider something, but the results were often so strange that I think the schematic "flow" has a lot of bugs. It could also be some sort of strange async thing between the clocked logic and the unsync'd 8051 signals (but I tried to sync all of them up right at the inputs). Anyway, the main constraint I used was the period on the clock. In the end I believe I also used two other constrains on the inputs and outputs of the chip relative to the clock. All over the design are TNM= attirbutes that assign register to various timing groups. There are all unused. I originally had a bunch of constraints for things like control registers to data registers, etc. This produced very erratic results. Version 3.1i added the "offset" timing constraint relative to the clock. So there's only three constratins... the speed of the clock, the time we are willing to accept from the clock to when the outputs change, and the time before the clock that inputs must be stable. At least that's roughly how I remember it. Many times I would fiddle for hours with the constrains and ultimately get a lot of unreliable compiles, and then just switch off timing constrains altogether and get a quick compile that ran pretty good. When I revisit the FPGA, the next major thing I want to try (other than learning the simulator and setting up some good simulations), is the floorplanner and more relative location specs. The automatic placement makes a giant mess and why it places things the way it does is a total mystery. There's a reason for the floorplanner.
>It took me a while to fight my way through the dram/ide address selection >logic - it was much clearer once I checked the schematic to determine where >the various ide signals were connected (must have suited the PCB since it >makes the fpga look a little weird). Yep, I did the board layout first, so it could fit into 2 layers. I also routed the whole board in 15 mil traces with 10 mil clearance, so it would be possible to etch with a hobby etching kit. In hindsight, that was a lot of extra routing work that I probably wouldn't bother to repeat. But it did put a lot of limits on which pins could route where. The address and data connections to the flash rom are also all scrambled to make the signals route (and the monitor has constants to adjust for them). If I could go back in time knowing what I know now, I'd definately use the 5 strobes for two RAS, two CAS and one extra address bit (instead of 4 RAS and 1 CAS). That would allow a lot more SIMMs to be used at full capacity that aren't wired like the Micron datasheet. Oh well. >I don't quite understand why the >generation of A5 is ORed with ide_addr_z and others use an inverted input >AND. I need to understand this well since I will be changing it to add the >chip select for the CS8900. A5 goes to pin 7 on the FPGA, which connects to pin 38 (CS1) on the IDE connector. When the IDE DMA reads from the drive, it needs to access the IDE data register, which is A0=0, A1=0, A2=0, CS0=0 and CS1=1. So the four AND gates force A0-A2 and CS0 low, and the OR gate forces CS1 high. Nothing is done to the other 6 lines, since they do not connect to the IDE bus. You'll almost certainly have to do similar things if you want to make the FPGA circuits access certain registers in the CS8900 without CPU bus cycles providing the address. As this function gets more complex, a nice way to minimize the logic might be to use initialized RAM blocks to hold the DMA addresses, and then follow each with a mux for the CPU addresses. That ought to allow a single CLB (1/2 for the RAM, 1/2 for the mux, and the internal H mux to choose between them) to generate each address bit.
> 1. The wierd bit ordering in the shift register. At first I > thought this was just an endian thing, but it seems to also > reverse the order of the bits too. Refering to this schematic: http://www.pjrc.com/tech/mp3/fpga/1.10/sr16.gif I supposed that depends on what you consider "reverse". The STA013 wants to see the MSB first. FAT32 is little endian, so bits 0-7 are the first byte, and 8-15 are the second byte. That is why the shift register is connected in that order. > 2. The operation of the bitcount block. It is really hard > to read when you're not used to those logic blocks. Refering to this schematic (is anyone else actually trying to follow these FPGA conversations??) http://www.pjrc.com/tech/mp3/fpga/1.10/bitcount.gif First, ignore those FMAP symbols. The have absolutely no logic functionality. Their only purpose is to serve as a placeholder for RLOC constraits to the xilinx compiler. Once you get past that, it's just a simple 5-bit down counter with sync preload of 16. Yes, the carry logic is a bit strange and the symbols do a very poor job of conveying their function (I always print out the relevant pages from libguide.pdf regarding those CY4_xx symbols). > I'm guessing it is a 4 bit counter > with a clock enable and a parallel load (begin) with a > non-zero output. However there seems to be more logic that > that in the block. Hmm... it's been quite a long time since I designed that.... I'm looking at it now....... Ok, it is a bit tricky and difficult to figure out by looking. That counter works together with a little state machine which is drawn on the main schematic. The state machine has two flip-flops, so ignore that IFD flip-flop on the MP3_REQ line, since all it does is sync the STA013's request signal to the FPGA's clock. That state machine has 3 valid states 0/0, 0/1, 1/0. Both flip-flops should never go high (if they ever did, it would immediately go back to 0/0). In this little informal notation for this message, the first number refers to the top flip-flop. Here's a conceptual way to think of the 3 states: 0/0 = Waiting for STA013 to be ready or for parallel load 1/0 = First half of clocking to STA013 0/1 = Second half of clocking to STA013 1/1 = (illegal state) The state sequence when transfering data is: 0/0 -> 1/0 -> 0/1 -> 1/0 -> 0/1 If the STA013 de-asserts its data request signal, or if the counter reaches zero (starts at 16 and counts down) then the 0/0 state is entered and it remains at 0/0 until the STA013 is ready AND the counter is non-zero. Of course, when the counter reaches zero, the "nonzero" output is inverted to become "DECODE_READY" on the main schematic, and that gets AND'd with the DMA request so the state machine receives a request to load another 16 bits whenever DECODE_READY is asserted AND we're still servicing a DMA request. The SR_LOAD signal (from the control state machine) is asserted when the 16 bits are transfered from DRAM to fill the shift register, and it sets the counter back to 16. On the next clock cycle, the DECODE_READY causes that little state machine to begin clocking the bits out (of course, if the STA013 is also requesting data... if not it stays in the 0/0 waiting state and the counter remains at 16). It needs to be a 5 bit counter because it there is one state for each bit, and 00000 is used to represent "empty". > You also seem to have used RLOC constaints within the > counter, can you remember why? Not for any great reason. Mostly my low opinion of the xilinx compiler's placer. It's also a habbit I got into with XACT 5.0 (before "foundation"), where the placer was not able to use carry logic with RLOCs. They seem to have fixed this sometime in the last several years... old habbits die hard, I guess. But I didn't put lots of RLOCs inside the MOXV prediction which was designed later on.