Announcement

Collapse
No announcement yet.

DDR3L & emmc issues

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DDR3L & emmc issues

    Hello fellas,

    I'm working on a project that contains ATSAMA5D2 processor with 2 DDR3L chips from micron and emmc chip. The first version of this design was done before my time and was believed to work properly in the field. I was mainly involved in the next-gen of this project which had some improvements together with added chips to create more features (generate keys etc)
    I haven't done any changes to the DDR3L interface which was imported (in the first revision) from the ref design of the manufacturer.
    The only change in these critical parts was to change the emmc part as it's becoming obsolete. I went for panasonic emmc RP-SEMC08DA1. Given that the previous version worked I haven't even changed the stack-up and kept everything on the critical path (DDR3L) as the same.

    However, when we had our first prototype boards, I noticed that the boards failing specifically when doing the emmc test. DDR tests usually work fine even when left for weeks on my desk. We did another spin adding some parts (again far away from the critical path) and unfortunately again ended up having 20-30% boards which will fail on emmc tests (bit-fade test write all memory cells, wait some time, then read everything). I did another test to confirm that the good boards are really good, by performing temperature stress inside a chamber. The profile was way below the limits of the lowest components which were the batteries (5-55C) the cycling. The results of this test after 48 hrs of temperature cycling in the chamber, most boards that were "good" in ambient failed in a similar way sometime after being in the chamber. I did another comparison experiment with V1 boards and found out that the previous revision was surviving the test in the chamber (although the sample was limited to 2 boards only).

    When turning off the data cache in the processor things get improved and boards start to pass the emmc test. However, I don't know if the problem lies in emmc part as the read-mismatch errors are happening when comparing both read/write buffers stored in DDR. This made me believe that there might be an issue with DDR3L interface although as I said earlier nothing has changed since V1 in that department.

    I did a comparison between V1 & V2 boards in terms of VDDIODDR, DDR_VREF and VDDCORE and found that the ripple vp-p values are closely matched if not improved in V2 yet these problems seem to happen only on V2 design.

    The errors I'm having are read-mismatches when reading back from the emmc (sometimes only few bits are different, other times its the whole byte)
    this happens and the test continues until the code is hit by a translation/data fault, undefined instruction or software interrupt.

    I doubted that the software might have been the cause but having the same software working on V1 quite well redirected me back to fundamental HW issues.

    Would really appreciate if you could give your thoughts on this issue as it's getting really frustrating. Can provide snippets of my schematic/layout if needed.

  • #2
    What test software do you use? Did you try also a different test software? Where are you running the test software from (from the eMMC or from a different "drive")?

    I am not sure how much eMMC can be stressed, but I do remember having problems with stressing SD cards - especially if the OS was running from it - testing would completely damage the SD card and make OS unreliable.

    Comment


    • #3
      We developed an internal test suite based on memtest86

      basically DDR tests (address testing, moving inversions, hammer and bit-fade).
      All of these pass on the "bad" board but when doing similar tests for emmc (address, inversions and bitfade); mismatches occur


      - The driver is not complaining about any read/write op
      - read/write buffers are in ddr so the read-mismatching is saying the difference between read-write buffers.
      - Good boards don't have this issue however when put in the chamber, things start to become like bad boards.

      if we continue doing the test I get translation errors undefined instruction etc. The thing is although the issues are happening when doing mmc but it might be an underlying problem with ddr so the corruption is happening in ddr. We're using bare-metal with no OS running; If we use the same test on V1 board it still survive the chamber test fine.


      When we run our deployed application we get all sorts of translation faults etc.
      Last edited by bashar_aba; 07-31-2019, 07:14 AM.

      Comment


      • #4
        We're using bare-metal with no OS running
        - Hmm, so, the test is running directly from DRAM or Cache or internal CPU RAM?

        Are you using exactly same memory chips? This still may be some problem with register settings.

        Or, what I have seen - some boards were failing, because of heat. Is CPU temperature / board temperature higher for V2?

        PS: I tried memtest86, but I ended up with using stressapptest ( https://github.com/stressapptest/stressapptest ) ... if the board was wrong, stressaptest was able to fail the boards within 1 hour even if memtest86 was showing everything ok or only crashed occasionally.

        Comment


        • #5
          Originally posted by robertferanec View Post
          - Hmm, so, the test is running directly from DRAM or Cache or internal CPU RAM?
          Yes the test code is loaded into DRAM and both data/instruction CACHEs are enabled. The reason why I went this way is to replicate how the final product sw will behave. I did some testing from internal SRAM but this won't be running in the same condition as the final sw.
          DDR tests will then be testing the available space (we have 512MB of DDR3L) usually the test code will reside in the first quarter and the rest will be continually tested

          Originally posted by robertferanec View Post
          Are you using exactly same memory chips? This still may be some problem with register settings.
          Yes the same chips (MT41K128M16JT-125:K) used in the reference design, the first spin we had to use the extended temp (XIT) variant of that chip and thought that it might be doing that problem but in the next spin we used exactly the same chips and still have the problems appearing. The tuning procedure is automatic by the driver; I tried running some different values but that didn't help either. The calibration is fairly obscure process and I don't really understand how it's done. It's not like a specific tool that runs and find out the optimal values, I guess that's due to the fact that the CPU doesn't support write leveling or ODT.

          Originally posted by robertferanec View Post
          Or, what I have seen - some boards were failing, because of heat. Is CPU temperature/board temperature higher for V2?
          So in ambient testing the failure rate will be around 20-30% but in the chamber, this will increase to 50-60%. The test profile is 5-55C and this should be a problem for neither the CPU nor the DRAM, emmc chips. The high limit was dictated by the batteries op limit of 60C.
          The same test profile was used when testing both versions of that board (V1 & V2)

          Originally posted by robertferanec View Post
          PS: I tried memtest86, but I ended up with using stressapptest ( https://github.com/stressapptest/stressapptest ) ... if the board was wrong, stressaptest was able to fail the boards within 1 hour even if memtest86 was showing everything ok or only crashed occasionally.

          Our test suite is not as extensive as memtest86 or stressaptest, but still employs the same mechanism of tests (bit fading, hammering etc). the problem is that porting these libraries to a bare-metal app is a very time-consuming project. Also our test suite could be utilised as a production test as it can test all interfaces running from DRAM. The issue here is that we have a test code that fail on V2 and doesnt on V1 in the chamber/ambient testing. The first question is why we have these failures given that the ddr interface is exactly the same.

          I did some checks on DDR_VREF it seems that the ripple vp-p on this rail when the board is performing DDR tests is very limited ~10mv. However, when the mmc test start I could see peaks getting bigger. Although the supply for mmc interface is from LDO on the original PMIC used in the ref design. I don't know if that can lead me somewhere


          Comment


          • #6
            the CPU doesn't support write leveling or ODT.
            - interesting. so, you had to use T-Branch topology for layout? Probably yes http://ww1.microchip.com/downloads/e...S00002717B.pdf

            - Temperature: What I meant is, that in our case, when CPU temperature crossed certain point, the boards started failing memory test. When we placed heatsink on CPU, everything was fine.

            - It can be power, but as you explained, not much really changed from previous version, so it is interesting. However, I have seen failing memory layouts, because of not enough decoupling capacitors between VTT/2 and VTT or VTT/2 and GND, but this is not probably your case (I don't think you are using termination resistors in your design).

            And I remembered, this may be something interesting for you:
            https://www.fedevel.com/designhelp/f...shing-nand-slc

            Comment


            • #7
              - Yes, The ddr interface is exactly as is in the ref-design of MC. The only changes where the stackup (6-layers vs 8-layers) and where DDR_VREF is done.

              - I understand your point, If you look at the temperature stress test that Microchip has done on their ref-design, they were going up to 70C. in our V2 tests we didn't even go above 60C. Also I had 2 V1 boards in the chamber running the same test code and they survived 48 hrs of continuous testing. This lead me to believe there was an actual problem with V2 board.

              - The decoupling scheme is directly taken from microchip ref design, cap for cap. there might be some changes in the placement. The noise of emmc power rail happens when you 're testing the emmc, I noticed the ripple will go up to 30mV (Vp-p). DDR_VREF also gets more peaks, I'm worried that there might be coupling happening especially that emmc power plane is on the same layer as DDR_VREF. however, I did cut the power supply from that plane and manually wired the supply back to 3.3V (as it was supplied in V1) and that didn't help as the board was still failing.

              Comment


              • #8
                Please, let us know when you find out what the problem was.

                Comment


                • #9
                  Unfortunately, the noise thing was a bit of red herring from the scope. I did some checks to ensure the stability of VDDIODDR and DDR-VREF in case any is dipping on or before the fault is happening but couldn't find anything wrong. The power rails seem to be enough, added few decoupling caps on both IODDR and VREF and that didn't change anything either.

                  I tried running the test code using JTAG running from DRAM, it works and the board passes some tests but can't really understand why the behaviour changes. There is a possibility that the JTAG code is using a separate config file for DDR controller but copying these register values to our bootloader doesn't reproduce the same good results.

                  Comment

                  Working...
                  X