Amtel SAM9 "boot from NAND" is a myth?| page 6

Reply by Stefan Reuther ●September 20, 20102010-09-20

Vladimir Vassilevsky wrote:
> Marc Jet wrote:
>> Some blocks come already marked as bad from
>> factory.  It is recommended to preserve this information, as factory
>> testing is usually more exhaustive than what you can implement in a
>> typical embedded system.
> 
> You are making unfounded assumptions here.

No, he is citing usual data sheets. So while skeptics may still doubt
that factory testing happens, Marc's claim certainly is not unfounded,
because you can read it in every data sheet :-]

  Stefan

Reply by Grant Edwards ●September 20, 20102010-09-20

On 2010-09-20, Ulf Samuelsson <nospam.ulf@atmel.com> wrote:
> 2010-09-20 16:18, Grant Edwards skrev:

> If you add an SPI flash (or a dataflash) and plan to boot Linux, you
> are probably better of by putting also u-boot and u-boot environment
> in the dataflash. You might also want to consider the kernel.

While the ROM bootloader supports the 25xx series "dataflash" parts we
got sold, there is no support in the AT91 bootstrap, U-Boot, or Linux
-- at least not that I could ever find.  I asked about it on the AT91
forum a few months back and got the usual response (IOW, none at all).

> Reason is the SAM-BA S/W which only knows how to erase the complete
> NAND flash.

Oh, I fixed that ages ago.

I added a few lines of code to the nand-flash, data-flash, and
serial-flash applets so that they can all erase a region of flash.

Then I wrote my own ROM-boot-protocol client in Python.

[Besides the lack of an "erase region" command, SAM-BA won't work at
all using a serial connection on a Linux host, it's not very usable
from the command-line, and it isn't very easy to use as a module from
other programs.]

> If you plan to program the NAND flash using another method, then of
> course, use the NAND flash for everything except bootstrap.

We'll initially use "SAM-BA" replacement program to program
prototypes. Then for production, the plan is to have the distributor
ship them with U-Boot preprogrammed so that we can use the TFTP server
in U-Boot to do the rest.

-- 
Grant Edwards               grant.b.edwards        Yow! I Know A Joke!!
                                  at               
                              gmail.com

Reply by David Brown ●September 20, 20102010-09-20

On 20/09/2010 19:19, Stefan Reuther wrote:

> Remember, we
> don't need 100% reliability. After all, all components have a finite
> life, and the flash just needs to live a little longer than the plug
> connectors or capacitors in the device :-) And by using many bits, I
> believe to got the chance that they all refuse to flip low enough.
>

This is what some people here apparently have trouble understanding. 
/Nothing/ is 100% reliable - it's just a matter of taking the 
reliability of your parts into account when designing a complete system.

> It's a flash. It's electrons that tunnel out gradually. It's not an evil
> gnome sitting within the package, deciding "today, I'll annoy the
> engineer in an especially evil twisted way", so while the data sheet
> allows a NAND flash to keep its old contents unmodifiably in a bad
> sector, I assume this doesn't happen in practice. Or, at least, not
> often enough to be observable.
>
>
>    Stefan
>

Reply by Ulf Samuelsson ●September 20, 20102010-09-20

2010-09-20 20:10, Grant Edwards skrev:
> On 2010-09-20, Ulf Samuelsson<nospam.ulf@atmel.com>  wrote:
>> 2010-09-20 16:18, Grant Edwards skrev:
>
>> If you add an SPI flash (or a dataflash) and plan to boot Linux, you
>> are probably better of by putting also u-boot and u-boot environment
>> in the dataflash. You might also want to consider the kernel.
>
> While the ROM bootloader supports the 25xx series "dataflash" parts we
> got sold, there is no support in the AT91 bootstrap, U-Boot, or Linux
> -- at least not that I could ever find.  I asked about it on the AT91
> forum a few months back and got the usual response (IOW, none at all).
>

It is hidden in a disused lavatory in the cellar marked:
"Beware of the Leopard".

There are three different AT91bootstraps around.

1) The obvious AT91bootstrap you can download from www.atmel.com
2) My derivative of AT91bootstrap which adds Kconfig etc.
    and is used by open source projects like Buildroot and OpenEmbedded.
3) There is normally an AT91bootstrap in the "Softpack's".
    This is different from (1) and (2).
    It supports the 25xx series SPI flash but relies
    on libraries not normally available in arm-linux compilers
    so you may have to compile it using arm-newlib, IAR or Keil.

>> Reason is the SAM-BA S/W which only knows how to erase the complete
>> NAND flash.
>
> Oh, I fixed that ages ago.
>
> I added a few lines of code to the nand-flash, data-flash, and
> serial-flash applets so that they can all erase a region of flash.
>

Nice, how about sharing!

> Then I wrote my own ROM-boot-protocol client in Python.
>

> [Besides the lack of an "erase region" command, SAM-BA won't work at
> all using a serial connection on a Linux host, it's not very usable
> from the command-line, and it isn't very easy to use as a module from
> other programs.]
>


>> If you plan to program the NAND flash using another method, then of
>> course, use the NAND flash for everything except bootstrap.
>
> We'll initially use "SAM-BA" replacement program to program
> prototypes. Then for production, the plan is to have the distributor
> ship them with U-Boot preprogrammed so that we can use the TFTP server
> in U-Boot to do the rest.
>

Or, if you have an SD-Connector,  you can boot from an SD-Card/
external SPI flash which programs the internal flash.

-- 
Best Regards
Ulf Samuelsson
These are my own personal opinions, which may (or may not)
be shared by my employer Atmel Nordic AB

Reply by Grant Edwards ●September 21, 20102010-09-21

On 2010-09-20, Ulf Samuelsson <nospam.ulf@atmel.com> wrote:
> 2010-09-20 20:10, Grant Edwards skrev:

>> While the ROM bootloader supports the 25xx series "dataflash" parts we
>> got sold, there is no support in the AT91 bootstrap, U-Boot, or Linux
>> -- at least not that I could ever find.  I asked about it on the AT91
>> forum a few months back and got the usual response (IOW, none at all).
>
> It is hidden in a disused lavatory in the cellar marked:
> "Beware of the Leopard".

Ah!  I just listened to Stephen Fry's auidiobook of THHGTTG last
weekend while driving home from Chicago.


> There are three different AT91bootstraps around.
>
> 1) The obvious AT91bootstrap you can download from www.atmel.com

That's the one I looked at.

> 2) My derivative of AT91bootstrap which adds Kconfig etc. and is used
>    by open source projects like Buildroot and OpenEmbedded.

Though am using buildroot for my rootfs, I found it more convenient to
build other things (kernel, bootstrap, U-Boot) separately, so I never
really looked into that one.

> 3) There is normally an AT91bootstrap in the "Softpack's". This is
>    different from (1) and (2). It supports the 25xx series SPI flash
>    but relies on libraries not normally available in arm-linux
>    compilers so you may have to compile it using arm-newlib, IAR or
>    Keil.

That's interesting.  I'll keep that in mind.

>>> Reason is the SAM-BA S/W which only knows how to erase the complete
>>> NAND flash.
>>
>> Oh, I fixed that ages ago.
>>
>> I added a few lines of code to the nand-flash, data-flash, and
>> serial-flash applets so that they can all erase a region of flash.
>
> Nice, how about sharing!

Sure.  The changes to the applets can certainly be shared.  I'll have
to check with management regarding my sam-ba client replacement. I
just double-checked, and the erase-region command has been added to
the nandflash and dataflash applets, but it never got added to the
serialflash (AT25xx) applet.

>>> If you plan to program the NAND flash using another method, then of
>>> course, use the NAND flash for everything except bootstrap.
>>
>> We'll initially use "SAM-BA" replacement program to program
>> prototypes. Then for production, the plan is to have the distributor
>> ship them with U-Boot preprogrammed so that we can use the TFTP
>> server in U-Boot to do the rest.
>
> Or, if you have an SD-Connector, you can boot from an SD-Card/
> external SPI flash which programs the internal flash.

That's also an option, but since we'll have to connect an Ethernet
cable anyway as part of the normal production test process, we want to
use Ethernet as the programming interface as well.

-- 
Grant Edwards               grant.b.edwards        Yow! Is a tattoo real, like
                                  at               a curb or a battleship?
                              gmail.com            Or are we suffering in
                                                   Safeway?

Reply by Marc Jet ●September 21, 20102010-09-21

> State of the art seems to be to use magic numbers for valid data, and
> destroy the ECC and/or magic numbers for blocks that are gone bad, so
> you can identify them later. That's the "bad block flag".

This seems to be "industry standard" from my experience as well.  But
IMHO it's not a good solution to the problem.

Typical NAND datasheets do not specify the behaviour of bad blocks.
The approach you mention, relies on certain behaviour from the bad
blocks (e.g. ability to erase or overwrite).  This is why I think it
is a bad approach.

Another approach is the following:

The chip is partitioned into data blocks and spare blocks.  During
mount, all block headers are scanned in a specific order, e.g.
ascending order for data blocks, and descending order for spare
blocks.

Every data block contains a header which contains its physical block
number and a (cryptographical) hash signature.  Blocks without valid
hash signature are considered bad or stale (e.g. powerfail during
erase).  In the first pass, every data block that passes this test is
considered valid - until a spare block overrides it.

The spare block header contains its own physical block number, and the
physical data block number of the block it replaces, and a hash
signature as well.  If a spare block exists for a data block, the data
block is degraded to "bad".  No matter what the data block content has
claimed to be.  Likewise if another valid spare block refers to the
same data block, it overrides the previously read spare block (thus
the block scanning order).  After all, what we arbitrarily designated
to be "spare" blocks, could be bad blocks too..

This method is able to memorize any combination of up to N bad blocks,
no matter what the bad block behaviour is.  Up to the collision
resistence of the hash algorithm, of course.  You can achieve any
desired reliability by choosing the hash algorithm accordingly.

The key point to understand is that the bad block information should
be stored in the good blocks, not in the bad ones.  The good blocks
are the ones that have their behaviour specified.

Reply by Paul ●September 21, 20102010-09-21

In article <i7aj23$1qo$1@reader1.panix.com>, invalid@invalid.invalid 
says...
> 
> On 2010-09-20, Ulf Samuelsson <nospam.ulf@atmel.com> wrote:
> > 2010-09-20 20:10, Grant Edwards skrev:
> 
> >> While the ROM bootloader supports the 25xx series "dataflash" parts we
> >> got sold, there is no support in the AT91 bootstrap, U-Boot, or Linux
> >> -- at least not that I could ever find.  I asked about it on the AT91
> >> forum a few months back and got the usual response (IOW, none at all).
> >
> > It is hidden in a disused lavatory in the cellar marked:
> > "Beware of the Leopard".
> 
> Ah!  I just listened to Stephen Fry's auidiobook of THHGTTG last
> weekend while driving home from Chicago.

So you will also have discovered about

   "There is no UP for rain to fall from, therefore rainfall of the
    universe is none"

Let alone no sex...

Says he the geek with CDs of the original Radio series..
 
-- 
Paul Carpenter          | paul@pcserviceselectronics.co.uk
<http://www.pcserviceselectronics.co.uk/>    PC Services
<http://www.pcserviceselectronics.co.uk/fonts/> Timing Diagram Font
<http://www.gnuh8.org.uk/>  GNU H8 - compiler & Renesas H8/H8S/H8 Tiny
<http://www.badweb.org.uk/> For those web sites you hate

Reply by rickman ●September 22, 20102010-09-22

On Sep 19, 10:06=A0am, David Brown
<david.br...@hesbynett.removethisbit.no> wrote:
> On 19/09/2010 05:09, rickman wrote:
>
>
>
> > On Sep 18, 7:34 am, Stefan Reuther<stefan.n...@arcor.de> =A0wrote:
> >> Allan Herriman wrote:
> >>> On Fri, 17 Sep 2010 12:19:56 -0700, rickman wrote:
> >>>> On Sep 17, 7:52 am, Marc Jet<jetm...@hotmail.com> =A0wrote:
> >>>>> People commonly expect bad blocks to have more bit errors than thei=
r
> >>>>> ECC copes with. =A0However, nowhere in the datasheets is a guarante=
e for
> >>>>> this.
> >> [...]
> >>>> Looking back, I never actually used a NAND flash in a design. =A0I
> >>>> understand how the bad bits would be managed. =A0But what about bad
> >>>> blocks? =A0Is this a spec on delivery or is it allowed for blocks to=
 go
> >>>> bad in the field? =A0I can't see how that could be supported without=
 a
> >>>> very complex scheme along the lines of RAID drives.
>
> >>> It's pretty simple actually. =A0When the driver reads a block that ha=
s an
> >>> error, it copies the corrected contents to an unused block and sets t=
he
> >>> bad block flag in the original block, preventing its reuse.
> >>> No software will ever clear the bad block flag, which means that the
> >>> effective size of the device decreases as blocks go bad in the field.
>
> >> But where do you store the "bad block" flag? It is pretty common to
> >> store it in the bad block itself. The point Marc is making is that thi=
s
> >> is not guaranteed to work.
>
> > Why do you need a bad block flag? =A0If the block has an ECC failure, i=
t
> > is bad and the OS will note that. =A0You may have to read the block ECC
> > the first time it fails, but after that it can be noted in the file
> > system as not part of a file and not part of free space on the drive.
>
> Failures can be intermittent - a partially failed bit could be read
> correctly or incorrectly depending on the data stored, the temperature,
> or the voltage. =A0So if you see that you are getting failures, you make =
a
> note of them and don't use that block again.

That's fine.  But my point is that if the block is "bad" either you
can either set a bad block flag or the ECC value be be invalid when
the media is read.  In either case you can flag it in your access
system (don't want to call it a file system) and not use that block
again until the next reboot.  This only has a performance impact at
boot time.  You don't have to *rely* on a bad block flag since that
can also be faulty.  But it can be used in addition to detecting an
ECC error.

Rick

Reply by rickman ●September 22, 20102010-09-22

On Sep 20, 1:01=A0pm, Stefan Reuther <stefan.n...@arcor.de> wrote:
> [I haven't got rickman's post.]
>
>
>
> David Brown wrote:
> > On 19/09/2010 05:09, rickman wrote:
>
> >> Why do you need a bad block flag? =A0If the block has an ECC failure, =
it
> >> is bad and the OS will note that. =A0You may have to read the block EC=
C
> >> the first time it fails, but after that it can be noted in the file
> >> system as not part of a file and not part of free space on the drive.
>
> How do you mark it "in the file system" if your file system is actually
> inside the NAND flash? Thought experiment: your bad block table is
> stored in a particular block. That block goes bad. Where do you mark
> that this block is now bad?

Sure, if that is your file system, then it doesn't work very well for
NAND flash does it?  The bad sector 0 problem is one that hard drives
have to this day, don't they?  Or maybe the internal controller can
remap that "invisibly" now that there are tons of embedded smarts in
them.  But that is the point.  If your device can't "fix" a bad block
in the lowest level of the file system on the drive, then it is
subject to failure.  If on boot, the software does what it has to do
to recover the structure of the file system, then it will be robust.

> State of the art seems to be to use magic numbers for valid data, and
> destroy the ECC and/or magic numbers for blocks that are gone bad, so
> you can identify them later. That's the "bad block flag".

Yes and once you find a "bad block" the "access system" (I really
shouldn't call it a file system since you might not be working at that
level) will have to remember this block in memory, not on the drive.
Each time the system is booted, it will have to either read a valid
bad block table, or construct its own.  I supposed that each time the
device needs a new block to write data it could search for a working
block.  That would be a very primitive system as well as slow, but it
would work and would not require a bad block table.

BTW, I assume that in order to trust a block on a NAND drive each
write would need to be verified in some manner.  Is that also included
in a NAND access system?

> > Failures can be intermittent - a partially failed bit could be read
> > correctly or incorrectly depending on the data stored, the temperature,
> > or the voltage. =A0So if you see that you are getting failures, you mak=
e a
> > note of them and don't use that block again.
>
> From what I've seen, those temporary failed bits are still within the
> specs of the NAND flash as long as you're running the part within specs.
> However, when you're way out of spec (say, 30=B0C over limit), all hell
> breaks loose.

Not sure what you mean by "within the specs".  Are you saying the spec
allows some level of intermittent failure on reads and/or writes?  If
so, there is still some level of intermittent that would be outside
the spec and needs to be flagged as bad.

Rick

Reply by David Brown ●September 22, 20102010-09-22

On 22/09/2010 18:09, rickman wrote:
> On Sep 19, 10:06 am, David Brown
> <david.br...@hesbynett.removethisbit.no>  wrote:
>> On 19/09/2010 05:09, rickman wrote:
>>
>>
>>
>>> On Sep 18, 7:34 am, Stefan Reuther<stefan.n...@arcor.de>    wrote:
>>>> Allan Herriman wrote:
>>>>> On Fri, 17 Sep 2010 12:19:56 -0700, rickman wrote:
>>>>>> On Sep 17, 7:52 am, Marc Jet<jetm...@hotmail.com>    wrote:
>>>>>>> People commonly expect bad blocks to have more bit errors than their
>>>>>>> ECC copes with.  However, nowhere in the datasheets is a guarantee for
>>>>>>> this.
>>>> [...]
>>>>>> Looking back, I never actually used a NAND flash in a design.  I
>>>>>> understand how the bad bits would be managed.  But what about bad
>>>>>> blocks?  Is this a spec on delivery or is it allowed for blocks to go
>>>>>> bad in the field?  I can't see how that could be supported without a
>>>>>> very complex scheme along the lines of RAID drives.
>>
>>>>> It's pretty simple actually.  When the driver reads a block that has an
>>>>> error, it copies the corrected contents to an unused block and sets the
>>>>> bad block flag in the original block, preventing its reuse.
>>>>> No software will ever clear the bad block flag, which means that the
>>>>> effective size of the device decreases as blocks go bad in the field.
>>
>>>> But where do you store the "bad block" flag? It is pretty common to
>>>> store it in the bad block itself. The point Marc is making is that this
>>>> is not guaranteed to work.
>>
>>> Why do you need a bad block flag?  If the block has an ECC failure, it
>>> is bad and the OS will note that.  You may have to read the block ECC
>>> the first time it fails, but after that it can be noted in the file
>>> system as not part of a file and not part of free space on the drive.
>>
>> Failures can be intermittent - a partially failed bit could be read
>> correctly or incorrectly depending on the data stored, the temperature,
>> or the voltage.  So if you see that you are getting failures, you make a
>> note of them and don't use that block again.
>
> That's fine.  But my point is that if the block is "bad" either you
> can either set a bad block flag or the ECC value be be invalid when
> the media is read.  In either case you can flag it in your access
> system (don't want to call it a file system) and not use that block
> again until the next reboot.  This only has a performance impact at
> boot time.  You don't have to *rely* on a bad block flag since that
> can also be faulty.  But it can be used in addition to detecting an
> ECC error.
>
> Rick

You can track bad blocks in all sorts of different ways.  Some will 
involve more work when the bad block is discovered, others will involve 
more checking before using a block.  But any file system, or "access 
system" if you like, has to have some way of tracking whether a block is 
in use or not.  If you think of bad blocks as being in use in a special 
file that can't be accessed normally, then you have got simple and 
efficient bad block tracking (at least, it's as simple and efficient as 
the rest of your file system).