Cache memory is a small, fast memory layer placed between a processor and slower main memory (Flash or DRAM) that transparently stores recently or frequently accessed instructions and data to reduce average access latency. On MCUs and application processors that include a cache, hits are served in one to a few cycles instead of the tens to hundreds of cycles a main-memory fetch would require.
In practice
Cache memory appears most often in processors fast enough that main memory becomes a bottleneck. ARM Cortex-M7 cores include optional instruction and data caches (I-Cache and D-Cache, typically 4–64 KB each), and Cortex-A-class SoCs commonly carry L1 caches of 32–64 KB per core plus a shared L2 of 256 KB to several MB. Simpler Cortex-M0/M3/M4 cores and most 8/16-bit MCUs (PIC, MSP430, AVR) do not include a hardware cache at all; on those parts, the concepts in this entry are not applicable.
The most common pitfall in embedded cache use is cache coherency with DMA. When a DMA controller writes to a memory region that the CPU has cached, the cache may hold a stale copy of the old data, and the CPU will read incorrect values. Similarly, a CPU write to a cached region may sit in the D-Cache without reaching RAM before a DMA read. Correct practice is to either mark DMA buffers as non-cacheable (via the MPU on Cortex-M7) or to explicitly clean (flush) and invalidate the relevant cache lines around DMA transfers. This is a frequent source of intermittent, hard-to-reproduce bugs.
In practice, memory-mapped peripheral registers should not be cached, as a cached read of a status register may return a stale value from a previous bus cycle rather than the current hardware state. On Cortex-M7 and Cortex-A parts, peripheral address regions are typically configured as Device or Strongly-Ordered memory through the MPU/MMU, which prevents caching. This is another layer of the same class of hazards as compiler and hardware reordering of peripheral accesses.
Execution performance from cache is highly workload-dependent. Code with tight loops and localized data access (high temporal and spatial locality) benefits greatly; code with large lookup tables, scattered linked-list traversals, or frequent context switches between large working sets may see little benefit or even pathological behavior from cache thrashing. Benchmarking with the cache enabled and disabled is the reliable way to quantify impact on a specific target.
Discussed on EmbeddedRelated
Frequently asked
How do I know if my MCU has a cache?
Check the core-level documentation first. ARM Cortex-M0, M0+, M3, and M4 cores do not include a hardware cache. Cortex-M7 includes optional I-Cache and D-Cache; whether they are present depends on the SoC vendor (STM32H7 and STM32F7 include them; some cost-reduced M7 variants may not). Cortex-A and Cortex-R cores typically include L1 caches, though the exact configuration depends on the specific implementation and SoC vendor. The device reference manual or datasheet will list cache size under the memory system or core subsystem section.
Do I need to enable the cache, or is it on by default?
On most Cortex-M7 devices, both the I-Cache and D-Cache are disabled after reset and must be explicitly enabled by firmware, typically via SCB_EnableICache() and SCB_EnableDCache() in CMSIS. Cortex-A platforms managed by an OS usually enable caches early in the boot sequence. Verify your startup code or BSP, because running with the cache accidentally disabled on a 400+ MHz M7 can make code run three to ten times slower than expected.
What is the difference between a cache clean, a cache invalidate, and a cache flush?
Invalidate marks cache lines as invalid so the next access fetches fresh data from main memory; it discards any dirty (modified) data without writing it back. Clean writes dirty lines back to main memory but leaves them valid in the cache. Flush is an informal term that some vendors use to mean clean-and-invalidate and others use to mean just invalidate -- always check the vendor's documentation for the exact operation. On Cortex-M7, CMSIS provides SCB_CleanDCache_by_Addr(), SCB_InvalidateDCache_by_Addr(), and SCB_CleanInvalidateDCache_by_Addr() for range operations on specific buffers.
Can the cache affect interrupt latency?
Yes, in two ways. A cache miss on the instruction fetch for an
ISR entry point adds
latency, though on Cortex-M7 the branch predictor and prefetch logic usually mitigate this for frequently taken interrupts. More subtly, a D-Cache clean or invalidate operation covering a large region (called from a low-priority task, for example) is not interruptible at the cache-line level on all implementations, which can introduce jitter. For hard real-time systems, keep
DMA buffer regions small and prefer MPU-based non-cacheable regions over large software maintenance operations.
Does the cache affect the behavior of volatile variables?
The C
volatile qualifier tells the compiler to read or write the variable through the bus on every access, preventing compiler-level caching in
registers. However, volatile does not control the hardware cache. On a Cortex-M7 with D-Cache enabled, a volatile read of a cached memory location still returns the value from the cache, not necessarily from
RAM. For peripheral registers this is addressed by marking the peripheral region as non-cacheable in the MPU. For shared memory used with
DMA or between cores, cache maintenance operations or non-cacheable MPU regions are needed in addition to volatile.
Differentiators vs similar concepts
Cache memory is sometimes confused with tightly coupled memory (TCM). TCM (ITCM and DTCM on Cortex-M7 and Cortex-R cores) is also fast, low-
latency memory close to the core, but it is explicitly addressed -- the programmer places code or data there intentionally by linking to the TCM address range. Cache is transparent: the hardware decides what to store there based on access patterns, and software generally has no direct control over which lines are resident at any moment. TCM gives deterministic, jitter-free latency; cache gives probabilistic latency improvement that depends on access locality. The two can coexist on the same device (STM32H7, for example, has both ITCM/DTCM and I-Cache/D-Cache) and are complementary: TCM is preferred for ISRs and hard-real-time code; cache benefits general-purpose code running from
Flash or
SDRAM.