Misplaced Pages

Scratchpad memory: Difference between revisions

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
Browse history interactively← Previous editContent deleted Content addedVisualWikitext
Revision as of 12:34, 13 June 2016 editComp.arch (talk | contribs)Extended confirmed users40,382 edits "organized as a multicore architecture" better? "utilising" and the non-British spelling) are bad words..← Previous edit Latest revision as of 21:25, 1 March 2024 edit undoComputerUserUser (talk | contribs)Extended confirmed users2,105 editsNo edit summaryTag: 2017 wikitext editor 
(14 intermediate revisions by 11 users not shown)
Line 1: Line 1:
{{Refimprove|date=October 2013}} {{Refimprove|date=October 2013}}


'''Scratchpad memory''' ('''SPM'''), also known as '''scratchpad''', '''scratchpad RAM''' or '''local store''' in ] terminology, is a high-speed internal memory used for temporary storage of calculations, data, and other work in progress. In reference to a ] ("]"), scratchpad refers to a special high-speed memory ] used to hold small items of data for rapid retrieval. It is similar to the usage and size of a scratchpad in life: a pad of paper for preliminary notes or sketches or writings, etc. '''Scratchpad memory''' ('''SPM'''), also known as '''scratchpad''', '''scratchpad RAM''' or '''local store''' in ] terminology, is an internal memory, usually high-speed, used for temporary storage of calculations, data, and other work in progress. In reference to a ] (or ]), scratchpad refers to a special high-speed memory used to hold small items of data for rapid retrieval. It is similar to the usage and size of a scratchpad in life: a pad of paper for preliminary notes or sketches or writings, etc. When the scratchpad is a hidden portion of the main memory then it is sometimes referred to as ''bump storage''.


It can be considered similar to the ] in that it is the next closest memory to the ] after the ], with explicit instructions to move data to and from ], often using ]-based data transfer.<ref>{{cite web|url=http://dent.cecs.uci.edu/~papers/date08/PAPERS/2002/DATE02/PDFFILES/04E_3.PDF|title=Assigning Program and Data Objects to Scratchpad for Energy Reduction|last=Steinke|first=Stefan |author2=Lars Wehmeyer |author3=Bo-Sik Lee |author4=Peter Marwedel|year=2002|publisher=University of Dortmund|accessdate=3 October 2013}}: "3.2 Scratchpad model .. The scratchpad memory uses software to control the location assignment of data."</ref> In contrast to a system that uses caches, a system with scratchpads is a system with ] latencies, because the memory access latencies to the different scratchpads and the main memory vary. Another difference from a system that employs caches is that a scratchpad commonly does not contain a copy of data that is also stored in the main memory. In some systems{{efn|Some older systems used a hidden part of main storage, referred to as bump storage, as scratchpad. In other systems, e.g., ], all addressable registers were held in scratchpad.}} it can be considered similar to the ] in that it is the next closest memory to the ] after the ], with explicit instructions to move data to and from ], often using ]-based data transfer.<ref>{{cite web|url=http://dent.cecs.uci.edu/~papers/date08/PAPERS/2002/DATE02/PDFFILES/04E_3.PDF|title=Assigning Program and Data Objects to Scratchpad for Energy Reduction|last=Steinke|first=Stefan |author2=Lars Wehmeyer |author3=Bo-Sik Lee |author4=Peter Marwedel|year=2002|publisher=University of Dortmund|accessdate=3 October 2013}}: "3.2 Scratchpad model .. The scratchpad memory uses software to control the location assignment of data."</ref> In contrast to a system that uses caches, a system with scratchpads is a system with ] (NUMA) latencies, because the memory access latencies to the different scratchpads and the main memory vary. Another difference from a system that employs caches is that a scratchpad commonly does not contain a copy of data that is also stored in the main memory.


Scratchpads are employed for simplification of caching logic, and to guarantee a unit can work without main memory contention in a system employing multiple processors, especially in ] for ]. They are mostly suited for storing temporary results (as it would be found in the CPU stack) that typically wouldn't need to always be committing to the main memory; however when fed by ], they can also be used in place of a cache for mirroring the state of slower main memory. The same issues of ] apply in relation to efficiency of use; although some systems allow strided DMA to access rectangular data sets. Another difference is that scratchpads are explicitly manipulated by applications. They may be useful for ], where predictable timing is hindered by cache behaviour. Scratchpads are employed for simplification of caching logic, and to guarantee a unit can work without main memory contention in a system employing multiple processors, especially in ] for ]. They are mostly suited for storing temporary results (as it would be found in the CPU stack) that typically wouldn't need to always be committing to the main memory; however when fed by ], they can also be used in place of a cache for mirroring the state of slower main memory. The same issues of ] apply in relation to efficiency of use; although some systems allow strided DMA to access rectangular data sets. Another difference is that scratchpads are explicitly manipulated by applications. They may be useful for ], where predictable timing is hindered by cache behavior.


Scratchpads are not used in mainstream desktop processors where generality is required for ] to run from generation to generation, in which the available on-chip memory size may change. They are better implemented in embedded systems, special-purpose processors and ], where chips are often manufactured as ], and where software is often tuned to one hardware configuration. Scratchpads are not used in mainstream desktop processors where generality is required for ] to run from generation to generation, in which the available on-chip memory size may change. They are better implemented in embedded systems, special-purpose processors and ], where chips are often manufactured as ], and where software is often tuned to one hardware configuration.
Line 11: Line 11:
==Examples of use== ==Examples of use==
* ] of 1975 contained 64 bytes of scratchpad. * ] of 1975 contained 64 bytes of scratchpad.
* The ] has 256 bytes of scratchpad memory on the 16-bit bus containing the processor registers of the ]<ref>{{Cite web |title=The TI-99/4A internal architecture |url=https://www.unige.ch/medecine/nouspikel/ti99/architec.htm |access-date=2023-03-08 |website=www.unige.ch}}</ref>
* ] is the only ]-compatible desktop processor to incorporate a dedicated scratchpad. * ] is the only ]-compatible desktop processor to incorporate a dedicated scratchpad.
* ], used in Sega's consoles, could lock cachelines to an address outside of main memory for use as a scratchpad. * ], used in Sega's consoles, could lock cachelines to an address outside of main memory for use as a scratchpad.
* Sony's ]'s ] had a scratchpad instead of an L1 cache. It was possible to place the CPU stack here, an example of the temporary workspace usage. * Sony's ]'s ] had a scratchpad instead of an L1 cache. It was possible to place the CPU stack here, an example of the temporary workspace usage.
* ] parallel ] features local-stores for each core, connected by a ], with DMA possible between them and off-chip links (possibly to DRAM). The architecture is similar to Sony's Cell, except all cores can directly address each others scratchpads, generating network messages from standard load/store instructions. * ] parallel ] features local-stores for each core, connected by a ], with DMA possible between them and off-chip links (possibly to DRAM). The architecture is similar to Sony's Cell, except all cores can directly address each other's scratchpads, generating network messages from standard load/store instructions.
* Sony's ] ] includes a 16&nbsp;] scratchpad, to and from which DMA transfers could be issued to its GS, and main memory. * Sony's ] ] includes a 16&nbsp;] scratchpad, to and from which DMA transfers could be issued to its GS, and main memory.
* ]'s SPEs are restricted purely to working in their "local-store", relying on DMA for transfers from/to main memory and between local stores, much like a scratchpad. In this regard, additional benefit is derived from the lack of hardware to check and update coherence between multiple caches: the design takes advantage of the assumption that each processor's workspace is separate and private. It is expected this benefit will become more noticeable as the number of processors scales into the "many-core" future. Yet because of the elimination of some hardware logics, the data and instructions of applications on SPEs must be managed through software if the whole task on SPE can not fit in local store.<ref>J. Lu, K. Bai, A. Shrivastava, , ''Design Automation Conference (DAC)'', June 2–6, 2013</ref><ref>K. Bai, A. Shrivastava, , ''Design Automation and Test in Europe (DATE)'', 2013</ref><ref>K. Bai, J. Lu, A. Shrivastava, B. Holton, , ''CODES+ISSS'', 2013</ref> * ]'s SPEs are restricted purely to working in their "local-store", relying on DMA for transfers from/to main memory and between local stores, much like a scratchpad. In this regard, additional benefit is derived from the lack of hardware to check and update coherence between multiple caches: the design takes advantage of the assumption that each processor's workspace is separate and private. It is expected this benefit will become more noticeable as the number of processors scales into the "many-core" future. Yet because of the elimination of some hardware logics, the data and instructions of applications on SPEs must be managed through software if the whole task on SPE can not fit in local store.<ref>J. Lu, K. Bai, A. Shrivastava, , ''Design Automation Conference (DAC)'', June 2–6, 2013</ref><ref>K. Bai, A. Shrivastava, , ''Design Automation and Test in Europe (DATE)'', 2013</ref><ref>K. Bai, J. Lu, A. Shrivastava, B. Holton, , ''CODES+ISSS'', 2013</ref>
* Many other processors allow L1 cache lines to be locked. * Many other processors allow L1 cache lines to be locked.
* Most ]s use a scratchpad. Many past 3D accelerators and game consoles (including the PS2) have used DSPs for vertex transformations. This differs from the stream based approach of modern GPUs which have more in common with a CPU cache's functions. * Most ]s use a scratchpad. Many past 3D accelerators and game consoles (including the PS2) have used DSPs for ]s. This differs from the stream-based approach of modern GPUs which have more in common with a CPU cache's functions.
* NVIDIA's ] ] running under ] provides 16&nbsp;KB of scratchpad (NVIDIA calls it Shared Memory) per thread-bundle when being used for ] tasks. Scratchpad also was used in later ] (]).<ref>{{cite web|url=http://origin-jp.nvidia.com/content/PDF/fermi_white_papers/D.Patterson_Top10InnovationsInNVIDIAFermi.pdf|title=The Top 10 Innovations in the New NVIDIA Fermi Architecture, and the Top 3 Next Challenges |last=Patterson|first=David|date=September 30, 2009 |publisher=Parallel Computing Research Laboratory & NVIDIA|accessdate=3 October 2013}}</ref> * NVIDIA's ] ] running under ] provides 16&nbsp;KB of scratchpad (NVIDIA calls it Shared Memory) per thread-bundle when being used for ] tasks. Scratchpad also was used in later ] (]).<ref>{{cite web|url=http://origin-jp.nvidia.com/content/PDF/fermi_white_papers/D.Patterson_Top10InnovationsInNVIDIAFermi.pdf|title=The Top 10 Innovations in the New NVIDIA Fermi Architecture, and the Top 3 Next Challenges |last=Patterson|first=David|date=September 30, 2009 |publisher=Parallel Computing Research Laboratory & NVIDIA|accessdate=3 October 2013}}</ref>
* Ageia's ] chip includes a scratchpad RAM in a manner similar to the Cell; its theory states that a cache hierarchy is of less use than software managed physics and collision calculations. These memories are also banked and a switch manages transfers between them. * Ageia's ] chip includes a scratchpad RAM in a manner similar to the Cell; the theory of this specific ] is that a cache hierarchy is of less use than software managed physics and collision calculations. These memories are also banked and a switch manages transfers between them.
* Intel's ] has a 16&nbsp;GB MCDRAM that can be configured as either a cache, scratchpad memory, or divided into some cache and some scratchpad memory. * Intel's ] has a 16&nbsp;GB MCDRAM that can be configured as either a cache, scratchpad memory, or divided into some cache and some scratchpad memory.
* ], a ], organized as a multicore architecture with a large multiported shared scratchpad. * ], a ], organized as a multicore architecture with a large multiported shared scratchpad.
* ] has designed an ] based on scratchpad memories<ref>{{Cite tech report|title=Dissecting the Graphcore IPU Architecture via Microbenchmarking|first1=Zhe
|last1=Jia
|first2=Blake
|last2= Tillman
|first3=Marco
|last3=Maggioni
|first4=Daniele P.
|last4=Scarpazza
|date=December 7, 2019|arxiv=1912.03413|url=https://www.graphcore.ai/hubfs/assets/pdf/Citadel%20Securities%20Technical%20Report%20-%20Dissecting%20the%20Graphcore%20IPU%20Architecture%20via%20Microbenchmarking%20Dec%202019.pdf|publisher= Citadel Enterprise Americas, LLC}}</ref>


==Alternatives== ==Alternatives==
Line 43: Line 53:
* ] * ]
* ] * ]

==Notes==
{{Notelist}}


==References== ==References==

Latest revision as of 21:25, 1 March 2024

This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "Scratchpad memory" – news · newspapers · books · scholar · JSTOR (October 2013) (Learn how and when to remove this message)

Scratchpad memory (SPM), also known as scratchpad, scratchpad RAM or local store in computer terminology, is an internal memory, usually high-speed, used for temporary storage of calculations, data, and other work in progress. In reference to a microprocessor (or CPU), scratchpad refers to a special high-speed memory used to hold small items of data for rapid retrieval. It is similar to the usage and size of a scratchpad in life: a pad of paper for preliminary notes or sketches or writings, etc. When the scratchpad is a hidden portion of the main memory then it is sometimes referred to as bump storage.

In some systems it can be considered similar to the L1 cache in that it is the next closest memory to the ALU after the processor registers, with explicit instructions to move data to and from main memory, often using DMA-based data transfer. In contrast to a system that uses caches, a system with scratchpads is a system with non-uniform memory access (NUMA) latencies, because the memory access latencies to the different scratchpads and the main memory vary. Another difference from a system that employs caches is that a scratchpad commonly does not contain a copy of data that is also stored in the main memory.

Scratchpads are employed for simplification of caching logic, and to guarantee a unit can work without main memory contention in a system employing multiple processors, especially in multiprocessor system-on-chip for embedded systems. They are mostly suited for storing temporary results (as it would be found in the CPU stack) that typically wouldn't need to always be committing to the main memory; however when fed by DMA, they can also be used in place of a cache for mirroring the state of slower main memory. The same issues of locality of reference apply in relation to efficiency of use; although some systems allow strided DMA to access rectangular data sets. Another difference is that scratchpads are explicitly manipulated by applications. They may be useful for realtime applications, where predictable timing is hindered by cache behavior.

Scratchpads are not used in mainstream desktop processors where generality is required for legacy software to run from generation to generation, in which the available on-chip memory size may change. They are better implemented in embedded systems, special-purpose processors and game consoles, where chips are often manufactured as MPSoC, and where software is often tuned to one hardware configuration.

Examples of use

  • Fairchild F8 of 1975 contained 64 bytes of scratchpad.
  • The TI-99/4A has 256 bytes of scratchpad memory on the 16-bit bus containing the processor registers of the TMS9900
  • Cyrix 6x86 is the only x86-compatible desktop processor to incorporate a dedicated scratchpad.
  • SuperH, used in Sega's consoles, could lock cachelines to an address outside of main memory for use as a scratchpad.
  • Sony's PS1's R3000 had a scratchpad instead of an L1 cache. It was possible to place the CPU stack here, an example of the temporary workspace usage.
  • Adapteva's Epiphany parallel coprocessor features local-stores for each core, connected by a network on a chip, with DMA possible between them and off-chip links (possibly to DRAM). The architecture is similar to Sony's Cell, except all cores can directly address each other's scratchpads, generating network messages from standard load/store instructions.
  • Sony's PS2 Emotion Engine includes a 16 KB scratchpad, to and from which DMA transfers could be issued to its GS, and main memory.
  • Cell's SPEs are restricted purely to working in their "local-store", relying on DMA for transfers from/to main memory and between local stores, much like a scratchpad. In this regard, additional benefit is derived from the lack of hardware to check and update coherence between multiple caches: the design takes advantage of the assumption that each processor's workspace is separate and private. It is expected this benefit will become more noticeable as the number of processors scales into the "many-core" future. Yet because of the elimination of some hardware logics, the data and instructions of applications on SPEs must be managed through software if the whole task on SPE can not fit in local store.
  • Many other processors allow L1 cache lines to be locked.
  • Most digital signal processors use a scratchpad. Many past 3D accelerators and game consoles (including the PS2) have used DSPs for vertex transformations. This differs from the stream-based approach of modern GPUs which have more in common with a CPU cache's functions.
  • NVIDIA's 8800 GPU running under CUDA provides 16 KB of scratchpad (NVIDIA calls it Shared Memory) per thread-bundle when being used for GPGPU tasks. Scratchpad also was used in later Fermi GPU (GeForce 400 series).
  • Ageia's PhysX chip includes a scratchpad RAM in a manner similar to the Cell; the theory of this specific physics processing unit is that a cache hierarchy is of less use than software managed physics and collision calculations. These memories are also banked and a switch manages transfers between them.
  • Intel's Knights Landing processor has a 16 GB MCDRAM that can be configured as either a cache, scratchpad memory, or divided into some cache and some scratchpad memory.
  • Movidius Myriad 2, a vision processing unit, organized as a multicore architecture with a large multiported shared scratchpad.
  • Graphcore has designed an AI accelerator based on scratchpad memories

Alternatives

Cache control vs scratchpads

Some architectures such as PowerPC attempt to avoid the need for cacheline locking or scratchpads through the use of cache control instructions. Marking an area of memory with "Data Cache Block: Zero" (allocating a line but setting its contents to zero instead of loading from main memory) and discarding it after use ('Data Cache Block: Invalidate', signaling that main memory didn't receive any updated data) the cache is made to behave as a scratchpad. Generality is maintained in that these are hints and the underlying hardware will function correctly regardless of actual cache size.

Shared L2 vs Cell local stores

Regarding interprocessor communication in a multicore setup, there are similarities between the Cell's inter-localstore DMA and a shared L2 cache setup as in the Intel Core 2 Duo or the Xbox 360's custom powerPC: the L2 cache allows processors to share results without those results having to be committed to main memory. This can be an advantage where the working set for an algorithm encompasses the entirety of the L2 cache. However, when a program is written to take advantage of inter-localstore DMA, the Cell has the benefit of each-other-Local-Store serving the purpose of BOTH the private workspace for a single processor AND the point of sharing between processors; i.e., the other Local Stores are on a similar footing viewed from one processor as the shared L2 cache in a conventional chip. The tradeoff is that of memory wasted in buffering and programming complexity for synchronization, though this would be similar to precached pages in a conventional chip. Domains where using this capability is effective include:

  • Pipeline processing (where one achieves the same effect as increasing the L1 cache's size by splitting one job into smaller chunks)
  • Extending the working set, e.g., a sweet spot for a merge sort where the data fits within 8×256 KB
  • Shared code uploading, like loading a piece of code to one SPU, then copy it from there to the others to avoid hitting the main memory again

It would be possible for a conventional processor to gain similar advantages with cache-control instructions, for example, allowing the prefetching to the L1 bypassing the L2, or an eviction hint that signaled a transfer from L1 to L2 but not committing to main memory; however, at present no systems offer this capability in a usable form and such instructions in effect should mirror explicit transfer of data among cache areas used by each core.

See also

Notes

  1. Some older systems used a hidden part of main storage, referred to as bump storage, as scratchpad. In other systems, e.g., UNIVAC 1107, all addressable registers were held in scratchpad.

References

  1. Steinke, Stefan; Lars Wehmeyer; Bo-Sik Lee; Peter Marwedel (2002). "Assigning Program and Data Objects to Scratchpad for Energy Reduction" (PDF). University of Dortmund. Retrieved 3 October 2013.: "3.2 Scratchpad model .. The scratchpad memory uses software to control the location assignment of data."
  2. "The TI-99/4A internal architecture". www.unige.ch. Retrieved 2023-03-08.
  3. J. Lu, K. Bai, A. Shrivastava, "SSDM: Smart Stack Data Management for Software Managed Multicores (SMMs)", Design Automation Conference (DAC), June 2–6, 2013
  4. K. Bai, A. Shrivastava, "Automatic and Efficient Heap Data Management for Limited Local Memory Multicore Architectures", Design Automation and Test in Europe (DATE), 2013
  5. K. Bai, J. Lu, A. Shrivastava, B. Holton, "CMSM: An Efficient and Effective Code Management for Software Managed Multicores", CODES+ISSS, 2013
  6. Patterson, David (September 30, 2009). "The Top 10 Innovations in the New NVIDIA Fermi Architecture, and the Top 3 Next Challenges" (PDF). Parallel Computing Research Laboratory & NVIDIA. Retrieved 3 October 2013.
  7. Jia, Zhe; Tillman, Blake; Maggioni, Marco; Scarpazza, Daniele P. (December 7, 2019). Dissecting the Graphcore IPU Architecture via Microbenchmarking (PDF) (Technical report). Citadel Enterprise Americas, LLC. arXiv:1912.03413.

External links

Category:
Scratchpad memory: Difference between revisions Add topic