Abstract

Resilience is becoming an increasingly critical performance requirement for future large-scale computing systems. In data center and high-performance computing systems with many thousands of nodes, errors in main memory can be a significant source of failures. As a result, large-scale memory systems must employ advanced error detection and correction techniques to mitigate failures. Memory devices are primarily designed for density, optimizing memory capacity and throughput, rather than resilience. A strict focus on memory performance instead of resilience risks undermining the overall stability of next-generation computers. In this work, we leverage an optically connected memory system to optimize both memory performance and resilience. A multicast-capable optical interconnection network replaces the traditional electronic bus between a processor and its main memory, allowing for a novel error-correction technique based on dynamic bit-steering. As compared to an electronically connected approach, we demonstrate significantly higher memory bandwidths and reduced latencies, in addition to a 700 × improvement in resilience.

© 2012 OSA

PDF Article

References

  • View by:
  • |
  • |
  • |

  1. P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.
  2. R. Baumann, “Soft errors in advanced computer systems,” IEEE Des. Test Comput., vol. 22, no. 3, pp. 258–266, 2005.
    [CrossRef]
  3. L. A. Barroso and U. Hölzle, “The datacenter as a computer: An introduction to the design of warehouse-scale machines,” in Synthesis Lectures on Computer Architecture. Morgan & Claypool, 2009.
  4. D. Oppenheimer, A. Ganapathi, and D. Patterson, “Why do Internet services fail and what can be done about it?” in 4th USENIX Symp. on Internet Technologies and Systems, 2003.
  5. A. Modine, “Web startups crumble under Amazon S3 outage” [Online]. Available: http://www.theregister.co.uk/2008/02/15/amazon_s3_outage_feb_2008/.
  6. B. Schroeder, E. Pinheiro, and W.-D. Weber, “DRAM errors in the wild: A large-scale field study,” in ACM SIGMETRICS, 2009.
  7. K. V. Vishwanath and N. Nagappan, “Characterizing cloud computing hardware reliability,” in Proc. of the 1st ACM Symp. on Cloud Computing (SoCC ’10), New York, 2010, pp. 193–204.
  8. C. Chen and M. Hsiao, “Error-correcting codes for semiconductor memory applications: A state-of-the-art review,” IBM J. Res. Dev., vol. 28, no. 2, pp. 124–134, 1984.
    [CrossRef]
  9. The ITRS Technology Working Groups, International Technology Roadmap for Semiconductors (ITRS) 2011 Edition [Online]. Available: http://www.itrs.net.
  10. D. Brunina, C. P. Lai, A. S. Garg, and K. Bergman, “Building data centers with optically connected memory,” J. Opt. Commun. Netw., vol. 3, no. 8, pp. A40–A48, 2011.
    [CrossRef]
  11. L. Chen, K. Preston, S. Manipatruni, and M. Lipson, “Integrated GHz silicon photonic interconnect with micrometer-scale modulators and detectors,” Opt. Express, vol. 17, no. 17, pp. 15248–15256, Aug.2009.
    [CrossRef] [PubMed]
  12. D. Brunina, C. P. Lai, A. S. Garg, and K. Bergman, “Wavelength-striped multicasting of optically-connected memory for large-scale computing systems,” in Optical Fiber Communications Conf. (OFC), Mar. 2011, OWH4.
  13. D. Brunina, C. P. Lai, D. Liu, A. S. Garg, and K. Bergman, “Optically-connected memory with error correction for increased reliability in large-scale computing systems,” in Optical Fiber Communications Conf. (OFC), Mar. 2012.
  14. T. C. May and M. H. Woods, “Alpha-particle-induced soft errors in dynamic memories,” IEEE Trans. Electron Devices, vol. 26, no. 1, pp. 2–9, 1979.
    [CrossRef]
  15. S. S. Mukherjee, J. Emer, and S. K. Reinhardt, “The soft error problem: An architectural perspective,” in HPCA ’05: Proc. of the 11th Int. Symp. on High-Performance Computer Architecture, 2005.
  16. E. Normand, “Single event upset at ground level,” IEEE Trans. Nucl. Sci., vol. 6, no. 43, pp. 2742–2750, 1996.
    [CrossRef]
  17. T. J. O’Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, and J. L. Walsh, “Field testing for cosmic ray soft errors in semiconductor memories,” IBM J. Res. Dev., vol. 40, no. 1, pp. 41–50, 1996.
    [CrossRef]
  18. J. F. Ziegler and W. A. Lanford, “Effect of cosmic rays on computer memories,” Science, vol. 206, pp. 776–788, 1979.
    [CrossRef] [PubMed]
  19. H. Mine and K. Hatayama, “Reliability analysis and optimal redundancy for majority-voted logic circuits,” IEEE Trans. Reliab., vol. 30, no. 2, pp. 189–191, 1981.
    [CrossRef]
  20. T. J. Dell, A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory, IBM Microelectronics Division, 1997.
  21. “Intel E7500 chipset MCH Intel×4 single device data correction (×4 SDDC) implementation and validation,” Intel Application Note AP-726, Aug.2002.
  22. “Servers and storage technology for the adaptive infrastructure,” HP Technology Advisor, 2006 [Online]. Available: http://h40089.www4.hp.com/integrity/pdf/4AA0-7545EEE.pdf.
  23. P. J. Meaney, L. A. Lastras-Montano, V. K. Papazova, E. Stephens, J. S. Johnson, L. C. Alves, J. A. O’Connor, and W. J. Clarke, “IBM zEnterprise redundant array of independent memory subsystem,” IBM J. Res. Dev., vol. 56, no. 1.2, pp. 4:1–4:11, 2012.
    [CrossRef]
  24. R. Ho, W. Mai, and M. A. Horowitz, “The future of wires,” Proc. IEEE, vol. 89, no. 4, pp. 490–504, Apr.2001.
    [CrossRef]
  25. JEDEC Solid State Technology Association, DDR3 SDRAM Standard [Online]. Available: http://www.jedec.org/standards-documents/docs/jesd-79-3d.
  26. O. Liboiron-Ladouceur, B. A. Small, and K. Bergman, “Physical layer scalability of WDM optical packet interconnection networks,” J. Lightwave Technol., vol. 24, no. 1, pp. 262–270, Jan.2006.
    [CrossRef]
  27. A. D. Kshemkalyani and M. Singhal, Distributed Computing: Principles, Algorithms, and Systems. Cambridge University Press, New York, 2008, pp. 6–13.
  28. D. Brunina, C. P. Lai, and K. Bergman, “A data rate- and modulation format-independent packet-switched optical network test-bed,” IEEE Photon. Technol. Lett., vol. 24, no. 5, pp. 377–379, Mar.2012.
    [CrossRef]
  29. C. P. Lai and K. Bergman, “Broadband multicasting for wavelength-striped optical packets,” J. Lightwave Technol., vol. 30, no. 11, pp. 1706–1718, June2012.
    [CrossRef]
  30. J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber, “Future scaling of processor-memory interfaces,” in Supercomputing (SC), Nov. 2010.

2012 (3)

P. J. Meaney, L. A. Lastras-Montano, V. K. Papazova, E. Stephens, J. S. Johnson, L. C. Alves, J. A. O’Connor, and W. J. Clarke, “IBM zEnterprise redundant array of independent memory subsystem,” IBM J. Res. Dev., vol. 56, no. 1.2, pp. 4:1–4:11, 2012.
[CrossRef]

D. Brunina, C. P. Lai, and K. Bergman, “A data rate- and modulation format-independent packet-switched optical network test-bed,” IEEE Photon. Technol. Lett., vol. 24, no. 5, pp. 377–379, Mar.2012.
[CrossRef]

C. P. Lai and K. Bergman, “Broadband multicasting for wavelength-striped optical packets,” J. Lightwave Technol., vol. 30, no. 11, pp. 1706–1718, June2012.
[CrossRef]

2011 (1)

2009 (1)

2006 (1)

2005 (1)

R. Baumann, “Soft errors in advanced computer systems,” IEEE Des. Test Comput., vol. 22, no. 3, pp. 258–266, 2005.
[CrossRef]

2001 (1)

R. Ho, W. Mai, and M. A. Horowitz, “The future of wires,” Proc. IEEE, vol. 89, no. 4, pp. 490–504, Apr.2001.
[CrossRef]

1996 (2)

E. Normand, “Single event upset at ground level,” IEEE Trans. Nucl. Sci., vol. 6, no. 43, pp. 2742–2750, 1996.
[CrossRef]

T. J. O’Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, and J. L. Walsh, “Field testing for cosmic ray soft errors in semiconductor memories,” IBM J. Res. Dev., vol. 40, no. 1, pp. 41–50, 1996.
[CrossRef]

1984 (1)

C. Chen and M. Hsiao, “Error-correcting codes for semiconductor memory applications: A state-of-the-art review,” IBM J. Res. Dev., vol. 28, no. 2, pp. 124–134, 1984.
[CrossRef]

1981 (1)

H. Mine and K. Hatayama, “Reliability analysis and optimal redundancy for majority-voted logic circuits,” IEEE Trans. Reliab., vol. 30, no. 2, pp. 189–191, 1981.
[CrossRef]

1979 (2)

T. C. May and M. H. Woods, “Alpha-particle-induced soft errors in dynamic memories,” IEEE Trans. Electron Devices, vol. 26, no. 1, pp. 2–9, 1979.
[CrossRef]

J. F. Ziegler and W. A. Lanford, “Effect of cosmic rays on computer memories,” Science, vol. 206, pp. 776–788, 1979.
[CrossRef] [PubMed]

Ahn, J. H.

J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber, “Future scaling of processor-memory interfaces,” in Supercomputing (SC), Nov. 2010.

Alves, L. C.

P. J. Meaney, L. A. Lastras-Montano, V. K. Papazova, E. Stephens, J. S. Johnson, L. C. Alves, J. A. O’Connor, and W. J. Clarke, “IBM zEnterprise redundant array of independent memory subsystem,” IBM J. Res. Dev., vol. 56, no. 1.2, pp. 4:1–4:11, 2012.
[CrossRef]

Barroso, L. A.

L. A. Barroso and U. Hölzle, “The datacenter as a computer: An introduction to the design of warehouse-scale machines,” in Synthesis Lectures on Computer Architecture. Morgan & Claypool, 2009.

Baumann, R.

R. Baumann, “Soft errors in advanced computer systems,” IEEE Des. Test Comput., vol. 22, no. 3, pp. 258–266, 2005.
[CrossRef]

Bergman, K.

D. Brunina, C. P. Lai, and K. Bergman, “A data rate- and modulation format-independent packet-switched optical network test-bed,” IEEE Photon. Technol. Lett., vol. 24, no. 5, pp. 377–379, Mar.2012.
[CrossRef]

C. P. Lai and K. Bergman, “Broadband multicasting for wavelength-striped optical packets,” J. Lightwave Technol., vol. 30, no. 11, pp. 1706–1718, June2012.
[CrossRef]

D. Brunina, C. P. Lai, A. S. Garg, and K. Bergman, “Building data centers with optically connected memory,” J. Opt. Commun. Netw., vol. 3, no. 8, pp. A40–A48, 2011.
[CrossRef]

O. Liboiron-Ladouceur, B. A. Small, and K. Bergman, “Physical layer scalability of WDM optical packet interconnection networks,” J. Lightwave Technol., vol. 24, no. 1, pp. 262–270, Jan.2006.
[CrossRef]

D. Brunina, C. P. Lai, A. S. Garg, and K. Bergman, “Wavelength-striped multicasting of optically-connected memory for large-scale computing systems,” in Optical Fiber Communications Conf. (OFC), Mar. 2011, OWH4.

D. Brunina, C. P. Lai, D. Liu, A. S. Garg, and K. Bergman, “Optically-connected memory with error correction for increased reliability in large-scale computing systems,” in Optical Fiber Communications Conf. (OFC), Mar. 2012.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Borkar, S.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Brunina, D.

D. Brunina, C. P. Lai, and K. Bergman, “A data rate- and modulation format-independent packet-switched optical network test-bed,” IEEE Photon. Technol. Lett., vol. 24, no. 5, pp. 377–379, Mar.2012.
[CrossRef]

D. Brunina, C. P. Lai, A. S. Garg, and K. Bergman, “Building data centers with optically connected memory,” J. Opt. Commun. Netw., vol. 3, no. 8, pp. A40–A48, 2011.
[CrossRef]

D. Brunina, C. P. Lai, D. Liu, A. S. Garg, and K. Bergman, “Optically-connected memory with error correction for increased reliability in large-scale computing systems,” in Optical Fiber Communications Conf. (OFC), Mar. 2012.

D. Brunina, C. P. Lai, A. S. Garg, and K. Bergman, “Wavelength-striped multicasting of optically-connected memory for large-scale computing systems,” in Optical Fiber Communications Conf. (OFC), Mar. 2011, OWH4.

Campbell, D.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Carlson, W.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Chen, C.

C. Chen and M. Hsiao, “Error-correcting codes for semiconductor memory applications: A state-of-the-art review,” IBM J. Res. Dev., vol. 28, no. 2, pp. 124–134, 1984.
[CrossRef]

Chen, L.

Clarke, W. J.

P. J. Meaney, L. A. Lastras-Montano, V. K. Papazova, E. Stephens, J. S. Johnson, L. C. Alves, J. A. O’Connor, and W. J. Clarke, “IBM zEnterprise redundant array of independent memory subsystem,” IBM J. Res. Dev., vol. 56, no. 1.2, pp. 4:1–4:11, 2012.
[CrossRef]

Curtis, H. W.

T. J. O’Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, and J. L. Walsh, “Field testing for cosmic ray soft errors in semiconductor memories,” IBM J. Res. Dev., vol. 40, no. 1, pp. 41–50, 1996.
[CrossRef]

Dally, W.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Dell, T. J.

T. J. Dell, A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory, IBM Microelectronics Division, 1997.

Denneau, M.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Emer, J.

S. S. Mukherjee, J. Emer, and S. K. Reinhardt, “The soft error problem: An architectural perspective,” in HPCA ’05: Proc. of the 11th Int. Symp. on High-Performance Computer Architecture, 2005.

Franzon, P.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Ganapathi, A.

D. Oppenheimer, A. Ganapathi, and D. Patterson, “Why do Internet services fail and what can be done about it?” in 4th USENIX Symp. on Internet Technologies and Systems, 2003.

Garg, A. S.

D. Brunina, C. P. Lai, A. S. Garg, and K. Bergman, “Building data centers with optically connected memory,” J. Opt. Commun. Netw., vol. 3, no. 8, pp. A40–A48, 2011.
[CrossRef]

D. Brunina, C. P. Lai, A. S. Garg, and K. Bergman, “Wavelength-striped multicasting of optically-connected memory for large-scale computing systems,” in Optical Fiber Communications Conf. (OFC), Mar. 2011, OWH4.

D. Brunina, C. P. Lai, D. Liu, A. S. Garg, and K. Bergman, “Optically-connected memory with error correction for increased reliability in large-scale computing systems,” in Optical Fiber Communications Conf. (OFC), Mar. 2012.

Harrod, W.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Hatayama, K.

H. Mine and K. Hatayama, “Reliability analysis and optimal redundancy for majority-voted logic circuits,” IEEE Trans. Reliab., vol. 30, no. 2, pp. 189–191, 1981.
[CrossRef]

Hiller, J.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Ho, R.

R. Ho, W. Mai, and M. A. Horowitz, “The future of wires,” Proc. IEEE, vol. 89, no. 4, pp. 490–504, Apr.2001.
[CrossRef]

Hölzle, U.

L. A. Barroso and U. Hölzle, “The datacenter as a computer: An introduction to the design of warehouse-scale machines,” in Synthesis Lectures on Computer Architecture. Morgan & Claypool, 2009.

Horowitz, M. A.

R. Ho, W. Mai, and M. A. Horowitz, “The future of wires,” Proc. IEEE, vol. 89, no. 4, pp. 490–504, Apr.2001.
[CrossRef]

Hsiao, M.

C. Chen and M. Hsiao, “Error-correcting codes for semiconductor memory applications: A state-of-the-art review,” IBM J. Res. Dev., vol. 28, no. 2, pp. 124–134, 1984.
[CrossRef]

Johnson, J. S.

P. J. Meaney, L. A. Lastras-Montano, V. K. Papazova, E. Stephens, J. S. Johnson, L. C. Alves, J. A. O’Connor, and W. J. Clarke, “IBM zEnterprise redundant array of independent memory subsystem,” IBM J. Res. Dev., vol. 56, no. 1.2, pp. 4:1–4:11, 2012.
[CrossRef]

Jouppi, N. P.

J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber, “Future scaling of processor-memory interfaces,” in Supercomputing (SC), Nov. 2010.

Karp, S.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Keckler, S.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Klein, D.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Kogge, P.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Kozyrakis, C.

J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber, “Future scaling of processor-memory interfaces,” in Supercomputing (SC), Nov. 2010.

Kshemkalyani, A. D.

A. D. Kshemkalyani and M. Singhal, Distributed Computing: Principles, Algorithms, and Systems. Cambridge University Press, New York, 2008, pp. 6–13.

Lai, C. P.

D. Brunina, C. P. Lai, and K. Bergman, “A data rate- and modulation format-independent packet-switched optical network test-bed,” IEEE Photon. Technol. Lett., vol. 24, no. 5, pp. 377–379, Mar.2012.
[CrossRef]

C. P. Lai and K. Bergman, “Broadband multicasting for wavelength-striped optical packets,” J. Lightwave Technol., vol. 30, no. 11, pp. 1706–1718, June2012.
[CrossRef]

D. Brunina, C. P. Lai, A. S. Garg, and K. Bergman, “Building data centers with optically connected memory,” J. Opt. Commun. Netw., vol. 3, no. 8, pp. A40–A48, 2011.
[CrossRef]

D. Brunina, C. P. Lai, D. Liu, A. S. Garg, and K. Bergman, “Optically-connected memory with error correction for increased reliability in large-scale computing systems,” in Optical Fiber Communications Conf. (OFC), Mar. 2012.

D. Brunina, C. P. Lai, A. S. Garg, and K. Bergman, “Wavelength-striped multicasting of optically-connected memory for large-scale computing systems,” in Optical Fiber Communications Conf. (OFC), Mar. 2011, OWH4.

Lanford, W. A.

J. F. Ziegler and W. A. Lanford, “Effect of cosmic rays on computer memories,” Science, vol. 206, pp. 776–788, 1979.
[CrossRef] [PubMed]

Lastras-Montano, L. A.

P. J. Meaney, L. A. Lastras-Montano, V. K. Papazova, E. Stephens, J. S. Johnson, L. C. Alves, J. A. O’Connor, and W. J. Clarke, “IBM zEnterprise redundant array of independent memory subsystem,” IBM J. Res. Dev., vol. 56, no. 1.2, pp. 4:1–4:11, 2012.
[CrossRef]

Leverich, J.

J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber, “Future scaling of processor-memory interfaces,” in Supercomputing (SC), Nov. 2010.

Liboiron-Ladouceur, O.

Lipson, M.

Liu, D.

D. Brunina, C. P. Lai, D. Liu, A. S. Garg, and K. Bergman, “Optically-connected memory with error correction for increased reliability in large-scale computing systems,” in Optical Fiber Communications Conf. (OFC), Mar. 2012.

Lucas, R.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Mai, W.

R. Ho, W. Mai, and M. A. Horowitz, “The future of wires,” Proc. IEEE, vol. 89, no. 4, pp. 490–504, Apr.2001.
[CrossRef]

Manipatruni, S.

May, T. C.

T. C. May and M. H. Woods, “Alpha-particle-induced soft errors in dynamic memories,” IEEE Trans. Electron Devices, vol. 26, no. 1, pp. 2–9, 1979.
[CrossRef]

Meaney, P. J.

P. J. Meaney, L. A. Lastras-Montano, V. K. Papazova, E. Stephens, J. S. Johnson, L. C. Alves, J. A. O’Connor, and W. J. Clarke, “IBM zEnterprise redundant array of independent memory subsystem,” IBM J. Res. Dev., vol. 56, no. 1.2, pp. 4:1–4:11, 2012.
[CrossRef]

Mine, H.

H. Mine and K. Hatayama, “Reliability analysis and optimal redundancy for majority-voted logic circuits,” IEEE Trans. Reliab., vol. 30, no. 2, pp. 189–191, 1981.
[CrossRef]

Modine, A.

A. Modine, “Web startups crumble under Amazon S3 outage” [Online]. Available: http://www.theregister.co.uk/2008/02/15/amazon_s3_outage_feb_2008/.

Montrose, C. J.

T. J. O’Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, and J. L. Walsh, “Field testing for cosmic ray soft errors in semiconductor memories,” IBM J. Res. Dev., vol. 40, no. 1, pp. 41–50, 1996.
[CrossRef]

Muhlfeld, H. P.

T. J. O’Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, and J. L. Walsh, “Field testing for cosmic ray soft errors in semiconductor memories,” IBM J. Res. Dev., vol. 40, no. 1, pp. 41–50, 1996.
[CrossRef]

Mukherjee, S. S.

S. S. Mukherjee, J. Emer, and S. K. Reinhardt, “The soft error problem: An architectural perspective,” in HPCA ’05: Proc. of the 11th Int. Symp. on High-Performance Computer Architecture, 2005.

Nagappan, N.

K. V. Vishwanath and N. Nagappan, “Characterizing cloud computing hardware reliability,” in Proc. of the 1st ACM Symp. on Cloud Computing (SoCC ’10), New York, 2010, pp. 193–204.

Normand, E.

E. Normand, “Single event upset at ground level,” IEEE Trans. Nucl. Sci., vol. 6, no. 43, pp. 2742–2750, 1996.
[CrossRef]

O’Connor, J. A.

P. J. Meaney, L. A. Lastras-Montano, V. K. Papazova, E. Stephens, J. S. Johnson, L. C. Alves, J. A. O’Connor, and W. J. Clarke, “IBM zEnterprise redundant array of independent memory subsystem,” IBM J. Res. Dev., vol. 56, no. 1.2, pp. 4:1–4:11, 2012.
[CrossRef]

O’Gorman, T. J.

T. J. O’Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, and J. L. Walsh, “Field testing for cosmic ray soft errors in semiconductor memories,” IBM J. Res. Dev., vol. 40, no. 1, pp. 41–50, 1996.
[CrossRef]

Oppenheimer, D.

D. Oppenheimer, A. Ganapathi, and D. Patterson, “Why do Internet services fail and what can be done about it?” in 4th USENIX Symp. on Internet Technologies and Systems, 2003.

Papazova, V. K.

P. J. Meaney, L. A. Lastras-Montano, V. K. Papazova, E. Stephens, J. S. Johnson, L. C. Alves, J. A. O’Connor, and W. J. Clarke, “IBM zEnterprise redundant array of independent memory subsystem,” IBM J. Res. Dev., vol. 56, no. 1.2, pp. 4:1–4:11, 2012.
[CrossRef]

Patterson, D.

D. Oppenheimer, A. Ganapathi, and D. Patterson, “Why do Internet services fail and what can be done about it?” in 4th USENIX Symp. on Internet Technologies and Systems, 2003.

Pinheiro, E.

B. Schroeder, E. Pinheiro, and W.-D. Weber, “DRAM errors in the wild: A large-scale field study,” in ACM SIGMETRICS, 2009.

Preston, K.

Reinhardt, S. K.

S. S. Mukherjee, J. Emer, and S. K. Reinhardt, “The soft error problem: An architectural perspective,” in HPCA ’05: Proc. of the 11th Int. Symp. on High-Performance Computer Architecture, 2005.

Richards, M.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Ross, J. M.

T. J. O’Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, and J. L. Walsh, “Field testing for cosmic ray soft errors in semiconductor memories,” IBM J. Res. Dev., vol. 40, no. 1, pp. 41–50, 1996.
[CrossRef]

Scarpelli, A.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Schreiber, R. S.

J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber, “Future scaling of processor-memory interfaces,” in Supercomputing (SC), Nov. 2010.

Schroeder, B.

B. Schroeder, E. Pinheiro, and W.-D. Weber, “DRAM errors in the wild: A large-scale field study,” in ACM SIGMETRICS, 2009.

Scott, S.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Singhal, M.

A. D. Kshemkalyani and M. Singhal, Distributed Computing: Principles, Algorithms, and Systems. Cambridge University Press, New York, 2008, pp. 6–13.

Small, B. A.

Snavely, A.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Stephens, E.

P. J. Meaney, L. A. Lastras-Montano, V. K. Papazova, E. Stephens, J. S. Johnson, L. C. Alves, J. A. O’Connor, and W. J. Clarke, “IBM zEnterprise redundant array of independent memory subsystem,” IBM J. Res. Dev., vol. 56, no. 1.2, pp. 4:1–4:11, 2012.
[CrossRef]

Sterling, T.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Taber, A. H.

T. J. O’Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, and J. L. Walsh, “Field testing for cosmic ray soft errors in semiconductor memories,” IBM J. Res. Dev., vol. 40, no. 1, pp. 41–50, 1996.
[CrossRef]

Vishwanath, K. V.

K. V. Vishwanath and N. Nagappan, “Characterizing cloud computing hardware reliability,” in Proc. of the 1st ACM Symp. on Cloud Computing (SoCC ’10), New York, 2010, pp. 193–204.

Walsh, J. L.

T. J. O’Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, and J. L. Walsh, “Field testing for cosmic ray soft errors in semiconductor memories,” IBM J. Res. Dev., vol. 40, no. 1, pp. 41–50, 1996.
[CrossRef]

Weber, W.-D.

B. Schroeder, E. Pinheiro, and W.-D. Weber, “DRAM errors in the wild: A large-scale field study,” in ACM SIGMETRICS, 2009.

Williams, R. S.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Woods, M. H.

T. C. May and M. H. Woods, “Alpha-particle-induced soft errors in dynamic memories,” IEEE Trans. Electron Devices, vol. 26, no. 1, pp. 2–9, 1979.
[CrossRef]

Yelick, K.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Ziegler, J. F.

T. J. O’Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, and J. L. Walsh, “Field testing for cosmic ray soft errors in semiconductor memories,” IBM J. Res. Dev., vol. 40, no. 1, pp. 41–50, 1996.
[CrossRef]

J. F. Ziegler and W. A. Lanford, “Effect of cosmic rays on computer memories,” Science, vol. 206, pp. 776–788, 1979.
[CrossRef] [PubMed]

IBM J. Res. Dev. (3)

C. Chen and M. Hsiao, “Error-correcting codes for semiconductor memory applications: A state-of-the-art review,” IBM J. Res. Dev., vol. 28, no. 2, pp. 124–134, 1984.
[CrossRef]

T. J. O’Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, and J. L. Walsh, “Field testing for cosmic ray soft errors in semiconductor memories,” IBM J. Res. Dev., vol. 40, no. 1, pp. 41–50, 1996.
[CrossRef]

P. J. Meaney, L. A. Lastras-Montano, V. K. Papazova, E. Stephens, J. S. Johnson, L. C. Alves, J. A. O’Connor, and W. J. Clarke, “IBM zEnterprise redundant array of independent memory subsystem,” IBM J. Res. Dev., vol. 56, no. 1.2, pp. 4:1–4:11, 2012.
[CrossRef]

IEEE Des. Test Comput. (1)

R. Baumann, “Soft errors in advanced computer systems,” IEEE Des. Test Comput., vol. 22, no. 3, pp. 258–266, 2005.
[CrossRef]

IEEE Photon. Technol. Lett. (1)

D. Brunina, C. P. Lai, and K. Bergman, “A data rate- and modulation format-independent packet-switched optical network test-bed,” IEEE Photon. Technol. Lett., vol. 24, no. 5, pp. 377–379, Mar.2012.
[CrossRef]

IEEE Trans. Electron Devices (1)

T. C. May and M. H. Woods, “Alpha-particle-induced soft errors in dynamic memories,” IEEE Trans. Electron Devices, vol. 26, no. 1, pp. 2–9, 1979.
[CrossRef]

IEEE Trans. Nucl. Sci. (1)

E. Normand, “Single event upset at ground level,” IEEE Trans. Nucl. Sci., vol. 6, no. 43, pp. 2742–2750, 1996.
[CrossRef]

IEEE Trans. Reliab. (1)

H. Mine and K. Hatayama, “Reliability analysis and optimal redundancy for majority-voted logic circuits,” IEEE Trans. Reliab., vol. 30, no. 2, pp. 189–191, 1981.
[CrossRef]

J. Lightwave Technol. (2)

J. Opt. Commun. Netw. (1)

Opt. Express (1)

Proc. IEEE (1)

R. Ho, W. Mai, and M. A. Horowitz, “The future of wires,” Proc. IEEE, vol. 89, no. 4, pp. 490–504, Apr.2001.
[CrossRef]

Science (1)

J. F. Ziegler and W. A. Lanford, “Effect of cosmic rays on computer memories,” Science, vol. 206, pp. 776–788, 1979.
[CrossRef] [PubMed]

Other (16)

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

S. S. Mukherjee, J. Emer, and S. K. Reinhardt, “The soft error problem: An architectural perspective,” in HPCA ’05: Proc. of the 11th Int. Symp. on High-Performance Computer Architecture, 2005.

The ITRS Technology Working Groups, International Technology Roadmap for Semiconductors (ITRS) 2011 Edition [Online]. Available: http://www.itrs.net.

D. Brunina, C. P. Lai, A. S. Garg, and K. Bergman, “Wavelength-striped multicasting of optically-connected memory for large-scale computing systems,” in Optical Fiber Communications Conf. (OFC), Mar. 2011, OWH4.

D. Brunina, C. P. Lai, D. Liu, A. S. Garg, and K. Bergman, “Optically-connected memory with error correction for increased reliability in large-scale computing systems,” in Optical Fiber Communications Conf. (OFC), Mar. 2012.

L. A. Barroso and U. Hölzle, “The datacenter as a computer: An introduction to the design of warehouse-scale machines,” in Synthesis Lectures on Computer Architecture. Morgan & Claypool, 2009.

D. Oppenheimer, A. Ganapathi, and D. Patterson, “Why do Internet services fail and what can be done about it?” in 4th USENIX Symp. on Internet Technologies and Systems, 2003.

A. Modine, “Web startups crumble under Amazon S3 outage” [Online]. Available: http://www.theregister.co.uk/2008/02/15/amazon_s3_outage_feb_2008/.

B. Schroeder, E. Pinheiro, and W.-D. Weber, “DRAM errors in the wild: A large-scale field study,” in ACM SIGMETRICS, 2009.

K. V. Vishwanath and N. Nagappan, “Characterizing cloud computing hardware reliability,” in Proc. of the 1st ACM Symp. on Cloud Computing (SoCC ’10), New York, 2010, pp. 193–204.

JEDEC Solid State Technology Association, DDR3 SDRAM Standard [Online]. Available: http://www.jedec.org/standards-documents/docs/jesd-79-3d.

A. D. Kshemkalyani and M. Singhal, Distributed Computing: Principles, Algorithms, and Systems. Cambridge University Press, New York, 2008, pp. 6–13.

T. J. Dell, A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory, IBM Microelectronics Division, 1997.

“Intel E7500 chipset MCH Intel×4 single device data correction (×4 SDDC) implementation and validation,” Intel Application Note AP-726, Aug.2002.

“Servers and storage technology for the adaptive infrastructure,” HP Technology Advisor, 2006 [Online]. Available: http://h40089.www4.hp.com/integrity/pdf/4AA0-7545EEE.pdf.

J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber, “Future scaling of processor-memory interfaces,” in Supercomputing (SC), Nov. 2010.

Cited By

OSA participates in CrossRef's Cited-By Linking service. Citing articles from OSA journals and other participating publishers are listed here.

Alert me when this article is cited.