Abstract

Resilience is becoming an increasingly critical performance requirement for future large-scale computing systems. In data center and high-performance computing systems with many thousands of nodes, errors in main memory can be a significant source of failures. As a result, large-scale memory systems must employ advanced error detection and correction techniques to mitigate failures. Memory devices are primarily designed for density, optimizing memory capacity and throughput, rather than resilience. A strict focus on memory performance instead of resilience risks undermining the overall stability of next-generation computers. In this work, we leverage an optically connected memory system to optimize both memory performance and resilience. A multicast-capable optical interconnection network replaces the traditional electronic bus between a processor and its main memory, allowing for a novel error-correction technique based on dynamic bit-steering. As compared to an electronically connected approach, we demonstrate significantly higher memory bandwidths and reduced latencies, in addition to a 700 × improvement in resilience.

© 2012 OSA

Full Article  |  PDF Article

References

  • View by:
  • |
  • |
  • |

  1. P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.
  2. R. Baumann, “Soft errors in advanced computer systems,” IEEE Des. Test Comput., vol. 22, no. 3, pp. 258–266, 2005.
    [CrossRef]
  3. L. A. Barroso and U. Hölzle, “The datacenter as a computer: An introduction to the design of warehouse-scale machines,” in Synthesis Lectures on Computer Architecture. Morgan & Claypool, 2009.
  4. D. Oppenheimer, A. Ganapathi, and D. Patterson, “Why do Internet services fail and what can be done about it?” in 4th USENIX Symp. on Internet Technologies and Systems, 2003.
  5. A. Modine, “Web startups crumble under Amazon S3 outage” [Online]. Available: http://www.theregister.co.uk/2008/02/15/amazon_s3_outage_feb_2008/.
  6. B. Schroeder, E. Pinheiro, and W.-D. Weber, “DRAM errors in the wild: A large-scale field study,” in ACM SIGMETRICS, 2009.
  7. K. V. Vishwanath and N. Nagappan, “Characterizing cloud computing hardware reliability,” in Proc. of the 1st ACM Symp. on Cloud Computing (SoCC ’10), New York, 2010, pp. 193–204.
  8. C. Chen and M. Hsiao, “Error-correcting codes for semiconductor memory applications: A state-of-the-art review,” IBM J. Res. Dev., vol. 28, no. 2, pp. 124–134, 1984.
    [CrossRef]
  9. The ITRS Technology Working Groups, International Technology Roadmap for Semiconductors (ITRS) 2011 Edition [Online]. Available: http://www.itrs.net.
  10. D. Brunina, C. P. Lai, A. S. Garg, and K. Bergman, “Building data centers with optically connected memory,” J. Opt. Commun. Netw., vol. 3, no. 8, pp. A40–A48, 2011.
    [CrossRef]
  11. L. Chen, K. Preston, S. Manipatruni, and M. Lipson, “Integrated GHz silicon photonic interconnect with micrometer-scale modulators and detectors,” Opt. Express, vol. 17, no. 17, pp. 15248–15256, Aug.2009.
    [CrossRef] [PubMed]
  12. D. Brunina, C. P. Lai, A. S. Garg, and K. Bergman, “Wavelength-striped multicasting of optically-connected memory for large-scale computing systems,” in Optical Fiber Communications Conf. (OFC), Mar. 2011, OWH4.
  13. D. Brunina, C. P. Lai, D. Liu, A. S. Garg, and K. Bergman, “Optically-connected memory with error correction for increased reliability in large-scale computing systems,” in Optical Fiber Communications Conf. (OFC), Mar. 2012.
  14. T. C. May and M. H. Woods, “Alpha-particle-induced soft errors in dynamic memories,” IEEE Trans. Electron Devices, vol. 26, no. 1, pp. 2–9, 1979.
    [CrossRef]
  15. S. S. Mukherjee, J. Emer, and S. K. Reinhardt, “The soft error problem: An architectural perspective,” in HPCA ’05: Proc. of the 11th Int. Symp. on High-Performance Computer Architecture, 2005.
  16. E. Normand, “Single event upset at ground level,” IEEE Trans. Nucl. Sci., vol. 6, no. 43, pp. 2742–2750, 1996.
    [CrossRef]
  17. T. J. O’Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, and J. L. Walsh, “Field testing for cosmic ray soft errors in semiconductor memories,” IBM J. Res. Dev., vol. 40, no. 1, pp. 41–50, 1996.
    [CrossRef]
  18. J. F. Ziegler and W. A. Lanford, “Effect of cosmic rays on computer memories,” Science, vol. 206, pp. 776–788, 1979.
    [CrossRef] [PubMed]
  19. H. Mine and K. Hatayama, “Reliability analysis and optimal redundancy for majority-voted logic circuits,” IEEE Trans. Reliab., vol. 30, no. 2, pp. 189–191, 1981.
    [CrossRef]
  20. T. J. Dell, A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory, IBM Microelectronics Division, 1997.
  21. “Intel E7500 chipset MCH Intel×4 single device data correction (×4 SDDC) implementation and validation,” Intel Application Note AP-726, Aug.2002.
  22. “Servers and storage technology for the adaptive infrastructure,” HP Technology Advisor, 2006 [Online]. Available: http://h40089.www4.hp.com/integrity/pdf/4AA0-7545EEE.pdf.
  23. P. J. Meaney, L. A. Lastras-Montano, V. K. Papazova, E. Stephens, J. S. Johnson, L. C. Alves, J. A. O’Connor, and W. J. Clarke, “IBM zEnterprise redundant array of independent memory subsystem,” IBM J. Res. Dev., vol. 56, no. 1.2, pp. 4:1–4:11, 2012.
    [CrossRef]
  24. R. Ho, W. Mai, and M. A. Horowitz, “The future of wires,” Proc. IEEE, vol. 89, no. 4, pp. 490–504, Apr.2001.
    [CrossRef]
  25. JEDEC Solid State Technology Association, DDR3 SDRAM Standard [Online]. Available: http://www.jedec.org/standards-documents/docs/jesd-79-3d.
  26. O. Liboiron-Ladouceur, B. A. Small, and K. Bergman, “Physical layer scalability of WDM optical packet interconnection networks,” J. Lightwave Technol., vol. 24, no. 1, pp. 262–270, Jan.2006.
    [CrossRef]
  27. A. D. Kshemkalyani and M. Singhal, Distributed Computing: Principles, Algorithms, and Systems. Cambridge University Press, New York, 2008, pp. 6–13.
  28. D. Brunina, C. P. Lai, and K. Bergman, “A data rate- and modulation format-independent packet-switched optical network test-bed,” IEEE Photon. Technol. Lett., vol. 24, no. 5, pp. 377–379, Mar.2012.
    [CrossRef]
  29. C. P. Lai and K. Bergman, “Broadband multicasting for wavelength-striped optical packets,” J. Lightwave Technol., vol. 30, no. 11, pp. 1706–1718, June2012.
    [CrossRef]
  30. J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber, “Future scaling of processor-memory interfaces,” in Supercomputing (SC), Nov. 2010.

2012 (3)

P. J. Meaney, L. A. Lastras-Montano, V. K. Papazova, E. Stephens, J. S. Johnson, L. C. Alves, J. A. O’Connor, and W. J. Clarke, “IBM zEnterprise redundant array of independent memory subsystem,” IBM J. Res. Dev., vol. 56, no. 1.2, pp. 4:1–4:11, 2012.
[CrossRef]

D. Brunina, C. P. Lai, and K. Bergman, “A data rate- and modulation format-independent packet-switched optical network test-bed,” IEEE Photon. Technol. Lett., vol. 24, no. 5, pp. 377–379, Mar.2012.
[CrossRef]

C. P. Lai and K. Bergman, “Broadband multicasting for wavelength-striped optical packets,” J. Lightwave Technol., vol. 30, no. 11, pp. 1706–1718, June2012.
[CrossRef]

2011 (1)

2009 (1)

2006 (1)

2005 (1)

R. Baumann, “Soft errors in advanced computer systems,” IEEE Des. Test Comput., vol. 22, no. 3, pp. 258–266, 2005.
[CrossRef]

2001 (1)

R. Ho, W. Mai, and M. A. Horowitz, “The future of wires,” Proc. IEEE, vol. 89, no. 4, pp. 490–504, Apr.2001.
[CrossRef]

1996 (2)

E. Normand, “Single event upset at ground level,” IEEE Trans. Nucl. Sci., vol. 6, no. 43, pp. 2742–2750, 1996.
[CrossRef]

T. J. O’Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, and J. L. Walsh, “Field testing for cosmic ray soft errors in semiconductor memories,” IBM J. Res. Dev., vol. 40, no. 1, pp. 41–50, 1996.
[CrossRef]

1984 (1)

C. Chen and M. Hsiao, “Error-correcting codes for semiconductor memory applications: A state-of-the-art review,” IBM J. Res. Dev., vol. 28, no. 2, pp. 124–134, 1984.
[CrossRef]

1981 (1)

H. Mine and K. Hatayama, “Reliability analysis and optimal redundancy for majority-voted logic circuits,” IEEE Trans. Reliab., vol. 30, no. 2, pp. 189–191, 1981.
[CrossRef]

1979 (2)

T. C. May and M. H. Woods, “Alpha-particle-induced soft errors in dynamic memories,” IEEE Trans. Electron Devices, vol. 26, no. 1, pp. 2–9, 1979.
[CrossRef]

J. F. Ziegler and W. A. Lanford, “Effect of cosmic rays on computer memories,” Science, vol. 206, pp. 776–788, 1979.
[CrossRef] [PubMed]

Ahn, J. H.

J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber, “Future scaling of processor-memory interfaces,” in Supercomputing (SC), Nov. 2010.

Alves, L. C.

P. J. Meaney, L. A. Lastras-Montano, V. K. Papazova, E. Stephens, J. S. Johnson, L. C. Alves, J. A. O’Connor, and W. J. Clarke, “IBM zEnterprise redundant array of independent memory subsystem,” IBM J. Res. Dev., vol. 56, no. 1.2, pp. 4:1–4:11, 2012.
[CrossRef]

Barroso, L. A.

L. A. Barroso and U. Hölzle, “The datacenter as a computer: An introduction to the design of warehouse-scale machines,” in Synthesis Lectures on Computer Architecture. Morgan & Claypool, 2009.

Baumann, R.

R. Baumann, “Soft errors in advanced computer systems,” IEEE Des. Test Comput., vol. 22, no. 3, pp. 258–266, 2005.
[CrossRef]

Bergman, K.

D. Brunina, C. P. Lai, and K. Bergman, “A data rate- and modulation format-independent packet-switched optical network test-bed,” IEEE Photon. Technol. Lett., vol. 24, no. 5, pp. 377–379, Mar.2012.
[CrossRef]

C. P. Lai and K. Bergman, “Broadband multicasting for wavelength-striped optical packets,” J. Lightwave Technol., vol. 30, no. 11, pp. 1706–1718, June2012.
[CrossRef]

D. Brunina, C. P. Lai, A. S. Garg, and K. Bergman, “Building data centers with optically connected memory,” J. Opt. Commun. Netw., vol. 3, no. 8, pp. A40–A48, 2011.
[CrossRef]

O. Liboiron-Ladouceur, B. A. Small, and K. Bergman, “Physical layer scalability of WDM optical packet interconnection networks,” J. Lightwave Technol., vol. 24, no. 1, pp. 262–270, Jan.2006.
[CrossRef]

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

D. Brunina, C. P. Lai, A. S. Garg, and K. Bergman, “Wavelength-striped multicasting of optically-connected memory for large-scale computing systems,” in Optical Fiber Communications Conf. (OFC), Mar. 2011, OWH4.

D. Brunina, C. P. Lai, D. Liu, A. S. Garg, and K. Bergman, “Optically-connected memory with error correction for increased reliability in large-scale computing systems,” in Optical Fiber Communications Conf. (OFC), Mar. 2012.

Borkar, S.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Brunina, D.

D. Brunina, C. P. Lai, and K. Bergman, “A data rate- and modulation format-independent packet-switched optical network test-bed,” IEEE Photon. Technol. Lett., vol. 24, no. 5, pp. 377–379, Mar.2012.
[CrossRef]

D. Brunina, C. P. Lai, A. S. Garg, and K. Bergman, “Building data centers with optically connected memory,” J. Opt. Commun. Netw., vol. 3, no. 8, pp. A40–A48, 2011.
[CrossRef]

D. Brunina, C. P. Lai, A. S. Garg, and K. Bergman, “Wavelength-striped multicasting of optically-connected memory for large-scale computing systems,” in Optical Fiber Communications Conf. (OFC), Mar. 2011, OWH4.

D. Brunina, C. P. Lai, D. Liu, A. S. Garg, and K. Bergman, “Optically-connected memory with error correction for increased reliability in large-scale computing systems,” in Optical Fiber Communications Conf. (OFC), Mar. 2012.

Campbell, D.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Carlson, W.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Chen, C.

C. Chen and M. Hsiao, “Error-correcting codes for semiconductor memory applications: A state-of-the-art review,” IBM J. Res. Dev., vol. 28, no. 2, pp. 124–134, 1984.
[CrossRef]

Chen, L.

Clarke, W. J.

P. J. Meaney, L. A. Lastras-Montano, V. K. Papazova, E. Stephens, J. S. Johnson, L. C. Alves, J. A. O’Connor, and W. J. Clarke, “IBM zEnterprise redundant array of independent memory subsystem,” IBM J. Res. Dev., vol. 56, no. 1.2, pp. 4:1–4:11, 2012.
[CrossRef]

Curtis, H. W.

T. J. O’Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, and J. L. Walsh, “Field testing for cosmic ray soft errors in semiconductor memories,” IBM J. Res. Dev., vol. 40, no. 1, pp. 41–50, 1996.
[CrossRef]

Dally, W.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Dell, T. J.

T. J. Dell, A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory, IBM Microelectronics Division, 1997.

Denneau, M.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Emer, J.

S. S. Mukherjee, J. Emer, and S. K. Reinhardt, “The soft error problem: An architectural perspective,” in HPCA ’05: Proc. of the 11th Int. Symp. on High-Performance Computer Architecture, 2005.

Franzon, P.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Ganapathi, A.

D. Oppenheimer, A. Ganapathi, and D. Patterson, “Why do Internet services fail and what can be done about it?” in 4th USENIX Symp. on Internet Technologies and Systems, 2003.

Garg, A. S.

D. Brunina, C. P. Lai, A. S. Garg, and K. Bergman, “Building data centers with optically connected memory,” J. Opt. Commun. Netw., vol. 3, no. 8, pp. A40–A48, 2011.
[CrossRef]

D. Brunina, C. P. Lai, A. S. Garg, and K. Bergman, “Wavelength-striped multicasting of optically-connected memory for large-scale computing systems,” in Optical Fiber Communications Conf. (OFC), Mar. 2011, OWH4.

D. Brunina, C. P. Lai, D. Liu, A. S. Garg, and K. Bergman, “Optically-connected memory with error correction for increased reliability in large-scale computing systems,” in Optical Fiber Communications Conf. (OFC), Mar. 2012.

Harrod, W.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Hatayama, K.

H. Mine and K. Hatayama, “Reliability analysis and optimal redundancy for majority-voted logic circuits,” IEEE Trans. Reliab., vol. 30, no. 2, pp. 189–191, 1981.
[CrossRef]

Hiller, J.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Ho, R.

R. Ho, W. Mai, and M. A. Horowitz, “The future of wires,” Proc. IEEE, vol. 89, no. 4, pp. 490–504, Apr.2001.
[CrossRef]

Hölzle, U.

L. A. Barroso and U. Hölzle, “The datacenter as a computer: An introduction to the design of warehouse-scale machines,” in Synthesis Lectures on Computer Architecture. Morgan & Claypool, 2009.

Horowitz, M. A.

R. Ho, W. Mai, and M. A. Horowitz, “The future of wires,” Proc. IEEE, vol. 89, no. 4, pp. 490–504, Apr.2001.
[CrossRef]

Hsiao, M.

C. Chen and M. Hsiao, “Error-correcting codes for semiconductor memory applications: A state-of-the-art review,” IBM J. Res. Dev., vol. 28, no. 2, pp. 124–134, 1984.
[CrossRef]

Johnson, J. S.

P. J. Meaney, L. A. Lastras-Montano, V. K. Papazova, E. Stephens, J. S. Johnson, L. C. Alves, J. A. O’Connor, and W. J. Clarke, “IBM zEnterprise redundant array of independent memory subsystem,” IBM J. Res. Dev., vol. 56, no. 1.2, pp. 4:1–4:11, 2012.
[CrossRef]

Jouppi, N. P.

J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber, “Future scaling of processor-memory interfaces,” in Supercomputing (SC), Nov. 2010.

Karp, S.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Keckler, S.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Klein, D.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Kogge, P.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Kozyrakis, C.

J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber, “Future scaling of processor-memory interfaces,” in Supercomputing (SC), Nov. 2010.

Kshemkalyani, A. D.

A. D. Kshemkalyani and M. Singhal, Distributed Computing: Principles, Algorithms, and Systems. Cambridge University Press, New York, 2008, pp. 6–13.

Lai, C. P.

D. Brunina, C. P. Lai, and K. Bergman, “A data rate- and modulation format-independent packet-switched optical network test-bed,” IEEE Photon. Technol. Lett., vol. 24, no. 5, pp. 377–379, Mar.2012.
[CrossRef]

C. P. Lai and K. Bergman, “Broadband multicasting for wavelength-striped optical packets,” J. Lightwave Technol., vol. 30, no. 11, pp. 1706–1718, June2012.
[CrossRef]

D. Brunina, C. P. Lai, A. S. Garg, and K. Bergman, “Building data centers with optically connected memory,” J. Opt. Commun. Netw., vol. 3, no. 8, pp. A40–A48, 2011.
[CrossRef]

D. Brunina, C. P. Lai, D. Liu, A. S. Garg, and K. Bergman, “Optically-connected memory with error correction for increased reliability in large-scale computing systems,” in Optical Fiber Communications Conf. (OFC), Mar. 2012.

D. Brunina, C. P. Lai, A. S. Garg, and K. Bergman, “Wavelength-striped multicasting of optically-connected memory for large-scale computing systems,” in Optical Fiber Communications Conf. (OFC), Mar. 2011, OWH4.

Lanford, W. A.

J. F. Ziegler and W. A. Lanford, “Effect of cosmic rays on computer memories,” Science, vol. 206, pp. 776–788, 1979.
[CrossRef] [PubMed]

Lastras-Montano, L. A.

P. J. Meaney, L. A. Lastras-Montano, V. K. Papazova, E. Stephens, J. S. Johnson, L. C. Alves, J. A. O’Connor, and W. J. Clarke, “IBM zEnterprise redundant array of independent memory subsystem,” IBM J. Res. Dev., vol. 56, no. 1.2, pp. 4:1–4:11, 2012.
[CrossRef]

Leverich, J.

J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber, “Future scaling of processor-memory interfaces,” in Supercomputing (SC), Nov. 2010.

Liboiron-Ladouceur, O.

Lipson, M.

Liu, D.

D. Brunina, C. P. Lai, D. Liu, A. S. Garg, and K. Bergman, “Optically-connected memory with error correction for increased reliability in large-scale computing systems,” in Optical Fiber Communications Conf. (OFC), Mar. 2012.

Lucas, R.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Mai, W.

R. Ho, W. Mai, and M. A. Horowitz, “The future of wires,” Proc. IEEE, vol. 89, no. 4, pp. 490–504, Apr.2001.
[CrossRef]

Manipatruni, S.

May, T. C.

T. C. May and M. H. Woods, “Alpha-particle-induced soft errors in dynamic memories,” IEEE Trans. Electron Devices, vol. 26, no. 1, pp. 2–9, 1979.
[CrossRef]

Meaney, P. J.

P. J. Meaney, L. A. Lastras-Montano, V. K. Papazova, E. Stephens, J. S. Johnson, L. C. Alves, J. A. O’Connor, and W. J. Clarke, “IBM zEnterprise redundant array of independent memory subsystem,” IBM J. Res. Dev., vol. 56, no. 1.2, pp. 4:1–4:11, 2012.
[CrossRef]

Mine, H.

H. Mine and K. Hatayama, “Reliability analysis and optimal redundancy for majority-voted logic circuits,” IEEE Trans. Reliab., vol. 30, no. 2, pp. 189–191, 1981.
[CrossRef]

Modine, A.

A. Modine, “Web startups crumble under Amazon S3 outage” [Online]. Available: http://www.theregister.co.uk/2008/02/15/amazon_s3_outage_feb_2008/.

Montrose, C. J.

T. J. O’Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, and J. L. Walsh, “Field testing for cosmic ray soft errors in semiconductor memories,” IBM J. Res. Dev., vol. 40, no. 1, pp. 41–50, 1996.
[CrossRef]

Muhlfeld, H. P.

T. J. O’Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, and J. L. Walsh, “Field testing for cosmic ray soft errors in semiconductor memories,” IBM J. Res. Dev., vol. 40, no. 1, pp. 41–50, 1996.
[CrossRef]

Mukherjee, S. S.

S. S. Mukherjee, J. Emer, and S. K. Reinhardt, “The soft error problem: An architectural perspective,” in HPCA ’05: Proc. of the 11th Int. Symp. on High-Performance Computer Architecture, 2005.

Nagappan, N.

K. V. Vishwanath and N. Nagappan, “Characterizing cloud computing hardware reliability,” in Proc. of the 1st ACM Symp. on Cloud Computing (SoCC ’10), New York, 2010, pp. 193–204.

Normand, E.

E. Normand, “Single event upset at ground level,” IEEE Trans. Nucl. Sci., vol. 6, no. 43, pp. 2742–2750, 1996.
[CrossRef]

O’Connor, J. A.

P. J. Meaney, L. A. Lastras-Montano, V. K. Papazova, E. Stephens, J. S. Johnson, L. C. Alves, J. A. O’Connor, and W. J. Clarke, “IBM zEnterprise redundant array of independent memory subsystem,” IBM J. Res. Dev., vol. 56, no. 1.2, pp. 4:1–4:11, 2012.
[CrossRef]

O’Gorman, T. J.

T. J. O’Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, and J. L. Walsh, “Field testing for cosmic ray soft errors in semiconductor memories,” IBM J. Res. Dev., vol. 40, no. 1, pp. 41–50, 1996.
[CrossRef]

Oppenheimer, D.

D. Oppenheimer, A. Ganapathi, and D. Patterson, “Why do Internet services fail and what can be done about it?” in 4th USENIX Symp. on Internet Technologies and Systems, 2003.

Papazova, V. K.

P. J. Meaney, L. A. Lastras-Montano, V. K. Papazova, E. Stephens, J. S. Johnson, L. C. Alves, J. A. O’Connor, and W. J. Clarke, “IBM zEnterprise redundant array of independent memory subsystem,” IBM J. Res. Dev., vol. 56, no. 1.2, pp. 4:1–4:11, 2012.
[CrossRef]

Patterson, D.

D. Oppenheimer, A. Ganapathi, and D. Patterson, “Why do Internet services fail and what can be done about it?” in 4th USENIX Symp. on Internet Technologies and Systems, 2003.

Pinheiro, E.

B. Schroeder, E. Pinheiro, and W.-D. Weber, “DRAM errors in the wild: A large-scale field study,” in ACM SIGMETRICS, 2009.

Preston, K.

Reinhardt, S. K.

S. S. Mukherjee, J. Emer, and S. K. Reinhardt, “The soft error problem: An architectural perspective,” in HPCA ’05: Proc. of the 11th Int. Symp. on High-Performance Computer Architecture, 2005.

Richards, M.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Ross, J. M.

T. J. O’Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, and J. L. Walsh, “Field testing for cosmic ray soft errors in semiconductor memories,” IBM J. Res. Dev., vol. 40, no. 1, pp. 41–50, 1996.
[CrossRef]

Scarpelli, A.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Schreiber, R. S.

J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber, “Future scaling of processor-memory interfaces,” in Supercomputing (SC), Nov. 2010.

Schroeder, B.

B. Schroeder, E. Pinheiro, and W.-D. Weber, “DRAM errors in the wild: A large-scale field study,” in ACM SIGMETRICS, 2009.

Scott, S.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Singhal, M.

A. D. Kshemkalyani and M. Singhal, Distributed Computing: Principles, Algorithms, and Systems. Cambridge University Press, New York, 2008, pp. 6–13.

Small, B. A.

Snavely, A.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Stephens, E.

P. J. Meaney, L. A. Lastras-Montano, V. K. Papazova, E. Stephens, J. S. Johnson, L. C. Alves, J. A. O’Connor, and W. J. Clarke, “IBM zEnterprise redundant array of independent memory subsystem,” IBM J. Res. Dev., vol. 56, no. 1.2, pp. 4:1–4:11, 2012.
[CrossRef]

Sterling, T.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Taber, A. H.

T. J. O’Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, and J. L. Walsh, “Field testing for cosmic ray soft errors in semiconductor memories,” IBM J. Res. Dev., vol. 40, no. 1, pp. 41–50, 1996.
[CrossRef]

Vishwanath, K. V.

K. V. Vishwanath and N. Nagappan, “Characterizing cloud computing hardware reliability,” in Proc. of the 1st ACM Symp. on Cloud Computing (SoCC ’10), New York, 2010, pp. 193–204.

Walsh, J. L.

T. J. O’Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, and J. L. Walsh, “Field testing for cosmic ray soft errors in semiconductor memories,” IBM J. Res. Dev., vol. 40, no. 1, pp. 41–50, 1996.
[CrossRef]

Weber, W.-D.

B. Schroeder, E. Pinheiro, and W.-D. Weber, “DRAM errors in the wild: A large-scale field study,” in ACM SIGMETRICS, 2009.

Williams, R. S.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Woods, M. H.

T. C. May and M. H. Woods, “Alpha-particle-induced soft errors in dynamic memories,” IEEE Trans. Electron Devices, vol. 26, no. 1, pp. 2–9, 1979.
[CrossRef]

Yelick, K.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

Ziegler, J. F.

T. J. O’Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, and J. L. Walsh, “Field testing for cosmic ray soft errors in semiconductor memories,” IBM J. Res. Dev., vol. 40, no. 1, pp. 41–50, 1996.
[CrossRef]

J. F. Ziegler and W. A. Lanford, “Effect of cosmic rays on computer memories,” Science, vol. 206, pp. 776–788, 1979.
[CrossRef] [PubMed]

IBM J. Res. Dev. (3)

C. Chen and M. Hsiao, “Error-correcting codes for semiconductor memory applications: A state-of-the-art review,” IBM J. Res. Dev., vol. 28, no. 2, pp. 124–134, 1984.
[CrossRef]

P. J. Meaney, L. A. Lastras-Montano, V. K. Papazova, E. Stephens, J. S. Johnson, L. C. Alves, J. A. O’Connor, and W. J. Clarke, “IBM zEnterprise redundant array of independent memory subsystem,” IBM J. Res. Dev., vol. 56, no. 1.2, pp. 4:1–4:11, 2012.
[CrossRef]

T. J. O’Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, and J. L. Walsh, “Field testing for cosmic ray soft errors in semiconductor memories,” IBM J. Res. Dev., vol. 40, no. 1, pp. 41–50, 1996.
[CrossRef]

IEEE Des. Test Comput. (1)

R. Baumann, “Soft errors in advanced computer systems,” IEEE Des. Test Comput., vol. 22, no. 3, pp. 258–266, 2005.
[CrossRef]

IEEE Photon. Technol. Lett. (1)

D. Brunina, C. P. Lai, and K. Bergman, “A data rate- and modulation format-independent packet-switched optical network test-bed,” IEEE Photon. Technol. Lett., vol. 24, no. 5, pp. 377–379, Mar.2012.
[CrossRef]

IEEE Trans. Electron Devices (1)

T. C. May and M. H. Woods, “Alpha-particle-induced soft errors in dynamic memories,” IEEE Trans. Electron Devices, vol. 26, no. 1, pp. 2–9, 1979.
[CrossRef]

IEEE Trans. Nucl. Sci. (1)

E. Normand, “Single event upset at ground level,” IEEE Trans. Nucl. Sci., vol. 6, no. 43, pp. 2742–2750, 1996.
[CrossRef]

IEEE Trans. Reliab. (1)

H. Mine and K. Hatayama, “Reliability analysis and optimal redundancy for majority-voted logic circuits,” IEEE Trans. Reliab., vol. 30, no. 2, pp. 189–191, 1981.
[CrossRef]

J. Lightwave Technol. (2)

J. Opt. Commun. Netw. (1)

Opt. Express (1)

Proc. IEEE (1)

R. Ho, W. Mai, and M. A. Horowitz, “The future of wires,” Proc. IEEE, vol. 89, no. 4, pp. 490–504, Apr.2001.
[CrossRef]

Science (1)

J. F. Ziegler and W. A. Lanford, “Effect of cosmic rays on computer memories,” Science, vol. 206, pp. 776–788, 1979.
[CrossRef] [PubMed]

Other (16)

J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber, “Future scaling of processor-memory interfaces,” in Supercomputing (SC), Nov. 2010.

JEDEC Solid State Technology Association, DDR3 SDRAM Standard [Online]. Available: http://www.jedec.org/standards-documents/docs/jesd-79-3d.

A. D. Kshemkalyani and M. Singhal, Distributed Computing: Principles, Algorithms, and Systems. Cambridge University Press, New York, 2008, pp. 6–13.

T. J. Dell, A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory, IBM Microelectronics Division, 1997.

“Intel E7500 chipset MCH Intel×4 single device data correction (×4 SDDC) implementation and validation,” Intel Application Note AP-726, Aug.2002.

“Servers and storage technology for the adaptive infrastructure,” HP Technology Advisor, 2006 [Online]. Available: http://h40089.www4.hp.com/integrity/pdf/4AA0-7545EEE.pdf.

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, Exascale Computing Study: Technology Challenges in Achieving Exascale Systems [Online]. Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

S. S. Mukherjee, J. Emer, and S. K. Reinhardt, “The soft error problem: An architectural perspective,” in HPCA ’05: Proc. of the 11th Int. Symp. on High-Performance Computer Architecture, 2005.

L. A. Barroso and U. Hölzle, “The datacenter as a computer: An introduction to the design of warehouse-scale machines,” in Synthesis Lectures on Computer Architecture. Morgan & Claypool, 2009.

D. Oppenheimer, A. Ganapathi, and D. Patterson, “Why do Internet services fail and what can be done about it?” in 4th USENIX Symp. on Internet Technologies and Systems, 2003.

A. Modine, “Web startups crumble under Amazon S3 outage” [Online]. Available: http://www.theregister.co.uk/2008/02/15/amazon_s3_outage_feb_2008/.

B. Schroeder, E. Pinheiro, and W.-D. Weber, “DRAM errors in the wild: A large-scale field study,” in ACM SIGMETRICS, 2009.

K. V. Vishwanath and N. Nagappan, “Characterizing cloud computing hardware reliability,” in Proc. of the 1st ACM Symp. on Cloud Computing (SoCC ’10), New York, 2010, pp. 193–204.

The ITRS Technology Working Groups, International Technology Roadmap for Semiconductors (ITRS) 2011 Edition [Online]. Available: http://www.itrs.net.

D. Brunina, C. P. Lai, A. S. Garg, and K. Bergman, “Wavelength-striped multicasting of optically-connected memory for large-scale computing systems,” in Optical Fiber Communications Conf. (OFC), Mar. 2011, OWH4.

D. Brunina, C. P. Lai, D. Liu, A. S. Garg, and K. Bergman, “Optically-connected memory with error correction for increased reliability in large-scale computing systems,” in Optical Fiber Communications Conf. (OFC), Mar. 2012.

Cited By

OSA participates in CrossRef's Cited-By Linking service. Citing articles from OSA journals and other participating publishers are listed here.

Alert me when this article is cited.


Figures (7)

Fig. 1
Fig. 1

(Color online) Compute nodes and memory nodes are connected using an optical interconnection network. Red arrows show a compute node multicasting to interleave data over two memory nodes. At each memory node, the data are further interleaved so each SDRAM stores a different portion of the data.

Fig. 2
Fig. 2

(Color online) Block diagram of standard error-correction hardware used for main memory; 64-bit memory words are protected by 8 parity-check bits and stored in memory as 72-bit ECC words.

Fig. 3
Fig. 3

(Color online) (a) Block diagram of the implemented 4 × 4 optical network topology, interconnecting four CPUs with four OCM nodes. (b) Photograph of the switching fabric test-bed.

Fig. 4
Fig. 4

(Color online) Diagram of our wavelength-striped message format. The WDM memory transaction consists of eight separate wavelengths, each modulated at 10 Gb/s.

Fig. 5
Fig. 5

(Color online) Experimental setup illustrating the communication path from the processor to two OCM nodes via wavelength-striped multicasting. The processor modulates five message-rate header wavelengths, in addition to eight payload wavelengths at 10 Gb/s, which are combined before being injected into the 4 × 4 optical network. Header wavelengths consist of frame (F), address bits 0 and 1 (A0, A1), and multicast bits 0 and 1 (M0, M1). The inset shows the contents of an OCM node, including receiver circuitry, FPGA (for SerDes), and SDRAM. The return path from OCM to processor is identical.

Fig. 6
Fig. 6

(Color online) Experimentally recovered post-ECC BER as a function of induced pre-ECC BER for the implemented advanced ECC OCM system, for varying number of interleaved nodes. All pre-ECC BERs less than 1 0 5 result in error-free post-ECC BERs ( BERs < 1 0 12 ).

Fig. 7
Fig. 7

(Color online) Optical eye diagrams for the 8 × 10 Gb/s memory payload wavelength channels at one network input port (top) and at one network output (bottom).