Straggler-Resilient Distributed Computing

Severinson, Lars Albin

dc.contributor.author	Severinson, Lars Albin
dc.date.accessioned	2022-08-31T12:44:16Z
dc.date.available	2022-08-31T12:44:16Z
dc.date.issued	2022-09-09
dc.date.submitted	2022-08-20T15:33:54.813Z
dc.identifier	container/34/13/aa/92/3413aa92-f041-47d8-b8c1-a4f0dde95250
dc.identifier.isbn	9788230868287
dc.identifier.isbn	9788230840726
dc.identifier.uri	https://hdl.handle.net/11250/3014726
dc.description	In reference to IEEE copyrighted material which is used with permission in this thesis, the IEEE does not endorse any of University of Bergen's products or services. Internal or personal use of this material is permitted. If interested in reprinting/republishing IEEE copyrighted material for advertising or promotional purposes or for creating new collective works for resale or redistribution, please go to http://www.ieee.org/publications_standards/publications/rights/rights_link.html to learn how to obtain a License from RightsLink.	en_US
dc.description.abstract	Utbredelsen av distribuerte datasystemer har økt betydelig de siste årene. Dette skyldes først og fremst at behovet for beregningskraft øker raskere enn hastigheten til en enkelt datamaskin, slik at vi må bruke flere datamaskiner for å møte etterspørselen, og at det blir stadig mer vanlig at systemer er spredt over et stort geografisk område. Dette paradigmeskiftet medfører mange tekniske utfordringer. En av disse er knyttet til "straggler"-problemet, som er forårsaket av forsinkelsesvariasjoner i distribuerte systemer, der en beregning forsinkes av noen få langsomme noder slik at andre noder må vente før de kan fortsette. Straggler-problemet kan svekke effektiviteten til distribuerte systemer betydelig i situasjoner der en enkelt node som opplever en midlertidig overbelastning kan låse et helt system. I denne avhandlingen studerer vi metoder for å gjøre beregninger av forskjellige typer motstandsdyktige mot slike problemer, og dermed gjøre det mulig for et distribuert system å fortsette til tross for at noen noder ikke svarer i tide. Metodene vi foreslår er skreddersydde for spesielle typer beregninger. Vi foreslår metoder tilpasset distribuert matrise-vektor-multiplikasjon (som er en grunnleggende operasjon i mange typer beregninger), distribuert maskinlæring og distribuert sporing av en tilfeldig prosess (for eksempel det å spore plasseringen til kjøretøy for å unngå kollisjon). De foreslåtte metodene utnytter redundans som enten blir introdusert som en del av metoden, eller som naturlig eksisterer i det underliggende problemet, til å kompensere for manglende delberegninger. For en av de foreslåtte metodene utnytter vi redundans for også å øke effektiviteten til kommunikasjonen mellom noder, og dermed redusere mengden data som må kommuniseres over nettverket. I likhet med straggler-problemet kan slik kommunikasjon begrense effektiviteten i distribuerte systemer betydelig. De foreslåtte metodene gir signifikante forbedringer i ventetid og pålitelighet sammenlignet med tidligere metoder.	en_US
dc.description.abstract	The number and scale of distributed computing systems being built have increased significantly in recent years. Primarily, that is because: i) our computing needs are increasing at a much higher rate than computers are becoming faster, so we need to use more of them to meet demand, and ii) systems that are fundamentally distributed, e.g., because the components that make them up are geographically distributed, are becoming increasingly prevalent. This paradigm shift is the source of many engineering challenges. Among them is the straggler problem, which is a problem caused by latency variations in distributed systems, where faster nodes are held up by slower ones. The straggler problem can significantly impair the effectiveness of distributed systems—a single node experiencing a transient outage (e.g., due to being overloaded) can lock up an entire system. In this thesis, we consider schemes for making a range of computations resilient against such stragglers, thus allowing a distributed system to proceed in spite of some nodes failing to respond on time. The schemes we propose are tailored for particular computations. We propose schemes designed for distributed matrix-vector multiplication, which is a fundamental operation in many computing applications, distributed machine learning—in the form of a straggler-resilient first-order optimization method—and distributed tracking of a time-varying process (e.g., tracking the location of a set of vehicles for a collision avoidance system). The proposed schemes rely on exploiting redundancy that is either introduced as part of the scheme, or exists naturally in the underlying problem, to compensate for missing results, i.e., they are a form of forward error correction for computations. Further, for one of the proposed schemes we exploit redundancy to also improve the effectiveness of multicasting, thus reducing the amount of data that needs to be communicated over the network. Such inter-node communication, like the straggler problem, can significantly limit the effectiveness of distributed systems. For the schemes we propose, we are able to show significant improvements in latency and reliability compared to previous schemes.	en_US
dc.language.iso	eng	en_US
dc.publisher	The University of Bergen	en_US
dc.relation.haspart	Paper I: A. Severinson, A. Graell i Amat, and E. Rosnes, “Block-diagonal and LT codes for distributed computing with straggling servers,” IEEE Transactions on Communications, vol. 67, no. 3, 1739-1753, Mar. 2019. The article is available at: <a href="https://hdl.handle.net/1956/22296" target="blank">https://hdl.handle.net/1956/22296</a>	en_US
dc.relation.haspart	Paper II: A. Severinson, A. Graell i Amat, E. Rosnes, Francisco Lazaro, and Gianluigi Liva, “A droplet approach based on Raptor codes for distributed computing with straggling servers,” in Proc. International Symposium on Turbo Codes & Iterative Information Processing (ISTC), Hong Kong, Dec. 2018. The article is available in the thesis file. The article is also available at: <a href="https://doi.org/10.1109/ISTC.2018.8625292" target="blank">https://doi.org/10.1109/ISTC.2018.8625292</a>	en_US
dc.relation.haspart	Paper III: A. Severinson, E. Rosnes, S. El Rouayheb, and A. Graell i Amat, “DSAG: A mixed synchronous-asynchronous iterative method for straggler-resilient learning”. The article is available in the thesis file.	en_US
dc.relation.haspart	Paper IV: A. Severinson, E. Rosnes, and A. Graell i Amat, “Coded distributed tracking,” in Proc. IEEE Global Communications Conference (GLOBECOM), Waikoloa, HI, Dec. 2019. The article is available in the thesis file. The article is also available at: <a href="https://doi.org/10.1109/GLOBECOM38437.2019.9013446" target="blank">https://doi.org/10.1109/GLOBECOM38437.2019.9013446</a>	en_US
dc.relation.haspart	Paper V: A. Severinson, E. Rosnes, and A. Graell i Amat, “Improving age-of-information in distributed vehicle tracking,” in Proc. URSI General Assembly and Scientific Symposium (GASS), Rome, Italy, Aug./Sep. 2021. The article is not available in BORA. The published version is available at: <a href=" https://doi.org/10.23919/URSIGASS51995.2021.9560297" target="blank"> https://doi.org/10.23919/URSIGASS51995.2021.9560297</a>	en_US
dc.rights	In copyright
dc.rights.uri	http://rightsstatements.org/page/InC/1.0/
dc.title	Straggler-Resilient Distributed Computing	en_US
dc.type	Doctoral thesis	en_US
dc.date.updated	2022-08-20T15:33:54.813Z
dc.rights.holder	Copyright the Author. All rights reserved	en_US
dc.contributor.orcid	0000-0002-3450-4392
dc.description.degree	Doktorgradsavhandling
fs.unitcode	12-12-0

Files in this item

Name:: drthesis_2022_severinson.pdf
Size:: 3.060Mb
Format:: PDF
Description:: Main thesis, papers I-IV

View/Open

This item appears in the following Collection(s)

Department of Informatics [917]

Show simple item record