Show simple item record

dc.contributor.authorSeverinson, Lars Albin
dc.date.accessioned2022-08-31T12:44:16Z
dc.date.available2022-08-31T12:44:16Z
dc.date.issued2022-09-09
dc.date.submitted2022-08-20T15:33:54.813Z
dc.identifiercontainer/34/13/aa/92/3413aa92-f041-47d8-b8c1-a4f0dde95250
dc.identifier.isbn9788230868287
dc.identifier.isbn9788230840726
dc.identifier.urihttps://hdl.handle.net/11250/3014726
dc.descriptionIn reference to IEEE copyrighted material which is used with permission in this thesis, the IEEE does not endorse any of University of Bergen's products or services. Internal or personal use of this material is permitted. If interested in reprinting/republishing IEEE copyrighted material for advertising or promotional purposes or for creating new collective works for resale or redistribution, please go to http://www.ieee.org/publications_standards/publications/rights/rights_link.html to learn how to obtain a License from RightsLink.en_US
dc.description.abstractUtbredelsen av distribuerte datasystemer har økt betydelig de siste årene. Dette skyldes først og fremst at behovet for beregningskraft øker raskere enn hastigheten til en enkelt datamaskin, slik at vi må bruke flere datamaskiner for å møte etterspørselen, og at det blir stadig mer vanlig at systemer er spredt over et stort geografisk område. Dette paradigmeskiftet medfører mange tekniske utfordringer. En av disse er knyttet til "straggler"-problemet, som er forårsaket av forsinkelsesvariasjoner i distribuerte systemer, der en beregning forsinkes av noen få langsomme noder slik at andre noder må vente før de kan fortsette. Straggler-problemet kan svekke effektiviteten til distribuerte systemer betydelig i situasjoner der en enkelt node som opplever en midlertidig overbelastning kan låse et helt system. I denne avhandlingen studerer vi metoder for å gjøre beregninger av forskjellige typer motstandsdyktige mot slike problemer, og dermed gjøre det mulig for et distribuert system å fortsette til tross for at noen noder ikke svarer i tide. Metodene vi foreslår er skreddersydde for spesielle typer beregninger. Vi foreslår metoder tilpasset distribuert matrise-vektor-multiplikasjon (som er en grunnleggende operasjon i mange typer beregninger), distribuert maskinlæring og distribuert sporing av en tilfeldig prosess (for eksempel det å spore plasseringen til kjøretøy for å unngå kollisjon). De foreslåtte metodene utnytter redundans som enten blir introdusert som en del av metoden, eller som naturlig eksisterer i det underliggende problemet, til å kompensere for manglende delberegninger. For en av de foreslåtte metodene utnytter vi redundans for også å øke effektiviteten til kommunikasjonen mellom noder, og dermed redusere mengden data som må kommuniseres over nettverket. I likhet med straggler-problemet kan slik kommunikasjon begrense effektiviteten i distribuerte systemer betydelig. De foreslåtte metodene gir signifikante forbedringer i ventetid og pålitelighet sammenlignet med tidligere metoder.en_US
dc.description.abstractThe number and scale of distributed computing systems being built have increased significantly in recent years. Primarily, that is because: i) our computing needs are increasing at a much higher rate than computers are becoming faster, so we need to use more of them to meet demand, and ii) systems that are fundamentally distributed, e.g., because the components that make them up are geographically distributed, are becoming increasingly prevalent. This paradigm shift is the source of many engineering challenges. Among them is the straggler problem, which is a problem caused by latency variations in distributed systems, where faster nodes are held up by slower ones. The straggler problem can significantly impair the effectiveness of distributed systems—a single node experiencing a transient outage (e.g., due to being overloaded) can lock up an entire system. In this thesis, we consider schemes for making a range of computations resilient against such stragglers, thus allowing a distributed system to proceed in spite of some nodes failing to respond on time. The schemes we propose are tailored for particular computations. We propose schemes designed for distributed matrix-vector multiplication, which is a fundamental operation in many computing applications, distributed machine learning—in the form of a straggler-resilient first-order optimization method—and distributed tracking of a time-varying process (e.g., tracking the location of a set of vehicles for a collision avoidance system). The proposed schemes rely on exploiting redundancy that is either introduced as part of the scheme, or exists naturally in the underlying problem, to compensate for missing results, i.e., they are a form of forward error correction for computations. Further, for one of the proposed schemes we exploit redundancy to also improve the effectiveness of multicasting, thus reducing the amount of data that needs to be communicated over the network. Such inter-node communication, like the straggler problem, can significantly limit the effectiveness of distributed systems. For the schemes we propose, we are able to show significant improvements in latency and reliability compared to previous schemes.en_US
dc.language.isoengen_US
dc.publisherThe University of Bergenen_US
dc.relation.haspartPaper I: A. Severinson, A. Graell i Amat, and E. Rosnes, “Block-diagonal and LT codes for distributed computing with straggling servers,” IEEE Transactions on Communications, vol. 67, no. 3, 1739-1753, Mar. 2019. The article is available at: <a href="https://hdl.handle.net/1956/22296" target="blank">https://hdl.handle.net/1956/22296</a>en_US
dc.relation.haspartPaper II: A. Severinson, A. Graell i Amat, E. Rosnes, Francisco Lazaro, and Gianluigi Liva, “A droplet approach based on Raptor codes for distributed computing with straggling servers,” in Proc. International Symposium on Turbo Codes & Iterative Information Processing (ISTC), Hong Kong, Dec. 2018. The article is available in the thesis file. The article is also available at: <a href="https://doi.org/10.1109/ISTC.2018.8625292" target="blank">https://doi.org/10.1109/ISTC.2018.8625292</a>en_US
dc.relation.haspartPaper III: A. Severinson, E. Rosnes, S. El Rouayheb, and A. Graell i Amat, “DSAG: A mixed synchronous-asynchronous iterative method for straggler-resilient learning”. The article is available in the thesis file.en_US
dc.relation.haspartPaper IV: A. Severinson, E. Rosnes, and A. Graell i Amat, “Coded distributed tracking,” in Proc. IEEE Global Communications Conference (GLOBECOM), Waikoloa, HI, Dec. 2019. The article is available in the thesis file. The article is also available at: <a href="https://doi.org/10.1109/GLOBECOM38437.2019.9013446" target="blank">https://doi.org/10.1109/GLOBECOM38437.2019.9013446</a>en_US
dc.relation.haspartPaper V: A. Severinson, E. Rosnes, and A. Graell i Amat, “Improving age-of-information in distributed vehicle tracking,” in Proc. URSI General Assembly and Scientific Symposium (GASS), Rome, Italy, Aug./Sep. 2021. The article is not available in BORA. The published version is available at: <a href=" https://doi.org/10.23919/URSIGASS51995.2021.9560297" target="blank"> https://doi.org/10.23919/URSIGASS51995.2021.9560297</a>en_US
dc.rightsIn copyright
dc.rights.urihttp://rightsstatements.org/page/InC/1.0/
dc.titleStraggler-Resilient Distributed Computingen_US
dc.typeDoctoral thesisen_US
dc.date.updated2022-08-20T15:33:54.813Z
dc.rights.holderCopyright the Author. All rights reserveden_US
dc.contributor.orcid0000-0002-3450-4392
dc.description.degreeDoktorgradsavhandling
fs.unitcode12-12-0


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record