rest: robust and efficient neural networks for sleep ... · rest: robust and efficient neural...

11
REST: Robust and Efficient Neural Networks for Sleep Monitoring in the Wild Rahul Duggal* 1 , Scott Freitas* 1 , Cao Xiao 2 , Duen Horng (Polo) Chau 1 , Jimeng Sun 1,3 {rahulduggal,safreita,polo}@gatech.edu,[email protected],[email protected] Georgia Institute of Technology 1 , IQVIA 2 , University of Illinois Urbana-Champaign 3 ABSTRACT In recent years, significant attention has been devoted towards integrating deep learning technologies in the healthcare domain. However, to safely and practically deploy deep learning models for home health monitoring, two significant challenges must be addressed: the models should be (1) robust against noise; and (2) compact and energy-efficient. We propose Rest, a new method that simultaneously tackles both issues via 1) adversarial training and controlling the Lipschitz constant of the neural network through spectral regularization while 2) enabling neural network compres- sion through sparsity regularization. We demonstrate that Rest produces highly-robust and efficient models that substantially out- perform the original full-sized models in the presence of noise. For the sleep staging task over single-channel electroencephalogram (EEG), the Rest model achieves a macro-F1 score of 0.67 vs. 0.39 achieved by a state-of-the-art model in the presence of Gaussian noise while obtaining 19× parameter reduction and 15× MFLOPS reduction on two large, real-world EEG datasets. By deploying these models to an Android application on a smartphone, we quan- titatively observe that Rest allows models to achieve up to 17× energy reduction and 9× faster inference. We open source the code repository with this paper: https://github.com/duggalrahul/REST. CCS CONCEPTS Applied computing Health informatics; Computing methodologies Supervised learning by classification. KEYWORDS deep learning, compression, adversarial, sleep staging 1 INTRODUCTION As many as 70 million Americans suffer from sleep disorders that affects their daily functioning, long-term health and longevity. The long-term effects of sleep deprivation and sleep disorders include an increased risk of hypertension, diabetes, obesity, depression, heart attack, and stroke [1]. The cost of undiagnosed sleep apnea alone is estimated to exceed 100 billion in the US [28]. A central tool in identifying sleep disorders is the hypnogramwhich documents the progression of sleep stages (REM stage, Non- REM stages N1 to N3, and Wake stage) over an entire night (see Fig. 1, top). The process of acquiring a hypnogram from raw sen- sor data is called sleep staging, which is the focus of this work. Traditionally, to reliably obtain a hypnogram the patient has to un- dergo an overnight sleep study—called polysomnography (PSG)—at a sleep lab while wearing bio-sensors that measure physiological signals, which include electroencephalogram (EEG), eye movements * Both authors contributed equally to this research. Rest Model Expert Scored SOTA Model W N1 N2 N3 REM W N1 N2 N3 REM W N1 N2 N3 REM 0 200 400 600 800 1000 Hypnogram Scoring in Noisy Environment Hypnogram Scoring in Noisy Environment Efficiency Measurements Inference Time (s) Energy Usage (J) 1143 123 355 57 Rest is 9x more efficient Rest is 6x faster SOTA (state-of-the-art) SOTA Time Figure 1: Top: we generate hypnograms for a patient in the SHHS test set. In the presence of Gaussian noise, our Rest- generated hypnogram closely matches the contours of the expert-scored hypnogram. Hypnogram generated by a state- of-the-art (SOTA) model by Sors et al. [32] is considerably worse. Bottom: we measure energy consumed (in Joules) and inference time (in seconds) on a smartphone to score one night of EEG recordings. Rest is 9X more energy efficient and 6X faster than the SOTA model. (EOG), muscle activity or skeletal muscle activation (EMG), and heart rhythm (ECG). The PSG data is then analyzed by a trained sleep technician and a certified sleep doctor to produce a PSG re- port. The hypnogram plays an essential role in the PSG report, where it is used to derive many important metrics such as sleep efficiency and apnea index. Unfortunately, manually annotating this PSG is both costly and time consuming for the doctors. Recent research has proposed to alleviate these issues by automatically generating the hypnogram directly from the PSG using deep neural networks [6, 34]. However, the process of obtaining a PSG report is still costly and invasive to patients, reducing their participation, which ultimately leads to undiagnosed sleep disorders [33]. One promising direction to reduce undiagnosed sleep disorders is to enable sleep monitoring at the home using commercial wearables (e.g., Fitbit, Apple Watch, Emotiv) [21]. However, despite significant research advances, a recent study shows that wearables using a arXiv:2001.11363v1 [eess.SP] 29 Jan 2020

Upload: others

Post on 24-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: REST: Robust and Efficient Neural Networks for Sleep ... · REST: Robust and Efficient Neural Networks for Sleep Monitoring in the Wild Rahul Duggal*1, Scott Freitas*1, Cao Xiao2,

REST: Robust and Efficient Neural Networksfor Sleep Monitoring in the Wild

Rahul Duggal*1, Scott Freitas*1, Cao Xiao2, Duen Horng (Polo) Chau1, Jimeng Sun1,3{rahulduggal,safreita,polo}@gatech.edu,[email protected],[email protected]

Georgia Institute of Technology1, IQVIA2, University of Illinois Urbana-Champaign3

ABSTRACT

In recent years, significant attention has been devoted towardsintegrating deep learning technologies in the healthcare domain.However, to safely and practically deploy deep learning modelsfor home health monitoring, two significant challenges must beaddressed: the models should be (1) robust against noise; and (2)compact and energy-efficient. We propose Rest, a new method thatsimultaneously tackles both issues via 1) adversarial training andcontrolling the Lipschitz constant of the neural network throughspectral regularization while 2) enabling neural network compres-sion through sparsity regularization. We demonstrate that Restproduces highly-robust and efficient models that substantially out-perform the original full-sized models in the presence of noise. Forthe sleep staging task over single-channel electroencephalogram(EEG), the Rest model achieves a macro-F1 score of 0.67 vs. 0.39achieved by a state-of-the-art model in the presence of Gaussiannoise while obtaining 19× parameter reduction and 15×MFLOPSreduction on two large, real-world EEG datasets. By deployingthese models to an Android application on a smartphone, we quan-titatively observe that Rest allows models to achieve up to 17×energy reduction and 9× faster inference. We open source the coderepository with this paper: https://github.com/duggalrahul/REST.

CCS CONCEPTS

• Applied computing → Health informatics; • Computing

methodologies→ Supervised learning by classification.

KEYWORDS

deep learning, compression, adversarial, sleep staging

1 INTRODUCTION

As many as 70 million Americans suffer from sleep disorders thataffects their daily functioning, long-term health and longevity. Thelong-term effects of sleep deprivation and sleep disorders includean increased risk of hypertension, diabetes, obesity, depression,heart attack, and stroke [1]. The cost of undiagnosed sleep apneaalone is estimated to exceed 100 billion in the US [28].

A central tool in identifying sleep disorders is the hypnogram—which documents the progression of sleep stages (REM stage,Non-REM stages N1 to N3, andWake stage) over an entire night (seeFig. 1, top). The process of acquiring a hypnogram from raw sen-sor data is called sleep staging, which is the focus of this work.Traditionally, to reliably obtain a hypnogram the patient has to un-dergo an overnight sleep study—called polysomnography (PSG)—ata sleep lab while wearing bio-sensors that measure physiologicalsignals, which include electroencephalogram (EEG), eyemovements

* Both authors contributed equally to this research.

RestModel

ExpertScored

SOTAModel

W

N1

N2

N3

REM

W

N1

N2

N3

REM

W

N1

N2

N3

REM

0 200 400 600 800 1000

Hypnogram Scoring in Noisy EnvironmentHypnogram Scoring in Noisy Environment

Efficiency Measurements

InferenceTime (s)

EnergyUsage (J)

1143

123

355

57

Rest is 9x more efficient

Rest is 6x faster

SOTA (state-of-the-art)

SOTA

Time

Figure 1: Top: we generate hypnograms for a patient in the

SHHS test set. In the presence of Gaussian noise, our Rest-

generated hypnogram closely matches the contours of the

expert-scored hypnogram. Hypnogram generated by a state-

of-the-art (SOTA) model by Sors et al. [32] is considerably

worse. Bottom:wemeasure energy consumed (in Joules) and

inference time (in seconds) on a smartphone to score one

night of EEG recordings. Rest is 9X more energy efficient

and 6X faster than the SOTA model.

(EOG), muscle activity or skeletal muscle activation (EMG), andheart rhythm (ECG). The PSG data is then analyzed by a trainedsleep technician and a certified sleep doctor to produce a PSG re-port. The hypnogram plays an essential role in the PSG report,where it is used to derive many important metrics such as sleepefficiency and apnea index. Unfortunately, manually annotatingthis PSG is both costly and time consuming for the doctors. Recentresearch has proposed to alleviate these issues by automaticallygenerating the hypnogram directly from the PSG using deep neuralnetworks [6, 34]. However, the process of obtaining a PSG reportis still costly and invasive to patients, reducing their participation,which ultimately leads to undiagnosed sleep disorders [33].

One promising direction to reduce undiagnosed sleep disorders isto enable sleep monitoring at the home using commercial wearables(e.g., Fitbit, Apple Watch, Emotiv) [21]. However, despite significantresearch advances, a recent study shows that wearables using a

arX

iv:2

001.

1136

3v1

[ee

ss.S

P] 2

9 Ja

n 20

20

Page 2: REST: Robust and Efficient Neural Networks for Sleep ... · REST: Robust and Efficient Neural Networks for Sleep Monitoring in the Wild Rahul Duggal*1, Scott Freitas*1, Cao Xiao2,

WWW ’20, April 20–24, 2020, Taipei, Taiwan Duggal & Freitas, et al.

EEG SensorREM

WakeVulnerable to noise

Vanilla ModelX

Noisy REM Signal

Gaussian Noise

Robust + SparseR��� 3: Re-train Model2: PruneModel1: TrainModel

LADV + LSPCLADV+ LSPC+ LSPA

R��� P������R��� P������

Figure 2: Rest Overview: (from left) When a noisy EEG signal belonging to the REM (rapid eye movement) sleep stage enters

a traditional neural network which is vulnerable to noise, it gets wrongly classified as a Wake sleep stage. On the other hand,

the same signal is correctly classified as the REM sleep stage by the Restmodel which is both robust and sparse. (From right)

Rest is a three step process involving (1) training the model with adversarial training, spectral regularization and sparsity

regularization (2) pruning the model and (3) re-training the compact model.

single sensor (e.g., single lead EEG) often have lower performancefor sleep staging, indicating a large room for improvement [3].

1.1 Contributions

Our contributions are two-fold—(i) we identify emerging researchchallenges for the task of sleep monitoring in the wild; and (ii) wepropose Rest, a novel framework that addresses these issues.

I. New Research Challenges for Sleep Monitoring.

• C1. Robustness to Noise.We observe that state-of-the-art deepneural networks (DNN) are highly susceptible to environmentalnoise (Fig. 1, top). In the case of wearables, noise is a seriousconsideration since bioelectrical signal sensors (e.g., electroen-cephalogram “EEG”, electrocardiogram “ECG”) are commonlysusceptible to Gaussian and shot noise, which can be introducedby electrical interferences (e.g., power-line) and user motions(e.g., muscle contraction, respiration) [5, 8, 10, 11]. This poses aneed for noise-tolerant models. In this paper, we show that adver-sarial training and spectral regularization can impart significantnoise robustness to sleep staging DNNs (see top of Fig 1).• C2. Energy andComputational Efficiency.Mobile deep learn-ing systems have traditionally offloaded compute intensive in-ference to cloud servers, requiring transfer of sensitive data andassumption of available Internet. However, this data uploadingprocess is difficult for many healthcare scenarios because of—(1)privacy: individuals are often reluctant to share health infor-mation as they consider it highly sensitive; and (2) accessibil-ity: real-time home monitoring is most needed in resource-poorenvironments where high-speed Internet may not be reliablyavailable. Directly deploying a neural network to a mobile phonebypasses these issues. However, due to the constrained computa-tion and energy budget of mobile devices, these models need to befast in speed and parsimonious with their energy consumption.

II. Noise-robust and Efficient Sleep Monitoring. Having iden-tified these two new research challenges, we propose Rest, thefirst framework for developing noise-robust and efficient neuralnetworks for home sleep monitoring (Fig. 2). Through Rest, ourmajor contributions include:

• “Robust and Efficient Neural Networks for Sleep Monitoring” Byintegrating a novel combination of three training objectives, Restendows a model with noise robustness through (1) adversarialtraining and (2) spectral regularization; and promotes energy andcomputational efficiency by enabling compression through (3)sparsity regularization.• Extensive evaluationWe benchmark the performance of Restagainst competitive baselines, on two real-world sleep stagingEEG datasets—Sleep-EDF from Physionet and Sleep Heart HealthStudy (SHHS). We demonstrate that Rest produces highly com-pact models that substantially outperform the original full-sizedmodels in the presence of noise. Restmodels achieves a macro-F1score of 0.67 vs. 0.39 for the state-of-the-art model in the pres-ence of Gaussian noise, with 19× parameter and 15× MFLOPSreduction.• Real-world deployment.We deploy a Restmodel onto a Pixel2 smartphone through an Android application performing sleepstaging. Our experiments reveal Rest achieves 17× energy re-duction and 9× faster inference on a smartphone, compared touncompressed models.

2 RELATEDWORK

In this section we discuss related work from three areas—(1) the taskof sleep stage prediction, (2) robustness of deep neural networksand (3) compression of deep learning models.

2.1 Sleep-Stage Prediction

Sleep staging is the task of annotating a polysomnography (PSG)report into a hypnogram, where 30 second sleep intervals are anno-tated into one of five sleep stages (W, N1, N2, N3, REM). Recently,significant effort has been devoted towards automating this annota-tion process using deep learning [2, 6, 9, 29, 32, 39], to name a few.While there exists a large body of research in this area—two worksin particular look at both single channel [6] and multi-channel [9]deep learning architectures for sleep stage prediction on EEG. In[6], the authors develop a deep learning architecture (SLEEPNET)for sleep stage prediction that achieves expert-level accuracy on

Page 3: REST: Robust and Efficient Neural Networks for Sleep ... · REST: Robust and Efficient Neural Networks for Sleep Monitoring in the Wild Rahul Duggal*1, Scott Freitas*1, Cao Xiao2,

REST: Robust and Efficient Neural Networksfor Sleep Monitoring in the Wild WWW ’20, April 20–24, 2020, Taipei, Taiwan

EEG data. In [9], the authors develop a multi-modal deep learningarchitecture for sleep stage prediction that achieves state-of-the-artaccuracy. As we demonstrate later in this paper (Section 4.5), thesesleep staging models are frequently susceptible to noise and suffera large performance drop in its presence (see Figure 1). In addition,these DNNs are often overparameterized (Section 4.6), making de-ployment to mobile devices and wearables difficult. Through Rest,we address these limitations and develop noise robust and efficientneural networks for edge computing.

2.2 Noise & Adversarial Robustness

Adversarial robustness seeks to ensure that the output of a neuralnetwork remains unchanged under a bounded perturbation of theinput; or in other words, prevent an adveresary from maliciouslyperturbing the data to fool a neural network. Adversarial deep learn-ing was popularized by [17], where they showed it was possible toalter the class prediction of deep neural network models by care-fully crafting an adversarially perturbed input. Since then, researchsuggests a strong link between adversarial robustness and noiserobustness [15, 20, 35]. In particular, [15] found that by performingadversarial training on a deep neural network, it becomes robustto many forms of noise (e.g., Gaussian, blur, shot, etc.). In contrast,they found that training a model on Gaussian augmented data ledto models that were less robust to adversarial perturbations. Webuild upon this finding of adversarial robustness as a proxy fornoise robustness and improve upon it through the use of spectralregularization; while simultaneously compressing the model to afraction of its original size for mobile devices.

2.3 Model Compression

Model compression aims to learn a reduced representation of theweights that parameterize a neural network; shrinking the computa-tional requirements for memory, floating point operations (FLOPS),inference time and energy. Broadly, prior art can be classified intofour directions—pruning [19], quantization [31], low rank approx-imation [37] and knowledge distillation [22]. For Rest, we focuson structured (channel) pruning thanks to its performance benefits(speedup, FLOP reduction) and ease of deployment with regularhardware. In structured channel pruning, the idea is to assign a mea-sure of importance to each filter of a convolutional neural network(CNN) and achieve desired sparsity by pruning the least impor-tant ones. Prior work demonstrates several ways to estimate filterimportance—magnitude of weights [24], structured sparsity regular-ization [36], regularization on activation scaling factors [26], filtersimilarity [13] and discriminative power of filters [40]. Recentlythere has been an attempt to bridge the area of model compres-sion with adversarial robustness through connection pruning [18]and quantization [25]. Different from previous work, Rest aims tocompress a model by pruning whole filters while imparting noisetolerance through adversarial training and spectral regularization.Rest can be further compressed through quantization [25].

3 REST: NOISE-ROBUST & EFFICIENT

MODELS

Rest is a new method that simultaneously compresses a neuralnetwork while developing both noise and adversarial robustness.

3.1 Overview

Our main idea is to enable Rest to endow models with these prop-erties by integrating three careful modifications of the traditionaltraining loss function. (1) The adversarial training term, whichbuilds noise robustness by training on adversarial examples (Sec-tion 3.2); (2) the spectral regularization term, which adds to thenoise robustness by constraining the Lipschitz constant of the neu-ral network (Section 3.3); and (3) the sparsity regularization termthat helps to identify important neurons and enables compression(Section 3.4). Throughout the paper, we follow standard notationand use capital bold letters for matrices (e.g., A), lower-case boldletters for vectors (e.g., a).

3.2 Adversarial Training

The goal of adversarial training is to generate noise robustnessby exposing the neural network to adversarially perturbed inputsduring the training process. Given a neural network f (X;W) withinput X, weightsW and corresponding loss function L(f (X;W), y),adversarial training aims at solving the following min-max problem:

minW

[E

X,y∼D

(maxδ ∈S

L(f (X + δ ;W), y))]

(1)

Here D is the unperturbed dataset consisting of the clean EEGsignals X ∈ RKin×KL (Kin is the number of channels and KL isthe length of the signal) along with their corresponding label y.The inner maximization problem in (1) embodies the goal of theadversary—that is, produce adversarially perturbed inputs (i.e.,X + δ ) that maximize the loss function L. On the other hand, theouter minimization term aims to build robustness by counteringthe adversary through minimizing the expected loss on perturbedinputs.

Maximizing the inner loss term in (1) is equivalent to findingthe adversarial signal Xp = X + δ that maximally alters the lossfunction L within some bounded perturbation δ ∈ S . Here S isthe set of allowable perturbations. Several choices exist for suchan adversary. For Rest, we use the iterative Projected GradientDescent (PGD) adversary since it’s one of the strongest first orderattacks [27]. Its operation is described below in Equation 2.

X(t+1)p = X

(t)p + Πτ

[ϵ · sign

{∇X(t)p

L(f (X(t)p ;W), y

) }](2)

Here X(0)p = X and at every step t , the previous perturbed inputX(t−1)p is modifiedwith the sign of the gradient of the loss, multiplied

by ϵ (controls attack strength). Πτ is a function that clips the inputat the positions where it exceeds the predefined L∞ bound τ . Finally,after niter iterations we have the Rest adversarial training termLadv in Equation 3.

Ladv = L(f (X(niter )p ;W), y) (3)

3.3 Spectral Regularizer

The second term in the objective function is the spectral regulariza-tion term, which aims to constrain the change in output of a neuralnetwork for some change in input. The intuition is to suppress theamplification of noise as it passes through the successive layers of

Page 4: REST: Robust and Efficient Neural Networks for Sleep ... · REST: Robust and Efficient Neural Networks for Sleep Monitoring in the Wild Rahul Duggal*1, Scott Freitas*1, Cao Xiao2,

WWW ’20, April 20–24, 2020, Taipei, Taiwan Duggal & Freitas, et al.

Algorithm 1: Noise Robust & Efficient Neural Network Training (Rest)Input: Model weights W, EEG signal X and label y from dataset D, spectral regularization λo , sparsity regularization λд , learning rate

α , perturbation strength ϵ , maximum PGD iterations niter and model sparsity sOutput: Noise robust, compressed neural network(1) Train the full model with Rest loss LR :

for epoch = 1 to N do

for minibatch B ⊂ D do

for X ∈ B do

X(1)p = X

for k=1 to niter doX(k+1)p = X

(k)p + Πτ (ϵ · sign(∇

X(k)p

L(f (X(k)p ;W), y)))Wgrad ← E

X,y∼D|▽WLR(Xp, y;W)|

where LR = L(f (Xp ;W), y)︸ ︷︷ ︸adversarial training

+ λo

N∑layer l=1

∥(W(l))TW(l) − I∥2︸ ︷︷ ︸spectral regularization

+ λд

N∑layer l=1

∥γ (l )∥1︸ ︷︷ ︸sparsity regularization

W←W − α ·Wgrad

(2) Prune the trained model:

Globally prune filters fromW having smallest γ values until nf (W′)

nf (W) ≤ s . Constrain layerwise sparsity so nf (W′(l))nf (W(l))

≥ 0.1.

(3) Re-train the pruned model:

Retrain compressed network f (X;W′) using adversarial training and spectral regularization (no sparsity regularization).

a neural network. In this section we show that an effective wayto achieve this is via constraining the Lipschitz constant of eachlayer’s weights.

For a real valued function f : R→ R the Lipschitz constant isa positive real value C such that | f (x1) − f (x2)| ≤ C |x1 − x2 |. IfC > 1 then the change in input is magnified through the functionf . For a neural net, this can lead to input noise amplification. Onthe other hand, if C < 1 then the noise amplification effect isdiminished. This can have the unintended consequence of reducingthe discriminative capability of a neural net. Therefore our goal is toset the Lipschitz constant C = 1. The Lipschitz constant for the lthfully connected layer parameterized by the weight matrixW(l) ∈RKin×Kout is equivalent to its spectral norm ρ(W(l)) [12]. Here thespectral norm of amatrixW is the square root of the largest singularvalue of WT

W. The spectral norm of a 1-D convolutional layerparameterized by the tensorW(l) ∈ RKout×Kin×Kl can be realized byreshaping it to a matrix W

(l) = RKout×(KinKl ) and then computingthe largest singular value.

A neural network of N layers can be viewed as a function f (·)composed of N sub-functions f (x) = f1(·) ◦ f2(·) ◦ ... fN (x). Aloose upper bound for the Lipschitz constant of f is the productof Lipschitz constants of individual layers or ρ(f ) ≤ ∏N

i=1 ρ(fi )[12]. The overall Lipschitz constant can grow exponentially if thespectral norm of each layer is greater than 1. On the contrary, itcould go to 0 if spectral norm of each layer is between 0 and 1. Thusthe ideal case arises when the spectral norm for each layer equals1. This can be achieved in several ways [12, 14, 38], however, oneeffective way is to encourage orthonormality in the columns of theweight matrix W through the minimization of ∥WT

W − I∥ where

I is the identity matrix. This additional loss term helps regulate thesingular values and bring them close to 1. Thus we incorporate thefollowing spectral regularization term into our loss objective, whereλo is a hyperparameter controlling the strength of the spectralregularization.

LSpectral = λo

N∑i=1∥(W(i))TW(i) − I∥2 (4)

3.4 Sparsity Regularizer & Rest Loss Function

The third term of the Rest objective function consists of the sparsityregularizer. With this term, we aim to learn the important filters inthe neural network. Once these are determined, the original neuralnetwork can be pruned to the desired level of sparsity.

The incoming weights for filter i in the lth fully connected (or 1-D convolutional) layer can be specified asW(l)i, : ∈ R

Kin (orW(l)i, :, : ∈RKin×KL ). We introduce a per filter multiplicand γ

(l )i that scales

the output activation of the ith neuron in layer l . By controllingthe value of this multiplicand, we realize the importance of theneuron. In particular, zeroing it amounts to dropping the entire filter.Note that the L0 norm on the multiplicand vector ∥γ (l )∥0, whereγ (l ) ∈ RKout , can naturally satisfy the sparsity objective since itcounts the number of non zero entries in a vector. However sincethe L0 norm is a nondifferentiable function, we use the L1 normas a surrogate [23, 26, 36] which is amenable to backpropagationthrough its subgradient.

To realize the per filter multiplicand γ (l )i , we leverage the perfilter multiplier within the batch normalization layer [26]. In most

Page 5: REST: Robust and Efficient Neural Networks for Sleep ... · REST: Robust and Efficient Neural Networks for Sleep Monitoring in the Wild Rahul Duggal*1, Scott Freitas*1, Cao Xiao2,

REST: Robust and Efficient Neural Networksfor Sleep Monitoring in the Wild WWW ’20, April 20–24, 2020, Taipei, Taiwan

modern networks, a batchnorm layer immediately follows the con-volutional/linear layers and implements the following operation.

B(l )i =©­«A(l) − µ(l)i

σ (l )i

ª®¬γ (l )i + β (l )i (5)

HereA(l)i denotes output activation of filter i in layer l while B(l)idenotes its transformation through batchnorm layer l ; µ(l ) ∈ RKout ,σ (l ) ∈ RKout denote the mini-batch mean and standard deviationfor layer l ’s activations; and γ (l ) ∈ RKout and β (l ) ∈ RKout arelearnable parameters. Our sparsity regularization is defined on γ (l )as below, where λд is a hyperparameter controlling the strength ofsparsity regularization.

LSparsity = λд

N∑i=1∥γ (l )∥1 (6)

The sparsity regularization term (6) promotes learning a subsetof important filters while training the model. Compression thenamounts to globally pruning filters with the smallest value of mul-tiplicands in (5) to achieve the desired model compression. Pruningtypically causes a large drop in accuracy. Once the pruned model isidentified, we fine-tune it via retraining.

Now that we have discussed each component of Rest, we presentthe full loss function in (7) and the training process in Algorithm 1.A pictorial overview of the process can be seen in Figure 2.

LR = L(f (Xp ;W), y)︸ ︷︷ ︸adversarial training

+ λo

N∑i=1∥(W(i))TW(i) − I∥2︸ ︷︷ ︸

spectral regularization

+ λд

N∑i=1∥γ (l )∥1︸ ︷︷ ︸

sparsity regularization

(7)

4 EXPERIMENTS

We compare the efficacy of Rest neural networks to four baselinemodels (Section 4.2) on two publicly available EEG datasets—Sleep-EDF from Physionet [16] and Sleep Heart Health Study (SHHS)[30]. Our evaluation focuses on two broad directions—noise ro-

bustness and model efficiency. Noise robustness compares theefficacy of each model when EEG data is corrupted with three typesof noise: adversarial, Gaussian and shot. Model efficiency comparesboth static (e.g., model size, floating point operations) and dynamicmeasurements (e.g., inference time, energy consumption). For dy-namic measurements which depend on device hardware, we deployeach model to a Pixel 2 smartphone.

4.1 Datasets

Our evaluation uses two real-world sleep staging EEG datasets.• Sleep-EDF: This dataset consists of data from two studies—ageeffect in healthy subjects (SC) and Temazepam effects on sleep(ST). Following [34], we use whole-night polysomnographic sleeprecordings on 40 healthy subjects (one night per patient) from

SC. It is important to note that the SC study is conducted inthe subject’s homes, not a sleep center and hence this dataset isinherently noisy. However, the sensing environment is still rela-tively controlled since sleep doctors visited the patient’s hometo setup the wearable EEG sensors. After obtaining the data, therecordings are manually classified into one of eight classes (W,N1, N2, N3, N4, REM, MOVEMENT, UNKNOWN); we follow thesteps in [34] and merge stages N3 and N4 into a single N3 stageand exclude MOVEMENT and UNKNOWN stages to match thefive stages of sleep according to the American Academy of SleepMedicine (AASM) [4]. Each single channel EEG recording of 30seconds corresponds to a vector of dimension 1 × 3000. Similarto [32], while scoring at time i , we include EEG recordings fromtimes i − 3, i − 2, i − 1, i . Thus we expand the EEG vector by con-catenating the previous three time steps to create a vector of size1 × 12000. After pre-processing the data, our dataset consists of42,191 EEG recordings, each described by a 12,000 length vectorand assigned a sleep stage label from Wake, N1, N2, N3 and REMusing the Fpz-Cz EEG sensor (see Table 1 for sleep stage break-down). Following standard practice [34], we divide the dataseton a per-patient, whole-night basis, using 80 % for training, 10 %for validation, and 10 % for testing. That is, a single patient isrecorded for one night and can only be in one of the three sets(training, validation, testing). The final number of EEG record-ings in their respective splits are 34,820, 5345 and 3908. Whilethe number of recordings appear to differ from the 80-10-10 ratio,this is because the data is split over the total number of patients,where each patient is monitored for a time period of variablelength (9 hours ± few minutes.)• Sleep Heart Health Study (SHHS): The Sleep Heart HealthStudy consists of two rounds of polysomnographic recordings(SHHS-1 and SHHS-2) sampled at 125 Hz in a sleep center envi-ronment. Following [32], we use only the first round (SHHS-1)containing 5,793 polysomnographic records over two channels(C4-A1 and C3-A2). Recordings are manually classified into oneof six classes (W, N1, N2, N3, N4 and REM). As suggested in [4],we merge N3 and N4 stages into a single N3 stage (see Table 1 forsleep stage breakdown). We use 100 distinct patients randomlysampled from the original dataset (one night per patient). Similarto [32], we look at three previous time steps in order to score theEEG recording at the current time step. This amounts to concate-nating the current EEG recording of size 1×3750 (equal to 125 Hz× 30 Hz) to generate an EEG recording of size 1×15000. After thispre-processing, our dataset consists of 100,065 EEG recordings,each described by a 15,000 length vector and assigned a sleepstage label from the same 5 classes using the Fpz-Cz EEG sensor.We use the same 80-10-10 data split as in Sleep-EDF, resulting

Dataset W N1 N2 N3(N4) REM Total

Sleep-EDF 8,168 2,804 17,799 5,703 7,717 42,191SHHS 28,854 3,377 41,246 13,409 13,179 100,065

Table 1: Dataset summary outlining thenumber of 30 second

EEG recordings belonging to each sleep stage class.

Page 6: REST: Robust and Efficient Neural Networks for Sleep ... · REST: Robust and Efficient Neural Networks for Sleep Monitoring in the Wild Rahul Duggal*1, Scott Freitas*1, Cao Xiao2,

WWW ’20, April 20–24, 2020, Taipei, Taiwan Duggal & Freitas, et al.

in 79,940 EEG recordings for training, 9999 for validation, and10,126 for testing.

4.2 Model Architecture and Configurations

We use the sleep staging CNN architecture proposed by [32], sinceit achieves state-of-the-art accuracy for sleep stage classificationusing single channel EEG. We implement all models in PyTorch0.4. For training and evaluation, we use a server equipped withan Intel Xeon E5-2690 CPU, 250GB RAM and 8 Nvidia Titan XpGPUs. Mobile device measurements use a Pixel 2 smartphone withan Android application running Tensorflow Lite1. With [32] as thearchitecture for all baselines below, we compare the following 6configurations:

(1) Sors [32]: Baseline neural network model trained on unper-turbed data. This model contains 12 1-D convolutional layersfollowed by 2 fully connected layers and achieves state-of-the-art performance on sleep staging using single channel EEG.

(2) Liu [26]: We train on unperturbed data and compress the Sorsmodel using sparsity regularization as proposed in [26].

(3) Blanco [7]:We use same setup from Liu above. During test time,the noisy test input is filtered using a bandpass filter with cutoff0.5Hz-40Hz This technique is commonly used for removingnoise in EEG analysis [7].

(4) Ford [15]: We train and compress the Sors model with spar-sity regularization on input data perturbed by Gaussian noise.Gaussian training parameter cд = 0.2 controls the perturbationstrength during training; identified through a line search inSection 4.4.

(5) Rest (A): Our compressed Sors model obtained through adver-sarial training and sparsity regularization. We use the hyperpa-rameters: ϵ = 10, niter= 5/10 (SHHS/Sleep-EDF), where ϵ is akey variable controlling the strength of adversarial perturbationduring training. The optimal ϵ value is determined through aline search described in Section 4.4.

(6) Rest (A+S): Our compressed Sors model obtained throughadversarial training, spectral and sparsity regularization. Weset the spectral regularization parameter λo = 3 × 10−3 andsparsity regularization parameter λд = 10−5 based on a gridsearch in Section 4.4.

All models are trained for 30 epochs using SGD. The initial learn-ing rate is set to 0.1 and multiplied by 0.1 at epochs 10 and 20; theweight decay is set to 0.0002. All compressed models use the samecompression method, consisting of weight pruning followed bymodel re-training. The sparsity regularization parameter λд = 10−5is identified through a grid search with λo (after determining ϵthrough a line search). Detailed analysis of the hyperparameterselection for ϵ , λo and λд can be found in Section 4.4. Finally, weset a high sparsity level s = 0.8 (80% neurons from the originalnetworks were pruned) after observation that the models are over-parametrized for the task of sleep stage classification.

1TensorFlow Lite: https://www.tensorflow.org/lite

4.3 Evaluation Metrics

Noise robustness metrics To study the noise robustness of eachmodel configuration, we evaluate macro-F1 score in the presenceof three types of noise: adversarial, Gaussian and shot. We selectmacro-F1 since it is a standard metric for evaluating classificationperformance in imbalanced datasets. Adversarial noise is defined atthree strength levels through ϵ = 2/6/12 in Equation 2; Gaussiannoise at three levels through cд = 0.1/0.2/0.3 in Equation 8; andshot noise at three levels through cs = 5000/2500/1000 in Equa-tion 9. These parameter values are chosen based on prior work[20, 27] and empirical observation. For evaluating robustness toadversarial noise, we assume the white box setting where the at-tacker has access to model weights. The formulation for Gaussianand shot noise is in Equation 8 and 9, respectively.

Xgauss = X + N (0, cg · σtrain) (8)

In Equation 8, σtrain is the standard deviation of the trainingdata and N is the normal distribution. The noise strength—low,medium and high—corresponds to cд = 0.1/0.2/0.3.

Xnorm =X − xmin

xmax − xmin

X′ = clip0,1

(Poisson(Xnorm.cs)

cs

)Xshot = X

′.(xmax − xmin) + xmin

(9)

In Equation 9, xmin ,xmax denote the minimum and maximumvalues in the training data; and clip0,1 is a function that projectsthe input to the range [0,1].

Model efficiencymetrics To evaluate the efficiency of eachmodelconfiguration, we use the following measures:

• Parameter Reduction: Memory consumed (in KB) for storingthe weights of a model.• Floating point operations (FLOPS): Number of multiply andadd operations performed by the model in one forward pass.Measurement units are Mega (106).• Inference Time: Average time taken (in seconds) to score onenight of EEG data. We assume a night consists of 9 hours andamounts to 1,080 EEG recordings (each of 30 seconds). This ismeasured on a Pixel 2 smartphone.• Energy Consumption: Average energy consumed by a model(in Joules) to score one night of EEG data on a Pixel 2 smartphone.Tomeasure consumed energy, we implement an infinite inferenceloop over EEG recordings until the battery level drops from100% down to 85%. For each unit percent drop (i.e., 15 levels),we log the number of iterations Ni performed by the model.Given that a standard Pixel 2 battery can deliver 2700 mAh at3.85 Volts, we use the following conversion to estimate energyconsumed E (in Joules) for a unit percent drop in battery levelE = 2700

1000 × 3600 × 3.85. The total energy for inferencing over anentire night of EEG recordings is then calculated as E

Ni× 1080

where Ni is the number of inferences made in the unit batterydrop interval. We average this for every unit battery percentage

Page 7: REST: Robust and Efficient Neural Networks for Sleep ... · REST: Robust and Efficient Neural Networks for Sleep Monitoring in the Wild Rahul Duggal*1, Scott Freitas*1, Cao Xiao2,

REST: Robust and Efficient Neural Networksfor Sleep Monitoring in the Wild WWW ’20, April 20–24, 2020, Taipei, Taiwan

0 5 10 15 20 25 300.4

0.5

0.6

0.7

0.8

Sleep-EDFAveraged

SHHSAveraged

Train Epsilon

Select ε with highestaverage macro-F1

Figure 3: Line search results for ϵ on Sleep-EDF and SHHS

datasets. We select ϵ=10, since it provides the best average

macro-F1 score on both datasets.

drop from 100% to 85% (i.e., 15 intervals) to calculate the averageenergy consumption

4.4 Hyperparameter Selection

Optimal hyper-parameter selection is crucial for obtaining goodperformance with both baseline and Rest models. We systemati-cally conduct a series of line and grid searches to determine idealvalues of ϵ , cд , λo and λд using the validation sets.

Selecting ϵ This parameter controls the perturbation strengthof adversarial training in Equation 2. Correctly setting this param-eter is critical since a small ϵ value will have no effect on noiserobustness, while too high a value will lead to poor benign accuracy.We follow standard procedure and determine the optimal ϵ on aper-dataset basis [27], conducting a line search across ϵ ∈ [0,30]in steps of 2. For each value of ϵ we measure benign and adver-sarial validation macro-F1 score, where adversarial macro-F1 is anaverage of three strength levels: low (ϵ=2), medium (ϵ=6) and high(ϵ=12). We then select the ϵ with highest macro-F1 score averagedacross the benign and adversarial macro-F1. Line search results areshown in Figure 3; we select ϵ = 10 for both dataset since it’s thevalue with highest average macro-F1.

Selecting cд This parameter controls the noise perturbation strengthof Gaussian training in Equation 8. Similar to ϵ , we determine cдon a per-dataset basis, conducting a line search across cд values: 0.1(low), 0.2 (medium) and 0.3 (high). Based on results from Table 2,we select cд=0.2 for both datasets since it provides the best averagemacro-F1 score while minimizing the drop in benign accuracy.

Selecting λo and λд These parameters determine the strength ofspectral and sparsity regularization in Equation 7.We determine thebest value for λo and λд through a grid search across the followingparameter values λo = [0.001, 0.003, 0.005] and λд = [1E − 04, 1E −05]. Based on results from Table 3, we select λo = 0.003 and λд =1E − 05. Since these are model dependent parameters, we calculatethem once on the Sleep-EDF dataset and re-use them for SHHS.

Guassian F1

cд Benign F1 Low Med High Average F1

EDF

0.1 0.75 0.76 0.7 0.5 0.680.2 0.7 0.72 0.75 0.64 0.70

0.3 0.67 0.68 0.71 0.75 0.7025

SHHS 0.1 0.69 0.74 0.45 0.21 0.52

0.2 0.68 0.69 0.68 0.43 0.62

0.3 0.55 0.57 0.65 0.74 0.63Table 2: Line search results for identifying optimal cд on

Sleep-EDF and SHHS datasets. Macro-F1 is abbreviated F1 in

table; average macro-F1 is the mean of all macro-F1 scores.

We select cд=0.2 for both datasets as it represents a good

trade-off between benign and Gaussian macro-F1.

Adversarial F1

λo λд Benign F1 Low Med High Avg. F1

0.001 1E-04 0.73 0.66 0.65 0.61 0.660.003 1E-04 0.72 0.64 0.63 0.59 0.650.005 1E-04 0.72 0.65 0.64 0.62 0.660.001 1E-05 0.73 0.66 0.65 0.62 0.670.003 1E-05 0.73 0.67 0.66 0.62 0.67

0.005 1E-05 0.73 0.64 0.64 0.62 0.66Table 3: Grid search results for λo and λд on Sleep-EDF

dataset. Macro-F1 is abbreviated as F1 in table; average

macro-F1 is the mean of all macro-F1 scores. We select λoand λд with highest average macro-F1 score.

4.5 Noise Robustness

To evaluate noise robustness, we ask the following questions—(1)what is the impact of Rest on model accuracy with and withoutnoise in the data? and (2) how does Rest training compare tobaseline methods of benign training, Gaussian training and noisefiltering? In answering these questions, we analyze noise robustnessof models at three scales: (i) meta-level macro-F1 scores; (ii) meso-level confusion matrix heatmaps; and (iii) granular-level single-patient hypnograms.

I. Meta analysis: Macro-F1 Scores In Table 4, we present a high-level overview of model performance through macro-F1 scores onthree types and strength levels of noise corruption. The Macro-F1 scores and standard deviation are reported by averaging overthree runs for each model and noise level. We identify multiple keyinsights as described below:

(1) Rest Outperforms Across All Types of Noise As demon-strated by the higher macro-F1 scores, Rest outperforms allbaseline methods in the presence of noise. In addition, Rest hasa low standard deviation, indicating model performance is notdependent on weight initialization.

Page 8: REST: Robust and Efficient Neural Networks for Sleep ... · REST: Robust and Efficient Neural Networks for Sleep Monitoring in the Wild Rahul Duggal*1, Scott Freitas*1, Cao Xiao2,

WWW ’20, April 20–24, 2020, Taipei, Taiwan Duggal & Freitas, et al.

Adversarial Gaussian Shot

Data Method Compress No noise Low Med High Low Med High Low Med High

Sleep-EDF

Sors [32] ✗ 0.67 ± 0.02 0.57 ± 0.02 0.51 ± 0.04 0.19 ± 0.06 0.66 ± 0.03 0.60 ± 0.03 0.39 ± 0.08 0.58 ± 0.04 0.42 ± 0.08 0.11 ± 0.03

Liu [26] ✓ 0.69 ± 0.02 0.52 ± 0.07 0.41 ± 0.07 0.09 ± 0.02 0.67 ± 0.02 0.53 ± 0.02 0.28 ± 0.04 0.52 ± 0.03 0.31 ± 0.04 0.06 ± 0.01

Blanco [7] ✓ 0.68 ± 0.01 0.51 ± 0.06 0.40 ± 0.06 0.09 ± 0.02 0.65 ± 0.02 0.54 ± 0.04 0.31 ± 0.10 0.53 ± 0.04 0.34 ± 0.09 0.08 ± 0.02

Ford [15] ✓ 0.64 ± 0.01 0.59 ± 0.01 0.60 ± 0.02 0.31 ± 0.08 0.65 ± 0.01 0.67 ± 0.02 0.57 ± 0.03 0.67 ± 0.02 0.60 ± 0.02 0.10 ± 0.01

Rest (A) ✓ 0.66 ± 0.02 0.64 ± 0.02 0.64 ± 0.02 0.61 ± 0.02 0.66 ± 0.02 0.67 ± 0.01 0.66 ± 0.01 0.67 ± 0.01 0.66 ± 0.01 0.42 ± 0.06

Rest (A+S) ✓ 0.69 ± 0.01 0.67 ± 0.02 0.66 ± 0.01 0.61 ± 0.03 0.69 ± 0.01 0.68 ± 0.01 0.67 ± 0.02 0.68 ± 0.01 0.67 ± 0.02 0.42 ± 0.08

SHHS

Sors [32] ✗ 0.78 ± 0.01 0.62 ± 0.03 0.46 ± 0.03 0.33 ± 0.00 0.64 ± 0.03 0.43 ± 0.02 0.35 ± 0.04 0.69 ± 0.02 0.59 ± 0.03 0.45 ± 0.01

Liu [26] ✓ 0.77 ± 0.01 0.61 ± 0.02 0.49 ± 0.04 0.34 ± 0.03 0.66 ± 0.05 0.45 ± 0.05 0.34 ± 0.04 0.70 ± 0.04 0.62 ± 0.04 0.47 ± 0.05

Blanco [7] ✓ 0.77 ± 0.01 0.60 ± 0.03 0.47 ± 0.04 0.33 ± 0.02 0.64 ± 0.07 0.43 ± 0.05 0.34 ± 0.04 0.67 ± 0.06 0.59 ± 0.05 0.46 ± 0.04

Ford [15] ✓ 0.62 ± 0.02 0.59 ± 0.01 0.62 ± 0.00 0.59 ± 0.05 0.66 ± 0.00 0.75 ± 0.04 0.47 ± 0.10 0.65 ± 0.00 0.68 ± 0.01 0.74 ± 0.04

Rest (A) ✓ 0.70 ± 0.01 0.68 ± 0.00 0.70 ± 0.01 0.67 ± 0.01 0.72 ± 0.01 0.76 ± 0.01 0.58 ± 0.03 0.72 ± 0.01 0.74 ± 0.01 0.76 ± 0.01

Rest (A+S) ✓ 0.72 ± 0.01 0.69 ± 0.01 0.70 ± 0.01 0.69 ± 0.02 0.74 ± 0.01 0.77 ± 0.01 0.62 ± 0.03 0.73 ± 0.01 0.75 ± 0.01 0.78 ± 0.00

Table 4: Meta Analysis: Comparison of macro-F1 scores achieved by each model. The models are evaluated on Sleep-EDF and

SHHS datasets with three types and strengths of noise corruption. We bold the compressed model with the best performance

(averaged over 3 runs) and report the standard deviation of each model next to the macro-F1 score. Rest performs better in

all noise test measurements.

(2) Spectral Regularization Improves Performance Rest (A+S) consistently improves upon Rest (A), indicating the useful-ness of spectral regularization towards enhancing noise robust-ness by constraining the Lipschitz constant.

(3) SHHS Performance Better Than Sleep-EDF Performance isgenerally better on the SHHS dataset compared to Sleep-EDF.One possible explanation is due to the SHHS dataset being lessnoisy in comparison to the Sleep-EDF dataset. This stems fromthe fact that the SHHS study was performed in the hospitalsetting while Sleep-EDF was undertaken in the home setting.

(4) Benign & Adversarial Accuracy Trade-off Contrary to thetraditional trade-off between benign and adversarial accuracy,Rest performance matches Liu in the no noise setting on sleep-EDF. This is likely attributable to the noise in the Sleep-EDFdataset, which was collected in the home setting. On the SHHSdataset, the Liu model outperforms Rest in the no noise setting,where data is captured in the less noise prone hospital setting.Due to this, Rest models are best positioned for use in noisyenvironments (e.g., at home); while traditional models are moreeffective in controlled environments (e.g., sleep labs).

II.MesoAnalysis: Per-class PerformanceWevisualize and iden-tify class-wise trends using confusionmatrix heatmaps (Fig. 4). Eachconfusion matrix describes a model’s performance for a given levelof noise (or no noise). A model that is performing well should have adark diagonal and light off-diagonal. We normalize the rows of eachconfusion matrix to accurately represent class predictions in animbalanced dataset. When a matrix diagonal has a value of 1 (darkblue, or dark green) the model predicts every example correctly;the opposite occurs at 0 (white). Analyzing Figure 4, we identifythe following key insights:

(1) Rest Performs Well Across All Classes Rest accuratelypredicts each sleep stage (W, N1, N2, N3, REM) across multipletypes of noise (Fig. 4, bottom 3 rows), as evidenced by the darkdiagonal. In comparison, each baseline method has considerableperformance degradation (light diagonal) in the presence ofnoise. This is particularly evident on the Sleep-EDF dataset (lefthalf) where data is collected in the noisier home environment.

(2) N1 Class Difficult to PredictWhen no noise is present (Fig. 4,top row), each method performs well as evidenced by the darkdiagonal, except on the N1 sleep stage class. This performancedrop is likely due to the limited number of N1 examples in thedatasets (see Table 1).

(3) IncreasedMisclassification Towards “Wake” Class On theSleep-EDF dataset, shot and adversarial noise cause the baselinemodels to mispredict classes as Wake. One possible explanationis that the models misinterpret the additive noise as evidence forthe wake class which has characteristically large fluctuations.

III. Granular Analysis: Single-patient Hypnograms We wanttomore deeply understand how ourRestmodels counteract noise atthe hypnogram level. Therefore, we select a test set patient from theSHHS dataset, and generate and visualize the patient’s overnighthypnograms using the Sors and Rest models on three levels ofGaussian noise corruption (Figure 5). Each of these hypnogramsis compared to a trained technicians hypnogram (expert scored inFig. 5), representing the ground-truth. We inspect a few more testset patients using the above approach, and identify multiple keyrepresentative insights:

(1) Noisy Environments Require RobustModelsAs data noiseincreases, Sors performance degrades. This begins at the lownoise level, further accelerates in the medium level and reaches

Page 9: REST: Robust and Efficient Neural Networks for Sleep ... · REST: Robust and Efficient Neural Networks for Sleep Monitoring in the Wild Rahul Duggal*1, Scott Freitas*1, Cao Xiao2,

REST: Robust and Efficient Neural Networksfor Sleep Monitoring in the Wild WWW ’20, April 20–24, 2020, Taipei, Taiwan

Shot

No Noise

Adversarial

Gaussian

R���(A+S) R���(A+S)Blanco FordLiuSors Blanco FordLiuSors

SHHSSleep-EDF

N1

N1

N2

N2

N3

N3WW

R

R 1 1

0 0Our approach is more accurate in the

presence of noise (i.e., darker diagonals)

(high)

(high)

(high)

Figure 4: Meso Analysis: Class-wise comparison of model predictions. The models are evaluated over the SHHS test set per-

turbed with different noise types. In each confusionmatrix, rows are ground-truth classes while columns are predicted classes.

The intensity of a cell is obtained by normalizing the score with respect to the class membership. When a cell has a value of

1 (dark blue, or dark green) the model predicts every example correctly, the opposite occurs at 0 (white). A model that is

performing well would have a dark diagonal and light off-diagonal. Rest has the darkest cells along the diagonal on both

datasets.

ExpertScored

No Noise

Low Noise(Gaussian)

(Gaussian)Med Noise

(Gaussian)High Noise

Rest(A+S) ModelState-of-the-Art Model

WN1N2N3

REM

0 200 400 600 800 1000

Our approach is accurate acrossall levels of environmental noise

Performance degrades withincreasing environmental noise

Time

Figure 5: Granular Analysis: Comparison of the overnight hypnograms obtained for a patient in the SHHS test set. The hypno-

grams are generated using the Sors (left) and Rest (right) models in the presence of increasing strengths of Gaussian noise.

When no noise is present (top row), bothmodels performwell, closelymatching the ground truth (bottom row). However, with

increasing noise, Sors performance rapidly degrades, while Rest continues to generate accurate hypnograms.

Page 10: REST: Robust and Efficient Neural Networks for Sleep ... · REST: Robust and Efficient Neural Networks for Sleep Monitoring in the Wild Rahul Duggal*1, Scott Freitas*1, Cao Xiao2,

WWW ’20, April 20–24, 2020, Taipei, Taiwan Duggal & Freitas, et al.

Inference Time

Energy Usage

909

Sors

SHHS

Sleep-EDF

Sleep-EDF

SHHS

R��� >9x faster

>6x

17x more efficient

>9x

302

355

31

57

53

1143

(in seconds; shorter is better)

(in joules; shorter is better)

123

Figure 6: Time and energy consumption for scoring a sin-

gle night of EEG recordings.Rest(A+S) is significantly faster

and more energy efficient than the state-of-the-art Sors

model. Evaluations were done on a Pixel 2 smartphone.

nearly zero at the high level. In contrast, Rest effectively han-dles all levels of noise, generating an accurate hypnogram ateven the highest level.

(2) Low Noise Environments Give Good Performance In theno noise setting (top row) both the Sors and Rest models gen-erate accurate hypnograms, closely matching the contours ofexpert scoring (bottom).

4.6 Model Efficiency

We measure model efficiency along two dimensions—(1) static met-rics: amount of memory required to store weights in memory andFLOPS; and (2) dynamic metrics: inference time and energy con-sumption. For dynamic measurements that depend on device hard-ware, we deploy each model to a Pixel 2 smartphone.

Analyzing Static Metrics: Memory & Flops Table 5 describesthe size (in KB) and computational requirements (in MFlops) ofeach model. We identify the following key insights:

(1) RestModels Require Fewest FLOPSOn both datasets, Restrequires the least number of FLOPS.

(2) RestModels are Small Restmodels are also smaller (or com-parable) to baseline compressed models while achieving signifi-cantly better noise robustness.

(3) Model Efficiency and Noise Robustness Combining the in-sights from Section 4.5 and the above, we observe that Restmodels have significantly better noise robustness while main-taining a competitive memory footprint. This suggests that ro-bustness is more dependent on the the training process, ratherthan model capacity.

Analyzing DynamicMetrics: Inference Time& Energy In Fig-ure 6, we benchmark the inference time and energy consumptionof a Sors and Rest model deployed on a Pixel 2 smartphone usingTensorflow Lite. We identify the following insights:

Data Model Size (KB) MFlops

Sleep-EDF

Sors [32] 8,896 1451Liu [26] 440 127Blanco [7] 440 127Ford [15] 448 144Rest (A) 464 98Rest (A+S) 449 94

SHHS

Sors [32] 8,996 1815Liu [26] 464 211Blanco [7] 464 211Ford [15] 478 170Rest (A) 476 160Rest (A+S) 496 142

Table 5: Comparison on model size and the FLOPS required

to score a single night of EEG recordings. Rest models

are significantly smaller and comparable in size/compute to

baselines.

(1) Rest Models Run Faster When deployed, Rest runs 9× and6× faster than the uncompressed model on the two datasets.

(2) Rest Models are Energy Efficient Rest models also con-sume 17× and 9× less energy than an uncompressed model onthe Sleep-EDF and SHHS datasets, respectively.

(3) Enabling Sleep Staging for Edge Computing The abovebenefits demonstrate that model compression effectively trans-lates into faster inference and a reduction in energy consump-tion. These benefits are crucial for deploying on the edge.

5 CONCLUSION

We identified two key challenges in developing deep neural net-works for sleep monitoring in the home environment—robustness tonoise and efficiency. We proposed to solve these challenges throughRest—a new method that simultaneously tackles both issues. Forthe sleep staging task over electroencephalogram (EEG), Resttrains models that achieve up to 19× parameter reduction and15×MFLOPS reduction with an increase of up to 0.36 in macro-F-1 score in the presence of noise. By deploying these models to asmartphone, we demonstrate that Rest achieves up to 17× energyreduction and 9× faster inference.

6 ACKNOWLEDGMENTS

This work was in part supported by the NSF award IIS-1418511,CCF-1533768, IIS-1838042, CNS-1704701, IIS-1563816; GRFP (DGE-1650044); and the National Institute of Health award NIH R011R01NS107291-01 and R56HL138415.

REFERENCES

[1] Bruce M Altevogt, Harvey R Colten, et al. 2006. Sleep disorders and sleep depriva-tion: an unmet public health problem. National Academies Press.

[2] F. Andreotti, H. Phan, N. Cooray, C. Lo, M. T. M. Hu, and M. De Vos. 2018.Multichannel Sleep Stage Classification and Transfer Learning using Convolu-tional Neural Networks. In 2018 40th Annual International Conference of theIEEE Engineering in Medicine and Biology Society (EMBC). 171–174. https://doi.org/10.1109/EMBC.2018.8512214

Page 11: REST: Robust and Efficient Neural Networks for Sleep ... · REST: Robust and Efficient Neural Networks for Sleep Monitoring in the Wild Rahul Duggal*1, Scott Freitas*1, Cao Xiao2,

REST: Robust and Efficient Neural Networksfor Sleep Monitoring in the Wild WWW ’20, April 20–24, 2020, Taipei, Taiwan

[3] Z Beattie, A Pantelopoulos, A Ghoreyshi, Y Oyang, A Statan, and C Heneghan.2017. 0068 ESTIMATION OF SLEEP STAGES USING CARDIAC AND AC-CELEROMETER DATA FROM A WRIST-WORN DEVICE. Sleep 40, suppl_1(April 2017), A26–A26.

[4] Richard B Berry, Rita Brooks, Charlene E Gamaldo, Susan M Harding, Carole LMarcus, Bradley V Vaughn, et al. 2012. The AASMmanual for the scoring of sleepand associated events. Rules, Terminology and Technical Specifications, Darien,Illinois, American Academy of Sleep Medicine 176 (2012).

[5] Vikrant Bhateja, Shabana Urooj, Rishendra Verma, and Rini Mehrotra. 2013. Anovel approach for suppression of powerline interference and impulse noise inECG signals. In IMPACT-2013. IEEE, 103–107.

[6] Siddharth Biswal, Joshua Kulas, Haoqi Sun, Balaji Goparaju,M. BrandonWestover,Matt T. Bianchi, and Jimeng Sun. 2017. SLEEPNET: Automated Sleep StagingSystem via Deep Learning. CoRR abs/1707.08262 (2017). arXiv:1707.08262 http://arxiv.org/abs/1707.08262

[7] S Blanco, S Kochen, OA Rosso, and P Salgado. 1997. Applying time-frequencyanalysis to seizure EEG activity. IEEE Engineering in medicine and biology maga-zine 16, 1 (1997), 64–71.

[8] Manuel Blanco-Velasco, Binwei Weng, and Kenneth E Barner. 2008. ECG signaldenoising and baseline wander correction based on the empirical mode decom-position. Computers in biology and medicine 38, 1 (2008), 1–13.

[9] Stanislas Chambon, Mathieu N Galtier, Pierrick J Arnal, Gilles Wainrib, andAlexandre Gramfort. 2018. A deep learning architecture for temporal sleep stageclassification using multivariate and multimodal time series. IEEE Transactionson Neural Systems and Rehabilitation Engineering 26, 4 (2018), 758–769.

[10] Kang-Ming Chang and Shing-Hong Liu. 2011. Gaussian noise filtering from ECGby Wiener filter and ensemble empirical mode decomposition. Journal of SignalProcessing Systems 64, 2 (2011), 249–264.

[11] Yongjian Chen, Masatake Akutagawa, Takahiro Emoto, and Yohsuke Kinouchi.2010. The removal of EMG in EEG by neural networks. Physiological measurement31, 12 (2010), 1567.

[12] Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and NicolasUsunier. 2017. Parseval networks: Improving robustness to adversarial examples.In Proceedings of the 34th International Conference on Machine Learning-Volume70. JMLR. org, 854–863.

[13] Rahul Duggal, Cao Xiao, Richard Vuduc, and Jimeng Sun. 2019. CUP: ClusterPruning for Compressing Deep Neural Networks. arXiv:arXiv:1911.08630

[14] Farzan Farnia, Jesse M Zhang, and David Tse. 2018. Generalizable AdversarialTraining via Spectral Normalization. arXiv preprint arXiv:1811.07457 (2018).

[15] Nic Ford, Justin Gilmer, Nicolas Carlini, and Dogus Cubuk. 2019. AdversarialExamples Are a Natural Consequence of Test Error in Noise. CoRR abs/1901.10513(2019). arXiv:1901.10513 http://arxiv.org/abs/1901.10513

[16] Ary L Goldberger, Luis AN Amaral, Leon Glass, Jeffrey M Hausdorff, Plamen ChIvanov, Roger G Mark, Joseph E Mietus, George B Moody, Chung-Kang Peng, andH Eugene Stanley. 2000. PhysioBank, PhysioToolkit, and PhysioNet: componentsof a new research resource for complex physiologic signals. Circulation 101, 23(2000), e215–e220.

[17] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining andharnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).

[18] Yiwen Guo, Chao Zhang, Changshui Zhang, and Yurong Chen. 2018. Sparsednns with improved adversarial robustness. In Advances in neural informationprocessing systems. 242–251.

[19] Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weightsand connections for efficient neural network. In Advances in neural informationprocessing systems. 1135–1143.

[20] Dan Hendrycks and Thomas Dietterich. 2019. Benchmarking neural net-work robustness to common corruptions and perturbations. arXiv preprintarXiv:1903.12261 (2019).

[21] André Henriksen, Martin Haugen Mikalsen, Ashenafi Zebene Woldaregay,Miroslav Muzny, Gunnar Hartvigsen, Laila Arnesdatter Hopstock, and SamelineGrimsgaard. 2018. Using fitness trackers and smartwatches to measure physicalactivity in research: analysis of consumer wrist-worn wearables. Journal ofmedical Internet research 20, 3 (2018), e110.

[22] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge ina neural network. arXiv preprint arXiv:1503.02531 (2015).

[23] Vadim Lebedev and Victor Lempitsky. 2016. Fast convnets using group-wisebrain damage. In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition. 2554–2564.

[24] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2016.Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710 (2016).

[25] Ji Lin, Chuang Gan, and Song Han. 2019. Defensive quantization: When efficiencymeets robustness. arXiv preprint arXiv:1904.08444 (2019).

[26] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Chang-shui Zhang. 2017. Learning efficient convolutional networks through networkslimming. In Proceedings of the IEEE International Conference on Computer Vision.2736–2744.

[27] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, andAdrian Vladu. 2017. Towards deep learningmodels resistant to adversarial attacks.

arXiv preprint arXiv:1706.06083 (2017).[28] American Academy of Sleep Medicine et al. 2016. Economic Burden of Undi-

agnosed Sleep Apnea in US is Nearly $150 Billion per Year. Published on theAmerican Academy of Sleep Medicine’s official website, on August 8 (2016).

[29] H. Phan, F. Andreotti, N. Cooray, O. Y. Chén, and M. De Vos. 2019. Joint Classi-fication and Prediction CNN Framework for Automatic Sleep Stage Classifica-tion. IEEE Transactions on Biomedical Engineering 66, 5 (May 2019), 1285–1296.https://doi.org/10.1109/TBME.2018.2872652

[30] Stuart F Quan, Barbara V Howard, Conrad Iber, James P Kiley, F Javier Nieto,George T O’Connor, David M Rapoport, Susan Redline, John Robbins, JonathanMSamet, et al. 1997. The sleep heart health study: design, rationale, and methods.Sleep 20, 12 (1997), 1077–1085.

[31] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016.Xnor-net: Imagenet classification using binary convolutional neural networks.In European Conference on Computer Vision. Springer, 525–542.

[32] Arnaud Sors, Stéphane Bonnet, Sébastien Mirek, Laurent Vercueil, and Jean-François Payen. 2018. A convolutional neural network for sleep stage scoringfrom raw single-channel EEG. Biomedical Signal Processing and Control 42 (2018),107 – 114. https://doi.org/10.1016/j.bspc.2017.12.001

[33] Annette Sterr, James K Ebajemito, Kaare B Mikkelsen, Maria A Bonmati-Carrion,Nayantara Santhi, Ciro DellaMonica, Lucinda Grainger, Giuseppe Atzori, VictoriaRevell, Stefan Debener, et al. 2018. Sleep EEG derived from behind-the-earelectrodes (cEEGrid) compared to standard polysomnography: A proof of conceptstudy. Frontiers in human neuroscience 12 (2018), 452.

[34] Akara Supratak, Hao Dong, Chao Wu, and Yike Guo. 2017. DeepSleepNet: Amodel for automatic sleep stage scoring based on raw single-channel EEG. IEEETransactions on Neural Systems and Rehabilitation Engineering 25, 11 (2017), 1998–2008.

[35] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, andAleksander Madry. 2018. Robustness may be at odds with accuracy. stat 1050(2018), 11.

[36] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. Learningstructured sparsity in deep neural networks. In Advances in neural informationprocessing systems. 2074–2082.

[37] Jian Xue, Jinyu Li, and Yifan Gong. 2013. Restructuring of deep neural networkacoustic models with singular value decomposition.. In Interspeech. 2365–2369.

[38] Yuichi Yoshida and Takeru Miyato. 2017. Spectral norm regularization for im-proving the generalizability of deep learning. arXiv preprint arXiv:1705.10941(2017).

[39] Mingmin Zhao, Shichao Yue, Dina Katabi, Tommi S Jaakkola, and Matt T Bianchi.2017. Learning Sleep Stages from Radio Signals: A Conditional AdversarialArchitecture. 70 (2017), 4100–4109.

[40] Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, QingyaoWu, Junzhou Huang, and Jinhui Zhu. 2018. Discrimination-aware channel prun-ing for deep neural networks. In Advances in Neural Information ProcessingSystems. 875–886.