본문 바로가기
  • overfitting AI , overfitting deep learning

DensePose From WiFi

by J.I SHIN 2023. 2. 8.

이미지 기반의 DensePose (좌측) , WiFi 기반의 DensePose (우측) 비교




Advances in computer vision and machine learning techniques have led to significant development in 2D and 3D human pose estimation from RGB cameras, LiDAR, and radars. However, human pose estimation from images is adversely affected by occlusion and lighting, which are common in many scenarios of interest. Radar and LiDAR technologies, on the other hand, need specialized hardware that is expensive and power-intensive. Furthermore, placing these sensors in non-public areas raises significant privacy concerns. To address these limitations, recent research has explored the use of WiFi antennas (1D sensors) for body segmentation and key-point body detection. This paper further expands on the use of the WiFi signal in combination with deep learning architectures, commonly used in computer vision, to estimate dense human pose correspondence. We developed a deep neural network that maps the phase and amplitude of WiFi signals to UV coordinates within 24 human regions. The results of the study reveal that our model can estimate the dense pose of multiple subjects, with comparable performance to image-based approaches, by utilizing WiFi signals as the only input. This paves the way for low-cost, broadly accessible, and privacy-preserving algorithms for human sensing.


📃 간단 리뷰


Human Pose Detection을 위해 WiFi 신호를 사용한다는 아이디어는 기존의 이미지 기반 또는 LiDAR, Radar 센서 기반 방법론과 비교해 몇 가지 이점이 있습니다. 조명이나 가려짐의 영향을 적게 받고, 카메라나 특수 하드웨어가 필요 없기 때문에 비용이 저렴하며, 광범위하게 접근할 수 있고, 개인 정보 보호 문제가 적고 가벼운 모델로 개발 될 수 있습니다.


최근 센서 가격이 빠르게 낮아지고 있지만, LiDAR는 수백~수천만원, Radar도 수십~수백만원인 것을 감안하면 공유기 2대로 DensePose를 구현할 수 있게 만든 이번 논문은 조금 특별하게 다가옵니다. 독거노인, 산업현장, 병실 등이 가장 먼저 떠올랐습니다. 이전에 한 기업과 함께 Radar를 활용한 심박수 및 호흡수 예측 프로젝트를 했던적이 있는데, 앞으로 카메라 뿐만 아니라 이러한 신호 처리 기반의 비전 연구가 더 많아지면 좋을 것 같습니다. 아무래도 일상 생활을 카메라에 담기에는 부담이 많으니까요.

아무튼 이번 아이디어도 아직 해결해야 할 한계는 분명 존재해 보입니다. 포즈 추정의 정확성과 신뢰성은 WiFi 신호 강도와 품질의 변화뿐만 아니라 환경 내의 다른 물체에 의해 영향을 받습니다. 또한, 논문에서의 연구 결과와는 다른 와이파이 환경에서 어떻게 일반화될지 불분명해 보입니다. 그럼에도 WiFi 신호를 사용했다는 점에서 향후 건강, 스포츠, 엔터테인먼트 등 다양한 분야에 큰 영향을 미칠 가능성이 있어 보입니다.




Much progress has been made in human pose estimation using 2D [7, 8, 12, 22, 28, 33] and 3D [17, 32] sensors in the last few years (e.g., RGB sensors, LiDARs, radars), fueled by applications in autonomous driving and augmented reality. These traditional sensors, however, are constrained by both technical and practical considerations. LiDAR and radar sensors are frequently seen as being out of reach for the average household or small business due to their high cost. For example, the medium price of one of the most common COTS LiDAR, Intel L515, is around 700 dollars, and the prices for ordinary radar detectors range from 200 dollars to 600 dollars. In addition, these sensors are too power-consuming for daily and household use. As for RGB cameras, narrow field of view and poor lighting conditions, such as glare and darkness, can have a severe impact on camera-based approaches. Occlusion is another obstacle that prevents the camera-based model from generating reasonable pose predictions in images. This is especially worrisome for indoors scenarios, where furniture typically occludes people. More importantly, privacy concerns prevent the use of these technologies in non-public places. For instance, most people are uncomfortable with having cameras recording them in their homes, and in certain areas (such as the bathroom) it will not be feasible to install them. These issues are particularly critical in healthcare applications, that are increasingly shifting from clinics to homes, where people are being monitored with the help of cameras and other sensors. It is important to resolve the aforementioned problems in order to better assist the aging population, which is the most susceptible (especially during COVID) and has a growing demand to keep them living independently at home. We believe that WiFi signals can serve as a ubiquitous substitute for RGB images for human sensing in certain instances. Illumination and occlusion have little effect on WiFi-based solutions used for interior monitoring. In addition, they protect individuals’ privacy and the required equipment can be bought at a reasonable price. In fact, most households in developed countries already have WiFi at home, and this technology may be scaled to monitor the well-being of elder people or just identify suspicious behaviors at home. The issue we are trying to solve is depicted in Fig. 1 (first row). Given three WiFi transmitters and three aligned receivers, can we detect and recover dense human pose correspondence in cluttered scenarios with multiple people (Fig. 1 fourth row). It should be noted that many WiFi routers, such as TP-Link AC1750, come with 3 antennas, so our method only requires 2 of these routers. Each of these router is around 30 dollars, which means our entire setup is still way cheaper than LiDAR and radar systems. Many factors make this a difficult task to solve. First of all, WiFi-based perception[11, 30] is based on the Channel-state-information (CSI) that represents the ratio between the transmitted signal wave and the received signal wave. The CSIs are complex decimal sequences that do not have spatial correspondence to spatial locations, such as the image pixels. Secondly, classic techniques rely on accurate measurement of time-of-fly and angle-of-arrival of the signal between the transmitter and receiver [13, 26]. These techniques only locate the object’s center; moreover, the localization accuracy is only around 0.5 meters due to the random phase shift allowed by the IEEE 802.11n/ac WiFi communication standard and potential interference with electronic devices under similar frequency range such as microwave oven and cellphones. To address these issues, we derive inspiration from recent proposed deep learning architectures in computer vision, and propose a neural network architecture that can perform dense pose estimation from WiFi. Fig 1 (bottom row) illustrates how our algorithm is able to estimate dense pose using only WiFi signal in scenarios with occlusion and multiple people.


📃 간단 리뷰


카메라, LiDAR 및 레이다의 사용을 포함한 Human Pose Estimation을 위한 현재 방법의 한계에 대해 언급하고 있습니다. 기존의 방법들은 비용, 전력 소비, 조명 및 개인 정보 보호 문제와 같은 요인의 영향을 받습니다. 그래서 저자들은 WiFi 신호를 사용할 것을 제안합니다. WiFi 신호는 더 접근하기 쉽고 비용 효율적인 솔루션이며 개인의 사생활을 보호합니다. 예를들면 TP링크의 3만원대 공유기 2대로도 가능하다고 말하고 있습니다.


사람의 자세 추정에 WiFi 신호를 사용하는데  전송파와 수신파의 비율로 계산되는 Channel-state-information(CSI)를 사용하게 됩니다. 하지만 이 정보는 공간을 나타내는 정보가 아닙니다. 그리고 다양한 요인에 의해 WiFi 신호의 위치 정확도는 0.5m 밖에 안된다고 합니다. 이런 문제를 해결하기 위해 최신의 비전 아키텍처에서 영감을 얻어, WiFi에서 밀도 높은 포즈 추정을 수행할 수 있는 신경망 아키텍처를 제안한다고 합니다.





This section briefly describes existing work on dense estimation from images and human sensing from WiFi. Our research aims to conduct dense pose estimation via WiFi. In computer vision, the subject of dense pose estimation from pictures and video has received a lot of attention [6, 8, 18, 40]. This task consists of finding the dense correspondence between image pixels and the dense vertices indexes of a 3D human body model. The pioneering work of Güler et al. [8] mapped human images to dense correspondences of a human mesh model using deep networks. DensePose is based on instance segmentation architectures such as Mark-RCNN [9], and predicts body-wise UV maps for each pixel, where UV maps are flattened representations of 3d geometry, with coordinate points usually corresponding to the vertices of a 3d dimensional object. In this work, we borrow the same architecture as DensePose [8]; however, our input will not be an image or video, but we use 1D WiFi signals to recover the dense correspondence. Recently, there have been many extensions of DensePose proposed, especially in 3D human reconstruction with dense body parts [3, 35, 37, 38]. Shapovalov et al.’s [24] work focused on lifting dense pose surface maps to 3D human models without 3D supervision. Their network demonstrates that the dense correspondence alone (without using full 2D RGB images) contains sufficient information to generate posed 3D human body. Compared to previous works on reconstructing 3D humans with sparse 2D keypoints, DensePose annotations are much denser and provide information about the 3D surface instead of 2D body joints. While there is a extensive literature on detection [19, 20], tracking [4, 34], and dense pose estimation [8, 18] from images and videos, human pose estimation from WiFi or radar is a relatively unexplored problem. At this point, it is important to differentiate the current work on radar-based systems and WiFi. The work of Adib et.al. [2] proposed a Frequency Modulated Continuous Wave (FMCW) radar system (broad bandwidth from 5.56GHz to 7.25GHz) for indoor human localization. A limitation of this system is the specialized hardware for synchronizing the transmission, refraction, and reflection to compute the Time-of-Flight (ToF). The system reached a resolution of 8.8 cm on body localization. In the following work [1], they improved the system by focusing on a moving person and generated a rough single-person outline with depth maps. Recently, they applied deep learning approaches to do fine-grained human pose estimation using a similar system, named RF-Pose [39]. These systems do not work under the IEEE 802.11n/ac WiFi communication standard (40MHz bandwidth centered at 2.4GHz). They rely on additional high-frequency and high-bandwidth electromagnetic fields, which need specialized technology not available to the general public. Recently, significant improvements have been made to radar-based human sensing systems. mmMesh [36] generates 3D human mesh from commercially portable millimeter-wave devices. This system can accurately localize the vertices on the human mesh with an average error of 2.47 cm. However, mmMesh does not work well with occlusions since high-frequency radio waves cannot penetrate objects. Unlike the above radar systems, the WiFi-based solution [11, 30] used off-the-shelf WiFi adapters and 3dB omnidirectional antennas. The signal propagate as the IEEE 802.11n/ac WiFi data packages transmitting between antennas, which does not introduce additional interference. However, WiFi-based person localization using the traditional time-of-flight (ToF) method is limited by its wavelength and signal-to-noise ratio. Most existing approaches only conduct center mass localization [5, 27] and single-person action classification [25, 29]. Recently, Fei Wang et.al. [31] demonstrated that it is possible to detect 17 2D body joints and perform 2D semantic body segmentation mask using only WiFi signals. In this work, we go beyond [31] by estimating dense body pose, with much more accuracy than the 0.5m that the WiFi signal can provide theoretically. Our dense posture outputs push above WiFi’s signal constraint in body localization, paving the road for complete dense 2D and possibly 3D human body perception through WiFi. To achieve this, instead of directly training a randomly initialized WiFi-based model, we explored rich supervision information to improve both the performance and training efficiency, such as utilizing the CSI phase, adding keypoint detection branch, and transfer learning from the image-based model.


📃 간단 리뷰


이 논문을 이해하기 위해선 DensePose에 알아볼 필요가 있습니다.




DensePose-COCO의 학습 데이터는 50,000장의 COCO 데이터셋에 사람이 직접 24개의 주석을 추가하였고, 이 주석을 바탕으로 3D 표면을 나타내는 UV 좌표가 생성됩니다. 이렇게 이미지와 표면이 완전히 대응되는 대규모 데이터셋에 의해 학습되었습니다.


DensePose의 핵심 중 하나는 2D 포인트로부터 3D 좌표를 구하는 방법으로 GPS (Geodesic Point Similarity, 측지선 포인트 유사성) 식에 의해 계산됩니다. 여기서 측지선은 곡면에서의 최단거리를 나타낸 선이라 생각하시면 좋을 것 같습니다.



여기서 Pj는 사람 j의 실측 주석을, g는 측지선 거리를 계산하고, ip, ip hat은 포인트 p의 추정값과 실제값을, K는 정규화 파라미터로 0.255로 설정되었습니다. 이 식은 DensePose from WiFi에서 그대로 사용됩니다.


이렇게 만들어진 데이터로 DensePose-RCNN을 학습하게 됩니다. DensePose-RCNN은 표면 분할과 점 대응의 방법으로 진행됩니다. FPN (Feature Pyramid Network) 기법과 ROI-Align 풀링을 사용하며,  Mask-RCNN의 아키텍처를 채택하여 선택한 각 영역에 해당하는 레이블과 좌표를 얻습니다.



Mask-RCNN과의 차이점은 Mask-RCNN은 객체에 속했는지에 대한 Mask를 출력하게 되고, 이는 인스턴스 분할 작업에 적합한 모델입니다. DensePose-RCNN은 각 픽셀의 Pose Map을 출력하여, 보다 정밀한 포즈 추정을 가능하게 합니다.






Our approach produces UV coordinates of the human body surface from WiFi signals using three components: first, the raw CSI signals are cleaned by amplitude and phase sanitization. Then, a two-branch encoder-decoder network performs domain translation from sanitized CSI samples to 2D feature maps that resemble images. The 2D features are then fed to a modified DensePose-RCNN architecture [8] to estimate the UV map, a representation of the dense correspondence between 2D and 3D humans. To improve the training of our WiFi-input network, we conduct transfer learning, where we minimize the differences between the multi-level feature maps produced by images and those produced by WiFi signals before training our main network. The raw CSI data are sampled in 100Hz as complex values over 30 subcarrier frequencies (linearly spaced within 2.4GHz±20MHz) transmitting among 3 emitter antennas and 3 reception antennas (see Figure 2). Each CSI sample contains a 3 × 3 real integer matrix and a 3 × 3 imaginary integer matrix. The inputs of our network contained 5 consecutive CSI samples under 30 frequencies, which are organized in a 150×3×3 amplitude tensor and a 150×3×3 phase tensor respectively. Our network outputs include a 17 × 56 × 56 tensor of keypoint heatmaps (one 56 × 56 map for each of the 17 kepoints) and a 25 × 112 × 112 tensor of UV maps (one 112 × 112 map for each of the 24 body parts with one additional map for background).


3.1 Phase Sanitization


The raw CSI samples are noisy with random phase drift and flip (see Figure 3(b)). Most WiFi-based solutions disregard the phase of CSI signals and rely only on their amplitude (see Figure 3 (a)). As shown in our experimental validation, discarding the phase information have a negative impact on the performance of our model. In this section, we perform sanitization to obtain stable phase values to enable full use of the CSI information. In raw CSI samples (5 consecutive samples visualized in Figure 3(a-b)), the amplitude (𝐴) and phase (Φ) of each complex element 𝑧 = 𝑎 +𝑏𝑖 are computed using the formulation 𝐴 = √︁ (𝑎 2 + 𝑏 2 ) and Φ = 𝑎𝑟𝑐𝑡𝑎𝑛(𝑏/𝑎). Note that the range of the arctan function is from −𝜋 to 𝜋 and the phase values outside this range get wrapped, leading to a discontinuity in phase values. Our first sanitization step is to unwrap the phase following [10]:



where 𝑖 denotes the index of the measurements in the five consecutive samples, and 𝑗 denotes the index of the subcarriers(frequencies). Following unwrapping, each of the flipping phase curves in Figure 3(b) are restored to continuous curves in Figure 3(c). Observe that among the 5 phase curves captured in 5 consecutive samples in Figure 3(c), there are random jiterings that break the temporal order among the samples. To keep the temporal order of signals, previous work [23] mentioned linear fitting as a popular approach. However, directly applying linear fitting to Figure 3(c) further amplified the jitering instead of fixing it (see the failed results in Figure 3(d)). From Figure 3(c), we use median and uniform filters to eliminate outliers in both the time and frequency domain which leads to Figure 3(e). Finally, we obtain the fully sanitized phase values by applying the linear fitting method following the equations below:



where 𝐹 denotes the largest subcarrier index (30 in our case) and ˆ𝜙𝑓 is the sanitized phase values at subcarrier 𝑓 (the 𝑓 th frequency). In Figure 3(f), the final phase curves are temporally consistent.



3.2 Modality Translation Network


In order to estimate the UV maps in the spatial domain from the 1D CSI signals, we first transform the network inputs from the CSI domain to the spatial domain. This is done with the Modality Translation Network (see Figure 4). We first extract the CSI latent space features using two encoders, one for the amplitude tensor and the other for the phase tensor, where both tensors have the size of 150×3×3 (5 consecutive samples, 30 frequencies, 3 emitters and 3 receivers). Previous work on human sensing with WiFi [30] stated that Convolutional Neural Network (CNN) can be used to extract spatial features from the last two dimensions (the 3 × 3 transmitting sensor pairs) of the input tensors. We, on the other hand, believe that locations in the 3×3 feature map do not correlate with the locations in the 2D scene. More specifically, as depicted in Figure 2(b), the element that is colored in blue represents a 1D summary of the entire scene captured by emitter 1 and receiver 3 (E1 - R3), instead of local spatial information of the top right corner of the 2D scene. Therefore, we consider that each of the 1350 elements (in both tensors) captures a unique 1D summary of the entire scene. Following this idea, the amplitude and phase tensors are flattened and feed into two separate multi-layer perceptrons (MLP) to obtain their features in the CSI latent space. We concatenated the 1D features from both encoding branches, then the combined tensor is fed to another MLP to perform feature fusion. The next step is to transform the CSI latent space features to feature maps in the spatial domain. As shown in Figure 4, the fused 1D feature is reshaped into a 24 × 24 2D feature map. Then, we extract the spatial information by applying two convolution blocks and obtain a more condensed map with the spatial dimension of 6×6. Finally, four deconvolution layers are used to upsample the encoded feature map in low dimensions to the size of 3 × 720 × 1280. We set such an output tensor size to match the dimension commonly used in RGB-image-input network. We now have a scene representation in the image domain generated by WiFi signals.



3.3 WiFi-DensePose RCNN


After we obtain the 3×720×1280 scene representation in the image domain, we can utilize image-based methods to predict the UV maps of human bodies. State-of-the-art pose estimation algorithms are two-stage; first, they run an independent person detector to estimate the bounding box and then conduct pose estimation from person-wise image patches. However, as stated before, each element in our CSI input tensors is a summary of the entire scene. It is not possible to extract the signals corresponding to a single person from a group of people in the scene. Therefore, we decide to adopt a network structure similar to DensePose-RCNN [8], since it can predict the dense correspondence of multiple humans in an end-toend fashion. More specifically, in the WiFi-DensePose RCNN (Figure 5), we extract the spatial features from the obtained 3 × 720 × 1280 imagelike feature map using the ResNet-FPN backbone [14]. Then, the output will go through the region proposal network [20]. To better exploit the complementary information of different sources, the next part of our network contains two branches: DensePose head and Keypoint head. Estimating keypoint locations is more reliable than estimating dense correspondences, so we can train our network to use keypoints to restrict DensePose predictions from getting too far from the body joints of humans. The DensePose head utilizes a Fully Convolutional Network (FCN) [16] to densely predict human part labels and surface coordinates (UV coordinates) within each part, while the keypoint head uses FCN to estimate the keypoint heatmap. The results are combined and then fed into the refinement unit of each branch, where each refinement unit consists of two convolutional blocks followed by an FCN. The network outputs a 17 × 56 × 56 keypoint mask and a 25 × 112 × 112 IUV map. The process is demonstrated in Figure 5. It should be noted that the modality translation network and the WiFi-DensePose RCNN are trained together.



3.4 Transfer Learning


Training the Modality Translation Network and WiFi-DensePose RCNN network from a random initialization takes a lot of time (roughly 80 hours). To improve the training efficiency, we conduct transfer learning from an image-based DensPose network to our WiFi-based network (See Figure 6 for details). The idea is to supervise the training of the WiFi-based network with the pre-trained image-based network. Directly initializing the WiFi-based network with image-based network weights does not work because the two networks get inputs from different domains (image and channel state information). Instead, we first train an image-based DensePose-RCNN model as a teacher network. Our student network consists of the modality translation network and the WiFi-DensePose RCNN. We fix the teacher network weights and train the student network by feeding them with the synchronized images and CSI tensors, respectively. We update the student network such that its backbone (ResNet) features mimic that of our teacher network. Our transfer learning goal is to minimize the differences of multiple levels of feature maps generated by the student model and those generated by the teacher model. Therefore we calculate the mean squared error between feature maps. The transfer learning loss from the teacher network to the student network is:



where 𝑀𝑆𝐸(·) computes the mean squared error between two feature maps, {𝑃2, 𝑃3, 𝑃4, 𝑃5} is a set of feature maps produced by the teacher network [14], and {𝑃 ∗ 2 , 𝑃∗ 3 , 𝑃∗ 4 , 𝑃∗ 5 } is the set of feature maps produced by the student network [14]. Benefiting from the additional supervision from the image-based model, the student network gets higher performance and takes fewer iterations to converge (Please see results in Table 5).



3.5 Losses


The total loss of our approach is computed as:



where 𝐿𝑐𝑙𝑠 , 𝐿𝑏𝑜𝑥, 𝐿𝑑𝑝, 𝐿𝑘𝑝, 𝐿𝑡𝑟 are losses for the person classification, bounding box regression, DensePose, keypoints, and transfer learning respectively. The classification loss 𝐿𝑐𝑙𝑠 and the box regression loss 𝐿𝑏𝑜𝑥 are standard RCNN losses [9, 21]. The DensePose loss 𝐿𝑑𝑝 [8] consists of several sub-components: (1) Cross-entropy loss for the coarse segmentation tasks. Each pixel is classified as either belonging to the background or one of the 24 human body regions. (2) Cross-entropy loss for body part classification and smooth L1 loss for UV coordinate regression. These losses are used to determine the exact coordinates of the pixels, i.e., 24 regressors are created to break the full human into small parts and parameterize each piece using a local two-dimensional UV coordinate system, that identifies the position UV nodes on this surface part.

We add 𝐿𝑘𝑝 to help the DensePose to balance between the torso with more UV nodes and limbs with fewer UV nodes. Inspired by Keypoint RCNN [9], we one-hot-encode each of the 17 ground truth keypoints in one 56×56 heatmap, generating 17×56×56 keypoints heatmaps and supervise the output with the Cross-Entropy Loss. To closely regularize the Densepose regression, the keypoint heatmap regressor takes the same input features used by the Denspose UV maps.





This section presents the experimental validation of our WiFi-based DensePose.


4.1 Dataset


We used the dataset 1 described in [31], which contains CSI samples taken at 100Hz from receiver antennas and videos recorded at 20 FPS. Time stamps are used to synchronize CSI and the video frames such that 5 CSI samples correspond to 1 video frame. The dataset was gathered in sixteen spatial layouts: six captures in the lab office and ten captures in the classroom. Each capture is around 13 minutes with 1 to 5 subjects (8 subjects in total for the entire dataset) performing daily activities under the layout described in Figure 2 (a). The sixteen spatial layouts are different in their relative locations/orientations of the WiFi-emitter antennas, person, furniture, and WiFi-receiver antennas. There are no manual annotations for the data set. We use the MS-COCO-pre-trained dense model "R_101_FPN_s1x_legacy" 2 and MS-COCO-pre-trained Keypoint R-CNN "R101-FPN" 3 to produce the pseudo ground truth. We denote the ground truth as "R101-Pseudo-GT" (see an annotated example in Figure 7). The R101-Pseudo-GT includes person bounding boxes, person-instance segmentation masks, body-part UV maps, and person-wise keypoint coordinates.

Throughout the section, we use R101-Puedo-GT to train our WiFi-based DensePose model as well as finetuning the image-based baseline "R_50_FPN_s1x_legacy".



4.2 Training/Testing protocols and Metrics


We report results on two protocols: (1) Same layout: We train on the training set in all 16 spatial layouts, and test on remaining frames. Following [31], we randomly select 80% of the samples to be our training set, and the rest to be our testing set. The training and testing samples are different in the person’s location and pose, but share the same person’s identities and background. This is a reasonable assumption since the WiFi device is usually installed in a fixed location. (2) Different layout: We train on 15 spatial layouts and test on 1 unseen spatial layout. The unseen layout is in the classroom scenarios. We evaluate the performance of our algorithm in two aspects: the ability to detect humans (bounding boxes) and accuracy of the dense pose estimation. To evaluate the performance of our models in detecting humans, we calculate the standard average precision (AP) of person bounding boxes at multiple IOU thresholds ranging from 0.5 to 0.95. In addition, by MS-COCO [15] definition, we also compute AP-m for median bodies that are enclosed in bounding boxes with sizes between 32 × 32 and 96 × 96 pixels in a normalized 640 × 480 pixels image space, and AP-l for large bodies that are enclosed in bounding boxes larger than 96 × 96 pixels. To measure the performance of DensePose detection, we follow the original DensePose paper [8]. We first compute Geodesic Point Similarity (GPS) as a matching score for dense correspondences:



where 𝑔 calculates the geodesic distance, 𝑃𝑗 denotes the ground truth point annotations of person 𝑗, 𝑖𝑝 and ˆ𝑖𝑝 are the estimated and ground truth vertex at point 𝑝 respectively, and 𝜅 is a normalizing parameter (set to be 0.255 according to [8]). One issue of GPS is that it does not penalize spurious predictions. Therefore, estimations with all pixels classified as foreground are favored. To alleviate this issue, masked geodesic point similarity (GPSm) was introduced in [8], which incorporates both GPS and segmentation masks. The formulation is as follows:



where 𝑀 and 𝑀ˆ are the predicted and ground truth foreground segmentation masks. Next, we can calculate DensePose average precision with GPS (denoted as dpAP· GPS) and GPSm (denoted as dpAP· GPSm) as thresholds, following the same logic behind the calculation of bounding box AP.



4.3 Implementation Details


We implemented our approach in PyTorch. We set the training batch size to 16 on a 4 GPU (Titan X) server. We empirically set 𝜆𝑑𝑝 = 0.6, 𝜆𝑘𝑝 = 0.3, 𝜆𝑡𝑟 = 0.1. We used a warmup multi-step learning rate scheduler and set the initial learning rate as 1𝑒 − 5. The learning rate increases to 1𝑒 − 3 during the first 2000 iterations, then decreases to 1 10 of its value every 48000 iterations. We trained for 145, 000 iterations for our final model.



4.4 WiFi-based DensePose under Same Layout


Under the Same Layout protocol, we compute the AP of human bounding box detections as well as dpAP· GPS and dpAP· GPSm of dense correspondence predictions. Results are presented in Table 1 and Table 2, respectively.


Method AP AP@50 AP@75 AP-m AP-l
WiFi 43.5 87.2 44.6 38.1 46.4

Table 1: Average precision (AP) of WiFi-based DensePose under the Same Layout protocol. All metrics are the higher the better.


From Table 1, we can observe a high value (87.2) of AP@50, indicating that our model can effectively detect the approximate locations of human bounding boxes. The relatively low value (35.6) for AP@75 suggests that the details of the human bodies are not perfectly estimated.