(被封殺) 協和醫學院規培醫生董襲瑩博士論文，中國知網已刪除＊阿波羅新聞網

【404文庫】協和醫學院規培醫生董襲瑩博士論文，中國知網已刪除

CDT編輯註：日前，中日友好醫院醫生肖飛被其妻舉報婚外情，其中一位當事人為規培醫師、協和醫學院博士生董襲瑩。舉報信中還指出，肖飛曾在手術過程中與護士發生衝突，期間不顧患者安危，與董襲瑩一同離開手術室。隨後，董襲瑩的教育背景引發網絡關注。公開信息顯示，董襲瑩本科就讀於美國哥倫比亞大學下屬的女子學院——巴納德學院，主修經濟學，2019年被北京協和醫學院「4+4」臨床醫學長學制試點班錄取為博士研究生。相比中國傳統醫學教育體系通常更長的教學與規培周期，協和「4+4」模式在公平性與專業性方面引發質疑。與此同時，有網民指出，董襲瑩的博士論文與北京科技大學的一項發明專利存在多處雷同，涉嫌學術不端。事件持續發酵後，中國知網已將其博士論文下架，而這篇論文的下架流程是否遵循既定的撤稿標準亦受到質疑。有網友在此之前保存了論文PDF，並上傳至GitHub。以下文字和圖片由CDT通過PDF版轉錄存檔。

北京協和醫學院臨床醫學專業畢業論文

學校代碼：10023

學號： B2019012012

跨模態圖像融合技術在醫療影像分析中的研究

專業年級：北京協和醫學院臨床醫學專業2019級試點班

姓名：董襲瑩

導師：邱貴興（教授）

北京協和醫學院臨床學院（北京協和醫院）

骨科

完成日期：2023年5月

摘要……………………………………………………………………………………………………………………1

Abstract……………………………………………………………………………………………………………….3

基於特徵匹配的跨模態圖像融合的宮頸癌病變區域檢測…………………………………6

1.1.前言……………………………………………………………………………………………………..6

1.2.研究方法………………………………………………………………………………………………7

1.2.1.研究設計和工作流程…………………………………………………………………..7

1.2.2.跨模態圖像融合………………………………………………………………………….7

1.2.3.宮頸癌病變區域檢測…………………………………………………………………11

1.3.實驗……………………………………………………………………………………………………11

1.3.1.臨床信息和影像數據集……………………………………………………………..11

1.3.2.模型訓練過程……………………………………………………………………………12

1.3.3.評價指標…………………………………………………………………………………..13

1.3.4.目標檢測模型的結果與分析………………………………………………………14

1.4.討論……………………………………………………………………………………………………15

1.5.結論……………………………………………………………………………………………………16

基於特徵轉換的跨模態數據融合的乳腺癌骨轉移的診斷……………………………….18

2.1.前言……………………………………………………………………………………………………18

2.2.研究方法…………………………………………………………………………………………….19

2.2.1.研究設計和工作流程…………………………………………………………………19

2.2.2.骨轉移目標區域檢測…………………………………………………………………20

2.2.3.基於特徵轉換的跨模態數據融合……………………………………………….20

2.2.4.乳腺癌骨轉移的分類模型………………………………………………………….21

2.3.實驗……………………………………………………………………………………………………22

2.3.1.臨床信息和影像數據集……………………………………………………………..22

2.3.2.模型訓練過程……………………………………………………………………………23

2.3.3.評價指標…………………………………………………………………………………..24

2.3.4.單模態骨轉移灶檢測模型及基於特徵轉換的跨模態分類模型的結果與分析………………………………………………………………………………………………..25

2.4.討論……………………………………………………………………………………………………30

2.5.結論……………………………………………………………………………………………………32

全文小結…………………………………………………………………………………………………………..33

參考文獻…………………………………………………………………………………………………………..35

縮略詞表…………………………………………………………………………………………………………..40

文獻綜述…………………………………………………………………………………………………………..41

跨模態深度學習技術在臨床影像中的應用…………………………………………………….41

3.1 Preface…………………………………………………………………………………………………41

3.2. Deep Neural Network(DNN)………………………………………………………………..42

3.2.1. Supervised learning…………………………………………………………………….43

3.2.2. Backpropagation…………………………………………………………………………46

3.2.3. Convolutional neural networks(CNN)………………………………………….46

3.3. Cross-modal fusion……………………………………………………………………………….49

3.3.1. Cross-modal fusion methods………………………………………………………..50

3.3.2. Cross-modal image translation……………………………………………………..51

3.4. The application of cross-modal deep learning…………………………………………..52

3.5. conclusion……………………………………………………………………………………………55

參考文獻……………………………………………………………………………………………………57

致謝………………………………………………………………………………………………………………….60

獨創性聲明……………………………………………………………………………………………………….61

學位論文版權使用授權書………………………………………………………………………………….61

摘要

背景

影像學檢查是醫療領域最常用的篩查手段，據統計，醫療數據總量中有超過90%是由影像數據構成[1]。然而，根據親身參與的臨床病例[2]可知,很多情況下，僅憑醫生的肉眼觀察和主觀診斷經驗，不足以對影像學異常作一明確判斷。而診斷不明引起的頻繁就醫、貽誤病情，則會嚴重影響患者的生活質量。

相較於傳統的主觀閱片，人工智能技術通過深度神經網絡分析大量影像和診斷數據，學習對病理診斷有用的特徵，在客觀數據的支持下做出更準確的判斷。為了模擬臨床醫生結合各種成像模式（如 CT、MRI和 PET）形成診斷的過程，本項目採用跨模態深度學習方法，將各種影像學模態特徵進行有機結合，充分利用其各自的獨特優勢訓練深度神經網絡，以提高模型性能。鑑於腫瘤相關的影像學資料相對豐富，本項目以宮頸癌和乳腺癌骨轉移為例，測試了跨模態深度學習方法在病變區域定位和輔助診斷方面的性能，以解決臨床實際問題。

方法

第一部分回顧性納入了220例有FDG-PET/CT數據的宮頸癌患者，共計72,602張切片圖像。應用多種圖像預處理策略對PET和CT圖像進行圖像增強，並進行感興趣區域邊緣檢測、自適應定位和跨模態圖像對齊。將對齊後的圖像在通道上級聯輸入目標檢測網絡進行檢測、分析及結果評估。通過與使用單一模態圖像及其他 PET-CT融合方法進行比較，驗證本項目提出的 PET-CT自適應區域特徵融合結果在提高模型目標檢測性能方面具有顯著性優勢。第二部分回顧性納入了233例乳腺癌患者，每例樣本包含 CT、MRI、或 PET一至三種模態的全身影像數據，共有3051張CT切片，3543張MRI切片，1818張PET切片。首先訓練YOLOv5對每種單一模態圖像中的骨轉移病灶進行目標檢測。根據檢測框的置信度劃分八個區間，統計每個影像序列不同置信度區間中含有檢出骨轉移病灶的個數，並以此歸一化後作為結構化醫療特徵數據，採用級聯方式融合三種模態的結構化特徵實現跨模態特徵融合。再用多種分類模型對結構化數據進行分類和評估。將基於特徵轉換的跨模態融合數據與特徵轉換後的單模態結構化數據，以及基於 C3D分類模型的前融合方式進行比較，驗證第二部分提出的方法在乳腺癌骨轉移診斷任務中的優越性能。

結果

第一部分的基於跨模態融合的腫瘤檢測實驗證明，PET-CT自適應區域特徵融合圖像顯著提高了宮頸癌病變區域檢測的準確性。相比使用CT或PET單模態圖像以及其他融合方法生成的多模態圖像作為網絡輸入，目標檢測的平均精確度分別提高了6.06%和8.9%，且消除了一些假陽性結果。上述測試結果在使用不同的目標檢測模型的情況下保持一致，這表明自適應跨模態融合方法有良好的通用性，可以泛化應用於各種目標檢測模型的預處理階段。第二部分基於特徵轉換的跨模態病例分類實驗證明，跨模態融合數據顯著提高了乳腺癌骨轉移診斷任務的性能。相較於單模態數據，跨模態融合數據的平均準確率和AUC分別提高了7.9%和8.5%，觀察 ROC曲線和 PR曲線的形狀和面積也具有相同的實驗結論：在不同的分類模型中，使用基於特徵轉換的跨模態數據，相比單模態數據，對於骨轉移病例的分類性能更為優越。而相較於基於 C3D的前融合分類模型，基於特徵轉換的後融合策略在分類任務方面的性能更優。

結論

本項目主要包含兩個部分。第一部分證實了基於區域特徵匹配的跨模態圖像融合後的數據集在檢測性能上優於單模態醫學圖像數據集和其他融合方法。第二部分提出了一種基於特徵轉換的跨模態數據融合方法。使用融合後的數據進行分類任務，其分類性能優於僅使用單模態數據進行分類或使用前融合方法的性能。根據不同模態醫學圖像的特徵差異與互補性，本項目驗證了跨模態深度學習技術在病變區域定位和輔助診斷方面的優勢。相比於只使用單模態數據進行訓練的模型，跨模態深度學習技術有更優的診斷準確率，可以有效的成為臨床輔助工具，協助和指導臨床決策。

關鍵詞：跨模態融合，深度學習，影像分析，宮頸癌，乳腺癌骨轉移

Abstract

Background

Imaging examinations serve as the predominant screening method in the medical field. As statistics reveal, imaging data constitute over90% of the entire medical dataset. Nonetheless, clinical cases have demonstrated that mere subjective diagnoses by clinicians often fall short in making definitive judgments on imaging anomalies. Misdiagnoses or undiagnosed conditions, which result in frequent hospital visits and delayed treatment, can profoundly affect patients』 quality of life.

Compared to the traditional subjective image interpretation by clinicians, AI leverages deep neural networks to analyze large-scale imaging and diagnostic data, extracting valuable features for pathology diagnosis, and thus facilitating more accurate decision-making, underpinned by objective data. To emulate clinicians』 diagnostic process that integrates various imaging modalities like CT, MRI, and PET, a cross-modal deep learning methodology is employed. This approach synergistically merges features from different imaging modalities, capitalizing on their unique advantages to enhance model performance.

Given the ample availability of oncologic imaging data, the project exemplifies the efficacy of this approach in cervical cancer segmentation and detection of breast cancer bone metastasis, thereby addressing pragmatic challenges in clinical practice.

Methods

The first part retrospectively analyzed72,602 slices of FDG-PET/CT scans from220 cervical cancer patients. Various preprocessing strategies were applied to enhance PET and CT images, including edge detection, adaptive ROI localization, and cross-modal image

fusion. The fused images were then concatenated on a channel-wise basis and fed into the object detection network for the precise segmentation of cervical cancer lesions. Compared to single modality images(either CT or PET) and alternative PET-CT fusion techniques,

the proposed method of PET-CT adaptive fusion was found to significantly enhance the object detection performance of the model. The second part of the study retrospectively analyzed3,051 CT slices,3,543 MRI slices and1,818 PET slices from233 breast cancer patients, with each case containing whole-body imaging of one to three modalities(CT, MRI, or PET). Initially, YOLOv5 was trained to detect bone metastases in images across different modalities. The confidence levels of the prediction boxes were segregated into eight tiers, following which the number of boxes predicting bone metastases in each imaging sequence was tallied within each confidence tier. This count was then normalized and utilized as a structured feature. The structured features from the three modalities were fused in a cascaded manner for cross-modal fusion. Subsequently, a variety of classification models were then employed to evaluate the structured features for diagnosing bone metastasis. In comparison to feature-transformed single-modal data and the C3D early fusion method, the cross-modal fusion data founded on feature transformation demonstrated superior performance in diagnosing breast cancer bone metastasis.

Results

The first part of our study delivered compelling experimental results, showing a significant improvement in the accuracy of cervical cancer segmentation when using adaptively fused PET-CT images. Our approach outperformed other object detection algorithms based on either single-modal images or multimodal images fused by other methods, with an average accuracy improvement of6.06% and8.9%, respectively, while also effectively mitigating false-positive results. These promising test results remained consistent across different object detection models, highlighting the robustness and universality of our adaptive fusion method, which can be generalized in the preprocessing stage of diverse object detection models. The second part of our study demonstrated that cross-modal fusion based on feature transformation could significantly improve the performance of bone metastasis classification models. When compared to algorithms employing single-modal data, models based on cross-modal data had an average increase in accuracy and AUC of7.9% and8.5%, respectively. This improvement was further corroborated by the shapes of the ROC and PR curves. Across a range of classification models, feature-transformed cross-modal data

consistently outperformed single-modal data in diagnosing breast cancer bone metastasis. Moreover, late fusion strategies grounded in feature transformation exhibited superior performance in classification tasks when juxtaposed with early fusion methods such as C3D.

Conclusions

This project primarily consists of two parts. The first part substantiates that deep learning object detection networks founded on the adaptive cross-modal image fusion method outperform those based on single-modal images or alternative fusion methods. The second part presents a cross-modal fusion approach based on feature transformation. When the fused features are deployed for classification models, they outperform those utilizing solely single-modal data or the early fusion model. In light of the differences and complementarity in the features of various image modalities, this project underscores the strengths of cross-modal deep learning in lesion segmentation and disease classification. When compared to models trained only on single-modal data, cross-modal deep learning offers superior diagnostic accuracy, thereby serving as an effective tool to assist in clinical decision-making.

Keywords: cross-modal fusion, deep learning, image analysis, cervical cancer, breast cancer bone metastasis

1.基於特徵匹配的跨模態圖像融合的宮頸癌病變區域檢測

1.1.前言

宮頸癌是女性群體中發病率第四位的癌症，每年影響全球近50萬女性的生命健康[3]。準確和及時的識別宮頸癌至關重要，是否能對其進行早期識別決定了治療方案的選擇及預後情況[4]。氟代脫氧葡萄糖正電子發射計算機斷層顯像/電子計算機斷層掃描（fluorodeoxyglucose-positron emission tomography/computed tomography, FDG-PET/CT），因其優越的敏感性和特異性，成為了一個重要的宮頸癌檢測方式[5]。由於CT能夠清晰地顯示解剖結構，FDG-PET能夠很好地反映局灶的代謝信息形成功能影像，FDG-PET/CT融合圖像對可疑宮頸癌病灶的顯示比單獨使用高解像度 CT更準確，特別是在檢測區域淋巴結受累和盆腔外病變擴展方面[6],[7],[8]。然而，用傳統方法為單一患者的 FDG-PET/CT數據進行分析需要閱讀數百幅影像，對病變區域進行鑑別分析，這一極為耗時的過程已經妨礙了臨床醫生對子宮頸癌的臨床診斷。

隨着計算機硬件和算法的進步，尤其是以深度學習[9]、圖像處理技術[10],[11]為代表的機器學習技術的革新，這些人工智能算法在臨床醫學的許多領域中起着關鍵作用[12]。基於其強大有效的特徵提取能力[13],[14]，深度學習中的卷積神經網絡可以通過梯度下降自動學習圖像中的主要特徵[15，極大地提高目標識別的準確性[16]，使深度學習成為計算機圖像處理領域的主流技術[17],[18]。利用深度學習技術對宮頸癌影像進行分析可以輔助臨床醫生做出更為準確的判斷，減輕臨床醫生的工作負擔，提高診斷的準確性[19]。

目前已經有很多在單一模態圖像中（CT或 PET）基於深度學習技術進行病變檢測的工作：Seung等使用機器學習技術依據PET圖像對預測肺癌組織學亞型[20]；Sasank進行了基於深度學習算法檢測頭 CT中關鍵信息的回顧性研究[21]；Chen使用隨機遊走（random walk）和深度神經網絡對 CT圖像中的肺部病變進行分割[22],[23]。但很少有關於使用跨模態圖像融合深度學習方法進行病變檢測的研究。

基於 PET/CT融合圖像的病變檢測項目包括三個研究任務：區域特徵匹配[24]，跨模態圖像融合[25和目標病變區域檢測[26]。Mattes使用互信息作為相似性標準，提出了一種三維PET向胸部CT配準的區域特徵匹配算法[27]。Maqsood提出了一種基於雙尺度圖像分解和稀疏表示的跨模態圖像融合方案[28]。Elakkiya利用更快的基於區域的卷積神經網絡（Faster Region-Based Convolutional Neural Network, FR-CNN）進行頸部斑點的檢測[29]。目前還沒有將上述三個研究任務，即區域特徵匹配、跨模態圖像融合、病變區域檢測任務，結合起來的研究工作。

為了減輕臨床醫生的工作負擔，基於跨模態深度學習方法，本項目的第一部分提出了一個統一的多模態圖像融合和目標檢測框架，用於宮頸癌的檢測。

1.2.研究方法

1.2.1.研究設計和工作流程

本項目旨在檢測 CT和 PET圖像中宮頸癌的病變區域，工作流程如圖1-1所示：掃描設備對每位患者進行PET和CT圖像序列的採集；通過區域特徵匹配和圖像融合來合成清晰且信息豐富的跨模態圖像融合結果；採用基於深度學習的目標檢測方法在融合圖像中對可疑宮頸癌的病變區域進行目標檢測。在圖1-1的最後一行中，矩形框出的黃色區域及圖中右上角放大的區域中展示了檢測出的宮頸癌病變區域。

圖1-1工作流程

1.2.2.跨模態圖像融合

圖1-2展示了跨模態圖像融合算法的流程圖。根據計算發現兩種模態圖像的比例和位置不同，如僅進行簡單的融合會錯誤地將處於不同位置的組織影像重疊，從而使組織發生錯位，定位不准，產生不可接受的誤差。因此，第一部分提出了一種跨模態圖像融合策略，其中的步驟包括對感興趣區域（region of interest, ROI）的自適應定位和圖像融合。

在PET和CT圖像中，自適應ROI定位能夠精準識別待分析處理的關鍵目標，即人體組織影像，然後計算不同模態圖像下組織影像之間的比例和位移。依據上述計算結果通過縮放、填充和裁剪的方式來融合 PET和CT圖像。

圖1-2 CT和PET跨模態圖像融合算法的流程圖

1.2.2.1.自適應ROI定位

鑑於數據集中 PET圖像與 CT圖像的黑色背景均為零像素值填充，ROI內非零像素值較多，而 ROI邊緣的非零像素值較少，因此，選用線檢測方法來標畫兩種模態圖像中的 ROI，最終標劃結果如圖1-2中的綠色線框出的部分所示，這四條線是 ROI在四個方向上的邊界。在不同方向上計算比例尺。在將 PET圖像放大後，根據ROI實現CT和PET圖像的像素級對齊。裁剪掉多餘的區域，並用零像素值來補充空白區域。如圖1-3（a）所示，線檢測方法從中心點出發，向四個方向即上下左右對非零像素值進行遍歷，並記錄下行或列上的非零像素值的數量。如圖1-3（b）所示，紅色箭頭代表遍歷的方向。在從 ROI中心向邊緣進行遍歷時，沿遍歷經線上的非零像素值數量逐漸減少，如果某一線上非零像素值的計數低於預設的閾值，那麼意味着該線已經觸及到 ROI的邊緣，如圖1-3（c）所示。然而，如果直接對未經預處理的圖像應用線檢測方法，會因受模糊邊緣及其噪聲的影響，得到較差的對齊結果，難以設置閾值。因此，需對PET和CT圖像單獨執行圖像增強預處理，以優化 ROI標化結果，改善跨模態融合效果。

由於PET和CT圖像具有不同的紋理特徵，應用不同的預處理策略，分別對圖像進行增強處理，以強化 ROI的邊緣特性，同時消除噪聲產生的干擾，再在兩種不同模態圖像中進行 ROI定位，如圖1-2所示。

圖1-3 ROI檢測示意圖

CT圖像是用 X射線對檢查部位一定厚度的組織層進行掃描，由探測器接收透過該層面的 X射線，經數字轉換器及計算機處理後形成的解剖學圖像。CT圖像通常比 PET圖像更清晰。為了提取 ROI，需利用圖像增強技術對 CT圖像進行預處理：首先，通過圖像銳化增強邊緣特徵和灰度跳變部分，使 CT圖像的邊緣（即灰度值突變區域）信息更加突出；由於銳化可能導致一定的噪聲，再使用高斯模糊濾波器（Gaussian blur）[30進行圖像平滑去噪，將噪聲所在像素點處理為周圍相鄰像素值的加權平均近似值，消除影響成像質量的邊緣毛躁；並執行Canny邊緣檢測（Canny edge detection）[31來設定閾值並連接邊緣，從而在圖像中提取目標對象的邊緣。儘管Canny邊緣檢測算法已包含高斯模糊的去噪操作，但實驗證實兩次高斯模糊後的邊緣提取效果更優。在對圖像進行銳化處理後，將提取的邊緣圖像與高斯模糊後的圖像進行疊加。具體地，對兩個圖像中的每個像素直接進行像素值相加，最終得到邊緣更清晰且減輕噪聲影響的增強後 CT模態圖像。

PET圖像是基於間接探測到的由正電子放射性核素髮射的γ射線，經過計算機進行散射和隨機信息的校正，形成的影像，能夠顯示體內代謝活動的信息。儘管PET可以顯示分子代謝水平，但由於成像原理的差異，PET圖像相較於 CT圖像顯得模糊。對PET的預處理方式與對CT圖像的類似，但省略了高斯模糊處理圖像噪聲的步驟，因為在銳化 PET模態圖像後產生的噪聲較少，為防止有效特徵信息的丟失，略過這一環節。

為了將兩個模態的圖像進行區域特徵匹配，使用PET和CT圖像中的矩形ROI框來計算縮放比例和位移參數，並通過縮放、填充和裁剪操作對PET和CT圖像中的ROI進行對齊。

1.2.2.2.圖像融合

CT和 PET圖像的尺寸分別為512×512像素和128×128像素，ROI特徵區域位於圖像的中心位置。通過縮放、零值填充和剪切，放大PET圖像的尺寸以與CT圖像的尺寸保持一致，並且將兩個模態圖像之間的 ROI對齊，以便後續的融合處理。經處理的PET和CT圖像轉化為灰度形式，分別進行加權和圖像疊加，將其置於不同通道中，作為網絡的輸入層。由於 PET圖像能展示體內分子層面的代謝水平，其對於腫瘤檢測的敏感性高於CT圖像。因此，本研究的圖像融合方法為PET圖像的ROI分配了更多權重，以提高宮頸癌檢測任務的表現。

圖1-4比較了本項目提出的自適應圖像融合的結果和直接融合的結果，選取人體不同部位的 CT、PET和 PET/CT圖像的融合結果進行展示。第一和第二列分別展示了未經處理的原始 CT和 PET圖像。簡單融合算法僅將兩個圖像的像素點相加，並未執行特徵匹配過程，得到的融合圖像無任何實用價值。由於通道拼接融合後的圖像轉變為高維多模態數據，而非三通道數字圖像，因此圖1-4並未展示通道拼接融合方法所得圖像。而本項目提出的自適應圖像融合方法實現了跨模態圖像的精準融合，可用於進一步的觀察和計算。

圖1-4不同圖像融合方式的可視化結果

1.2.3.宮頸癌病變區域檢測

先由兩位臨床醫生對跨模態融合圖像中的病變區域進行人工標註，並訓練YOLOv5[32]目標檢測網絡來識別融合圖像中的病灶區域，如圖1-5所示。模塊骨架用於提取圖像的深層特徵，為減少通過切片操作進行採樣過程中的信息損失，採用聚焦結構，並使用跨階段局部網絡（cross-stage partial network, CSPNet）[33]來減少模型在推理中所需的計算量。頭模塊用於執行分類和回歸任務，採用特徵金字塔網絡（feature pyramid network, FPN）和路徑聚合網絡（path aggregation network, PAN）[34]。

為了提高對極小目標區域的檢測效果，輸入層採用了mosaic數據增強（mosaic data augmentation）[35]方法，將四個隨機縮放、剪切和隨機排列的圖像拼接在一起。模塊骨架包括 CSPNet和空間金字塔池化（spatial pyramid pooling, SPP）[36]操作。輸入圖像通過三個 CSP操作和一個 SPP操作，生成了一個四倍於原始大小的特徵圖。頭模塊有三個分支網絡，分別接收來自不同層的融合特徵、輸出各層的邊界框回歸值和目標類別，最後由頭模塊合併分支網絡的預測結果。

圖1-5目標檢測網絡結構

1.3.實驗

1.3.1.臨床信息和影像數據集

本項目選取符合以下條件的患者開展研究：1）於2010年1月至2018年12月期間在國家癌症中心中國醫學科學院腫瘤醫院被診斷為原發性宮頸癌的患者2）有FDG-PET/CT圖像；3）有電子病歷記錄。總共入組了220名患者，共計72,602張切片圖像，平均每位患者有330張切片圖像入組實驗。其中，CT切片圖像的高度和寬度均為512像素，而PET切片圖像的高度和寬度均為128像素，每個模態的數據集都包含了6,378張切片圖像，即平均每位患者有29張切片圖像，用於訓練和測試。在入組進行分析之前，所有患者數據都已去標識化。本研究已獲得北京協和醫學院國家癌症中心倫理委員會的批准。

該數據集包含220個患者的全身 CT和全身 PET圖像數據，因入組的每位患者均確診為宮頸癌，數據集中各例數據均包含病變區域，如表1-1所示。鑑於所有患者的CT和PET均在同一時間且使用相同設備採集，因此 CT和PET展示的解剖信息與代謝信息來自同一時刻患者身體的同一區域，其特徵具有一對一對應且可匹配的特性。根據腫瘤大小、浸潤深度、盆腔臨近組織侵犯程度、腹盆腔淋巴結轉移的情況可將宮頸癌的進展程度進行分期，主要包括四期，每期中又進一步細分為更具體的期別。國際婦產科聯盟（International Federation of Gynecology and Obstetrics,

FIGO）於2018年10月更新了宮頸癌分期系統的最新版本[37]。本項目數據集囊括了 FIGO分期全部四個期別的宮頸癌影像。為了保持訓練和測試的公平性，納入訓練集和測試集的不同期別影像的分佈，即不同 FIGO分期的劃分比例，需保持一致，否則可能會導致某些 FIGO期別的數據集無法進行訓練或測試。因此，在保證處於不同期別的患者數據的劃分比例的基礎上，採用五折交叉驗證方法將220名患者的數據進行五等分，每個部分大約包括了45例患者的數據，在每輪驗證中隨機選擇一個部分作為測試集。所有模型都需要進行5次訓練和評估，以獲取在測試集上表現出的性能的平均值和標準差。

表1-1數據集中的病例數及臨床分期

1.3.2.模型訓練過程

在按上述步驟準備好數據集後，首先將圖像從512×512像素調整為1024×1024像素，然後使用多種數據增強方法，包括 mosaic增強[38]、HSV（Hue, Saturation, Value）顏色空間增強[39]、隨機圖像平移、隨機圖像縮放和隨機圖像翻轉，增加輸入數據集對噪聲的魯棒性。在每次卷積後和激活函數前進行批歸一化（Batch Normalization, BN）[40]。所有隱藏層都採用 Sigmoid加權線性單元(Sigmoid-Weighted Linear Units, SiLU)[41作為激活函數。訓練模型所用的學習率設置為1e-5，並在起始訓練時選擇較小的學習率，然後在5個輪次（epoch）後使用預設的學習

率。每個模型使用PyTorch框架在4個Nvidia Tesla V100-SXM232G GPU上進行50個輪次的訓練。使用0.98的動量和0.01的權值衰減通過隨機梯度下降法（Stochastic Gradient Descend, SGD）來優化各網絡層的權重目標函數。在訓練過程中，網絡在驗證集上達到最小的損失時，選擇最佳參數。所有實驗中的性能測量都是在採用最優參數設置的模型中對測試集進行測試得到的，詳見表1-2。

為了進一步證明本項目所提出的模型的普適性，選擇了六個基於深度學習的目標檢測模型作為基準，並測試了所有模型在輸入不同的圖像融合結果時的性能。每個模型的輸入完全相同，而唯一的區別是神經網絡中的超參數來自每個模型的官方設置，而這些超參數因模型而異。

表1-2網絡訓練的超參數

1.3.3.評價指標

本項目使用「準確度精值50」（Accuracy Precision50, AP50）來評估目標檢測的性能。AP50是當交並比（Intersection over Union, IOU）閾值為0.5時的平均精度，如公式3所定義，其中P和R分別是精度（Precision）和召回率（Recall）的縮寫。模型的預測結果會有不同的召回率和精度值，這取決於置信度閾值。將召回率作為橫軸，精度作為縱軸，可以繪製 PR曲線，而 AP是該曲線下的面積。IOU是將真實標註區域和模型預測區域的重疊部分除以兩區域的集合部分（即真實區域和預測區域的並集）得到的結果，如公式4所示。精度和召回率的計算方式分別在公式1和2中列出，其中真正例表示預測為正例的正樣本，假正例和假負例代表的概念以此類推。精度表明在模型預測結果里，被判斷為正例的樣本中有多少實際是正例，而召回率表示實際為正例的樣本中多少被預測為正例。表1-3記錄了圖像數據集交叉驗證後各個目標檢測模型的 AP50的平均值和方差。

1.3.4.目標檢測模型的結果與分析

本項目採用不同的目標檢測模型，包括單階段目標檢測模型(YOLOv5[32]、RetinaNet[42]、ATSS[43)和二階段目標檢測模型(Faster-RCNN[44

、CascadeRCNN[45]、LibraRCNN[46])，在五折交叉驗證下比較了使用 CT圖像、PET圖像、PET-CT簡單融合圖像、PET-CT通道拼接融合圖像（concat fusion）和本項目所提出的 PET-CT自適應區域特徵融合圖像作為輸入數據集時，每個模型的目標檢測性能。其中，CT和PET是單模態圖像，而PET-CT簡單融合圖像、PET-CT通道拼接圖像和PET-CT自適應區域特徵融合圖像是跨模態融合圖像。簡單融合是指將 PET圖像簡單地縮放到與 CT圖像相同的大小後進行像素值的疊加，而通道拼接融合是直接將兩種模態圖像在通道上串聯在一起作為網絡的輸入。

如表1-3所示，加粗的數字代表每行中最好的實驗結果。與使用單一模態數據進行腫瘤檢測模型分析（如只使用CT或PET圖像）相比，本項目所提出的自適應跨模態圖像融合方法在目標檢測任務中展現出了更高的檢測精度。由於自適應融合方法能夠在跨模態融合之前將兩種模態圖像的信息進行預對齊，對 CT圖像和PET圖像的結構特徵進行一一配准，因此，與簡單融合方法和通道拼接融合方法相比，自適應融合方法的性能最佳。上述針對不同模態圖像及使用不同跨模態融合方法作為輸入得到的測試性能結果在使用不同的目標檢測模型的情況下保持一致，這表明本項目所提出的跨模態自適應融合方法有良好的通用性，可以泛化應用到各種目標檢測模型的預處理中。

表1-3五折交叉驗證目標檢測實驗的結果（「*」表示交叉驗證中的某一折在訓練過程中出現梯度爆炸，數值為目標檢測模型的 AP50的均值和方差）

圖1-6將不同模態圖像下目標宮頸癌病變區域的檢測結果和實際標註的癌灶區域進行了可視化。其中綠色框是由醫師標註的真實病變區域，黃色框是目標檢測模型的預測結果。分析圖像模態信息可知，CT圖像既包含了人體正常結構的信息，也包含病灶的解剖信息，前者可能會干擾宮頸癌病變區域特徵的識別和檢測。因此，在單一 CT模態下會有一些漏檢。與 CT模態的預測框相比，PET模態下的預測框與標註框的 IOU更高，或許是由於 PET影像有更多能表現宮頸癌區域特徵的信息。在 PET-CT區域特徵跨模態融合圖像中檢測效果最佳，因為 PET-CT融合圖像融合了兩種模態的不同特徵，從而大大提高檢測的準確性。

圖1-6跨模態融合圖像的目標檢測結果

1.4.討論

本項目旨在評估深度學習算法是否可以跨模態融合 FDG-PET和 CT圖像，並在融合圖像中實現宮頸癌病灶區域的自動檢測。我們提出了一個基於跨模態融合圖像的檢測框架，包括區域特徵匹配、圖像融合和目標檢測等步驟。融合 CT和PET圖像可以最大程度地提取各個模態中包含的信息，因此 PET-CT跨模態融合圖像含有豐富的解剖和功能信息。目標檢測實驗證明，本項目提出的跨模態融合方法得到的融合圖像顯著提高了目標檢測的準確性，相比單模態和其他融合方法得到的多模態圖像，目標檢測平均精確度分別提高了6.06%和8.9%。

表1-3展示了基於不同的圖像融合方法形成的多模態圖像，不同檢測模型在五折交叉驗證下的結果。因在解剖和功能影像中均有異常表現的區域更可能是癌變，我們推測，圖像信息對齊有利於對宮頸癌病灶的目標區域檢測。圖1-6展示了在不同目標檢測模型和不同輸入圖像數據模態下目標檢測效果的可視化圖像。基於本項目提出的跨模態融合方法生成的圖像進行的目標檢測的檢測結果更為準確，並消除了一些假陽性結果。根據醫生的日常診斷習慣，生成了以紅色和黃色為主色的融合圖像。

利用 FDG-PET/CT對宮頸癌進行及時、準確的分期能夠影響患者的臨床治療決策，進而延緩疾病進展，並減少腫瘤治療相關的整體財務負擔[47]。對 FDG-PET/CT圖像的解釋在很大程度上依賴臨床上獲得的背景信息，並需要綜合臨床分析來確定是否發生癌症的浸潤和轉移[48]。在某些情況下，核醫學科閱片醫師可以迅速識別局部擴展和淋巴栓塞。而多數情況下，核醫學科醫師分析一位患者的FDG-PET/CT影像學檢查結果平均需要三個小時。比起佔用醫師昂貴且稀缺的時間，利用計算機進行此項工作既能節約成本，預計耗時又短，且可以全天候運行。本項目的目標是通過人工智能方法實現PET和CT圖像的自動融合，並利用目標檢測技術識別宮頸癌的浸潤和轉移，作為輔助工具加速 FDG-PET/CT的閱片過程，從而使臨床醫生能夠在最短的時間內按照 FIGO指南對宮頸癌進行分期。

這項研究仍存在一些局限性。雖然本項目對基於 PET-CT自適應融合圖像的目標檢測方法與其他最先進的基於深度學習的目標檢測方法進行了比較，但將該方法拓展應用到其他病種的影像學分析的可行性仍需評估。此外，我們提出的跨模態融合框架在圖像融合時並未考慮每種模態圖像的權重分佈。未來可以設計一種特殊的損失函數來調整 ROI內每個像素的權重分佈，以提高目標檢測結果的準確性。

1.5.結論

本項目提出了一種基於跨模態圖像融合的多模態圖像進行病變區域檢測的深度學習框架，用於宮頸癌的檢測。為了應對醫學影像中單一模態圖像在腫瘤檢測方面的性能不足，提出了一種基於區域特徵匹配的自適應跨模態圖像融合策略，將融合後的多模態醫學圖像輸入深度學習目標檢測模型完成宮頸癌病變區域檢測任務，並討論了深度學習模型在每種模態圖像輸入間的性能差異。大量的實驗證明，與使用單一模態的影像及基於簡單融合方法或通道拼接融合方法的多模態影像相比，自適應融合後的多模態醫學圖像更有助於宮頸癌病變區域的檢測。

本項目所提出的技術可實現 PET和CT圖像的自動融合，並對宮頸癌病變區域進行檢測，從而輔助醫生的診斷過程，具備實際應用價值。後續將基於第一部分的目標檢測模型基礎，利用特徵轉換的方法，將圖像數據轉換為結構數據，將跨模態融合方法應用於分類問題。

2.基於特徵轉換的跨模態數據融合的乳腺癌骨轉移的診斷

2.1.前言

骨骼是第三常見的惡性腫瘤轉移部位，其發生率僅次於肺轉移和肝轉移，近70%的骨轉移瘤的原發部位為乳腺和前列腺[49],[50]。骨轉移造成的骨相關事件非常多樣，從完全無症狀到嚴重疼痛、關節活動度降低、病理性骨折、脊髓壓迫、骨髓衰竭和高鈣血症。高鈣血症又可導致便秘、尿量過多、口渴和疲勞，或因血鈣急劇升高導致心律失常和急性腎功能衰竭[51。骨轉移是乳腺癌最常見的轉移方式，也是患者預後的分水嶺，其診斷後的中位生存期約為40個月[52],[53]。因此，及時發現骨轉移病灶對於診斷、治療方案的選擇和乳腺癌患者的管理至關重要。目前，病灶穿刺活檢是診斷骨轉移的金標準，但鑑於穿刺活檢有創、存在較高風險、且假陰性率高，臨床常用影像學檢查部分替代穿刺活檢判斷是否發生骨轉移。

Batson的研究表明，乳腺的靜脈回流不僅匯入腔靜脈，還匯入自骨盆沿椎旁走行到硬膜外的椎靜脈叢[54]。通過椎靜脈叢向骨骼的血液回流部分解釋了乳腺癌易向中軸骨和肢帶骨轉移的原因。因潛在骨轉移灶的位置分佈較廣，影像學篩查需要覆蓋更大的區域，常要求全身顯像。常用的骨轉移影像診斷方法包括全身骨顯像（whole-body bone scintigraphy, WBS）、計算機斷層掃描（computed tomography, CT）、磁共振成像（magnetic resonance imaging, MRI）和正電子發射斷層顯像（positron emission tomography, PET）[55]。CT可以清晰地顯示骨破壞，硬化沉積，和轉移瘤引起軟組織腫脹；MRI具有優異的骨和軟組織對比解像度；因[18F]氟化鈉會特異性地被骨組織吸收、代謝， PET可以定位全身各處骨代謝活躍的區域。然而，單一模態影像常不足以檢測骨轉移，且用傳統方法綜合單一患者的 CT、MRI、PET數據篩查骨轉移病灶需要對上千幅影像進行解讀，這一極為耗時的過程可能影響臨床醫生對乳腺癌骨轉移的診斷，造成誤診、漏診。而骨轉移的漏診會誤導一系列臨床決策，導致災難性後果。

作為一種客觀評估體系，人工智能輔助骨轉移自動診斷系統通過減少觀察者間和觀察者內的變異性，提高了診斷的一致性和可重複性，降低了假陰性率。在減輕臨床醫師的工作負擔的同時，提高診斷的準確性。目前已經有很多在單一模態圖像中（CT、MRI或 PET）基於深度學習技術進行骨轉移病變檢測的工作： Noguchi等人開發了一種基於深度學習的算法，實現了在所有 CT掃描區域中對骨轉移病灶的自動檢測[56]；Fan等人用 AdaBoost算法和 Chan-Vese算法在 MRI圖像上對肺癌的脊柱轉移病灶進行了自動檢測和分割肺[57]；Moreau等人比較了不同深度學習算法在 PET/CT圖像上分割正常骨組織和乳腺癌骨轉移區域的性能[58]。但很少有使用跨模態數據融合的深度學習方法，判斷是否存在骨轉移灶的相關研究。

旨在減輕臨床醫生的工作負擔，本章提出了基於特徵轉換的跨模態數據融合方法，用於分析 CT、MRI和 PET圖像，以判斷其中是否存在乳腺癌骨轉移病灶。

基於特徵轉換的 CT、MRI和 PET跨模態圖像數據融合，進行骨轉移病變分類（即存在骨轉移病灶和不存在骨轉移病灶兩類）項目包括三個研究任務：目標病變區域檢測，特徵構造及轉換和分類任務。具體地，採用目標檢測模型對不同模態的醫學圖像序列數據進行單獨的骨轉移瘤目標檢測，再對這些檢測結果進行特徵提取。所提取的特徵包括不同模態下檢測結果置信度的區間佔比、檢測框的面積大小、檢測框在圖像中的空間位置分佈等。這些特徵被整理成結構化數據格式，完成了從非結構化影像數據到結構化數據特徵的特徵轉換和融合過程。最後，將轉換後的特徵輸入分類模型進行分類任務。實驗比較了基於特徵轉換的跨模態數據融合方法在乳腺癌骨轉移腫瘤分類任務的性能，與僅使用單模態數據執行分類

任務的性能。同時，還將本項目提出的基於特徵轉換的融合策略與其他融合方法進行了對比。

2.2.研究方法

2.2.1.研究設計和工作流程

本項目旨在判斷 CT、MRI、PET圖像序列中是否存在乳腺癌骨轉移病灶。工作流程如圖2-1所示：掃描設備對每位患者進行 CT、MRI或 PET圖像序列的採集；使用目標檢測模型分別在不同模態圖像中對可疑乳腺癌骨轉移灶進行目標檢測；對檢測結果進行特徵提取、構造和融合，得到具有可解釋性的結構化醫療數據；用分類模型對結構化數據進行分類任務，得出預測結果，從而判斷乳腺癌骨轉移是否發生。

圖2-1工作流程

2.2.2.骨轉移目標區域檢測

先由兩位臨床醫師對多模態數據集圖像中的骨轉移病灶進行人工標註，並對患者進行分類（標籤分為乳腺癌骨轉移和非乳腺癌骨轉移），並訓練 YOLOv5目標檢測網絡，以識別各個單一模態圖像中的乳腺癌骨轉移病灶。

2.2.3.基於特徵轉換的跨模態數據融合

在本項目的數據集中，各種模態序列影像的掃描範圍均涵蓋了患者的全身。某患者的影像序列（不論是單模態圖像還是多模態圖像）中檢測到含有骨轉移病灶的切片圖像數量越多，則意味着該患者發生乳腺癌骨轉移的概率越大。根據這一基本推理，採用後融合方法，將一個影像序列中含有腫瘤切片圖像的比例（百分比）作為結構化的數據特徵，作為後續分類任務的依據。

具體操作如下：在每個模態的圖像中完成骨轉移區域的目標檢測任務訓練後，統計每個圖像序列中檢測到轉移瘤目標區域的檢測框數量。按照檢測框的置信度劃分為8個區間：10%～20%、20%～30%、30%～40%、40%～50%、50%～60%、60%～70%、70%～80%和大於80%。在每個區間內，分別統計各模態圖像序列中轉移瘤檢測框數量，再除以該序列中切片圖像的總數，得到每個置信度區間內每種模態圖像序列中含有轉移瘤檢測框的百分比。接着將三種模態圖像提取出的統計特徵拼接，組成結構化數據，實現跨模態數據融合。若患者缺失某種模態數據，相應的統計特徵（百分比）將被置為零。特徵轉換後的結構化數據如圖2-2所示，每種模態數據包括8個特徵，即不同的置信區間，最後一列為標籤值，其中「0」表示負例，「1」表示正例。

圖2-2特徵轉換後的結構化數據

2.2.4.乳腺癌骨轉移的分類模型

利用構建好的結構化醫療特徵進行乳腺癌骨轉移分類任務，融合跨模態圖像數據特徵判斷是否發生乳腺癌骨轉移。本項目採用分類模型主要以模式識別基礎模型為主，包括SVM[59]、AdaBoost[60]、RandomForest[61]、LightGBM[62]、GBDT[63]。SVM是一種基於核函數的監督學習模型，用於解決分類問題，通過尋找最優超平面在特徵空間中將樣本分為不同類別，決策函數映射輸入特徵到輸出標籤，核函數將特徵映射到新空間，損失函數度量決策函數性能，最大化超平面與樣本間距離實現分類，可使用不同核函數處理高維特徵。Adaboost是一種疊代算法，於1995年由 Freund Y等人提出，能夠將多個弱分類器結合成一個強分類器，通過選擇初始訓練集、訓練弱分類器、加權重新分配樣本和重複訓練直到訓練完成所有弱學習器，最後通過加權平均或投票得出最終決策。由 Breiman L等人於2001年提出得 RandomForest是一種基於決策樹的機器學習算法，可用於分類和回歸任務。通過構建多個決策樹並對它們的預測結果進行平均或投票來得出最終預測結果，訓練過程中隨機選擇特徵，以避免過擬合併減少計算量。機器學習模型 LightGBM是一種基於決策樹的梯度提升機算法，由Ke G等人在2017年提出，適用於結構化數據的分類任務。具有高效、內存友好、支持並行處理和多 CPU等特點，能快速處理大量特徵，通過基於直方圖的決策樹算法減少訓練時間和內存使用量。通過損失函數的泰勒展開式來近似表示殘差來計算損失函數。由 Friedman J H等人於2001年提出的 GBDT是一種疊代的決策樹算法，通過構建多個決策樹來擬合目標函數，每一步都在上一步的基礎上構建新的決策樹，以不斷減小誤差，流程包括選取子集、訓練弱學習器、梯度下降法最小化誤差，最終將弱學習器加入總體模型，重複以上步驟直至達到最優解。

2.2.4.1.基於C3D的跨模態數據融合分類模型

本項目採用C3D[64分類模型作為對照模型，基於3D卷積神經網絡的深度學習方法，使用跨模態數據融合中的前融合策略。如圖2-1所示，該融合策略從每個模態的圖像序列中篩選出一部分，合併一個完整的多模態圖像序列，並在通道上進行級聯，進行跨模態數據融合。融合後的數據作為3D卷積神經網絡的輸入，經過多個3D卷積層提取特徵，最終在全連接層中執行分類任務，以判斷影像中是否存在乳腺癌骨轉移病灶。

2.3.實驗

2.3.1.臨床信息和影像數據集

本項目選取符合以下條件的患者開展研究：1）於2000年01月至2020年12月期間在北京協和醫院或國家癌症中心中國醫學科學院腫瘤醫院被診斷為原發性乳腺癌的患者2）有全身 CT或 PET或 MRI其中任一模態的全身影像數據；3）有電子病歷記錄。入組患者中有145名被確診為乳腺癌骨轉移，作為正例樣本，有88名患者未發生乳腺癌骨轉移，作為負例樣本。每例樣本數據包含一至三種不同模態的圖像序列，其圖像尺寸和切片圖像數量各異。乳腺癌骨轉移的多模態醫學圖像數據集對患者的全身進行採樣，由於患者的 CT、MRI或 PET是不同時間、在不同設備上採集的，不同模態間的特徵並非一一匹配。其中，CT模態共有3051張切片，MRI模態共有3543張切片，而 PET模態共有1818張切片。在入組進行分析之前，所有患者數據都已去標識化。本研究已通過北京協和醫院倫理委員會批准。

該數據集可以用於執行目標檢測任務和分類任務。

骨轉移目標檢測任務僅分析數據集中的正例樣本，進行五折交叉驗證：將145例患者的數據按模態分為三組（CT組、MRI組、PET組），在每個組內對數據進行五等分，在每輪驗證中選取一部分作為測試集。為獲得測試性能的平均值，所有模型都需進行5次訓練和評估。

在利用結構化數據執行分類任務時，需要平衡正負樣本數量，因此要擴充數據集。將具有多種模態的樣本拆分為包含較少模態的樣本，如將「CT+MRI+PET」類型拆分為「CT+MRI」或「CT+PET」等。如表2-1所示，擴充後共有380例樣本數據，包括188個正樣本和192個負樣本。下一步，合併五折交叉驗證的目標檢測結果，此後，進行特徵構建和轉換，從而獲得適合跨模態數據融合和分類任務的結構化數據；對於負樣本數據，也需要在合併骨轉移目標檢測模型的推理結果後，對數據進行結構化處理。

為證實在乳腺癌骨轉移判斷的分類任務中，基於特徵轉換的跨模態融合數據性能優於單一模態數據，需要進行多模態融合數據與單模態數據的對照實驗。如表2-1所示，單模態數據包括僅有 CT、僅有 MRI和僅有 PET三種類型的數據集合，總計212個樣本，而多模態數據涵蓋了CT+MRI、CT+PET、MRI+PET和CT+MRI+PET四種類型，共計168個樣本。分別對單模態數據和多模態數據進行獨立劃分，將每種模態數據進行五等份，進行五折交叉驗證。在每輪驗證中，選擇一部分作為測試集。利用 SVM、AdaBoost、RandomForest、LightGBM、GBDT以及 C3D模型進行實驗，每個模型都需進行5輪訓練和評估，以獲得測試集上性能的平均值。

為適應 C3D模型對圖像統一尺寸的要求，針對不同患者切片數量、大小的差異，進行預處理。在每種模態圖像序列中等間隔抽取60張圖像切片，並進行縮放，使其組合為180張128×128像素的切片。對於缺失的模態數據，用60張零像素值的黑色圖像切片進行填充。從180張切片中隨機選取一個起始位置，連續抽取120張切片作為模型的最終輸入，確保輸入尺寸為128×128×120像素。

表2-1擴充後的分類數據集

2.3.2.模型訓練過程

在按上述步驟準備好數據集後，進行目標檢測任務訓練時，將每個模態的圖像大小統一到1024×1024像素，然後使用多種數據增強方法，增加輸入數據集對噪聲的魯棒性。

目標檢測模型採用 YOLOv5模型模型並使用 PyTorch深度學習框架在2個Nvidia Tesla V100-SXM232G GPU上進行70個輪次的訓練。初始學習率為0.00001，使用0.98的動量和0.01的權值衰減通過 SGD來優化各網絡層的權重目標函數。在訓練過程中，網絡在驗證集上達到最小的損失時，選擇最佳參數。

進行分類任務時，採用 LightGBM、GBDT、AdaBoost、RandomForest以及SVM，上述模型均為機器學習模型，其超參數對模型預測結果影響較大，在分類任務中，採用SVM、AdaBoost、RandomForest、LightGBM以及GBDT等機器學習模型。因其超參數會對預測結果產生較大影響，在訓練過程中，使用網格搜索策略為這些模型尋找最佳參數。網格搜索策略在一定範圍的超參數空間內尋找最佳的超參數組合，通過枚舉各種可能的組合併評估模型預測結果，最終選擇表現最優的超參數組合。要搜索的超參數包括學習率、樹的最大深度、葉子節點數量、隨機抽樣比例、權重的L1正則化項和權重的L2正則化項等。實驗結果將基於最優超參數設定下的預測模型。模型訓練的網絡結構圖如圖2-3所示。

用於對照的 C3D模型使用 PyTorch深度學習框架在1個 NVIDIA Tesla V100-SXM232GB GPU上訓練100個輪次，初始學習率為0.00001，使用動量為0.9，權值衰減為0.0005的 SGD梯度下降優化器對各網絡層權重的目標函數進行優化。

圖2-3網絡結構圖

2.3.3.評價指標

本項目中的骨轉移目標檢測任務採用 AP50作為評價指標，其介紹詳見上一章節。

而在分類任務中，採用準確率（Accuracy, Acc）、敏感性（Sensitivity, Sen）、特異性（Specificity, Spe）、AUC（Area Under Curve, AUC）作為評價指標，並採用ROC曲線和PR曲線對模型進行評估。準確率是指對於給定的測試集，分類模型正確分類的樣本數佔總樣本數的比例，如公式5所示，其中真正例（True Positive, TP）表示預測為正例且標籤值為正例，假正例（False Positive, FP）表示預測為正例但標籤值為負例，和假負例（False Negative, FN）和真負例（True Negative, TN）代表的概念以此類推。如公式64和公式7所示，敏感性和特異性的定義分別為：預

測正確的正例佔所有正例的比例，以及預測正確的負例佔所有負例的比例。ROC曲線是一種評估二分類模型的方法，其橫軸為假陽性率（False Positive Rate，FPR），其計算方式與上一章的召回率（Recall）相同，縱軸為真陽性率（True Positive Rate，TPR），TPR和 FPR的計算方式詳見公式8和公式9。ROC曲線展示了在不同閾值下，TPR與 FPR的變化關係。因為左上角點對應的假陽性率為0，真陽性率為1，表明模型將所有正例樣本分類正確，且未將任何負例樣本誤判為正例。若 ROC曲線靠近左上角，提示模型性能較好。AUC代表ROC曲線下的面積，即從（0,0）到（1,1）進行積分測量ROC曲線下二維區域的面積。AUC綜合考慮所有可能的分類閾值，提供了一個全面的性能度量。AUC值表示隨機從正負樣本中各抽取一個，分類器正確預測正例得分高於負例的概率。AUC值越接近1，說明模型性能越優秀。PR曲線的繪製方法詳見上一章， PR曲線在不同分類閾值下展示了分類器在精度（Precision, P）和召回率（Recall, R）方面的整體表現。

2.3.4.單模態骨轉移灶檢測模型及基於特徵轉換的跨模態分類模型的結果與分析

本項目對乳腺癌骨轉移多模態醫學圖像數據集（包括 CT、MRI、PET）進行了單模態腫瘤檢測實驗和基於特徵轉換的跨模態病例分類實驗。其中，單模態腫瘤檢測實驗是多模態腫瘤分類實驗的前置步驟。

採用單階段目標檢測模型 YOLOv5，在五折交叉驗證下比較了使用單模態 CT圖像、PET圖像、MRI圖像作為輸入數據集時，模型的目標檢測性能。並在將目標檢測結果進行特徵轉換後，採用不同的分類模型，包括後融合分類模型（LightGBM、GBDT、AdaBoost、RandomForest、SVM）和前融合分類模型（C3D），在五折交叉驗證下比較了使用單模態數據和跨模態融合數據作為輸入時，每個模型的分類性能。

表2-2展示了在不同單一模態數據上，五折交叉驗證得到的骨轉移病灶檢測結果，評估指標為 AP50。實驗結果表明，PET模態的檢測精度較高，而 CT模態的檢測精度最低。輸入數據量較少、檢測目標面積小、轉移瘤的特徵難以與正常骨組織區分是提高檢測精度的難點。圖2-4不同單一模態圖像下目標骨轉移病變區域的檢測結果和實際標註的癌灶區域進行了可視化。綠色框由醫師標註，目標檢測模型標註的預測框為黃色。

表2-2單模態骨轉移灶檢測五折交叉驗證結果

圖2-4可視化單模態目標檢測結果

將 CT、MRI和 PET的數據組成單模態子數據集進行單模態分析，而將兩種及兩種以上的數據組成跨模態子數據集進行多模態分析。表2-3和表2-4展示了在上述兩種子數據集中進行五折交叉驗證的結果，對比了6種不同模型每一折的準確率、AUC，及其平均值。這6類模型中的前5種模型使用後融合策略，而作為對照的C3D模型採用前融合策略。對比表2-3和表2-4的實驗結果可知，在任一模型（包括前融合模型）中，基於特徵轉換的跨模態融合數據在乳腺癌骨轉移分類任務上相較於僅使用單模態數據的性能有所提高：平均準確率提高了7.9%；平均AUC提高了8.5%。如表2-5和2-6所示，跨模態融合方法比單模態方法的平均敏感性提高了7.6%，平均特異性提高了9.4%。

表2-3基於單模態子數據集進行分類任務的準確率和 AUC

表2-4基於跨模態子數據集行分類任務的準確率和 AUC

表2-5單模態數據分類的敏感性和特異性

表2-6跨模態融合數據分類的敏感性和特異性

圖2-5和圖2-6分別展示了6個模型利用單模態數據進行分類實驗和利用特徵轉換和融合後的跨模態數據進行分類實驗的 PR曲線。可以根據曲線形狀和曲線下方面積來評估不同模型的性能表現，曲線下面積越大，提示模型的性能越優秀。綜合觀察單模態和跨模態分類實驗的P-R曲線圖，可以發現，基於跨模態數據的分類任務的P-R曲線下面積大於基於單模態數據的分類任務的P-R曲線下面積，提示跨模態數據作為輸入時分類模型的表現更加出色。

比較基於單模態數據進行分類的模型的 P-R曲線，可見3D卷積網絡的訓練方式相較於其他後融合模型的性能表現更優。然而，在基於跨模態數據進行分類的模型的 P-R曲線中，基於特徵轉換的跨模態後融合策略相對於基於3D卷積的前融合方法具有更好的性能。

圖2-5基於單模態數據不同分類模型的PR曲線

圖2-6基於跨模態數據不同分類模型的PR曲線

圖2-7和圖2-8展示了6種分類模型在使用單模態數據進行分類實驗和使用跨模態數據進行分類實驗的情況下的ROC曲線。通過對比觀察六個模型的ROC曲線的形狀和面積來評估不同模型的性能。靠近左上角的 ROC曲線表示假陽性率接近0，真陽性率接近1，趨近於左上角的 ROC曲線提示模型性能優越。對比圖2-7和圖2-8可知，使用基於特徵轉換的跨模態數據的骨轉移病例分類模型的性能更為優越。

圖2-7基於單模態數據不同分類模型的ROC曲線

圖2-8基於跨模態數據不同分類模型的ROC曲線

本文提出的跨模態數據融合方法是基於特徵轉換的後融合策略，相較於前融合策略具有更好的性能。實驗表明，無論採用前融合或後融合策略，基於跨模態融合數據的實驗都表現出了顯著的優勢。相較於多模態數據，單一模態數據所捕獲的特徵較為單一（如僅有結構信息），可能由於缺乏關鍵和全面的特徵信息導致模型性能不佳，而跨模態融合方法則能從不同模態中獲取更多的有效特徵，並將其融合，從而提高準確率。

2.4.討論

本項目旨在評估基於特徵轉換的跨模態數據融合方法是否可以跨模態融合 CT、MRI和 PET圖像的有效特徵，以對乳腺癌患者進行是否發生骨轉移的評估。本項目提出了一個基於特徵轉換的跨模態融合圖像數據框架，用於對骨轉移病變進行分類，包括目標病變區域檢測、特徵構造及融合形成可解釋的結構化數據以及跨模態融合數據分類步驟。融合 CT、MRI和 PET的轉換特徵數據能夠充分利用各個模態中的信息，為分類任務提供更多的數據支持，並增加輔助判斷的特徵線索。基於特徵轉換的跨模態病例分類實驗證明，本項目提出的跨模態融合數據顯著提高了對影像序列進行二分類任務的性能，相較於單模態數據，平均準確率和 AUC分別提高了7.9%和8.5%。

如表2-2所示，單模態目標檢測模型在 PET圖像中的檢測精度較高，而在 CT和 MRI圖像中的精度相對較低。圖2-4展示了在 YOLOv5目標檢測模型中不同單一輸入圖像模態下乳腺癌骨轉移檢測效果的可視化圖像。分析各種圖像模態信息可知，CT和 MRI圖像不僅包含病灶的解剖信息，還包含了人體正常組織的結構信息，而後者可能會干擾宮頸癌病變區域特徵的識別和檢測，導致單一CT或MRI模態下出現漏檢現象。與之不同，PET圖像展示的是組織代謝信息。骨轉移病灶通常伴隨着頻繁的成骨和破骨活動，在 PET影像中呈高代謝，而正常骨組織的代謝相對較緩慢，通常不會顯示在圖像中。因此，PET對背景組織的干擾較 CT、MRI更不敏感，有助於目標檢測模型識別異常代謝區域。因早期無症狀骨轉移病灶通常體積較小，可能因目標區域面積過小影響目標檢測結果。在模型處理過程中，池化（pooling）操作可能導致特徵或圖像信息的損失，從而造成特徵缺失。為了克服這一問題，後續研究可以關注提高模型在處理小目標區域時的性能。

表2-4和表2-6展示了在各種分類模型中，基於跨模態結構化數據在五折交叉驗證下的分類性能。通過對比分析發現，相對於基於 C3D的前融合分類模型，基於特徵轉換的後融合策略在性能方面有所提高。醫學影像數據有數據量較少、維度高、結構複雜以及樣本識別難度大等特點，這導致將特徵提取任務交由模型完成，直接輸入原始數據或經過簡單預處理的數據，讓模型自主進行特徵提取並生成最終輸出的這種端到端的前融合方法效果不盡如人意。由於患者在體型、身高等方面的個體差異，一個圖像序列內的CT和MRI切片數量也有所不同。因此，需要對圖像進行歸一化處理，將其轉換為統一的標準格式，如調整到相同尺寸、修正切割後圖像中心的位置等。歸一化操作旨在對數據進行統一格式化和壓縮，但這可能會導致圖像未對齊、圖像與特徵錯位、數據壓縮過度以及特徵丟失等問題。因此，採用基於特徵轉換的後融合策略可能更合適本項目。前融合所採用的 C3D分類模型是一種在三維數據上進行分析的網絡模型。三維數據具有尺度高、維度大以及信息稀疏等特點。儘管 C3D網絡訓練過程中增加了一個維度的信息，但同時也提高了算法分析的複雜性，特別是在模型訓練過程中，佔用了大量顯存等硬件資源，可能導致批歸一化不理想和網絡收斂不完全的問題。與 C3D相比，本文提出的二階段後融合方法實現了特徵壓縮，提取置信度這種可解釋的特徵，並去除了無關的稀疏特徵。在有限的硬件資源和數據量的限制下，這種方法能更好地學習數據特徵，起到了類似正則化（通過在損失函數中添加約束，規範模型在後續疊代過程中避免過度擬合）的效果。

乳腺癌骨轉移病灶在代謝和結構方面都較正常骨組織顯著不同。因此，通過融合 CT、MRI和 PET圖像的特徵信息，實現解剖和功能信息的跨模態融合，能更有效地完成分類任務，幫助診斷乳腺癌骨轉移。然而，綜合分析全身 CT、MRI和PET圖像信息需要醫師投入大量時間，且存在較大的觀察者間差異。一旦發生漏診，會導致嚴重後果。利用計算機輔助醫師判斷乳腺癌是否發生骨轉移不僅可以節省成本和時間，還能提供更加客觀的評估標準。計算機輔助診斷工具可以綜合多模態圖像的結果進行特徵轉換和分析，預防漏診的發生。因此，在未來的研究中，可以重點關注開發此類計算機輔助診斷系統，以提高乳腺癌骨轉移診斷的準確性和效率。

這項研究仍存在一些局限性。從單個影像模態中提取的特徵較為單一，僅有置信區間，可以在後續的訓練中可以從臨床角度出發加入更多可能影響骨轉移判斷的因素作為分類特徵，如檢測目標的面積，或增加中軸骨檢出目標的權重。因本項研究具有多模態影像數據的病例量不夠，未來可以嘗試除五折交叉驗證之外其他的模型訓練方法以降低數據量對分類模型性能的影響。

2.5.結論

本項目提出了一種基於特徵轉換的跨模態數據融合方法進行分類任務的深度學習框架，用於判斷是否發生乳腺癌骨轉移。首先獨立對不同模態的醫學圖像數據進行腫瘤檢測，根據目標檢測結果進行特徵構造，並將其組織成結構化數據的形式，完成從非結構化數據特徵到結構化數據特徵的轉換與融合。最終，將結構化數據特徵輸入分類器，進行骨轉移的分類任務，並對照 C3D前融合模型，討論了基於特徵轉換方法進行跨模態數據後融合的優勢。大量的實驗證明，使用基於特徵轉換的跨模態融合數據進行分類任務的性能優於基於單模態數據的分類性能；使用本項目提出的後融合策略執行分類任務較使用前融合策略的分類模型（C3D）的性能更好。

本項目所提出的技術可綜合 CT、MRI和 PET模態數據的特徵，對乳腺癌患者是否發生骨轉移進行判斷，輔助臨床醫師進行乳腺癌骨轉移病灶的篩查，具備實際應用價值，也為在醫學圖像分析任務中更有效地應用跨模態融合方法，提供了關鍵的理論支持。

全文小結

目前，醫學影像學的解讀大量依賴臨床醫生個人的主觀診斷經驗，人工閱片易漏診小目標，難以推廣及表述，具有一定的局限性。與此相比，人工智能技術可以通過深度神經網絡對大量積累的影像數據和診斷數據進行分析，學習並提取數據中對病理診斷有用的特徵，從而在數據支持下做出更客觀的判斷。按成像方式不同，醫學影像數據可分為多種模態，如B超、CT、MRI、PET。為了最大限度模擬臨床醫生結合不同模態影像檢查結果形成診斷的過程，設計人工智能模型時，應將各種影像學模態的特徵進行有效的融合，即本項目中應用的跨模態深度學習方法，充分利用不同模態圖像的獨特優勢訓練深度神經網絡，從而提高模型性能。本項目以宮頸癌和乳腺癌骨轉移為例，驗證了跨模態深度學習方法在病變區域定

位和輔助診斷方面的性能。

在第一部分中，我們回顧性納入了220例有FDG-PET/CT數據的宮頸癌患者，共計72,602張切片圖像。通過圖像增強、邊緣檢測，實現 PET和 CT圖像的 ROI自適應定位，再通過縮放、零值填充和剪切的方式，將兩種模態圖像的 ROI對齊。經過加權和圖像疊加，進行圖像融合，將融合後的圖像作為目標檢測網絡的輸入層，進行宮頸癌病變區域檢測。實驗證明，相比使用單一 CT圖像、單一 PET圖像、PET-CT簡單融合圖像、PET-CT通道拼接融合圖像作為網絡輸入，PET-CT自適應區域特徵融合圖像顯著提高了宮頸癌病變區域檢測的準確性，目標檢測的平均精確度（AP50）分別提高了6.06%和8.9%，且消除了一些假陽性結果，展現出可觀的臨床應用價值。

在第二部分中，我們回顧性納入了233例乳腺癌患者，每例樣本數據包含 CT、MRI、或 PET一至三種模態的全身影像數據，共有3051張 CT切片，3543張 MRI切片，1818張 PET切片。首先訓練 YOLOv5目標檢測網絡，對每種單一模態圖像中的骨轉移病灶進行目標檢測。統計每個影像序列中含有檢出骨轉移病灶的個數和置信度，將每個置信區間內含有目標檢測框的百分比作為結構化醫療特徵數據。採用級聯方式融合三種模態的結構化特徵，得到具有可解釋性的結構化醫療數據，再用分類模型進行分類，預測是否發生骨轉移。實驗證明，相較於單模態數據，跨模態融合數據顯著提高了乳腺癌骨轉移診斷任務的性能，平均準確率和 AUC分別提高了7.9%和8.5%，觀察 ROC曲線和 PR曲線的形狀和面積也有相同的實驗結論。在不同的分類模型（SVM、AdaBoost、RandomForest、LightGBM、GBDT）中，使用基於特徵轉換的跨模態數據，相比單模態數據，對於骨轉移病例的分類性能更為優越。而相較於基於 C3D的前融合分類模型，基於特徵轉換的後融合策略在分類任務方面的性能更優。

綜上所述，本文基於人工智能深度學習算法，針對不同模態醫學圖像的特徵差異與互補性，進行多模態醫學影像數據的跨模態融合，提高了模型的腫瘤檢測和分類性能，檢測模型和分類模型可以輔助影像學閱片過程，具有顯著的臨床實際應用價值。

參考文獻

[1]陳思源,譚艾迪,魏雙劍,蓋珂珂.基於區塊鏈的醫療影像數據人工智能檢測模型[J].網絡安全與數據治理,2022,41(10):21-25.

[2] Dong X, Wu D. A rare cause of peri-esophageal cystic lesion[J. Gastroenterology,2023,164(2):191-193.

[3] Arbyn M, Weiderpass E, Bruni L, et al. Estimates of incidence and mortality of cervical cancer in2018: a worldwide analysis[J. The Lancet Global Health,2020,8(2): e191-e203.

[4] Marth C, Landoni F, Mahner S, et al. Cervical cancer: ESMO Clinical Practice Guidelines for diagnosis, treatment and follow-up[J. Annals of Oncology,2017,28:iv72-iv83.

[5] Gold M A. PET in Cervical Cancer—Implications forStaging,』Treatment Planning, Assessment of Prognosis, and Prediction of Response[J. Journal of the National Comprehensive Cancer Network,2008,6(1):37-45.

[6] Gandy N, Arshad M A, Park W H E, et al. FDG-PET imaging in cervical cancer[C].Seminars in nuclear medicine. WB Saunders,2019,49(6):461-470.

[7] Grigsby P W. PET/CT imaging to guide cervical cancer therapy[J. Future Oncology,2009,5(7):953-958.

[8] Mirpour S, Mhlanga J C, Logeswaran P, et al. The role of PET/CT in the management of cervical cancer[J. American Journal of Roentgenology,2013,201(2): W192-W205.

[9] LeCun Y, Bengio Y, Hinton G. Deep learning[J. nature,2015,521(7553):436-444.

[10] Szeliski R. Computer vision: algorithms and applications Springer Science& Business Media[J].2010.

[11] Ma B, Yin X, Wu D, et al. End-to-end learning for simultaneously generating decision

map and multi-focus image fusion result[J. Neurocomputing,2022,470:204-216.

[12] Anwar S M, Majid M, Qayyum A, et al. Medical image analysis using convolutional

neural networks: a review[J. Journal of medical systems,2018,42:1-13.

[13] Ma B, Ban X, Huang H, et al. Deep learning-based image segmentation for al-la alloy

microscopic images[J. Symmetry,2018,10(4):107.

[14] Li Z, He J, Zhang X, et al. Toward high accuracy and visualization: An interpretable feature extraction method based on genetic programming and non-overlap degree[C].2020 IEEE International Conference on Bioinformatics and Biomedicine(BIBM). IEEE,2020:299-304.

[15] Rumelhart D E, Hinton G E, Williams R J. Learning representations by back-propagating errors[J]. nature,1986,323(6088):533-536.

[16] He K, Gkioxari G, Dollar P, et al. Mask R-CNN[C. International Conference on Computer Vision. IEEE Computer Society,2017, pp.2980-2988.

[17] Ma B, Wei X, Liu C, et al. Data augmentation in microscopic images for material data mining[J]. npj Computational Materials,2020,6(1):125.

[18] Ma B, Zhu Y, Yin X, et al. Sesf-fuse: An unsupervised deep model for multi-focus image fusion[J]. Neural Computing and Applications,2021,33:5793-5804.

[19] Kermany D S, Goldbaum M, Cai W, et al. Identifying medical diagnoses and treatable diseases by image-based deep learning[J. cell,2018,172(5):1122-1131. e9.

[20] Hyun S H, Ahn M S, Koh Y W, et al. A machine-learning approach using PET-based radiomics to predict the histological subtypes of lung cancer[J. Clinical nuclear medicine,2019,44(12):956-960.

[21] Chilamkurthy S, Ghosh R, Tanamala S, et al. Deep learning algorithms for detection of critical findings in head CT scans: a retrospective study[J. The Lancet,2018,392(10162):2388-2396.

[22] Chen C, Xiao R, Zhang T, et al. Pathological lung segmentation in chest CT images based on improved random walker[J]. Computer methods and programs in biomedicine,2021,200:105864.

[23] Chen C, Zhou K, Zha M, et al. An effective deep neural network for lung lesions segmentation from COVID-19 CT images[J]. IEEE Transactions on Industrial Informatics,2021,17(9):6528-6538.

[24] Hill D L G, Batchelor P G, Holden M, et al. Medical image registration[J]. Physics in medicine& biology,2001,46(3): R1.

[25] Du J, Li W, Lu K, et al. An overview of multi-modal medical image fusion[J].Neurocomputing,2016,215:3-20.

[26] Watanabe H, Ariji Y, Fukuda M, et al. Deep learning object detection of maxillary cyst-like lesions on panoramic radiographs: preliminary study[J]. Oral radiology,2021,37:487-493.

[27] Mattes D, Haynor D R, Vesselle H, et al. PET-CT image registration in the chest using free-form deformations[J]. IEEE transactions on medical imaging,2003,22(1):120-128.

[28] Maqsood S, Javed U. Multi-modal medical image fusion based on two-scale image decomposition and sparse representation[J]. Biomedical Signal Processing and Control,2020,57:101810.

[29] Elakkiya R, Subramaniyaswamy V, Vijayakumar V, et al. Cervical cancer diagnostics healthcare system using hybrid object detection adversarial networks[J]. IEEE Journal of Biomedical and Health Informatics,2021,26(4):1464-1471.

[30] Al-Ameen Z, Sulong G, Gapar M D, et al. Reducing the Gaussian blur artifact from CT medical images by employing a combination of sharpening filters and iterative deblurring algorithms[J]. Journal of Theoretical and Applied Information Technology,2012,46(1):31-36.

[31] Canny J. A computational approach to edge detection[J]. IEEE Transactions on pattern analysis and machine intelligence,1986(6):679-698.

[32] Jocher G, Stoken A, Borovec J, et al. ultralytics/yolov5: v5.0-YOLOv5-P61280 models, AWS, Supervise. ly and YouTube integrations[J. Zenodo,2021.

[33] Wang C Y, Liao H Y M, Wu Y H, et al. CSPNet: A new backbone that can enhance learning capability of CNN[C. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops,2020:390-391.

[34] Liu S, Qi L, Qin H, et al. Path aggregation network for instance segmentation[C.

Proceedings of the IEEE conference on computer vision and pattern recognition,2018:

8759-8768.

[35] Bochkovskiy A, Wang C Y, Liao H Y M. Yolov4: Optimal speed and accuracy of object detection[J. arXiv preprint arXiv,2020,2004:10934.

[36] He K, Zhang X, Ren S, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J. IEEE transactions on pattern analysis and machine intelligence,2015,37(9):1904-1916.

[37] Lee S I, Atri M.2018 FIGO staging system for uterine cervical cancer: enter cross-sectional imaging[J. Radiology,2019,292(1):15-24.

[38] Bochkovskiy A, Wang C Y, Liao H Y M. Yolov4: Optimal speed and accuracy of object detection[J. arXiv preprint arXiv:2004.10934,2020.

[39] Smith A R. Color gamut transform pairs[J. ACM Siggraph Computer Graphics,1978,12(3):12-19.

[40] Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by

reducing internal covariate shift[C. International conference on machine learning.

pmlr,2015:448-456.

[41] Elfwing S, Uchibe E, Doya K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning[J. Neural Networks,2018,107:3-11.

[42] Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection[C].Proceedings of the IEEE international conference on computer vision.2017:2980-2988.

[43] Zhang S, Chi C, Yao Y, et al. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection[C. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.2020:9759-9768.

[44] Ren S, He K, Girshick R, et al. Faster r-cnn: Towards real-time object detection with region proposal networks[J. Advances in neural information processing systems,2015,28.

[45] Cai Z, Vasconcelos N. Cascade r-cnn: Delving into high quality object detection[C].Proceedings of the IEEE conference on computer vision and pattern recognition.2018:6154-6162.

[46] Pang J, Chen K, Shi J, et al. Libra r-cnn: Towards balanced learning for object detection[C]. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.2019:821-830.

[47] Cohen P A, Jhingran A, Oaknin A, et al. Cervical cancer[J. The Lancet,2019,393(10167):169-182.

[48] Lee S I, Atri M.2018 FIGO staging system for uterine cervical cancer: enter cross-sectional imaging[J]. Radiology,2019,292(1):15-24.

[49] Coleman R E. Metastatic bone disease: clinical features, pathophysiology and treatment strategies[J. Cancer treatment reviews,2001,27(3):165-176.

[50] Cecchini M G, Wetterwald A, Van Der Pluijm G, et al. Molecular and biological mechanisms of bone metastasis[J. EAU Update Series,2005,3(4):214-226.

[51] Cuccurullo V, Lucio Cascini G, Tamburrini O, et al. Bone metastases radiopharmaceuticals:

an overview[J. Current radiopharmaceuticals,2013,6(1):41-47.

[52] Emens L A, Davidson N E. The follow-up of breast cancer[C. Seminars in oncology. WB Saunders,2003,30(3):338-348.

[53] Chen W Z, Shen J F, Zhou Y, et al. Clinical characteristics and risk factors for developing bone metastases in patients with breast cancer[J. Scientific reports,2017,7(1):1-7.

[54] Batson O V. The function of the vertebral veins and their role in the spread of metastases[J. Annals of surgery,1940,112(1):138.

[55] O』Sullivan G J, Carty F L, Cronin C G. Imaging of bone metastasis: an update[J].World journal of radiology,2015,7(8):202.

[56] Noguchi S, Nishio M, Sakamoto R, et al. Deep learning–based algorithm improved radiologists』 performance in bone metastases detection on CT[J. European Radiology,2022,32(11):7976-7987.

[57] Fan X, Zhang X, Zhang Z, et al. Deep Learning on MRI Images for Diagnosis of Lung Cancer Spinal Bone Metastasis[J. Contrast Media& Molecular Imaging,2021,2021(1):1-9.

[58] Moreau N, Rousseau C, Fourcade C, et al. Deep learning approaches for bone and bone lesion segmentation on18FDG PET/CT imaging in the context of metastatic breast cancer[C].42nd Annual International Conference of the IEEE Engineering in Medicine& Biology Society(EMBC). IEEE,2020:1532-1535.

[59] Boser B E, Guyon I M, Vapnik V N. A training algorithm for optimal margin classifiers[C]. Proceedings of the fifth annual workshop on Computational learning theory,1992:144-152.

[60] Freund Y, Schapire R E. A decision-theoretic generalization of on-line learning and an application to boosting[J]. Journal of computer and system sciences,1997,55(1):119-139.

[61] Breiman L. Random forests[J. Machine learning,2001,45:5-32.

[62] Ke G, Meng Q, Finley T, et al. Lightgbm: A highly efficient gradient boosting decision tree[J]. Advances in neural information processing systems,2017,30:52.

[63] Chen T, Guestrin C. Xgboost: A scalable tree boosting system[C. Proceedings of the22nd acm sigkdd international conference on knowledge discovery and data mining,2016:785-794.

[64] Tran D, Bourdev L, Fergus R, et al. Learning spatiotemporal features with3d convolutional networks[C. Proceedings of the IEEE international conference on computer vision,2015:4489-4497.

文獻綜述

3.跨模態深度學習技術在臨床影像中的應用

The Application of Deep Learning and Cross-modal Fusion Methods in Medical Imaging

Abstract

Deep learning technology is gaining widespread prominence across various fields in this era. In the realm of medical imaging, it has steadily assumed a pivotal role in tasks such as feature recognition, object detection, and image segmentation, since its inception. With the continuous evolution of imaging techniques, the individual patient often possesses an expanding wealth of multi-modal imaging data. It is evident that deep learning models utilizing cross-modal image fusion techniques will find diverse applications in a lot more clinical scenarios. In the future, deep learning will play a significant role in the medical sector, encompassing screening, diagnosis, treatment, and long-term disease management. To provide a reference for future research, this review aims to present a concise overview of the fundamental principles of deep learning, the nature of cross-modal fusion methods based on deep learning, with their wide-ranging applications, and a comprehensive survey of the present clinical uses of single-modal and cross-modal deep learning techniques in medical imaging, with a particular emphasis on bone metastasis imaging.

Keywords: deep learning, cross-modal, tumor imaging, bong metastasis

3.1 Preface

Nowadays, data are generated in massive quantities in the healthcare sector, from sources such as high-resolution medical imaging, biosensors with continuous output of physiologic metrics, genome sequencing, and electronic medical records. The limits on the analysis of such data by humans alone have clearly been exceeded, necessitating an increased reliance on machines. The use of artificial intelligence(AI), the deep-learning subtype in particular, has been enabled by the use of labeled big data, along with markedly enhanced computing power and cloud storage. One field that has attracted particular attention for the application of AI is radiology, as the cost of medical scans is declining, and the use of imaging studies is now at least as common as physical examinations, even surpassing the latter in surgical emergencies out of humanism and accuracy concerns. AI can greatly aid clinicians in diagnosis, especially in the interpretation of radiographic images, the accuracy of which heavily relies on the clinical experience and scrutiny of their interpreters, thus freeing clinicians to devote more of their attention to providing bedside healthcare. The radiologic screening and staging of tumors rely heavily on radiologists』 subjective judgments. For some minuscule or ambivalent lesions, it is often difficult to arrive at a definitive diagnosis based solely on clinical experience. A case reported by Dong et al. proves the vulnerability of relying on error-prone human judgment[1. AI methods mainly analyze medical images through image processing and deep learning techniques. As an assistant for clinicians, deep neural networks can be trained with large datasets of radiologic images or clinical information, automatically learning features key to the revelation of pathology or lesion localization. In addition to deep learning models based on images of a single modality, researchers have also proven the feasibility of integrating multi-modal medical imaging data in algorithms, with improved model robustness. A combination of feature representations from different imaging modalities can effectively improve the performance of tumor detection, classification, and segmentation. Artificially generating relatively scarce imaging data from more easily accessible radiographs by way of cross-modal image translation can not only aid diagnoses but also improve the performance of deep learning models.

This article will briefly review the relevant background of deep neural networks(DNNs), as well as the up-to-date development of cross-modal fusion and image translation methods. An overview of the current clinical applications of single-modal and cross-modal deep learning in tumor imaging, especially in bone metastasis imaging, is also provided.

3.2. Deep Neural Network(DNN)

Traditional machine learning methods face limitations in handling data in its raw form, as creating a suitable internal representation or feature vector requires a meticulous feature extractor designed manually to convert raw data, such as image pixels. Only after then a classifier, every detail of which was manually set and adjusted, could detect or classify patterns in the input, and spell out its outcome. Because of the varying qualities of images, lots of intricate image enhancement or filtering algorithms, such as adaptive Gaussian learning, histogram equalization, and de-noising, are designed alone for the purpose of pre-processing images to be ready for the feature extractor. Another downside of conventional machine learning is that manually coded algorithms would only, with images with finer details or better contrast, allow for automatic execution of the thought processes that can best mimic, not surpass, that of a clinician. All the key features to be extracted and used in encoding the classifier were essentially the same set of「inputs」 a clinician would use to make his or her judgment.

In contrast,「representation learning」 is a set of techniques utilized in deep learning, which enable machines to analyze raw data without manual feature engineering. This allows the system to automatically identify the relevant patterns or features needed for classification or detection. Pattern recognition using deep neural networks(DNNs) can help interpret medical scans, pathology slides, skin lesions, retinal images, electrocardiograms, endoscopy, faces, and vital signs. Deep learning algorithms employ multiple tiers of representation through the composition of nonlinear modules, which transform the input representation from one level to the next, beginning with the raw input and continuing to higher and more abstract levels. It is helpful to think of the entire network as, nonetheless, a「function」, that takes in a set of inputs and spills out an output, though with absurdly complicated parameters and transformations. Irrelevant variations or noise can be lost as stepping up towards the higher layers of representation that amplify only features important for discrimination. By this「layering」 method, very complex functions can be learned. A key differentiating feature of deep learning compared with other subtypes of AI is its autodidactic quality, i.e. neither the number of layers nor features of each layer is designed by human engineers, unencumbered by either the essence or the flaws of the human brain.

3.2.1. Supervised learning

Supervised learning is the process of learning with training labels assigned by the supervisor, i.e. the training set of examples has its raw input data bundled with their desired outputs. When used to classify images into different categories, the machine is shown an image and produces an output in the form of a vector of scores, one for each category during training. In supervised learning, the machine receives immediate feedback on its performance when its output does not match the expected output. The aim is to assign the highest score to the desired category among all categories. This is achieved by calculating a cost function that measures the average error or distance between the output scores and the expected pattern of scores. The inputs of the cost function are the parameters of the machine. Much like the「update rule」 used in the Mixed Complementarity Problem(MCP), the perceptron learning algorithm updates its internal adjustable parameters to minimize errors when it predicts the wrong category[2]. The adjustable parameters consist of weights and biases, where weights control the input-output function of the machine. The algorithm learns from its mistakes rather than successes, and the weights can number in the hundreds or millions. Weights are assigned to the connections between neurons from the input layer and one of the neurons in the next layer, in some sense representing the「strength」 of a connection. The activation of a single notch of neurons in the next layer was computed by taking the weighted sum of all the activations of the first layer, e.g., greyscale values of the pixels. Just like biological neurons may have different thresholds that the graded sum of electric potentials at the cell body needed to reach for axonal propagation, the algorithm may not want its neuron to light up simply over a sum greater than0. So, a「bias for inactivity」 is introduced into the weighted sum formula. For example, if a neuron is designed to be active only if the weighted sum exceeds10, then a10 is subtracted from the formula before the transformation that follows. Weights represent the pixel pattern(weight assigned to each pixel can be visualized as a pixel pattern) that the algorithm identifies, while biases provide a threshold indicating the required level of weighted sum for a neuron to become meaningfully active. When the goal is to have the value of activations of the next layer between0 and1 and were the mapping to be smooth and linear, then the weighted sum can be pumped into a sigmoid function, i.e.1/(1+ exp(−w)), where w is the weighted sum. A sigmoid transformation compresses the continuum of real numbers, mapping them onto the interval between0 and1, effectively pushing negative inputs towards zero, and positive inputs towards1, and the output steadily increases around the input0. Say the mapping is from p neurons from the one layer to the q neurons of the next layer, there would be p×q number of weights and q biases. These are all adjustable parameters that can be manipulated to modify the behavior of this network. In a deep learning system, changing the parameters may reflect a shift in the location, size, or shape of the representations to find「better」 features to travel through the layers to get to the desired output.

At present though, preferred mappings in DNN are neither smooth nor linear. By applying a non-linear function to the input, the categories become separable by the last output layer in a linear way, resulting in a definitive category output, unlike the previously mentioned range of numerical values that would require arbitrary cut-off points to finalize the categorization. A sigmoid transformation was once popular in the era of Multilayer Perceptron, during which a machine was simply an executor of commands, and the feature detected by each layer was designed and programmed by human engineers, such that the final output, as a continuous variable, would be interpretable[3]. The rectified linear unit(ReLU) is currently the most widely used non-linear function, which introduces non-linearity into the network by setting all negative values to zero. This is in contrast to the smoother non-linearities, such as tanh(z) or1/(1+ exp(−z)), used in previous decades.

ReLU has proven to be a faster learner in deep networks compared to these other non-linear functions and allows for the training of deep supervised networks without the need for unsupervised pre-training[4].

This would not work in DNN, as hidden layers would not be picking up edges and patterns based on our expectations. How the machine gets to the correct output is still an enigma, and its intelligence still awaits revelation.

The essential of learning by neural networks is to minimize the cost function. It is important for this cost function to have a nice and smooth output so that the local minimum can be obtained by taking little steps downhill, rather than being either on or off in a binary way the way biological neurons are. To adjust the weight and bias values of the parameter vector in a high-dimensional space, the learning algorithm computes a gradient vector that specifies how much the error, or cost, would increase or decrease if each parameter were slightly modified. In mathematical terms, this is similar to taking derivatives of a function with respect to a variable to observe the trend of the function between the two infinitesimally close values of that variable. In multivariate calculus, the gradient of a function indicates the path of the steepest incline, pointing towards the direction in the input space where one should move to minimize the output of this cost function with the utmost speed, and the length of the vector indicates exactly how steep the steepest ascent is. The weight vector is modified by shifting it in the opposite direction of the gradient vector, and the size of the adjustments is proportional to the slope of the gradient vector. When the slope of the gradient vector approaches the minimum, the step size decreases to prevent overshooting. This is the so-called「gradient descent」 that converges on some local minimum. Minimizing the cost function can guarantee better performance across all training samples. Viewed from a different perspective, the gradient vector of the cost function encodes the relative importance of weights and biases, which changes to which weights matter the most to minimize the cost. The magnitude of each component represents

how sensitive the cost is to each weight and bias.

In practice, most practitioners use a procedure called stochastic gradient descent(SGD). It involves randomly selecting a few input vectors as mini-batches, computing the corresponding outputs, errors, and the gradient descent step. The weights were adjusted accordingly. This process is repeated for many small subsets of examples from the training set until the average cost function stops decreasing. Each small subset of examples gives a noisy estimate of the average gradient over all examples, and thus the「stochasticity」. Despite its simplicity, SGD often achieves good results with far less computation time than more complex optimization techniques[5].

3.2.2. Backpropagation

Recursively adjusting the weights in proportion to the activation of the second-to-last layer, vise vera, or altering the biases to decrease the cost for a single training sample is a single round of digital learning. In a nutshell, the backpropagation procedure is an algorithm of computing the gradient descent efficiently. Calculating the gradient of a cost function with respect to the weights in a stacked multilayer module is a practical application of the chain rule of derivatives. A key insight is that the derivative of the cost function concerning the input can be obtained by reversing the order of the layers, working from the higher to the lower layers. The process of backpropagation entails computing gradients through all layers. From the uppermost layer where the network generates predictions down towards the lowermost layer where the external input is introduced. Once these gradients have been calculated, it is straightforward to determine the gradients with respect to the weights and biases of each module. After these gradients are computed, it becomes a straightforward task to derive the gradients with respect to the weights and biases of each module. The average of desired changes, obtained by traversing the backpropagation route for alternate training samples, was the optimal adjustment that parameters could make to make the model performs better in the training set.

It was commonly thought that a simple gradient descent would get trapped in suboptimal local minima— weight configurations for which no small change would reduce the cost function, as finding the global minimum would be an intractable task. Recent theoretical and empirical results strongly suggest the cost function’s landscape is actually filled with a huge number of saddle points where the gradient is zero, indicating that the optimization challenge is more complex than originally thought, but most of these points have similar cost function values[6]. In other words, the depth of the local minima is almost the same across different saddle points, so it is not crucial which one the algorithm gets stuck at.

3.2.3. Convolutional neural networks(CNN)

Convolutional neural networks(CNNs) are easier to train and have better generalization capabilities compared to other feedforward networks with fully connected layers. They are specifically designed to process data represented as multiple arrays, such as grayscale images consisting of a single2D array containing pixel intensities of varying values.

The four key ideas behind CNN are inspired by the properties of natural signals and visual neuroscience: local connections, shared weights, pooling, and the use of multi-layer. The convolutional and pooling layers are directly inspired by the concept of simple cells and complex cells, respectively, in the visual cortex, and the overall architecture is reminiscent of the LGN-V1-V2-V4-IT hierarchy in the visual cortex’s ventral pathway[7][8]. Local groups of values in array data often exhibit high correlation and form characteristic local motifs that can be readily identified. Therefore, pattern recognition makes CNN most useful in detecting images.

3.2.3.1. Convolution

The main function of the convolutional layer is to identify and extract local combinations of features from the preceding layer(Fig.1). The actual process of matching is accomplished through filtering in the convolutional layer: A filter bank can be thought of as a small matrix of representative features(of real numbers) for which the number of rows and columns, eg. n× n, of pixels is arbitrarily set. The filter and image patch are lined up, and each image pixel is multiplied(dot product) by the corresponding feature. The result is added up and divided by the total number of pixels in the filter to arrive at a specific feature value. Feature value indicates how well the feature is represented at that position. Sliding over n pixels, the same procedure is repeated for every n× n block of pixels for the entire input image, and a feature map, a「map」 of where the filter feature occurs, is obtained. All units in a feature map share the same filter bank. Therefore, the local characteristics of images and other signals remain constant regardless of their location. In simpler terms, a pattern that appears in one part of the image can appear in any other part as well. Hence, the approach of employing units with identical weights to identify corresponding patterns across various sections of the array is adopted. In a convolution layer, filtering can be performed for a bunch of features and create a stack of filtered images. Each feature map in a layer employs its own filter bank. From a mathematical standpoint, the operation of filtering executed by a feature map can be described as a discrete convolution, hence the name.

Fig.1: Example of a filter(kernel) convolution. Note the new pixel value shown in the figure has not been weighted by the number of windows in the filter.

3.2.3.2. Pooling

The aim of the pooling layer is to reduce the size of a feature map by merging similar features into a single one through the following steps:(1) choose an appropriate window size, usually2×2 or3×3 pixels;(2) pick a stride(by how many pixels the window steps down to run through a feature map) accordingly, usually2 pixels;(3) walk the window by its stride across the filtered images;(4) take the maximum value in each window as the pooling result and form a「pooled map」. A robust motif detection can be accomplished by simplifying the positioning of each feature across all feature maps that are fed into this CNN layer. Pooling helped the algorithm to neglect where in each window the maximum value occurs, making it less sensitive to small shifts in position, either parallel or rotational, an image that strongly fits the filter will still get picked up.

3.2.3.3. Normalization

To keep the math from blowing up, a layer is then passed through a non-linearity such as a ReLU making negative values all0. This procedure of nonlinear transformation is「normalization」 in deep learning terms.

The CNN architecture involves stacking multiple stages of convolution, non-linearity(normalization), and pooling on top of each other, followed by a final fully connected layer(Fig.2). Each layer’s filter banks in the convolutional layers and voting weights in the fully connected layer are learned through the backpropagation algorithm. In the fully connected layer, also known as the dense layer due to the fact that a large number of neurons are densely connected with each other, a list of feature values becomes a list of votes, when timed by relevant weights that map this layer to the output layer gives the final answer. It is worth noting that this list of votes in the fully connected layer looks a lot alike a list of feature values. Indeed, the output of this layer as intermediate categories can still feed into the input of the next layer, propagating the cycle instead of becoming the final votes.

Fig.2: Example of a CNN with various types of layers. The convolutional layer does not decrease the size, i.e., the number of pixels, of the input figure, rather it encodes the feature of its input. The pooling layer does decrease the size of its input. The amount of which the size decreases depends on the size of the pooling window and the stride.

3.3. Cross-modal fusion

Cross-modal fusion refers to the process of integrating data from different modalities. PET/CT is a classic example of cross-modal fusion. CT is a type of imaging modality that provides high-resolution, cross-sectional images with excellent clarity and density resolution. PET, on the other hand, is a nuclear imaging technique that generates images showing the spatial distribution of positron-emitting radiopharmaceuticals within the body. With less precise structural details, PET images are well capable of displaying metabolic activity. PET/CT fuses CT with PET, possessing information on both the anatomical details and the metabolic spectrum. As each different information stream possesses unique characteristics, single-modal data often do not contain all the effective features to produce accurate results, whether for data analysis or prediction tasks. Cross-modal deep learning models combine data from two or more different modalities, learning different feature representations from different modalities and facilitating communication and transformation among different information streams, to accomplish specific downstream tasks. This special type of deep learning can improve the accuracy of predictions and enhance the robustness of models.

3.3.1. Cross-modal fusion methods

Cross-modal fusion methods can be categorized into three types: early fusion, late fusion, and hybrid fusion. In early fusion, unimodal features are combined into a single representation before the feature extraction or modeling process[9]. After feature extraction or modeling is performed separately to reduce unimodal features, the outputs are integrated to learn concepts and obtain the final prediction in late fusion[10]. Whereas hybrid fusion combines early and late fusion methods, where fusion is performed at both the feature level

and the output layer[11].

There are various methods of early fusion, including operating on elements at the same position in different modalities. For example, in the field of medical imaging, different imaging modalities can be fused into an integrated image. Nefian et al. proposed a cross-modal early fusion method that used both the factorial and the coupled hidden Markov model for audio-visual integration in speech recognition[12]. Early fusion was done by multiplying the corresponding elements of visual features that capture mouth deformation over consecutive frames and the vector representation, representing the frequency of audio observations, learned by long short-term memory neural networks. A dimensionality reduction was then done on the observation vectors obtained by concatenation of the audio and visual features. Indeed, early fusion methods are often simple in structure with low computational complexity. However, the resulting feature is often high in dimensions, which can impose a significant computational burden on the subsequent model if dimensionality reduction is not performed.

As an example of late fusion, in2014, Simonyan et al. proposed an architecture that separately inputs spatial and temporal recognition streams of videos, where the spatial stream recognizes actions from still video frames, whilst the temporal stream is in charge of action recognition from motion in the form of dense optical flow[13]. The learned feature outputs are combined by late fusion via either averaging or a linear support vector machine(SVM). As fusion significantly improves on both streams alone, the result proves the complementary nature of inputs spatial and temporal recognition streams and that cross-modal fusion indeed preserves more information of use in the algorithm. Late fusion does not explicitly consider the inter-modality correlation at the feature level, which may result in a lack of interaction among different modalities at the feature level. Consequently, the resulting feature representations after cross-modal fusion may not be rich enough, potentially limiting the effectiveness of the fusion approach.

There is no one optimal solution for all, and the choice of fusion method should be case-by-case.

3.3.2. Cross-modal image translation

Cross-modal image translation has gradually matured in the field of computer vision. Given sufficient training data, deep learning models are capable of learning discriminative features from images of different modalities, and the process of image-to-image translation can be viewed as transforming one potential representation of a scene to another.

In2017, Isola et al. released a Pix2Pix software that is effective at various image translation tasks, such as synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images[14]. Conventional CNN learns to minimize a loss function that is arbitrary, the making of which takes a lot of manual effort. The Pix2Pix software adopts conditional Generative Adversarial Networks(GANs) that automatically learn the loss function to train the mapping from input to output images, besides learning the mapping itself, as a generic solution to pixel-to-pixel prediction. These networks solve the whole genre of problems that used to require very different loss functions. Conditional GANs differ from other formulation in that it treats output pixels as mutually dependent and thus learns a structured loss, which penalizes the joint configuration of the output. Pix2Pix has good performance on many image translation tasks, but its ability in generating high-resolution images is suboptimal. Wang et al. improved upon Pix2Pix by proposing a new image translation framework for synthesizing high-resolution photo-realistic images from semantic label maps using conditional GANs in2018. Compared to Pix2Pix, this method has two main improvements: image translation at2048×1024 resolution and semantic editing of images. To generate high-resolution images, their method uses a coarse-to-fine generator, which is composed of a local enhancer for fine high-resolution image conversion and a global generator for coarse low-resolution image conversion respectively, a multi-scale discriminator architecture, and a robust adversarial learning objective function.

Additionally, it adds a low-dimensional feature channel to the input, which allows for the generation of diverse results images based on the same input label map. Zhu et al. proposed the BicycleGAN model combining both the conditional Variational Autoencoder GAN approach and the conditional Latent Regressor GAN approach, based on Pix2Pix in2017[15]. BicycleGAN is a technique for multi-modal image translation that accomplishes not just the primary objective of mapping the input, together with the latent code to the output, but also concurrently learns an encoder that maps the output back to the latent space. The bijection between the output and the latent space prevents multiple distinct latent codes from producing the same output, also known as non-injective mapping. BicycleGAN allows the generator to model a distribution of high-dimensional output given different encoders, producing diverse and realistic results while remaining faithful to the input.

To accurately transform specific objects among different modalities is the main challenge of cross-modal image translation. Most cross-modal image translation methods require paired data as the input, and due to the scarcity of the paired data, the translated images are often suboptimal or suffer from mode collapse, where the output only represents a limited number of real samples, etc. Therefore, how to achieve high-quality cross-modal image translation with a small amount of paired data is a valuable direction for research.

3.4. The application of cross-modal deep learning

AI is increasingly being studied in metastatic skeletal oncology imaging, and deep learning has been assessed for tasks such as detection, classification, segmentation, and prognosis. Zhao et al. developed a deep neural network-based model to detect bone metastasis on whole-body bone scan(WBS), irrespective of the primary malignancy[16. Compared to experienced nuclear medicine physicians, the deep learning model not only had a time savings of99.88% for the same workload, but it also had better diagnostic performance, with improved accuracy and sensitivity. To overcome the constraint of the time-consuming effort required for precise labeling of large datasets, Han et al. proposed a2D CNN classifier-tandem architecture named GLUE, which integrates whole body and local patches for WBS of prostate cancer patients[17. The2D-CNN modeling is the best fit for planar nuclear medicine scans, provided there is a massive amount of training data available. The GLUE model had significantly higher AUCs than a whole-body-based2D CNN model when the labeled dataset used for training was limited. Noguchi et al. developed a deep learning-based algorithm, with high lesion-based sensitivity and low false positives, to detect bone metastases in CT scans[18. An observer study was also done to evaluate its clinical efficacy, which showed improved radiologists』 performance when aided by the model, with higher sensitivity, by both lesion-based and case-based analyses, in less amount of interpretation time. Fan et al. used AdaBoost and Chan-Vese algorithms to detect and segment sites of spinal metastasis of lung cancer on MRI images[19. Chan-Vese algorithm had the best performance. The accuracy of the segmentation, expressed in terms of DSC and Jaccard coefficient scores, were0.8591 and0.8002, respectively. Liu et al. built a deep learning model based on3D U-Net algorithms for the automatic segmentation of pelvic bone and sites of prostate cancer metastases on MRI-DWI and T1-weighted MRI images[20. The model was found to work best on patients with few metastases, boosting the use of CNN as an aid in M-staging in clinical practice. Multiple deep classifiers were developed by Lin et al. to automatically detect metastases in251 thoracic SPECT bone images[21. The performance of the classifiers was found to be excellent, with an AUC of0.98. Moreau et al. compared different deep learning approaches to segment bones and metastatic lesions in PET/CT images of breast cancer patients[22]. The results indicated that the U-NetBL-based approach for bone segmentation outperformed traditional methods, with a mean DSC of0.94±0.03, whereas the traditional methods struggled to distinguish metabolically active organs from the bone draft.

Compared to the aforementioned deep learning examples, the more avant-garde cross-modal image fusion and translation techniques have not been widely investigated in bone metastasis imaging. Xu et al. adopted two different convolutional neural networks for lesion segmentation and detection and combined the spatial feature representations extracted from the two different modalities of PET and CT[23]. Their cross-modal method completed the three-dimensional detection of multiple myeloma, outperforming traditional machine learning methods. The research conducted by Wang et al. revealed that texture features extracted from multiparametric prostate MRI before intervention, when combined with clinicopathological risks such as free PSA level, Gleason score, and age, could effectively predict bone metastasis in patients with prostate cancer[24]. The outcome of this study can be seen as a proof of concept for the significance of cross-modal data.

Even though cross-modal investigations regarding the sites of bone metastases are limited by now, there has been plenty of evidence proving the utility of cross-modal fusion in oncological imaging. These applications and trains of thought can be well extrapolated to the field of osseous metastasis imaging. Cross-modal fusion can be applied to tasks such as tumor detection, segmentation, and classification to improve model the performance of deep learning models. Cross-modal image translation can be used for data augmentation to facilitate various downstream tasks.

Cross-modal fusion methods are often employed to enrich the models with cross-modal image features, thus improving the performance of tumor detection. Further, convolutional neural networks are used to capture the relationships between adjacent pixels and extract effective features from the image in deep learning-based cross-modal tumor detection algorithms. In2021, Huang et al. proposed a ResNet network-based framework, AW3M, that used ultrasonography of four different modalities jointly to diagnose breast cancer[25]. By combining the cross-modal data, the AW3M based upon multi-stream CNN equipped with self-supervised consistency loss was utilized to extract both modality-specific and modality-invariant features, with improved diagnostic performance.

As for tumor segmentation, many researchers hinge on either the four types of MRI image modalities or the two modalities of PET/CT encompassing anatomical and metabolic information to perform cross-modal fusion and improve segmentation performance. For instance, Ma et al. explored CNN-based cross-modal approaches for automated nasopharyngeal carcinoma segmentation[26]. Their proposed multi-modality CNN utilizes CT and MRI to jointly learn a cross-modal similarity metric and fuse complementary features at the output layer to segment paired CT-MR images, demonstrating exceptional performance. Additionally, the study combines the features extracted from each modality’s single-modality CNN and multi-modality CNN to create a combined CNN that capitalizes on the unique characteristics of each modality, thereby improving segmentation performance. Fu et al. introduced a deep learning-based framework for multimodal PET-CT segmentation that leverages PET’s high tumor sensitivity in2021[27. Their approach utilized a multimodal spatial attention module to highlight tumor regions and suppress normal regions with physiologic high uptake from PET input. The spatial attention maps generated by the PET-based module were then used to target a U-Net backbone for the segmentation of areas with higher tumor likelihood at different stages from CT images. Results showed that their method surpasses the state-of-the-art lung tumor segmentation approach by7.6% in the Dice similarity coefficient.

As the diagnostic procedure often requires the integration of multi-modal information, such as chief complaints, physical examinations, medical histories, laboratory tests, and radiology, cross-modal fusion methods are also commonly utilized in disease classification tasks. Cross-modal fusion synthesizes data from different modalities to enrich effective feature representations, enabling deep learning models to extract useful information from different modalities to aid in diagnosis. Zhang et al. proposed a technique for prostate cancer diagnosis using a multi-modal combination of B-mode ultrasonography and sonoelastography[28]. Quantitative features such as intensity statistics, regional percentile features, and texture features were extracted from both modalities, and an integrated deep network was proposed to learn and fuse these multimodal ultrasound imaging features. The final step of disease classification was completed by a support vector machine.

Due to the relative scarcity of medical images, cross-modal image translation is often used to synthesize part of the data in the training set as a data augmentation method for a better-performing deep learning model with a small sample size. Since integrated data from different modalities often exhibit better performance in deep learning models, the multi-modal image data input generated by cross-modal image translation methods can be directly used as targets for tumor detection. A two-step approach for semi-supervised tumor segmentation using MRI and CT images was proposed by Jiang et al[29. The first step is a tumor-aware unsupervised cross-modal adaptation using a target-specific loss to preserve tumors on synthesized MRIs from CT images. The second step involves training a U-Net model with synthesized and limited original MRIs using semi-supervised learning. Semi-supervised learning is used to boost the accuracy(80%) of tumor segmentation by combining labeled pre-treatment MRI scans with synthesized MRIs, while training with synthesized MRIs had an accuracy of74%. The proposed approach demonstrated the effectiveness of tumor-aware adversarial cross-modal translation for accurate cancer segmentation from limited imaging data.

In general, there have been bounties of research supporting the application of deep learning in bone metastasis imaging, but the specific application of cross-modal fusion methods is still lacking. Whereas, clinical evaluations regarding bone metastasis often require multi-modal data, such as a chief complaint of lower back pain, a past medical history of pathological fractures, a positive genetic test for specific mutations indicating a higher risk of bone metastasis, or increased blood calcium and alkaline phosphatase concentrations in laboratory reports, etc. Therefore, evaluating osseous lesions with multi-modal data can improve the specificity of diagnosis and reduce the false positive rates in the diagnostic and treatment process. The application of cross-modal deep learning methods in the field of bone metastasis imaging and diagnosis is worth further exploration.

3.5. conclusion

The above review covers the definition and basic principles of deep learning and cross-modal image generation and fusion methods, briefly describes some common cross-modal deep learning algorithms, and summarizes bits of current research on the application of deep learning models in medical imaging, especially bone metastasis imaging. Compared to traditional deep learning models fed with data input of a single modality, multi-modal methods are more recent, with a limited number of relevant research. Given the increasing prevalence of cancer screening and the significant surge in patient-specific clinical data, including radiographs and laboratory tests, it is reasonable to anticipate an unparalleled demand for advanced, intelligent cross-modal deep learning methods in the future. Nevertheless, the use of AI in medical imaging analysis faces various challenges and limitations. These include the need for extensive and diverse datasets for training and validation, the potential for bias and overfitting, as well as the inherent black-box nature of deep learning algorithms[30. Even though the demand for a large training set reiterates the merit of cross-modal deep learning, which enables the automatic generation of sample images through cross-modal image translation, the size of the training set still has a profound impact on the performance of algorithms. In parallel, the demand for「explainability」 has led to the notion of「interpretable machine learning」, utilizing heat maps and metrics to track the focus of deep neural networks[31]. Overall, there is still much to be investigated regarding the application of cross-modal deep learning in the field of medical imaging.

In summary, the project should be founded on the application of cross-modal deep learning techniques to offer practical solutions for challenges encountered in the clinical setting.

參考文獻

[1] Dong X, Wu D. A rare cause of peri-esophageal cystic lesion[J]. Gastroenterology,2023,164(2):191-193.

[2] Aswathi R R, Jency J, Ramakrishnan B, et al. Classification Based Neural Network Perceptron Modelling with Continuous and Sequential data[J]. Microprocessors and Microsystems,2022:104601.

[3] Gardner M W, Dorling S R. Artificial neural networks(the multilayer perceptron)—a review of applications in the atmospheric sciences[J]. Atmospheric environment,1998,32(14-15):2627-2636.

[4] Glorot X, Bordes A, Bengio Y. Deep Sparse Rectifier Neural Networks[J]. Journal of Machine Learning Research,2011,15:315-323.

[5] Bottou L, Bousquet O. The tradeoffs of large scale learning[J]. Advances in Neural Information Processing Systems,2007,20:1-8.

[6] Dauphin, Y. et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization[J]. Advances in neural information processing systems,2014,27:2933–2941.

[7] Hubel D H, Wiesel T N. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex[J]. Journal of Physiology,1962,160(1):106-154.

[8] Cadieu C F, Hong H, Yamins D, et al. Deep neural networks rival the representation of primate IT cortex for core visual object recognition[J]. Plos Computational Biology,2014,10(12): e1003963.

[9] Nefian A V, Liang L, Pi X, et al. Dynamic bayesian networks for audio-visual speech recognition[J]. EURASIP Journal on Advances in Signal Processing,2002,2002(11):1-15.

[10] Snoek C G M, Worring M, Smeulders A W M. Early versus late fusion in semantic video analysis[C]. Proceedings of the13th annual ACM international conference on Multimedia,2005:399-402.

[11] Wu Z, Cai L, Meng H. Multi-level fusion of audio and visual features for speaker identification[C]. International Conference on Biometrics Springer, Berlin, Heidelberg,2005:493-499.

[12] Nefian A V, Liang L, Pi X, et al. Dynamic Bayesian networks for audio-visual speech recognition[J]. EURASIP Journal on Advances in Signal Processing,2002,2002(11):1-15.

[13] Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos[J]. Advances in neural information processing systems,2014,27:568–576.

[14] Isola P, Zhu J Y, Zhou T, et al. Image-to-image translation with conditional adversarial networks[C]. Proceedings of the IEEE conference on computer vision and pattern recognition,2017:1125-1134.

[15] Zhu J Y, Zhang R, Pathak D, et al. Toward multimodal image-to-image translation[J]. Advances in neural information processing systems,2017,30:465–476.

[16] Zhao Z, Pi Y, Jiang L, Xiang Y, Wei J, Yang P, et al. Deep neural network based artificial intelligence assisted diagnosis of bone scintigraphy for cancer bone metastasis[J]. Scientific Reports,2020,10(1):17046.

[17] Han S, Oh J S, Lee J J. Diagnostic performance of deep learning models for detecting bone metastasis on whole-body bone scan in prostate cancer[J]. European journal of nuclear medicine and molecular imaging,2021,49(2):1-11.

[18] Noguchi S, Nishio M, Sakamoto R, Yakami M, Fujimoto K, Emoto Y, et al. Deep learning–based algorithm improved radiologists』 performance in bone metastases detection on CT[J]. European Radiology,2022,32(11):7976-7987.

[19] Fan X, Zhang X, Zhang Z, Jiang Y. Deep learning on MRI images for diagnosis of lung cancer spinal bone metastasis[J]. Contrast Media& Molecular Imaging,2021,2021(1):1-9.

[20] Liu X, Han C, Cui Y, Xie T, Zhang X, Wang X. Detection and segmentation of pelvic bones metastases in MRI images for patients with prostate cancer based on deep learning[J]. Frontiers in Oncology,2021,11:773299.

[21] Lin Q, Li T, Cao C, Cao Y, Man Z, Wang H. Deep learning based automated diagnosis of bone metastases with SPECT thoracic bone images[J]. Scientific Reports,2021,11(1):4223.

[22] Moreau N, Rousseau C, Fourcade C, Santini G, Ferrer L, Lacombe M, et al. Deep learning approaches for bone and bone lesion segmentation on18FDG PET/CT imaging in the context of metastatic breast cancer[J].42nd Annual International Conference of the IEEE Engineering in Medicine& Biology Society,2020:1532-1535.

[23] Xu L, Tetteh G, Lipkova J, et al. Automated whole-body bone lesion detection for multiple myeloma on68Ga-pentixafor PET/CT imaging using deep learning methods[J]. Contrast media& molecular imaging,2018,2018:2391925.

[24] Wang Y, Yu B, Zhong F, Guo Q, Li K, Hou Y, et al. MRI-based texture analysis of the primary tumor for pre-treatment prediction of bone metastases in prostate cancer[J]. Magnetic Resonance Imaging,2019,60:76-84.

[25] Huang R, Lin Z, Dou H, et al. AW3M: An auto-weighting and recovery framework for breast cancer diagnosis using multi-modal ultrasound[J]. Medical Image Analysis,2021,72:102137.

[26] Ma Z, Zhou S, Wu X, et al. Nasopharyngeal carcinoma segmentation based on enhanced convolutional neural networks using multi-modal metric learning[J]. Physics in Medicine& Biology,2019,64(2):025005.

[27] Fu X, Bi L, Kumar A, et al. Multimodal spatial attention module for targeting multimodal PET-CT lung tumor segmentation[J]. IEEE Journal of Biomedical and Health Informatics,2021,25(9):3507-3516.

[28] Zhang Q, Xiong J, Cai Y, et al. Multimodal feature learning and fusion on B-mode ultrasonography and sonoelastography using point-wise gated deep networks for prostate cancer diagnosis[J]. Biomedical Engineering/Biomedizinische Technik,2020,65(1):87-98.

[29] Jiang J, Hu Y C, Tyagi N, et al. Tumor-aware, adversarial domain adaptation from CT to MRI for lung cancer segmentation[C]. International conference on medical image computing and computer-assisted intervention. Springer, Cham,2018:777-785.

[30] Castelvecchi D. Can we open the black box of AI?[J]. Nature,2016,538(7623):20.

[31] Kuang C. Can A.I. Be Taught to Explain Itself?[J]. The New York Times,2017,21.

致謝

首先，感謝我的導師，邱貴興院士，涓涓師恩，銘記於心。感謝吳南師兄，知遇之恩無以報。感謝吳東老師，三生有幸，得您伴我一程風雪。也感謝所有參與此項目的科研合作夥伴和課題組的師兄師弟們，你們的幫助讓這項研究得以順利進行。

感謝在協和遇到的所有老師們，學生朽木，希望未來也能如你們一樣，不負一襲白衣。

最後，要感謝我的家人。Wherever I go, this family is my fortress.

這路遙馬急的人間，你我平安喜樂就好。

責任編輯：江一　來源：CDT 轉載請註明作者、出處並保持完整。

(被封殺) 協和醫學院規培醫生董襲瑩博士論文， 中國知網已刪除

相關新聞

(被封殺) 協和醫學院規培醫生董襲瑩博士論文，中國知網已刪除