Deep learning-based weak micro-defect detection on an optical lens surface with micro vision

Wennuo Yang; Meiyun Chen; Meiyun Chen; Heng Wu; Zhengyang Lin; Deqing Kong; Shengli Xie; Kiyoshi Takamasu

doi:10.1364/OE.482389

1. Introduction

As a carrier for optical functions, OL play a vital role in the development of various optical instruments, for example, lasers [1,2], projector [3] and other novel devices. Therefore, the quality inspection of OL is significant. Generally speaking, the most common defects of lenses are scratch, bursting crack and spots [4], which are micro size and irregular shape. Until now, much of the defect inspection work of lenses, including classification and recording, has been done manually, that leads to the following drawbacks: time-consuming, highly subjective, and naked eye damage caused by long-term work. Thus, the replacement of traditional manual detection for machine vision is the inevitable result of industrial development.

Defect detection of glass consists of feature extraction and classification. As the rapid development of computer technology, defect detection based on computer vision attracts considerable attention. However, lens is a kind of glass, which has some difficulties in defect detection as a transparent optical lens makes the object areas dark and uneven. (1) The uneven distribution of light has a large impact on the detection of the glass substrate, which causes a local difference in the image. (2) Because of the micro size and irregular shape of the lens defect, and the gray value of the defect in the acquired image being very close to the background, it is difficult to detect. (3) The background texture of the image acquired is complex and frequently crosses defects, making detection more difficult. To overcome above difficulties. Jazi proposed a method utilizing wavelet filters and improved Support Vector Machine (SVM) to extract feature and classify respectively [5]. Jian extracts defect features to reduce the feature dimensionality based on the multifractal spectrum and address the imbalanced defect example classification by NSVs (non support vectors) in the minority classes [6]. In the industrial production of glass, however, above traditional image processing methods have problems such as low accuracy, poor real-time performance, and poor robustness. Therefore, a higher accuracy method, convolution neural network (CNN) algorithm based on deep learning is more studied at present.

A kind of CNN algorithm, in recent years, object detection algorithm has received extensive attention within the field of computer vison due to its important research value for applications. Its model is capable of achieving high-speed object detection and object bounding box segmentation that integrates defect classification and location, respectively. Yu Jun proposed an end-to-end place recognition model based on VGG-16 network or ResNet-18 network to form a novel end-to-end deep neural network that can be easily trained via the standard backpropagation method, and the experiment results prove that the proposed model successfully achieved a great recognition effect [7]. Paszkiel used ConvNet for feature extraction from complex EEG signals and achieved high accuracy, demonstrating the superiority of deep learning-based convolutional neural networks in the field of image processing [8]. In the field of object detection, Faster RCNN [9] (faster region-based convolutional neural networks) and YOLO [10] (you only look once) have the advantages of high accuracy and faster speed. Unlike the Faster RCNN model, which uses two-stage detection framework, the YOLO model employs a regression mechanism, meaning that the target can be inferenced in one step. Therefore, YOLO is superior to the Faster RCNN in terms of detection speed, while slightly worse than the Faster RCNN in detection accuracy, especially in the case of weak and small objects. As the fifth edition of YOLO’s iterative, YOLOv5 are proposed via a series of improvements, such as auto learning bounding box anchors, the cross-stage partial network and so on, making it has the advantages of fast convergence, high precision, and strong customization. It also has strong real-time processing capabilities and low hardware computing requirements, meaning that it can be easily transplanted to mobile devices. Obviously, these advantages are very helpful for ensuring the detection accuracy of the object, but the requirements for the speed and accuracy of OL defect detection are very high in this paper, the defects detection is still not ideal for existing method due to the irregular shape, multi-scale and weak feature. Therefore, a new method combined with the ISE attention network is proposed in this paper, ISE-YOLO, for defective objects based on YOLOv5 to achieve high performance in OL images.

The main contributions of this work are as follows:

1. We set up a platform MVIS which can automatically capture OL images and meet the actual production needs of enterprise, and used it to provide all the images required by experiments in this paper.
2. For a high recall OL detection, we designed the ISE-YOLO model to force the network to pay more attention to the weak micro-defects, as well as make the model more robust by learning more distinguishable features, which obtain models with higher performance compared to other structures.
3. The proposed ISE attention module is a universal plug and play CNN network, which could be utilized in neck to improve the detection accuracy performance significantly.
4. Aiming at solving the data imbalance, we improved the ISE-YOLO detection performance using the introduced class loss function.

2. Principle

2.1. Micro vison-based image acquisition system

Image acquisition system includes illuminating equipment, optical components, sensor and signal processing unit commonly [11]. The selection of imaging method determines the efficiency and accuracy of defect detection. As a high precision device, however, the defects of OL are small to micro/nano scale, which cannot be detected by the normal machine vision. To solve this problem effectively, we propose an OL images acquisition system based on micro vision.

Figure 1 illustrates the schematic diagram of the proposed micro vision-based OL surface weak-defect detection system setup named micro vision-based inspection system (MVIS) for short. The system consists of three main parts: the microscopic imaging unit (MIU), the image processing unit (IPU), and the motion control unit (MCU). The MIU is based on the objective microscope, the light source, and the charge coupled device sensor (CCD), which can generate the magnified images of the detected defects. Comprised of the image acquisition card (IAC), which converts the analog signal acquired by sensor into the digital signal, and the personal computer (PC), the IPU enables transmit and processing the image data efficiently. The MCU composing the XY-platform and controller can capture images for all areas of OL automatically. In the optical path propagation, the beams generated by light source pass through the lens and produce parallel beams. After that, the parallel beams are refracted though the splitter prims (BS) and then focus on the OL through the objective lens. The surface of OL is smooth so that the beams is reflected and travel in the prime lens. Finally, the beams are focused on the CCD sensor by a lens, producing the image data. Concretely, to capture the image information of weak micro-defects effectively meanwhile reducing the burden of computation, the 30× objective lens and the CCD of 640 pixels × 480 pixels with a pixel size of 4.375µm was selected. In terms of light source, white LED is a great option for MVIS, which is free from instrument interference and low energy consumption. For inspecting automatically, XY-platform is applied to X and Y directions respectively with OL, traversing whole surface images. And the controller can command the movement locus and speed to adapt to various workpieces.

Fig. 1. The micro vison-based surface detection system of multiple defects for OL.

Download Full Size | PDF

2.2. Inspection process

Figure 2 shows the flow chart of the proposed ISE-YOLO methodology of implementation, which mainly includes training and evaluation stages. Firstly, a large number of OL images are captured by MVIS, and we screen them out with defects. Then, we aggregate all images to define and annotate defect categories, which are described in paragraph 2.3. In the training stage, mosaic data argument [12] is used to the rich dataset for enhancing network robustness. After that, the ISE-YOLO is designed for the defects of irregular shape, multi-scale and weak feature, and section 3 shows a detailed description of it, which is composed of network architecture, ISE attention module and PloyLoss function. To get a better model, we adjust the hyperparameters, such as learning rate and iterations, to optimize the performance of the models. Besides, we evaluate each epoch of training and save the model weights for the best performance. In the valuation stage, the images are clipped into the same shape as during the training stage to ensure that the logic of model inference is the same as the training process. Then, the trained model is loaded and used to defect detection for valuation. Finally, to demonstrate the advantage of the proposed method and the effectiveness of inspection in the production line, we conduct analytical experiments in chapter 4 by comparing the detection performances of other deep learning-based defect detection algorithms.

Fig. 2. Overall methodology of implementing OL defects detection model based on ISE-YOLO.

Download Full Size | PDF

2.3. Defects dataset

In this study, the training and valuation images are captured by MVIS. We collected enough common defect images including eight classes of scratch, bursting crack, spots, bubble, mura, calymma, degumming and demoulding. In addition to the defects, for cooperating with the XY-platform to capture image automatically, we added a class named edge to detect the edge of OL. To sum up, 1059 images of eight defects and edge are obtained and stored in BMP format with a resolution of 640 pix × 480 pix, 898 of which are used as training images and 161 as evaluation images. The surface defects of OL images are hard to detect due to the ambiguous boundary, great variety of defects’ morphologies and micro size, which are the major concern of the paper. The distribution of each category and defect feature is shown as Table 1.

Table 1. Dataset information: Train/val represent the number of images of the corresponding category in the training dataset and valuation dataset, respectively. Real figure is an example of zoomed-in images. Defect feature describe the characteristics of nine categories.

View Table | View all tables in this article

These defects are caused by mechanical errors or improper operation of workers, which cannot be completely avoided in actual production. Due to the fact that certain defect types rarely appear in the actual industrial process, the distribution of such samples is uneven, with the largest number of wound and the smallest number of bubble and calymma, resulting in data imbalance. To address this problem, mosaic data enhancement is used in this paper. This data argument method combines four images into one image in a certain proportion, such as horizontal mirror flipping, random brightness adjustment, random small clipping and fixed scaling, to increase the background complexity, reducing the amount of computation and training time and improving the robustness. Examples are shown in Fig. 3. In additional, the LabelImg tool is used to mark the category and location information of all image targets in this paper, and save them to the txt file for network training.

Fig. 3. The example of mosaic argument.

Download Full Size | PDF

3. Proposed method

3.1. Network architecture

Figure 4 shows the network architecture of ISE-YOLO. Firstly, the input image changes from $480 \times 480 \times 3$ to $240 \times 240 \times 12$ after the focus operation. Then, in the backbone network, CSPDarkNet53 [13] is used to enhance the feature fusion of the model through CSP residual structures and extract rich information from shallow and deep feature maps. In this shallow layer, the resolution of the feature map is high, but the feature semantic information extracted by the network is weak. In contrast, the resolution of the high-level feature maps is low, whereas the convolutional neural network can better extract the semantic information of high-level features with network layers deepening gradually. Therefore, the neck network adopts the form of feature pyramid linking PANet [14] (Path Aggregation Network), which both utilizes top-down and bottom-up fusion structures to effectively deepen the network, fuse the multiscale features extracted from the backbone and strengthen the feature semantic of defects with three different scales. For weak micro-defects with a few features in OL images, deep convolution is easy to extract rich semantic information of detect category and location. Hence, we propose the improved Squeeze-and-Excitation (ISE) module at three deepest layers of the neck, which will be discussed in paragraph 3.2. In additional, to maximize the extraction of features that are conducive to the detection of weak micro-defects in OL images, it is necessary to make full advantage of the high-level semantic information of the convolutional neural network in the deep layer, so three output dimensions of the neck are set to ${{{\mathbb R}}^{60 \times 60 \times 384}}$, ${{{\mathbb R}}^{30 \times 30 \times 384}}$ and ${{{\mathbb R}}^{15 \times 15 \times 768}}$ respectively. In this way, without substantially increasing the size of the network model and the complexity of the algorithm, the ability to extract deep feature information is enhanced, which is conducive to detect weak micro-defects. Afterword, in the detection head, three sets of output feature maps are detected, and the anchor boxes are applied on the inferential results to generate the final output vectors with a class probability score, a confidence score, and a bounding box. Finally, we exploit the PloyLoss [15] function to calculate the class loss. In the experiment, the proposed network models have a qualitative improvement compared with the set of YOLO object detection in terms of detection accuracy, reasoning speed, and network parameters.

Fig. 4. The network architecture of ISE-YOLO.

Download Full Size | PDF

3.2. ISE attention module

The attention mechanism is helpful to capture the weak features in most computer vision tasks and have been widely used for object detection, which effectively select object information, and more weights can be assigned to small and weak objects to improve the feature expression ability of weak defects for accurate detection [16]. The Squeeze-and-Excitation [17] attention module that allows the network to perform feature recalibration, which increases the accuracy of the classification model using only one-dimension channel self-attention map. Therefore, in the neck of ISE-YOLO, we utilize the ISE attention module as the deepest layer, through which the network can learn to use global information to selectively emphasize informative features and suppress less useful ones, in the deepest feature extraction process.

Figure 5 shows the specific ISE attention mechanism structure, which is mainly divided into three steps: squeeze-and-excitation, scale and fuse. In top branch, the squeeze-and-excitation operation is achieved by exploiting global average pooling (${F_{sq}}$), which can be expressed as Eq. (1), to compress the matrix to $1 \times 1 \times C$ in order to squeeze the global spatial information of input $\textrm{X} \in {{{\mathbb R}}^{\textrm{H} \times \textrm{W} \times \textrm{C} }}$ and uses two fully connection layers to add more nonlinear. Here, the relationship between the channels is made flexible by two fully connection layers, which also fit the complex correlation between the channels, reduce the burden of calculations as much as possible, and obtain the weight value. Therefore, the network can significantly enhance the learning of convolutional features by explicitly modelling the interdependencies between the channels of its convolutional features while adding only a slight increase in model parameter.

(1)$${F_{sq}}(X) = \frac{1}{{H \times W}}\sum\limits_{i = 1}^H {\sum\limits_{j = 1}^W {X(i,j)} }$$

(2)$${F_{ex}}({F_{sq}},\omega ) = \sigma ({\omega _1}\delta {\omega _2}{F_{sq}})$$

Fig. 5. The architecture of ISE attention module.

Download Full Size | PDF

In Eq. (1), the matrix $1 \times 1 \times C$ is generated by input $X$ through its spatial dimensions $H \times W \times C$, and i, j are the $i - th$ row and $j - th$ column of the feature maps, respectively. In Eq. (2), $\sigma $ means the sigmoid activation function, $\delta $ represents the ReLU activation function, $\omega $ is the weight, and the ${\omega _1}$ and ${\omega _2}$ represent the two different fully connection layers respectively. After that, to create a self-attention function on channels whose relationships are not confined to the local receptive field the convolutional filters are responsive to, the weight values calculated in the first step are multiplied back to the input $X$ through scale operation given by

(3)$$V = {F_{scale}}(X,{F_{ex}}) = X \cdot {F_{ex}}, $$

where ${F_{scale}}(X,{F_{ex}})$ is channel-wise multiplication, multiplying the feature maps $U$ with the weight value obtained in the ${F_{ex}}$ stage.

In bottom branch, after a 3 × 3 CBL block, which include standard 3 × 3 kernel size two-dimension convolution, batch normalization, and ReLU activation functions, the input $\textrm{X} $ extracts deeper high-level feature maps $U \in {{{\mathbb R}}^{H \times W \times C}}$ for subsequent operation. The result $Y \in {{{\mathbb R}}^{H \times W \times C}}$, in the fuse operation, generated by the concatemer $[V,U] \in {{{\mathbb R}}^{H \times W \times 2C}}$ of V and U though 1 × 1 CBL block, which can change the result to has the same channel-size of input and integrate the information of two stages for rich semantic.

ISE has a strong adaptive adjustment structure which in favor of directly embedded in the network. We believe that the small additional parameters increase incurred by the ISE module is justified by its contribution to the model performance.

3.3. PloyLoss

To address the class imbalance issue due to the objective production factors, we exploit a better class loss function PloyLoss, which took into account the superior multi classification performance of the focal function, achieved enhancement of the accuracy and performance on this basis and have been shown to be robust to label noise in the training data. It decomposes the focal loss into a series of weighted polynomial bases, given by

(4)$${L_{ploy - N}} = {L_{FL}}({P_t},\gamma ) + {\sum\limits_{n = 1}^N {{\varepsilon _n}(1 - {P_t})} ^{n + \gamma }}, $$

where ${P_t}$ is the model’s prediction probability of the target ground-truth class; ${\varepsilon _n} \in [ - 1/n,\infty )$ is the coefficient of $n - th$ perturbation term; ${L_{FL}}$ and $\gamma $ represent the focal loss [18] and its coefficient, respectively.

${L_{ploy - N}}$ tunes polynomial coefficients on different tasks and datasets, which is proved better than focal loss through experiments. We focus the first term of it, truncating all higher order polynomial terms, which can be shown as Eq. (5) and introduced two super parameters. To find the appropriate values for them, we conducted a series of control variable experiments and finally set $\varepsilon $ to 1 and $\gamma $ to 1.1. Thus, the class loss function in this paper can be expressed as Eq. (6).

(5)$${L_{ploy - 1}} = {L_{FL}}({P_t},\gamma ) + \varepsilon {(1 - {P_t})^{1 + \gamma }}$$

(6)$$Los{s_{cls}} = {L_{FL}}({P_t},1.1) + {(1 - {P_t})^{2.1}}$$

In addition, the bounding box loss (Lossbox) and the confidence loss (Lossconf) is calculated through BCE (Binary Cross Entropy) with logits loss, and the loss calculation for each detection layer is calculated by

(7)$$Loss = Los{s_{cls}} + Los{s_{box}} + Los{s_{conf}}$$

4. Experimental analysis

To verify the performance of ISE-YOLO, the same OL dataset is used in the training comparison with Faster RCNN, YOLOv3 [19], YOLOv6 [20], two models of YOLOv5 and YOLOv7 [21]. In this section, we use the same experimental configuration and evaluation metrics to compare the latest detection algorithms with the detection models proposed in this paper in terms of detection accuracy, speed, parameters, etc. Then, an ablation experiment is carried out on the proposed model to test the performance brought by the different improved methods.

4.1. Experimental platform

Figure 6 shows the experimental MVIS setup, which is built according to the framework of Fig. 1. For detection algorithm, the PC experiments hardware configuration includes AMD Ryzen 7 5800X 8Core Processor, 3.80 GHz; NVIDVIA GeForce RTX 3080Ti graphics card, single GPU, 44 G memory. The software environment is Windows 10 Professional 64-bit operating system. Pytorch framework was the tool used to build the detection model, applying Python3.8.12 as the programming language, CUDA11.1 as the GPU computing platform, and the GPU acceleration library by CUDNN10.1 deep learning.

Fig. 6. The experimental setup of MVIS.

Download Full Size | PDF

4.2. Evaluation metrics

In order to evaluate the performance of the detection model, Precision (P), Recall (R), mean Average Precision (mAP), and F1 score are usually used for the quantitative analysis. The detection speed of the model was evaluated by Frames Per Second (FPS). P, R, and F1 were assessed by Eqs. (8)–(10), respectively.

(8)$$P = \frac{{TP}}{{TP + FP}}, $$

(9)$$R = \frac{{TP}}{{TP + FN}}, $$

where TP, FP, TN, and FN stand for true positive, false positive, true negative, and false negative.

(10)$$F1 = \frac{{2P \cdot R}}{{P + R}}. $$

As the most important object detection metrics, mAP is more frequently utilized in defect detection studies, which averages the precision values of all classes and is formulated by Eq. (11). In particularly, mAP defaults to the IoU threshold 0.5 and mAP95 means the AP by ten IoU threshold of 0.5:0.05:0.95 is averaged in this paper.

(11)$$mAP = \int\limits_0^1 {P(R)dR}$$

FPS refers to the number of frames that can process a specified size image per second. It is used to measure the running speed of the object detection model. The higher the FPS, the faster the model runs and the better the real-time performance.

4.3. Comparison of the detection performance

In ensure the fairness of the experiments, the same initial training parameters and dataset are set for each group of experiments. In the training stage, the input resolution is uniformly resized to 480 pix × 480 pix and utilizes the Stochastic Gradient Descent (SGD) optimizer, whose momentum and weight decay are set to 0.9 and 0.0005 respectively. The batch size is set to 8, the epoch is 500, and the learning rate is a monotone-decreasing sequence from 0.01 to 0.002.

Figure 7 shows the change curves for the mAP during training of YOLOv5m, YOLOv5l, YOLOv7 and the proposed ISE-YOLO. There is a gradual increase in the mAP value with bigger training epochs. Here, four groups generally follow a similar rapid upward trend in the initial phases of training. Close to the 70 training epochs, having a better performance, the mAP of ISE-YOLO continues to rise significantly and surpass YOLOv7’s, due to the advanced ISE attention mechanism and loss function making the smaller loss value. The accuracy of models improved slowly in the middle phases of training and finally converge and the ISE-YOLO continued to maintain better performance. For the maximum mAP of each group, the network architecture of YOLOv5l is deeper and wider than that of YOLOv5m, whose mAP is lower than YOLOv5l’s. Because YOLOv7 improves the feature extraction network and sample assignment strategy of YOLOv5, its detection mAP is slightly better than that of YOLOv5l. However, the detection methods proposed in this paper have a higher performance than YOLOv7 and the number of trainings required to enter the stable state is also relatively low. Furthermore, other classical object detection networks are trained in the same initial training set and the all experiments detailed results are shown in the Table 2.

Fig. 7. The change curves for the mAP of four YOLO models.

Download Full Size | PDF

Table 2. Comparison of the results with six other recent methods.

View Table | View all tables in this article

Intuitively, Table 2 shows that the proposed ISE-YOLO have improved in many indicators compared with other YOLO series and Faster RCNN. Meanwhile, our method also achieves a relatively high FPS among the mentioned models, which can meet the requirements of real-time defects detection. In terms of mAP, mAP95 and F1 indicators, the proposed ISE-YOLO achieve a better performance than other methods. Compared with Faster RCNN, the proposed model has an overwhelming advantage in all indicators. Although the weight of ISE-YOLO is increased by 22.8% compared with the YOLOv5m, the FPS is just decreased by 0.17%. Meanwhile, the proposed is 0.16%, 6.12%, and 3.07% higher than YOLOv5m in detection accuracy indicators P, R, and F1, and the mAP and mAP95 have also been improved by approximately 3.62% and 4.63% respectively. The experiments of YOLOv5m and YOLOv5l show that although the accuracy is improved by simply deepening the network, the detection speed is greatly decreased. Compared with YOLOv3 and YOLOv6, the mAP of ISE-YOLO is increased by 3.5% and 2.6%, and the weight is reduced by approximately 58% and 56% resulting in the FPS faster by 10% and 87%, respectively. In addition, the mAP and F1 of ours are approximately 2.58% and 2.22% higher than that of YOLOv7 respectively, which is the latest version of YOLO serials at present, and the parameters amount decreased by more than 30% while the FPS increase 16%, which greatly improves the precision of detection meanwhile reduces the hardware calculation burden.

Figure 8 shows the intuitive test results of nine categories inferred by the proposed model and some micro-defects are zoom-in. And examples of the test results on the qualitative comparison of different methods are shown in Fig. 9. For the weak scratch detected in the first column, (b1) and (c1) have no defect detected in the lower half of the example. Although (a1) shows the weak scratch, the region of interest is too big. By contrast, the result of (d1) is excellent. As shown in (d2), the proposed can solve the missed detection problem of the others by utilizing the ISE attention module to focus on the useful semantics and suppress ones. In terms of the number of detected defects, the proposed detects more than others, on the third column, which (d3) detects the tiny spots but others do not. This shows that the proposed method can detect micro-defect effectively by realizing the deep layers information. Although the defects are weak and small which is difficult to distinguish from the background, the proposed model can still achieve a high recall and precision of the detection by enhancing deep features, expanding layers, and improving loss function.

Fig. 8. Examples of test results for each category.

Download Full Size | PDF

Fig. 9. Some examples of the detection result on the OL valuation dataset. The first row is the original images, and the defect from left to right are scratch, degumming, and spots. Rows two through five are the test results of YOLOv3, YOLOv5l, YOLOv7 and ISE-YOLO. Especially, the inference results of (a3) - (d3) are enlarged region of the red border in original.

Download Full Size | PDF

4.4. Ablation study

In order to see the effect of different improvement techniques on the performance of the model more directly, we conduct ablation experiments as shown in Table 3. We consider the influence of different combinations of the three factors of SE attention module, ISE attention module and PloyLoss class loss function.

Table 3. Ablation study of detection performance.

View Table | View all tables in this article

From Table 3, the first row is the baseline, the YOLOv5m model. Compared with the YOLOv5m, the performance of second row was improved, of which the AP and F1 improved by 1.92% and 1.94% respectively and the FPS increased by more than 22%. This shows the effectiveness of using attention mechanism enhances the learning ability, the generalization and the robust properties of the CNN network algorithms while reducing computation complexity. We set the second and third groups of experiments to verify the attention module, and the results show that ISE has more complex network parameters and is deeper than SE. Consequently, utilizing the ISE module to replace the SE module, AP and F1 increased by 3.37% and 2.73%, respectively. Further, it is worth to note that the performance is significantly enhanced by the four group as compared to the baseline. The model obtained maximum AP of 94.23% showing an increment of 3.62% and F1 improved by 2.73%. Finally, we choose the ISE module to rich deep layers information in the neck of network and PloyLoss as the class loss function, and ultimately obtaining ISE-YOLO network. Comparing the fourth group with the baseline, the results fully demonstrate the superiority of ISE-YOLO to YOLOv5, it can be clearly observed that the performance of the fourth group is much better than that of the latter. Therefore, using the ISE attention module to extract deep semantic layers and utilizing PloyLoss as the class loss function are necessary to OL inspection algorithm.

5. Conclusions

In this paper, a novel approach for OL inspection was proposed. To improve the efficiency and accuracy of automatic detection, we built a micro vision-based OL surface weak micro-defects inspection system named MVIS for collecting OL dataset and practical defect detection, which is challenging since the weak morphological feature and micro size, and requires a high recall in the actual production. Based on the YOLOv5m, we proposed a novel variation denoted as ISE-YOLO. Firstly, we designed the network architecture of the neck to take full advantage of deep layers. Secondly, we improved the attention module as ISE attention for extracting richer semantic. At last, we introduced the PloyLoss as class loss function make the network learn more information. In order to evaluate the robustness of our method, we implemented comprehensive experiments by comparing it with other prevalent methods. The experimental results shows that our method can achieve high accuracy with practical computational complexity. Overall, our approach is characterized by high precision and speed for OL defects detection, which can be used to automatically inspect a single OL in a short time and applied in large-scale production inspection lines efficiently with mechanical arms to load and unload OL.

Mainly focusing on the accuracy of the OL images, in this paper, we used great number of parameters to extract feature. However, some speed requirements are limited to increase the accuracy. In the future, we plan to further refine the feature extraction structure and reduce the model weight size on the prerequisite of ensuring accuracy.

Funding

Natural Science Foundation of Guangdong Province (2021A1515011817, 2022A1515010005, 2022A1515011636); Science and Technology Program of Guangzhou (202201010258); National Natural Science Foundation of China (22127810, 61727810).

Acknowledgments

The authors would like to thank the Matsubayashi Optics (Guangzhou) Co., Ltd and PA for the President Zhengyang Lin and Engineer Yongming Ma for their full assistance and providing the required data.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but can be obtained from the authors upon reasonable request.

References

1. Y. Zhai, Q. Feng, and B. Zhang, “A simple roll measurement method based on a rectangular-prism,” Opt. Laser Technol. 44(4), 839–843 (2012). [CrossRef]

2. J. Chen, M. Chen, H. Wu, S. Xie, and T. Kiyoshi, “Distortion spot correction and center location base on deep neural network and MBAS in measuring large curvature aspheric optical element,” Opt. Express 30(17), 30466–30479 (2022). [CrossRef]

3. D. A. Roberts, N. Kundtz, and D. R. Smith, “Optical lens compression via transformation optics,” Opt. Express 17(19), 16535–16542 (2009). [CrossRef]

4. W. Ming, F. Shen, X. Li, Z. Zhang, J. Du, Z. Chen, and Y. Cao, “A comprehensive review of defect detection in 3C glass components,” Measurement 158, 107722 (2020). [CrossRef]

5. A. Yousefian-Jazi, J. Ryu, S. Yoon, and J. J. Liu, “Decision support in machine vision system for monitoring of TFT-LCD glass substrates manufacturing,” J. Process Control 24(6), 1015–1023 (2014). [CrossRef]

6. C. Jian, J. Gao, and Y. Ao, “Imbalanced defect classification for mobile phone screen glass using multifractal features and a new sampling method,” Multimed. Tools Appl. 76(22), 24413–24434 (2017). [CrossRef]

7. Y. J., Z. C., Z. J., H. Q., and T. D., “Spatial Pyramid-Enhanced NetVLAD With Weighted Triplet Loss for Place Recognition,” IEEE Trans. Neural Netw. Learning Syst. 31(2), 661–674 (2020). [CrossRef]

8. S. Paszkiel and P. Dobrakowski, “The Use of Multilayer ConvNets for the Purposes of Motor Imagery Classification,” Springer International Publishing, 10–19 (2021).

9. S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems 28, (2015).

10. J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 779–788 (2016).

11. M. Chen, Z. Zhang, H. Wu, S. Xie, and H. Wang, “Otsu-Kmeans gravity-based multi-spots center extraction method for microlens array imaging system,” Opt. Laser Eng. 152, 106968 (2022). [CrossRef]

12. A. Bochkovskiy, C. Wang, and H. M. Liao, “Yolov4: Optimal speed and accuracy of object detection,” arXiv, arXiv:2004.10934 (2020).

13. C. Wang, H. M. Liao, Y. Wu, P. Chen, J. Hsieh, and I. Yeh, “CSPNet: A new backbone that can enhance learning capability of CNN,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 390–391 (2020).

14. J. Yang, X. Fu, Y. Hu, Y. Huang, X. Ding, and J. Paisley, “PanNet: A deep network architecture for pan-sharpening,” Proceedings of the IEEE International Conference on Computer Vision (ICCV), 5449–5457 (2017).

15. Z. Leng, M. Tan, C. Liu, E. D. Cubuk, X. Shi, S. Cheng, and D. Anguelov, “PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions,” arXiv, arXiv:2204.12511 (2022).

16. D. Zhang, J. Han, G. Cheng, and M. Yang, “Weakly supervised object localization and detection: A survey,” IEEE Trans. Pattern Anal. Mach. Intell. 44, 5866–5885 (2021). [CrossRef]

17. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 7132–7141 (2018).

18. T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2980–2988 (2017).

19. J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv, arXiv:1804.02767 (2018).

20. C. Li, L. Li, H. Jiang, K. Weng, Y. Geng, L. Li, Z. Ke, Q. Li, M. Cheng, and W. Nie, “YOLOv6: a single-stage object detection framework for industrial applications,” arXiv, arXiv:2209.02976 (2022).

21. C. Wang, A. Bochkovskiy, and H. M. Liao, “YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” arXiv, arXiv:2207.02696 (2022).

Model	P(%)	R(%)	mAP(%)	mAP95(%)	F1(%)	FPS(f/s)	Weight (MB)
Faster RCNN	-	-	86.30	56.22	-	14	315.31
YOLOv3	90.48	86.91	90.73	61.96	88.66	53	117.80
YOLOv5m	88.05	87.01	90.61	59.19	87.53	59	40.24
YOLOv5l	89.60	86.84	91.46	62.29	88.20	45	88.51
YOLOv6	87.56	87.37	91.63	62.59	87.46	31	112.12
YOLOv7	86.82	90.00	91.65	62.59	88.38	50	71.38
ISE-YOLO	88.21	93.13	94.23	63.82	90.60	58	49.41

SE	ISE	PloyLoss	mAP(%)	F1(%)	Weight(MB)	FPS(f/s)
			90.61	87.53	40.24	53
√			91.90	89.47	30.85	65
	√		93.98	90.26	49.41	58
	√	√	94.23	90.60	49.41	58

Model	P(%)	R(%)	mAP(%)	mAP95(%)	F1(%)	FPS(f/s)	Weight (MB)
Faster RCNN	-	-	86.30	56.22	-	14	315.31
YOLOv3	90.48	86.91	90.73	61.96	88.66	53	117.80
YOLOv5m	88.05	87.01	90.61	59.19	87.53	59	40.24
YOLOv5l	89.60	86.84	91.46	62.29	88.20	45	88.51
YOLOv6	87.56	87.37	91.63	62.59	87.46	31	112.12
YOLOv7	86.82	90.00	91.65	62.59	88.38	50	71.38
ISE-YOLO	88.21	93.13	94.23	63.82	90.60	58	49.41

SE	ISE	PloyLoss	mAP(%)	F1(%)	Weight(MB)	FPS(f/s)
			90.61	87.53	40.24	53
√			91.90	89.47	30.85	65
	√		93.98	90.26	49.41	58
	√	√	94.23	90.60	49.41	58

Deep learning-based weak micro-defect detection on an optical lens surface with micro vision

Abstract

1. Introduction

2. Principle

2.1. Micro vison-based image acquisition system

2.2. Inspection process

2.3. Defects dataset

3. Proposed method

3.1. Network architecture

3.2. ISE attention module

3.3. PloyLoss

4. Experimental analysis

4.1. Experimental platform

4.2. Evaluation metrics

4.3. Comparison of the detection performance

4.4. Ablation study

5. Conclusions

Funding

Acknowledgments

Disclosures

Data availability

References

Data availability

Cited By

Figures (9)

Tables (3)

Equations (11)

Optics Express