Adequacy of manual endoscopic inspection of the upper and lower gastrointestinal mucosa is operator-dependent: it is common knowledge that the likelihood of finding lesions and the degree of cancer prevention of manual endoscopic procedures is dependent on the skillset and effort by endoscopists. Recent developments in artificial intelligence allow measurement of skillset and effort during actual endoscopic procedures. Many endoscopic features, representing skillset and effort, such as clarity of image, absence of stool, looking sideways, performing retroflexion and obtaining all possible mucosal views, can be classified or counted, and the results presented in real-time to endoscopists and stored at the end of the procedure as an automated, objective report with representative images documenting adequacy of inspection. Real-time feedback provides endoscopists the option, when measurements suggest inadequate or incomplete inspection, to change technique or repeat and expand inspection. However, responding to real-time feedback and obtaining best possible measurements or all possible mucosal views do not equate to careful inspection; endoscopists may be focused on obtaining best measurements instead of inspecting the mucosa. Therefore, prospective studies with long-term follow-up will be required to determine whether artificial intelligence driven real-time feedback will lead not only to better intra-procedural measurements but also to improved patient outcome, for example, a decrease in gastrointestinal cancer incidence and mortality.
The goal of any gastrointestinal endoscopy is to completely inspect all of the lining and contents – if present – of the target organs, to describe the condition of the mucosa as variants of normal or abnormal, to detect any mucosal, submucosal or external lesions effecting function or shape of the target organs and to perform diagnostic or therapeutic procedures. In the begin most endoscopy was simple and focused on inspection and diagnostic sampling; however, lately very complex endoscopy is being performed where the focus has moved to combining diagnostic and definitive, therapeutic procedures into – when possible – a single endoscopy . Examples of more advanced, complex endoscopy include endoscopic ultrasound, endoscopic submucosal dissection, endoscopic stenting, rendezvous procedures, transmural procedures between organs or the skin, and endoscopic suturing.
The reality of endoscopy is that the vast majority of endoscopies are screening or diagnostic esophagogastroduodenoscopies (EGDs) and colonoscopies. These are done by endoscopists: human beings with more or less training in endoscopy, more or less experience, and more or lesser endoscopic skillsets. Unlike machines such as CT or MR scanners, human beings make errors . Indeed, endoscopy is an operator-dependent procedure where the likelihood of detecting lesions and the success of any form of interventional procedure such as Barrett’s mucosa ablation, treatment of esophageal varices or completeness of colon polyp resection depend on the skillset and effort of the endoscopist. Thus the reality of endoscopy is that lesions are being missed or not completely removed during EGD as well as colonoscopy; numerous articles have described this during the past decades . Therefore, there is a proven need for technology that can document what was done and can prevent lesions from being missed or incompletely removed during endoscopy.
The foremost potential contribution of artificial intelligence (AI) in endoscopy is objective analysis of what was seen and done during endoscopy [ , ]. This contribution is most likely to materialize for simple and frequent endoscopies where the organs to be inspected are predictable, the order of mucosal views is rather fixed, and only a few diagnostic or therapeutic procedures are likely to occur. In practice this means EGDs, colonoscopies and to a lesser extent video capsule endoscopies. AI or deep learning needs – as the term learning indicates – to learn, and does so from large collections or corpora of examples, preferentially carefully annotated for the features to be learned . Video files of endoscopies are relatively easy to obtain for EGDs, colonoscopies and video capsule endoscopies as many are done on a day-to-day basis. Manual video file annotation is more difficult to achieve for a large video file corpus as many endoscopists with the needed expertise do not have time for, are not rewarded for or have no interest in this type of work. Furthermore, any AI system will only be able to evaluate endoscopies with features that are readily present in the learning material used to train the system; therefore images from different endoscopic manufacturers or different types of endoscopes (color wheel vs color chip) or endoscopies from different units with a different protocol for inspection may not result in a true objective documentation . A second potential benefit of AI is real-time feedback: what was seen or what was not seen, what type of mucosa was seen, what type of polyp was seen, does it need to be removed, and if so was it completely removed? [ , ] In this review we will outline the components of an AI-based system for endoscopy to improve adequacy of inspection. We will discuss examples of different methods aimed at improving and documenting mucosal inspection of stomach and colon. We will not discuss imaging of the small bowel, as current video capsules are not driven by endoscopists but instead rely on clear fluid intake by the patient and spontaneous motility of the small bowel. Technical details of AI methodology will also not be discussed – for these the reader is referred to the appropriate literature.
Components of an AI system
As mentioned, it is unlikely that an AI system can be created that will work in every endoscopy unit without some adjustments for type of endoscopic instrument, light source, imaging method, and endoscopy protocol used . Therefore the following will be a general discussion, not focused on any specific brand of endoscope or protocol. Development of an AI system requires a development phase where a network is being trained; this is a compute-intense process that most of the time happens on a large server or in the cloud ( Figure 1 ). The result is a static model, that can be used anywhere. However, real-time feedback cannot have any time lag, that is, latency, and therefore inference of the model when real-time feedback is desired has to be performed on local workstations [ , ]. Such workstations typically consist of multicore CPUs with possibly Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs) for repetitive, calculation-intense tasks, a sufficient amount of RAM to hold a series of images in memory for processing, a video capture card for signal input and a graphics card for signal output .
Video file capture
In general, creation of an AI system requires capture of the training and test set video files, and once an AI system is implemented capture and analysis of the stream of images generated during routine endoscopy. For data capture, there are 3 possibilities: none, manual, and automated. The first option, to discard images at the time of capture does not allow manual annotation and repeated training and testing of AI software. Therefore, this is only a valid option for analysis of video files during routine practice using an already trained AI system. The second option, manual capture, has been used most often. It requires the presence of an analog or digital video recorder and personnel starting and stopping the recording process. An example of this option is video capsule endoscopy, where the capsule is activated prior to placement of swallowing by the patient. The third option, automated capture, in itself can already be AI software where images obtained using manual capture are used to train an automated AI software module. Non-AI, rules-based software has been created that analyzes the color, shape and movement components of video streams, and decides based on rules whether an “inside-the-patient” status exists; if this is the case video capture and analysis will be performed . AI software also has been created using training sets of inside-the-patient and outside-the-patient images.
The value of detailed, manually annotated training and testing video file corpora cannot be overstated: they are crucial for development of any AI system; they can be reused if newer AI methods become available, and the video files can be re-annotated for additional features if new features need to be discovered . Examples of features that can be annotated range from simple to complex. A simple feature is the inside-the-patient or outside-the-patient status of the endoscope; annotation of this feature is not difficult and does not require extensive endoscopic expertise. A more complex feature is clear vs blurry frame detection; what percentage of a frame needs to be clear to define it as a clear frame, as only clear frames and not blurry frames undergo additional image analysis .
However, although a feature such as inside-the-patient vs outside-the-patient status appears simple, starting or stopping a new video file with each transition from outside to inside the patient may not be the right thing to do. First, if the first image of a video file is inside the patient, how do we know that other inside-the-patient images preceding the first image were not accidently omitted? Second, during endoscopic procedures where the endoscope is repeatedly inserted and withdrawn – that is, large polypectomy during colonoscopy or variceal band ligation during EGD – the video stream ideally should consist of all parts of the endoscopy, including the outside-the-patient parts between all inside-the-patient parts of a single patient. The same can be said about the second example: blurry frame detection and removal from subsequent analysis. It is well possible that a key finding such as an arteriovenous malformation, an ulcer or a polyp is only or mostly present on video images that are partly blurry; therefore any feature classification for blurry frames needs to consider that frame removal may result in removal of potentially critical information about the condition of the patient.
Analyzing the video file for basic components
Before features of interest, such as adequacy of inspection, are extracted from a streaming video file, first a number of preparatory steps may need to be performed where basic features are detected. Many of these steps humans perform automatically, but a computerized system needs to be instructed to perform these preparatory steps to detect these basic features. These are more or less the same for all parts of the hollow gastrointestinal tract and summarized in Table 1 .
|Image clarity [ , ]||Clear or blurry (percentage)|
|Tip movement||Forward or backward (binary)|
|Tip speed||Fast or slow (continuous variable)|
|Tip view direction||Along gut axis or lateral wall (continuous variable)|
|Location with GI tract||Esophagus, Stomach, Small bowel, Appendix, Cecum, IC valve, Rectum (multiple choice)|
|Remaining luminal debris [ , ]||None or present (percentage)|
These preparatory steps may not be as straightforward as it seems to the casual reader as classification of most basic features is frequently not binary, that is, yes/no, but a continuous range between a minimal and maximal value where optimal thresholds have to be determined and set. For example, in the colon remaining luminal debris can be estimated per frame based on color (brown), and can range from 0 to 100 percent of the frame. The location in the GI tract is a challenge; there are only a few characteristic hallmarks that provide location information with great certainty: the vocal cords, the EG junction, the pylorus, the major papilla, the villi of the small bowel, the IC valve, the appendiceal opening within the crowfoot of the cecum, and the rectum on retroflexion. External modalities such as fluoroscopic or electromagnetic determination of endoscope tip location may help but do not provide absolute certainty related to small bowel and colon due to the mobility of these organs in the abdominal cavity.
Analyzing the video file for desired features
There is no standard for adequacy of inspection in GI endoscopy; there are varying degrees of recommendations and guidelines which can vary from professional society and country . Guidelines in general cover 6 features ( Table 2 ) that have a weaker or stronger relationship with what was seen: (1) degree of organ preparation; (2) extent of intubation; (3) time spent on inspection; (4) number and location of images taken during endoscopy; (5) an estimate of the amount of mucosa visualized; (6) frequency with which lesions are detected during screening. For video capsule endoscopy there are little or no recommendations, as the passage of current capsules is not operator-dependent.
|Degree of organ preparation||Not defined||Numerous scales|
|Extent of intubation||Duodenum, major papilla||Cecum|
|Time spent on inspection||Total time ≥7 min ≥1 min/cm in circumferential Barrett’s epithelium||Withdrawal time ≥6 min [ , ]|
|Images taken||At least 8-10 (ESGE)||At least 8 (ESGE)|
|Estimate of mucosa visualized||Not defined, 100% assumed||90-95% expected|
|Frequency of lesions||Inlet patch in ∼10 % expected||ADR ≥25 % during screening|
The degree of organ preparation is mostly a subjective opinion of the endoscopist. For EGD, where fasting in general results in an empty stomach, this is most of the time not a major issue. But for colonoscopy numerous scales exist that document the estimated state of colonic preparation, a subjective opinion of the endoscopist, before or after washing and removal of remaining debris [ , ]. The eventual colonic preparation depends greatly on the willingness of the endoscopist to spend time and effort on removing remaining debris. The extent of intubation is well-defined; for colonoscopy this means reaching the cecum and for EGD this means reaching the second part of the duodenum with visualization of the major papilla [ , ]. The time spent on inspection varies per organ. For colonoscopy the average withdrawal time in negative-result screening colonoscopies is recommended to be at least 6 minutes . This recommendation however allows very fast colonoscopies as only an average, not a per procedure time, is used. For EGD, US societies do not specify a specific time spent on inspection but articles from Europe and Asia state that longer inspection time results in detection of more lesions . Furthermore, based on data of 16 slow (N = 8) and fast (N = 8) endoscopists Teh et al. recommend a minimum EGD time of 7 minutes per EGD for a first EGD in patients at risk for neoplasms as slow endoscopists detected 3-fold more neoplastic lesions . When Barrett’s esophagus is present, a minimum inspection time of 1 minute per centimeter of circumferential Barrett’s mucosa is recommended .
The only truly objective forms of documentation are images and video files. Retention of video files is not recommended by any endoscopic society, but documentation using images is. US societies do not specify a specific number of images for any procedure. For EGD the minimal number of images is 8 in Great Britain and 10 in the latest European Society of Gastrointestinal Endoscopy (ESGE) guidelines [ , ]. The ESGE specifies that accurate photo documentation includes at least one representative image of each of the following anatomical landmarks: duodenum, major papilla, antrum, incisura minor, corpus, retroflexion of the fundus, diaphragmatic indentation, upper end of the gastric folds, squamo-columnar junction, distal and proximal esophagus. In East Asia, where there is a high incidence of gastric cancer, up to 48 images are recommended to be taken systematically from vocal cords to the second part of the duodenum with most attention paid to the stomach; indeed, Yao proposed that systematic screening of the stomach requires at least 12 antegrade and 10 retroflexed images, but this is only for areas with a high incidence of gastric cancer . For screening colonoscopy the ESGE recommends 8 images: rectum, middle sigmoid, proximal descending colon, distal transverse colon, proximal transverse colon, distal ascending colon, ileocecal valve and cecum .
Another subjective measure of adequacy of inspection during GI endoscopy is estimation by the endoscopist of the percentage of mucosa visualized. Documenting an estimation is not recommended by any endoscopy society but for EGD it is assumed that all mucosa of esophagus, stomach, and first half of duodenum is inspected; even though, the bulb and first part of the duodenum may be very difficult or impossible to completely inspect due to anatomical restrictions. For colonoscopy the consensus among experts is that at least 90%-95% of mucosa can be inspected in a normal colon during screening colonoscopy. The last measure of adequacy of inspection is calculation of average number of lesions identified across a series of endoscopies. For EGD the only infrequently present lesion that requires careful inspection to find it, is a mucosal inlet patch; such a lesion is present in the upper esophagus in at least 10% of people . Keeping track of this finding as part of an adequacy of mucosal inspection quality control process is in general not advised or implemented. Since 2006 the ASGE/ACG Taskforce on Quality in Endoscopy has recommended to use the frequency among screening colonoscopies in which at least one adenomatous polyp was detected, called the adenoma detection rate (ADR), as the major determinant of adequacy of mucosal inspection . This was based on prevalence rates of adenomas in colonoscopy screening studies that have been consistently over 25% in men and 15% in women more than 50 years old.
Two things rapidly became evident; first, endoscopists would look for an adenomatous appearing polyp and biopsy or remove that, but then would not find any additional polyps in most cases, a phenomenon called “one-and-done”. Indeed, preliminary data from the CORI database from 2018 show that “one-and-done” represents 60% of all colonoscopies in which at least one adenoma is found; this suggests that “gaming” of the quality control features continuous unabated today . Second, an absolute ADR of 0.2 is not an appropriate measure of adequacy of mucosal inspection. In 2015, the ADR was raised from 0.2 to 0.25, but that number too is not a guarantee for a high quality inspection . Data from a large retrospective analysis from California showed that each 1% increase in the ADR decreased the interval colorectal cancer rate by 3% . Therefore, there is no fixed ADR that differentiates a suboptimal from an adequate inspection of the colon.
Developing AI software
After one has captured the video, performed the basic analysis, selected the desired features and created a large corpus of annotated video files with adequate representation of the desired features, the deep learning process can start. This is not a simple process but requires expertise in computer sciences in addition to understanding of the intricacies of endoscopy. Given this range of expertise any AI project in endoscopy requires a team of computer science and endoscopy experts working closely together. As mentioned before, the focus of this article is not to go into the details of the AI methodology; however, one should have a basic understanding of the complexity of a computer-based system that analyzes a video stream within milliseconds, feeds the results from one algorithm into another algorithm, and eventually combines all the results in a decision module, where decisions are made to yes/no provide feedback, and if feedback will be provided about which feature and in what format. All of this has to happen within 33 milliseconds as most video files record at a frame rate of about 30 per seconds.
Applications of AI to improve adequacy of mucosal inspection
The goal of every endoscopic examination is to obtain the maximal possible amount of information about the organs being examined. To define quality of endoscopy we created the acronym CLEAR; it stands for (1) Clean mucosa; (2) Look Everywhere; and (3) Abnormality Removal . Adequate mucosal inspection involves the first 2 components of CLEAR: cleaning the mucosa and then looking everywhere. AI applications that will improve adequacy of inspection will have to address these 2 components of CLEAR.
Recently a first attempt to use AI as method to measure and stimulate endoscopists to perform inspection of the entire stomach was published. We know from numerous studies that just as in the colon, lesions are being missed in the stomach [ , , ]. In Asia gastric cancer remains a major cause of death, and screening programs to find gastric cancer early are common . A team of investigators from China has addressed the issue of completeness of inspection by successfully developing and testing a convolutional neural network (CNN) based method to measure and document in real-time whether all mucosa of the stomach is inspected . The method is simple and uses a number of standard views which are typical of forward, sideways, and retroflexed inspection of the stomach. During the endoscopy, the software checks whether all views representing gastric locations are detected and gradually, with each view detected, changes the color of those locations in a schematic representation of the stomach: a homogenous coloration at the end of gastric inspection means that with very high likelihood the entire gastric mucosa has been inspected. The software also stores images of all locations inspected. The stomach is an ideal organ for this method: it is homogenous, has a clear begin and end – the EG junction and pylorus respectively – and consists of an asymmetric cavity with a relatively thick wall without deep folds. With adequate inflation every stomach will have a similar shape, and if devoid of remaining food, bile and saliva, images of standard views should be fairly similar explaining why the AI method works.
Compared to the stomach, the colon is challenging organ, as it is long, has several angulations and flexures, and contains numerous deep or shallow haustrae. The colon has mobile segments (sigmoid and transverse) and nearly always contains remaining debris. Lastly, unlike the stomach, the colon does not have a predictable, recognizable shape after a certain degree of insufflation and therefore does not allow a firm determination of anatomical location. This makes measuring adequacy of inspection a much greater challenge than for the stomach. Therefore multiple features need to be measured such as presence of remaining debris, maximal insertion, retroflexion and mucosa seen in order to report on adequacy of colon inspection.
The first feature of adequacy of inspection of colon mucosa is determination of the amount of debris present, and whether the endoscopist actually does remove this during endoscopy. At least 2 groups have tested real-time fecal debris detection; both methods were simple and based on detection of green-brown color pixels rather than expected red-orange-pink colon mucosa [ , , ]. Color detection does not involve deep learning but rather consists of setting thresholds for color features.
Reaching the cecum is required in order to inspect the entire colon. Detection of several features can mark the cecum: detection of the appendiceal orifice and the crowfoot, detection of the ileal-cecal (IC) valve and detection of the villi of the small bowel after IC valve entry [ , ]. For each of these features computer algorithms can and have been developed, allowing with 90%-95% accuracy detection of the cecum as the colon segment of maximal insertion.
Retroflexion allows a different view of the mucosa, both in the right colon as well as in the rectum; instead of looking proximally at haustrae and the rectum, the endoscopist looks distally along the colon axis. This permits detection of lesions that are not or not easily seen using the conventional proximal view. Retroflexion has a characteristic feature: the shaft of the endoscope is seen from somewhere in the picture going toward the edge of the image [ , ]. Rules-based and AI methods detect this.
Amount of mucosa seen
Two methods are in development or have been developed already that provide indirect or direct information about the amount of mucosa seen. The ADR – currently in use clinically as a marker of quality – is a third method; it can be measured using AI by combining 2 methods: polyp detection and polyp type estimation. We will not discuss these methods here as ADR is not a direct or indirect measurement of amount of mucosa seen; ADR assumes that adequate inspection has occurred.
The first method measures the effort of the endoscopist to withdraw the endoscope in a gradual, circumferential fashion ( Figure 2 ) [ , ]. As mentioned, the colon is anything but a straight hollow tube; in order to see behind angulations and haustrae, the tip of the endoscope has to be actively deflected away from the center of the lumen towards the lateral wall. This deflection away from the center can be measured in 2D images as long as location of the center of the lumen is known. Because endoscopes are equipped with wide angle lenses (140 degrees or more), tip deflection of 60-70 degrees will result in adequate lateral wall inspection while still maintaining a view of the distal, central lumen. Therefore, the distance between the center of the (endoscope) image and the location of the center of the lumen in the image is directly related to the amount of tip deflection away from the central colon axis. By measuring this distance away from the center of the lumen over the entire 360 degrees or circumference in 2D images, one measures whether the tip of the endoscope was circumferentially deflected in the actual 3D colon. Once a circumferential inspection has been completed, this can be reported and recorded as a single complete circumferential inspection. By counting the number of circumferential inspections during the withdrawal phase, one obtains a good representation of the amount of effort spent by the endoscopist to completely inspect the colon. By providing real-time feedback as a “spiral score”, the endoscopist can see the actual number of circumferential inspections already completed. The higher the spiral scores the greater the likelihood of an adequate inspection of the colon.