The videos below are recordings of a live biosecurity training course hosted by BioMADE and held virtually on March 31, 2023. This course was developed by BioMADE members, Signature Science, LLC, Rice University and Aclid with the objectives of increasing familiarity with important biosecurity concepts and regulations, introducing participants to sequence screening using open-source software and helping participants understand how to interpret sequence screening results. The course materials are publicly available and can be found on GitHub, and a playlist containing all supporting videos below can also be found on YouTube.
Course Introduction
Read video transcript
Hi everyone, I am the Principal Investigator of the contract that was partially funded by BioMade to bring you this course, which is a biosecurity sequence screening training course for bioengineers. And while we developed this course with bioengineers in mind, we definitely welcome anyone with an interest in these topics to join us here and we look forward to getting everyone’s feedback. So thank you so much for taking the time to check out this online course.
The objectives for this course are for participants to become familiar with important biosecurity concepts and regulations, as well as gain experience by performing sequence screening with open source software that we’re going to walk you through. We hope that you’ll learn how to interpret sequence screening results and we encourage you to complete all portions of this course. So please watch all the lectures, participate in all the exercises and case studies, and try running the software on your own if you’re able to. The software that we’re going to use for this course does require Linux servers and if you do not have access to Linux or aren’t comfortable with running command line tools, there is no pressure to do that. We also provide on our GitHub site the outputs from the software so you can watch the examples of how the software is run and then you can see what the outputs look like and you can walk through the interpretation portions of the course with us despite not having generated the results yourself.
But either way, we encourage you to jump in and participate in interpreting the outputs and we also encourage you to fill out the pre- and post-course surveys. We will provide links to that at the end of this introduction and then at the conclusion of the course we’ll have a post course survey and, spoiler alert, the pre- and post-course surveys have exactly the same questions. So you’ll be asked the same questions before the course and then again after the course. And the idea is that we’d like to see how your answers change after you watch and participate in all the different course material. This helps us out a lot. It gives us feedback and it helps us to be able to gain funding to continue to improve the course in the future. So please do participate in those course surveys and it helps make the material better moving forward.
A big note and announcement that I’d like to make is that the views, conclusions, software databases and all the material in this course are solely from the authors and they do not represent the views or opinions or endorsement of the US government or any government agency. This is just coming from the authors. It is solely our opinion but we believe our opinion could be helpful for you and our work could be helpful for you which is why we’re making it openly available. So please keep that in mind as you go through this material. I also wanted to raise awareness that during the original recording of this course in March of 2023 the HHS guidance was slightly different than it is today when we’re making these online materials in October 2023 because there was a new guidance document that was released. So some of the lectures that you’ll see such as those from Dr. Matt Sharkey were recorded back in March of 2023 before the updated guidance document came out. I definitely encourage you to take a look at this guidance document which is linked below and the reason for that is that there are some substantial changes to the guidance that happened in October of 2023. Those changes include looking at sequences of concern. So rather than just pathogen genomes or toxin sequences we are now looking at anything that could be considered a sequence of concern that might introduce or contribute to pathogenicity or based on its function. So please keep that in mind.
It’s also very relevant to this course because I believe we have the only open source software available to help label sequences of concern. Speaking of which, to our knowledge, our software is the only open source version of a software to do this type of thing and label sequences of concern. That labeling is done through machine learning and manual curation. And this is also the only training course of its kind. I don’t know another biosecurity training course that’s out there in an open way that helps people access software and learn how to interpret the software outputs for biosecurity purposes. There’s often a hesitation to do that because of concerns about dual use. So in this introductory lecture, I also wanted to touch on reasons why we’ve decided to make our material openly available. That material includes our software, which is readily available and installable on Kanda and other ways of downloading and installing biometric software. Our database is also freely available. And this training is freely available. And the reasons that we’ve decided to be so transparent and open about sharing our work is that we believe this fills a gap for legitimate basic research purposes. There is not a lot out there in the space that’s free and accessible to people. So we felt like this was really filling a gap there.
The sequence of concern ontology labels that you’ll hear about throughout this course were developed by our team of scientists. They were not endorsed by any regulatory authority or government agency. These, again, are just our opinions on how things should be labeled, and all of the labeling was done based on publicly available data. So publications and experimental evidence based on functions. That’s what went into our labeling as well as our machine learning processes. All of that is published. It’s out there. It’s on your own scientific review, but it’s not specifically endorsed by any regulatory authority. So please do not take anything in this course as being something that anyone other than the authors is recommending that you use. This is simply a tool that we were putting out there to fill a gap. And while individual sequences of concern are labeled by SeqScreen, I wanted to make it clear that there are no automated threat calls made by SeqScreen, and that is a very important distinction. So our SeqScreen software and databases will take users through taxonomic assignments and functional calls and characterization, which includes adding these pathogenic ontology labels to sequences of concern. But we do not go so far as to say, “this sequence is a threat, this is not a threat.” We don’t do green, yellow, red classifications with SeqScreen or any other high level categorization of threat. That is something that is very nuanced and challenging and specific to different end users and what they might be doing. Their application, other metadata considerations need to go into that final threat call. That’s not something that we do, but we believe that we’re giving you very important fundamental information that helps you as a user to make those calls. And that’s something that we’re going to work through in this course, but I believe that’s also something that is worth mentioning here because if we were making automated threat calls and we were making that publicly available, I think that would be under a higher degree of scrutiny as far as dual use and if we should be making all of that fully transparent. So hopefully that makes sense. All of that said, we believe that the concepts involved in making threat calls and making these final determinations are very important to teach people. So we need people to understand how to do this, what the processes, what the considerations are, and this is all going to help be transparent and promote good biosecurity practices.
If you have any questions about that, please feel free to reach out to me. I’m happy to talk with you more about any of these concerns that anyone has. Wherever you are coming from today, we believe that this course is going to be valuable to you and you’re going to get something out of it. So if you are a bioinformatician running in-house command line tools, I definitely think this course is going to be useful to you because we’re going to be telling you all about an open source tool that you can integrate into your pipeline for biosecurity sequence screening. So you will enjoy hearing about SeqScreen and learning more about how you can use that and how to interpret its outputs. If you are a biologist who is interested in interpreting outputs either from commercial software or from open source software like SeqScreen, I think you will also get something out of this course. Even if you’re not going to be running the software yourself at the command line, it’s very important as a biologist that you understand how the interpretation process goes because that may be something that you’re asked to weigh on as a subject matter expert or maybe you’re going to be asked to help determine how that automated process is going to work for your institution making those threat calls. And these pieces of information that we’re going to share with you are going to be important factors in that. So I think understanding those concepts is going to be great for you to work through examples of and learn how others are interpreting these outputs. That’s going to be a wonderful experience for you and the work that you’re doing, which is incredibly important.
If you are a commercial entity who happens to be outsourcing everything from sequence screening, the mechanics of it, to actually interpreting the outputs if you are hands off in all of those categories, I think this is still going to benefit you because from a management perspective, I think it’s really important to understand what you’re managing at least at some level. You should know what’s going on and that’s going to help you to be a better manager of what’s happening and when things are complex or the answers aren’t clear, you’re going to have a better appreciation for why that’s happening and you’re going to help guide decision makers through the process that needs to happen when those results come back as yellow or ambiguous. And so if you are in that boat, I think you’re also going to benefit from this course and just learning more about all of this and seeing how you can be a better manager of this whole process.
I definitely want to acknowledge everyone that has contributed to this course. Thank you so much to all of my collaborators, including Todd from Rice University, Kevin from Aclid, Todd’s team of graduate students and postdocs who have contributed to SeqScreen software, particularly Advait, who spent a good deal of his PhD writing this code for us. Thank you, Advait, as well as Gene Godbold from Signature Science, who has done a great deal of biocuration, which we then use to help expand on with machine learning. He continues to do biocuration and pioneer the way for different ontology frameworks for describing pathogenesis. So thank you so much, Gene, for all of your work. You’ll hear a guest lecture from him as well. In addition to Todd and Kevin throughout this course, Beth from BioMADE is also going to give a lecture. She is the director of 4S, and you will hear more about her role and how this is relevant to BioMADE shortly. And I also wanted to thank all of our different contract program management team and technical leadership team at BioMADE. Thank you to Steve, Louise, Kristen for all of the work that you have done to help support us and make sure that this course is staying on track and staying relevant to the bioengineering community. And special thanks to Dr. Matthew Sharkey of the U.S. Department of Health and Human Services for giving a guest lecture during this course. We appreciate you helping to promote and teach the concepts in your guidance document and continue to push the community towards best practices in biosecurity for sequence screening. So thank you for taking the time to share those insights with us.
The course agenda for today is listed on the slide, but I definitely encourage you to go to our GitHub repository where on the ReadMe at the GitHub site you’ll see an overview of all of our different lectures that are going to happen as well as the exercises and case studies, different materials associated with that, slides for people’s lectures, whatever you might want to see, I think you’ll find in our GitHub repository, so please check that out for more detail.
And, also, at this moment I would encourage you all to go fill out the pre-course survey. The link is below and I would love to hear everyone’s feedback. There’s an open comments section at the bottom, if none of the questions encompass what you want to communicate to us, we would love to hear whatever you want to share. Feedback at the end. So thank you so much for taking the time to do that. And again, I hope you enjoy this course. We put a lot into it and we really believe it’s filling a gap where not a lot else is out there to help teach people how to do this. So, I hope it benefits you. I think it will. Alright, thank you so much.
Relevance of Sequence Screening to the BioMADE Community
Read video transcript
I just want to start out by saying this is a really super exciting day. I want to give a little bit of history here. I remember the moment that Krista brought up this idea to Todd and I as an EWD proposal. It was kind of in the early days of BioMADE and we were collaborating on how we could make this sequence screening more accessible and applicable to industry and so the BioMADE opportunity was awesome. And you know, this plan started then and I really appreciate the support of BioMADE along the way. Krista and the SigSci team have done a stellar job leading it thanks to all of the SigSci folks and the Rice team and Kevin stepping up with Aclid for the industry role and yeah, for everyone who’s here. Okay, so once you start naming names, it’s everybody.
And I will start out here with a quick introduction. I’m trying to get a laser pointer up here. Sorry for the little delays here that I may need. So yeah, there’s a lot on this quick overview of BioMADE. It’s a bioindustrial manufacturing ecosystem and really we’re bringing together a diverse range of member organizations, industry, academia, nonprofits and we’re all working together with government to ensure, (sorry I’m still trying to get my laser to kind of start up here but maybe that’s not going to happen. Oh there we go, okay). And really so we’re working together to bridge this gap between the lab scale research and at-scale manufacturing. So you know, we want to get, let’s bring bio-based products and everyday use to help solve problems. Listen, you know, act now all towards securing America’s future through innovation, education, and collaboration. And the mission also includes ensuring America’s future through, or sorry, ensuring that the bio manufacturing workforce of the future is prepared and inspired and ready to fill the new jobs. So the vision is to build a sustainable U.S. end-to-end bioindustrial manufacturing ecosystem that will benefit all Americans. Some logistics, it’s a public-private partnership launched in 2021, catalyzed and sponsored by the DOD and it’s one of the 16 manufacturing USA Innovation Institutes manufacturing USA. So there’s almost 200 member organizations now across the U.S. and really we’re propelling new biotechnology products from the laboratory to the everyday, safe, sustainable solutions for everybody.
Okay, so, what do we mean by bioindustrial manufacturing or sometimes just bio manufacturing? So, really, we’re using biology to solve problems and fill needs. So, living organisms can be coaxed to make new products that are sustainable and environmentally friendly and that can be used for all Americans, so equity is important. Bio manufacturing is key to advancing the bio economy, which is forecast to be up to four trillion a year within 10 to 20 years and numbers may be updated now. So, the bio-based products that will help keep supply chain resilience, counter carbon emissions impacts on our environment, protect the health of everyone. The range of applications is really huge, limited by imagination. Really, products that members of the ecosystem are working together now to manufacture clothing, bioplastics, skincare, concrete, sustainable aviation fluid. So, it’s really, it’s amazing what you can do when folks, join forces and work together. Okay, so, the BioMADE approach and the source of the acronym and really, Manipulate, Accumulate, De-risk and Execute are the four areas of action to bridge this gap from lab to real use. I won’t talk through those four areas in detail, but I want to point out really why we’re focusing in that mid-range.
So, there’s been billions in federal investment, careers, and tremendous and ingenious efforts in R&D for biotechnology the last 10, 20 — three decades really. But, it’s this innovation about technology hasn’t been able to and always get past this scale up to get to the commercial readiness and into the hands of users. And so, it’s called the “Valley of Death”. And that’s where BioMADE is really focusing. And we’re calling it, it’s in small letters here, even if you see the bio-manufacturing readiness levels that correlate with the technology readiness levels, TRLs, that somebody may be familiar with. And a paper produced by the BioMADE team with considerable input from members describes these bio-MRLs. But we don’t work in a vacuum. So, you know, especially for biosecurity and other components of 4S, which I’ll explain here in a bit. We must consider this entire manufacturing life-cycle and transparency and communication so that we’re all driving to the end goal of a safe and secure bio-economy. And so, we have at BioMADE three interconnected focal areas. The technology and innovation, which is the largest group that’s driving the advancements, helping realize the potential of that bio-technology innovation. And then the educational workforce development is ensuring that we have the people, the resources as we scale up fast across the United States. And these opportunities are available for all. And then the 4S, which is Safety, Security, Sustainability, and Social responsibility. Really, they underlie all the activities and pursuits of the BioMADE ecosystem. And this course, this biosecurity training, it really helps meet the goal of all three areas. And it’s also relevant across all the MRLs I just showed on the previous slide. And so, I will briefly describe just a brief overview of the three areas and then highlight the relevance of biosecurity sequence screening. So, there’s more information on the website and other presentations about BioMADE and the activities of the different areas.
And so technology and innovation. In a nutshell, it reduces the barriers to scale of the commercialization and it strengthens the mid-MRLs as I mentioned through these major focal areas. And really, I want to talk through all of them, just to point out that they all involve or impact it somehow by micro manipulation or evolution really at some levels. So what we want to think about with sequence screening, you know, through genetic change, or, you know, at some sequence level. So mindfulness of what we are producing, what are the intermediates and the products, what are they capable of doing is the relevance there. And so education workforce development, Kristen mentioned she’s with this group and this is the group that is sponsoring this project. And we all recognize that the rapid advancement of bio-manufacturing, it’s opening up the need for inspired trainings, prepared workforce, leveraging all the talent and individuals across the country. And BioMADE is building the workforce of the future through creative inclusive partnerships. And the three areas here this course is relevant to all of them, the career with career awareness. And that includes careers in biosecurity and biosecurity, innovative education professional development and I was security is is really important for all of these and the course fits into all three.
And so this brings us to this area, third area, which is what I direct I’m very proud to be a part of the 4S program. And in the objective, if I can, in the nutshell, is really the 4S is building mechanisms and partnerships to enable the entire bio-industrial ecosystem and beyond to embrace and integrate safety, security, sustainability and social responsibility into all bio-manufacturing pursuits. And we have three areas of action, the highlight here, kind of in what what we’re building and doing and with the BioMADE team and the members we’re integrating the 4S in all projects and activities which involves member engagement and empowerment, providing resources, such as this course, developing frameworks and guidelines to support 4S for the members. And then public engagement and communication pathways and do is we don’t stay within by the way, involving everyone to a communication and also interacting with regulatory agencies and grateful that that Matt Sharkey is here today. And, and more about that later, so the 4S landscape here on the bottom it covers a lot of ground. And this course really supports all of these areas. Briefly the safety and security and it’s clear why by security sequence screening is important to block risk scenarios protect against harm for people. And the sustainability, it means we protect against harm to the environment. Also, and we need to ensure and demonstrate the safety of bioproducts, so that bio-manufacturing is understood and accepted by consumers for the economic sustainability. And that social responsibility is broad, it doesn’t fit in a box, but learning and sharing tools to empower responsible use of our technology is here, including in adherence to regulatory guidelines and best practices, and also point out that Matt Sharkey has been responsive and forward thinking towards the bio-security as the technology is evolving.
Okay, so I wanted to zoom in on biosecurity. What does this mean? A high-level definition, there are many variations out there, but it includes measures to protect humans, animals, plants, and other living organisms from potential harm from a biological agent. In other words, biosecurity manages and guards against bio risk. What do we mean by bio risk? In a basic level, a bio risk scenario carries potential harm caused by a bio agent. We have lists of known bio risk causing bugs and harmful biomolecules, namely toxins. These are just a few example pictured here. But we must consider novel organisms or products that we haven’t seen before. They might arise from natural sources, so they’re not profoundly aware or engineered sources. It’s this novel risk from engineered sources that were especially important to bioengineered technologies. In order to be responsible biosecurity, we really ought to address two key questions. I have them here, and I’m going to elaborate a little bit on them.
The first one is, where does a bio risk come from? There’s really two main ways, deliberate or accidental, or we could say mischief or mistake, or another word, for the word bio-terror already, bio-terror or bio-error. In bio manufacturing, the risk of the various attempts is not, we don’t think that being large, we’re all working towards a common goal, plus we’re not working with pathogens typically, it’s benign organisms. This accidental creation of a bio risk is important. We need to consider all sources.
Then the second bullet is really what we’re doing. We need to be able to identify the bio risk. What does it look like? That can be a moving target because science is moving fast, and biosecurity must keep pace. We’ll be addressing that also as we go forward. This question of how do we recognize bio risk, I’d like to highlight a study that was done — still very relevant — back in 2018. Some of you may have been involved in this. It’s from the National Academy of Science Committee in 2018. They laid out a process for evaluating bio risk potential in new synbiotechnologies by the defense and the age of synthetic biology. It provided an often-cited framework for risk assessment, including various needs of bio-engineering and genome engineering. The report highlighted drawbacks in the current pathogen, ‘bad’ gene, based risk assessment. If we hear more today from Gene, that risks are not only related to pathogens. We need to consider the functional genotypes that may be harmful. I’m thinking ahead to what Gene is presenting. You can look forward to that. This is a daunting task. We look to the community for help with that. To assess bio risk, we really need to consider what can an organism do. That may be more important than what it is.
I also want to mention just last week, a paper that is building from this framework, “Safety by Design: Safety and Security in the Age of Synthetic Genomics”. James is one of the co-authors out here. There may be others here too, but we’re involved in that. SeqScreen is mentioned as well as the guidelines we’ll hear about from Matt later. It’s still very timely and relevant. Also, very recent, when the highlight, exciting news last week, announcements by the Biden-Harris administration during the presidential forum on advancing biotechnology and bio-manufacturing innovation. It was last Wednesday. I think we all recall the executive order on advancing biotechnology and bio-manufacturing innovation for sustainable, safe, and secure American bio economy that was last fall. It was also the day that this project was announced in September. This is an all-of-government, all-of-country approach to really ensure bio-manufacturing. It is the way of the future to solve many of our problems. Last week marked the release of the strategies and the release of the agencies, this all-of-government approach to fulfill the executive order. In that, to the point of this slide, priorities related to safety, security, sustainability, societal impact and responsible use of our technology. They came up often and they’re highlighted in the report document, “Mole goals for years by biotechnology and bio-manufacturing.” So, security were addressed by all the agencies, education and workforce development came up.
And I just want to, a couple quotes I want to pull out that really highlight the importance of sequence screening. “Biologic materials and systems will be used in new ways while being engineered to bake both new and familiar products”. Biosafety and biosecurity should be included in these new technologies and processes early in the design phase. And then another one, “Mitigating the risks of accidental or purposeful design or release of harmful organisms requires an expanded evidentiary basis to enable the prediction of biosafety risks of any synthetic sequence part or organism”. So, there’s a tangible action to screen for predicted functions and pathogens and that we should look closer at what we’re finding. So, there’s going to be a more detailed plan for strengthening and innovating biosafety and biosecurity that will be forthcoming. And lastly, last point here, “The collaborative spirit and strength of BioMADE ecosystem was highlighted in the announcements last week. So, we have important work to do.
Okay, so where is bioengineering occurring in biotechnology and bio-manufacturing and where are the opportunities for sequence grading is really what I wanted to highlight here. And so on the left are some of the application areas of bioengineering. And you kind of think about you have the sequence, bioengineering sequence. The sequence is involved or impacted by these different applications, strain engineering in particular and the genomic sequence. Analysis bioprospecting foreshadowing of what is coming today and other protein engineering and then innocent towards some of the products of bio-manufacturing, the strains themselves. I’ll point out how oligonucleotides and CRISPR involves repair templates which may be benign sequence but may have an insert of a foreign sequence. And so what could that do to the organism? These are all important. And then some others that are not represented here where functional screening play a role and bio surveillance or looking for contamination in biochemistry and so on. So you can be thinking about all the different parts of the workflow through the MRLs where it will be important to carry out sequence screening.
All right, so moving on. I wanted to look closer and highlight SeqScreen which is the open source tool that we’ll learn today. Some of the value to industry, well first I’ll highlight this, I’m sure it’ll come up later that it was supported initially by IARPA. The Intelligence Advanced Research Projects Activity and the Functional Genomics Characterization and Threat analysis program really towards the biosecurity and biorisk prediction purpose. And so it can be used to help demonstrate safety of bioengineered products to help meet regulatory guidance criteria because it does look for against the controlled agents which you’ll hear more too. But it will also, the potential biorisk sequences was the goal towards biosecurity but it will also help predict biological function. So the use of the PFAM gene ontology terms that you’re hearing about these are important for all, beyond just biosecurity for all bioengineering to advance the product development. And then some of the practical attributes, the ability to screen hundreds of thousands of sequences in parallel. A variety of sequence inputs which will be seen in the case studies today can be downloaded, accessible to any researcher because we’ll all do it here.
And I think the minimal cost being open source, that’s a huge asset. And that’s because that is a barrier for biosecurity and industry is cost. And so, yeah, the buy-in is important. All right, so my final slide, I’m probably getting close to time here too. I want to end by emphasizing that, biosecurity is a shared responsibility. BioMADE is built on collaboration, transparency, working together. Let’s bring all stakeholders to the discussions and to the work and all the advancements in the training and enabling bioengineers to take ownership of the safety of the engineered molecules they are producing and to work together to help you to share insights and strategies. And so that’s why this is so valuable that we’re all together here today for this and that sequence screening needs to be considered, and that all part of components of the workflow in bio-manufacturing starting with the early research and development. And BioMADE is building and empowering the workforce of the future. And this innovative, accessible biosecurity training is a key part of that. And so again, integration across all of the activities of the bio industrial ecosystem. And so that is the end of this slide. Any questions? On BioMADE or bio industrial manufacturing? Okay. And thank you.
Screening Framework Guidance
Read video transcript
Okay, so thanks a lot for having me today. I’m Matt Sharkey, biologist at ASPR, I’ve introduced myself before. I want to say that I’m particularly happy to be here because when I was a much younger man with many fewer gray hairs, I worked for Gene at Battelle on these issues. And that was one of the most formative jobs I’ve had, and I took that experience through several subsequent jobs, and lucky enough to have ended up working on the federal guidance that suggests that these types of screening protocols are used for providers of synthetic and nucleic acids. So I’m going to give an overview of the 2010 guidance from HHS, and then I’m going to give a forward-looking summary of changes that should be expected in the guidance in the coming year and the issues that have come up given the evolving technology landscape.
So, kind of “the why” – why does the federal government want providers of synthetic DNA at this point to screen for sequences of concern? Probably you guys don’t need this background, but prior to the advances in synthetic biology that occurred over the past 20 years — 30 years really — there was access control inherent in the availability of entire pathogens, especially in entire viruses. So, we didn’t have to worry prior to 2001 when the first synthetic polio virus was produced in kelo cells, that if we had access control for a small RNA virus, people who were unauthorized to obtain those viruses could get their hands on them after the synthetic polio virus was published in ’91, and then a synthetic polio virus itself was produced in 2001. There was more of a concern that the select agent program and the Department of Commerce’s control list would not be able to prevent unauthorized people from accessing this. So, I’ll talk subsequently about the advances in 2017, that led to a provision of the guidance, but… (Sorry, you guys see the wrong set of text). Okay, so in 2010, in order to reduce the risk that unauthorized individuals or individuals with malicious intent could obtain regulated agents, those regulated agents listed by the federal select agent program or the Department of Commerce through the Commerce Control List, through the use of commercial DNA synthesis, the Department of Health and Human Services issued a screening for guidance from providers of synthetic DNA, and it was kind of a soft touch, it’s not in any way a regulation, it isn’t even tied to funding, it is a suggestion to the synthetic double-stranded DNA industry that they be on the lookout for sequences that could encode regulated pathogens. I’ll mention that the reason this is not a regulatory framework is because the federal select agent programs.
I’ll mention that the reason this is not a regulatory framework is because the Federal Select Agent Program statutorily only regulates nucleic acids that could themselves give rise to select agents, so that would be a full length genomes of negative or positive sense of single stranded RNA. Those genomes themselves are regulated as select agents, but the subject of the components of any select agent are not themselves regulated entities. I think it wasn’t entirely clear in 2010 what kind of impact this would have on the DNA industry. But in subsequent slides I’ll point out that I think that the guidance itself provided a good benchmark for the industry and has led to a wide uptake. Thanks to our colleagues at IGSC. James, thanks for that.
So, oh, I’ll mention before Beth was overly kind and said that Matt Sharkey was very forward thinking in framing the future perspectives of guidance. I’m just the scribe, I write down what the, or my colleagues from the interagency come up with and I try to aggregate it all into a policy document, so I don’t want to take any credit for any of this. Also, I wasn’t involved in the 2010 guidance, that was Jessica Tucker who led the process on behalf of ASPR. She is now director of the Office of Science Policy over at NIH. So the 2010 guidance encourages providers to screen 200 base pair windows for regulated sequences. As I said, it’s voluntary and it throughout encourages providers to know who their customers are and if they’re providing their customers with a sequence that contains the 200 base pair window that is unique to a select agent, or toxin, or commerce control listed agent. Then to do some follow up and make sure that the customer is legitimate, has a legitimate use and if a mal intent is suspected, it says you’ll call the FBI. My FBI colleagues have let me know that occasionally they’re contacted by these, only a few times a year. It’s not every week that they get a call like this. So it looks as though there’s not a huge problem with people with mal intent trying to obtain a synthetic nucleic acid. I’ll talk later about the increasing availability of bench-topping nucleic acids in the sciences and how we’re trying to address that market. So thank you to the international Gene Synthesis Consortium, which self reports that they encompass about 80% of commercial genes that this is capacity worldwide. I don’t think that they, or we, know that that’s the case, but they’re certainly a big organization. They have a lot of the, they have all the big players in synthetic DNA, double-stranded DNA provision, and they require a harmonized screening protocol that is consistent with the recommendations in the 2010 guidance. So through the international teams at this Consortium and without either US regulations, international regulations, or a prescriptive funding mechanism, IGSC and others have managed to do a great job encouraging uptake of the guidance, both in the US and international.
I’ll mention that although HHS and the federal government have not pursued a regulatory approach towards this guidance, states have. So California passed Assembly Bill 70 in 2021. It was vetoed by Governor Newsom, who — I’ve read a summary of his remarks about this — his intention was that he didn’t want California to have a regulatory framework that was out of step with the rest of the United States. I’ll note that California has regulatory frameworks that are out of step with the rest of the United States on many other issues, including carcinogens and vehicle emissions. But this is an emerging field. I think the risk is acute by political leaders, so I understand their reasoning. As an employee of the Federal Government, I have absolutely no position on whether Maryland or California should adopt such regulations. But I want to mention, because I think it’s relevant to the group, that this has been investigated by states. There’s a great overview of this from West and Gronvall, published in 2020, talking about the bill in California. I’ll also mention that G.K. Gronvall spoke in favor of the Maryland Bill in 2021 in front of the Maryland Assembly. Okay, so, and I’ll mention also that those bills themselves pointed to either the IJSC harmonized protocol or to the HHS guidance, I think the Maryland bill I was pointing to the HHS guidance was the California bill pointed to the IJSC protocol. Once again, neither of these are law in either state.
So, in 2017, Canadian researchers published an article indicating that they had purchased synthetic double-stranded DNA from a company, I think it was a Dutch company out of California, so it was on GNR, I think it was Novartis, could be in the future. But in any case, they ordered synthetic double-stranded DNA that they used to chemically synthesize horsepox virus. This was a significant and unexpected advance in that horsepox virus has more than 200,000 base pairs in its illustrated DNA genome and it has some structure similar to chelomeres that were thought to be difficult to reproduce in the laboratory in order to give a transfectible fully horsepox or orthopox genome. It could generate horsepox virus. This, of course, being important because horsepox is 95% similar in sequence to variola virus that comes in smallpox and interestingly for the application of the screen framework guidance, the 2010 screen framework guidance. Although horsepox has many 200 base pair windows that are shared with smallpox, meaning that they are identical in sequence between the two viruses, since the guidance says that these 200 base pair windows must be unique to regulated agents, those windows are not unique to horsepox, er, to smallpox, they are of course also found in horsepox. In follow up, it turns out that everyone was doing their job. GNR understood that they flagged the order, they realized that these sequences were smallpox sequences. They understood the guidance well enough to know that since they are also horsepox sequences, there is no violation of the guidelines in that case and they also knew that they were providing these sequences to legitimate researchers and so they went ahead and provided them. The interagency at the time noticed this and thought, like the little bullseye diagram on the right, we have a set of layers in how we want to affect best practices in this industry. We have regulations which are the export controls in the federal solicitation program, we have the NIH guidelines which are tied to the funding mechanism, we have the sequence screening framework for providers and we are thinking what is the best way to address this. Nobody really wants to expand the solicitation list to encompass a bunch of pathogens that are not essentially former BW program pathogens.
But there are definitely risks associated with the provision of sequences that can contribute to pathogen toxicity from unregulated pathogens and so we decided to pursue revisions of the 2010 guidance that would encompass and mitigate the risks associated with this work and work like this and just the impact, the “so what” about the 2017 horse pox research is that it would allow probably only a couple of thousand people in the world who could carry out the work given what was published. but it allows some people, not just a couple of people, but it will let people around the world now to using synthetic constraint DNA to recreate potentially smallpox virus. Smallpox virus is the most regulated and accessed-controlled material in the world, of course it only exists in two laboratories, one in Novosibirsk and one in Atlanta, and it is heavily regulated by the biological weapons dimension so having this kind of out of access control puts everyone at risk so that is why we sought to address this.
Of course there are other changes in the biotech landscape that we wanted to address and mitigate associated with any revision to this guidance, the increase in ease of large scale assembly, increase in ease of conversion between different types of oligonucleitides, cost of synthesis is going down, there are way more companies producing these materials in 2022 than there were in 2010, the market is expanding rapidly and then there are completely new risks such as bench-time synthesizers and an emerging DIY community, I’ll say that as we considered all of these things we try to take as light a hand as possible and I’ll explain, we try to spread the responsibility out throughout the entirety of entities that are involved in the provision of synthetic nucleic acids in order to take some of the burden off of providers and also in order to focus on building a culture of responsibility with what we are going to term sequences of concern.
So we issued a couple of federal register notices to get public feedback about whether, and how we should modify the 2010 guidance. So in 2020, we issued a federal register notice that asked a lot of questions about, you know, in these five or six different topical areas, what should we, what could or should we do to change the current guidance? And then in 2022, after deliberating on the feedback from that first federal register notice for several years, we published a draft proposed guidance and once again, it was a bit public feedback on whether we’re on the right track. We held a workshop in 2022, the EBRC helped us out with that, got a lot of great feedback. And we’ve reached out to, as we have considered finalizing the draft and issuing this new guidance, we’ve reached out to some people who are on the call here, including Sig Science and ACLID, in order to figure out if we’re right-sized and doing the right thing here.
So the links to those federal register notices are the short URLs. Okay, so these are the seven topical areas that we requested feedback on in 2020. That kind of explains itself, so we wanted to know how we could modify the guidance that these areas address to addressing risk. And this is what we, these are the broad areas where we modified the 2010 guidance and are getting ready to issue a final guidance document is in clearance at the White House currently. So what we expect will happen on this topic is in no way meant to commit or to make evident the intent of the US government, but what we think is going to happen is the White House will approve this draft or a modified version of it, and then it will come to HHS where we’ll put it through clearance, essentially we send it up the chain to our system secretary, see if they like what they see, and then we’ll issue it as HHS guidance. And this is kind of a sneak peek into what should be expected. This is very similar to the federal register notice that was published in 2022, but there are some modifications.
I’ll note that the scope has been expanded to provide recommendations for providers for users from manufacturers with their party vendors institutions, principal users, and end users by which I mean we are asking users to preemptively identify if they have, if their order contains sequences of concern, we’re asking third party vendors, principal users and end users to verify the legitimacy of people that they transfer the materials that they’ve purchased from providers to and to record those transfers. We’re asking institutions to engage in some oversight of the use of bench-top nucleic acid synthesizers that are operated on their premises, and of course we’re asking manufacturers of bench-top nucleic acid synthesizers to start to verify the legitimacy of their customers, not that we’re saying they haven’t been, but we didn’t recommend it before. So we’re just trying to be as proactive as possible in establishing best practices in this industry. So the definition of sequences of concern in the revised guidance will likely include all sequences that can confer pathogenicity or toxicity. This is potentially a problematic definition which people smarter than me such as Gene and the other bioinformaticists here can speak to and I’m going to have some slides to talk about how we hope to mitigate the application of, or the potential challenges and the application of that definition. Of course the 2010 guidance is limited to sequences that are unique to some of the biological oscillations, and sorts of agents listed in the control list. The types of all of the nucleotides that are subject to the recommendations now include a single end double-stranded DNA and RNA, we’re suggesting that the sequence screening windows increase to 50 base pair in the ease, the increasing ease of making larger constructs of this DNA. And then we also suggest a screening window that’s very small for batch orders if those orders can be ligated and would have the appropriate overlaps for construction of larger sequences than our 50 base pairs would. So we also assume that that may be a little bit difficult for providers to implement. There are some recommended management and cybersecurity suggestions in the guidance as well that weren’t present in the 2010 guidance.
So I want to talk about some expected difficulties in implementing. So the sequence of concern definition is inclusion of the 2010 definition. We kind of suggest that at a minimum we suggest that you adhere to the 2010 definition. And then as soon as it’s practical to do so, we recommend that sequences known to contribute to pathogenicity and toxicity, whether from regulated or unregulated agents, are treated as SOCs. So that’s kind of a subset of the BSAP or CCL sequences. In the 2010 guidance, it was just suggested that anything from BSAP or CCL that is unique to those agents should be considered a sequence of concern. Whereas now we’re saying, you know, if you’re going to go beyond this list, the only sequences that you should consider to be of concern are those that contribute to pathogenicity or toxicity. I’m going to leave it to the bioinformaticists here to let you know how you might want to do that. And I just want to point out that it’s further expanded in that we’re talking about pathogenicity or toxicity that threatens public, agricultural, plant, animal, animal plant products, or the environment. So that’s a huge definition.
I’ll note that the government is, it doesn’t seem that the US government is likely to host a unified screening database. We think that the US government will develop standards against which screening databases that are housed privately can be assessed for their suitability, something like an ISO standard, but not probably issued by the US government, not by ISO. But that is, of course, a major outstanding issue is how could and should providers and users figure out what we mean by sequences of concern. We hope to mitigate the implementation of this by asking users to preemptively let providers or bench-top devices know if their order contains sequences of concern. So they should preemptively say, yeah, I’m ordering a sequence of concern. You know, and here’s some information to verify my legitimacy. We think that users are best positioned to know if their sequences are sequences of concern. I mean, as a former laboratory scientist, I knew whether I was working with a viral envelope protein or a bacterial antibiotic resistance gene that should be clear to any to any scientists who’s placing these orders.
So we also are hoping to address the risk associated with bench-top synthesizers. So this guidance indicates that manufacturers should screen all customers purchasing bench-top nucleic acids and validate customer legitimacy, and they should only provide the synthesizers to customers that have legitimate mechanisms in place to ensure that they only option by legitimate users are also suggesting that the devices contain either hardwired or cloud screening capabilities and that they not be able to manufacture at least SOCs without that screening taking place. We have been meeting with the bench-top synthesis community to determine if that is possible. It turns out that all of the bench-top synthesis manufacturers that we’ve met with already have a mechanism like this in place in so much as they’re trying to comply with the 2010 guidance. We understand that this mechanism becomes more complicated as they try to implement the revised definitions of sequences of concern.
Okay, so there are some questions here. You know, we also try to establish an institutional responsibility to ensure that only legitimate users access the synthesizers and are able to manufacture sequences of concern. There are issues of how to deal with equipment sold prior to the guidance. One of the ways to deal with this is to verify legitimacy in purchasing reagents, especially reagents that are solely used in bench-top synthesizers. And then we understand that some institutions have institutional security, biosecurity policies or cyber security policies that prevent their devices from being able to access the internet. So we are we’re suggesting that the manufacturers come up with work-arounds and we’ve spoken with some manufacturers who already have work-arounds for those issues in particular.
So the last thing I’d like to point out and it looks like I’m running out of time. That’s great is we included something in there that that I think is pretty forward looking. It didn’t come from me, it came from my national security colleagues FBI and DHS. And it’s that, you know, with the increased ability of machine learning algorithms to produce functionalities using entirely novel amino acid sequences, we expect that that some actors with mal intent may try to submit sequences that are completely unique, but which encode protein functionalities that are not themselves, you know, like an analog of a bacterial toxin that has a completely novel sequence. And so we’re suggesting that the industry work to develop algorithms that can predict the structure and the function of sequences that aren’t known otherwise. And just today, a fantastic group from Collaborative Pharmaceuticals, King’s College London and the SPETS Laboratory in Switzerland published an article. I’ve got to move forward a slide, sorry. There’s ChatGPT for biology. What could go wrong. It’s in the bulletin of the atomic scientists and it points to two articles published in 2022. They are deep language models for protein design, which themselves have published results that they are able to make functional proteins or regenerate the functions of natural proteins with extremely low identity in protein sequence to those proteins themselves. So I think that this kind of validates where we were thinking that it’s increasingly becoming possible to to design new sequences that have a function that is already known and we we envision that people may the actors with mal intent may use these models to order bad functions or sequences of concern — functional sequences of concern. They have unique sequences. And I would like to point out that some of the same individuals involved in that, including the folks from this SPETS Lab and King’s College published an article at the top here in Nature Machine Intelligence about the dual use of artificial intelligence for drug discovery in which scientists at Collaborative Pharmaceuticals in North Carolina used or intentionally misused a toxicity predicting machine algorithm to generate a bunch of novel chemical weapons structures. So we think that the risk is very real and and we’re pretty sure that the industry will act totally responsibly to mitigate these risks and we are so grateful to have so many talented people working on this. That’s all I got. If you guys want to learn more about ASPR, here are some links for you to look at feel free to email me. I’ll say that I co-led the development of this revised guidance with Dr. Miriam Blackfish-Feely, who’s also at ASPR and Miriam and I were going to field any any questions that anyone has about this. And here’s my disclaimer once again, and I’ll stop sharing this screen. So thanks a lot for your time today.
DNA Screening Concepts
Read video transcript
In this section, we’ll be chatting about DNA screening concepts. You’ll be using some of these later in the hands-on exercises, and this should help frame how some of this is implemented. I work at ACLID where we’re building security and compliance automation for gene synthesis and governments. We help identify pathogens and toxins in DNA sequences that are either synthetic or natural, and we help industry with customer verifications and compliance checks, really helping automate a lot of the process, making sure they protect their liability and stay compliant.
When we think about biological risk, we often use a few buckets. Sequences in the red and yellow buckets typically derive from highly pathogenic agents, while sequences in the green bucket are either common to all of life or are unique to non-pathogens. Sequences in the red bucket are both from highly pathogenic agents and known to directly cause harm, such as encoding a toxin or are responsible for pathogenicity, for example, invading into a host cell or evading the immune response. Sequences in the yellow buckets similarly derive from highly pathogenic agents, but don’t neatly fall into a bucket for pathogenicity or a bucket for housekeeping or metabolic genes. Their function may not be known, or these sequences might be in a gray area where a scientific opinion is required, depending on the use case. Sequences in the green bucket, on the other hand, are sequences that are considered relatively safe. These are either housekeeping genes from highly pathogenic agents, so they’re conserved, maybe found across all of life or metabolic, or they aren’t really found in a pathogenic agent, and so can be considered to be relatively safe.
Some of the most highly pathogenic agents are also found on regulatory lists. Many countries have export controls that govern what types of material can be sold abroad, and countries that participate in the Australia group all agree to control a common set of materials at a bare minimum. Some of these countries include countries in Europe, Asia, and in North America, countries like Canada, the UK, the US, Japan, Korea, and many others. In the US there’s additionally the Federal Select Agent Program that monitors the possession and transfer of certain agents, including some of their DNA sequences, as well as a screening framework by the Department of Health and Human Services that you may have heard about earlier in the course that governs how the synthesis of DNA is done in industry and research.
When sequences are flagged due to regulatory or biological risk, it often warrants a review. The review is used to determine legitimacy and what if any safety precautions need to be taken. You’ll probably want to document the results to reuse for “look back” later, and these will oftentimes be used for standardization and for helping make sure that you’re conducting research safely and responsibly.
Effects of Sensitivity & Specificity on False Positives & Negatives
Read video transcript
Screening sequences is often done using bioinformatics tools that perform similarity searches, align two sequences together, or maybe even analyze common patterns found in the queries and references that are used in your database. There’s no best tool or pipeline, and each has their own unique tradeoffs, their own unique set of measurements and results. Typically, there are tradeoffs between power, between speed, sensitivity, and specificity, and while tools don’t share a common measurement scheme, there are some common identity metrics like query coverage and statistical measures like E-value. There aren’t any standard thresholds for these, and so a lot of times you’ll have to experiment to find which parameters work best or which thresholds will provide the best results for your use case. These can oftentimes vary between the parameters used, the database, and the tools. It really does depend on what particular use case you’re going to be using it for and how it works within your pipeline, within your unique set of sequences that you’ll be screening with your own unique set of reference sequences that are going to be in your database.
Sensitivity and specificity can have complex effects on the rates of false positives and negatives depending on the tools used, depending on the databases, and on the use case as well. Sometimes small changes can lead to large changes in the other, so small change in sensitivity can lead to a large change in specificity and vice versa. Because different tools rely on various heuristics to make them faster, there can be these nonlinear relationships between how one affects the other. In general, the specificity and sensitivity are inversely related. With higher specificity, there’s a higher chance of missing something leading to a false negative, while with a higher sensitivity, there is a higher chance of matching to a sequence with low homology or a sequence that’s not actually related or similar leading to a false positive. As much as possible, we want to minimize both. False positives and false negatives are in a natural tug of war, and so there is a natural challenge of balancing the sensitivity and specificity of different tools so that we can reduce the amount of false negatives and false positives together.
Reducing false positives will reduce the burden for review, and so the less manual time that’s spent, the more likely you’re focusing on the most important risks. And reducing false negatives will reduce liability and overall improve compliance, making sure that you’re not missing anything and that you’re doing the best job, the most responsible job to protect the business or your research or you make sure to do it safely.
False Positives
Read video transcript
There are two major types of false positives: similarity based and function based. False positives, due to similarity, can be because of remote matches or due to near neighbors, between highly pathogenic agents and those that aren’t necessarily pathogenic or not pathogenic at all. Some common reasons for this are inappropriate thresholds or very short sequences which can cause various alignments or sometimes the sequences themselves can be highly repetitive leading to the wrong identification. False positives can also be caused by incorrect assessments of function. A sequence from a highly pathogenic agent may still be safe if it’s related to housekeeping function, for example DNA repair or a ribosomal protein or if the function or sequence are conserved across different types of organisms, some of which may be non pathogenic or some of which may be less severely pathogenic.
In the next few slides we’ll go through a few examples showing how this may show up in DNA screening. Here we have an example of a sequence that identifies as deriving from anthrax but the gene is a DNA directed RNA polymerase. DNA directed RNA polymerases are enzymes that help with synthesizing RNA from a DNA template and can often times be found in lots of different organisms. In this case while we have a match to anthrax we also have a gene that’s typically considered housekeeping.
In this next example we have a match to anthrax and also matches to other related organisms within the bacillus family like bacillus sirius. While some of them are pathogens and may cause disease they’re not nearly as severe as bacillus anthracis and so don’t typically correspond. These genes that are shared amongst them wouldn’t typically correspond with what makes bacillus anthracis much more pathogenic than for example bacillus sirius. And so in this case, given the best match being to anthrax and some of its relatives, this is an indication that it may be conserved and that the gene is probably not responsible for the main reason that anthrax is highly pathogenic.
False Negatives
Read video transcript
False negatives can be caused by alignment errors or by obfuscations in sequence that make it difficult to determine the correct organism or gene. Because many bioinformatics tools rely on heuristics to make alignments faster, tools might miss correct results or alignments or surface incorrect ones. New research can also expose gaps when new genes are discovered or annotated or new variants are found. On the other hand, with sequence obfuscation, a sequence can become so dissimilar from its original that tools might not be able to detect the similarity.
Sequence obfuscation is a complicated topic of its own, but some common examples are codon optimization, engineered plasmids, and oligo pool assemblies. These oftentimes have a good scientific purpose like optimizing yield of a protein or creating a custom construct that might be easier to assemble in-house. We’ll go through an example now of an initial DNA screening implementation that misses a controlled region followed by a reparamaterization that corrects this.
Here we seem to have a Caspace 5 homolog found in a few different primates with relatively high coverage of 72%. From a first glance, this might be a unique protein that probably has some similar functions to the Caspace enzyme but just hasn’t been annotated yet. Going to the next example, after reparamaterization, we find that there’s actually a small 150 base pair region that matches to Rift Valley Fever virus with high coverage and high identity. This was initially hidden by the much larger alignment to the Caspace enzyme and may have been caused by either too strict thresholds for significance or too much optimization speed or power that the heuristics missed the alignment altogether. Because we don’t always readily know where a gene starts and ends, we may need to use more conservative measures to account for these kinds of cases. This wraps up the section on DNA screening concepts. It should hopefully give you a good overview of things to watch out for and help guide you through some of the hands-on exercises.
Alignment to Biothreats – Presentation
Read video transcript
Hi everyone, it’s Krista and I am excited for us to get started with our very first exercise of the course. So let’s jump in. Before we start, I wanted to give you some foundational information about SeqScreen and how to install and run it. Todd from Rice University is going to give us a more detailed presentation in an upcoming lecture that will provide more detail about the SeqScreen components in an overview of how the software works. But at a very high level, I wanted to point out this set of publications. The one on the left is our first SeqScreen software publication that describes fast and sensitive mode. The one on the right is our latest publication that describes what we call SeqScreen Nano or ONT mode. ONT mode assumes that you’re working with long query sequences from a nanopore sequencing device that have multiple open reading frames within the single sequence, whereas fast and sensitive modes assume that each query sequence only has one gene or open reading frame present within it. Otherwise, the SeqScreen workflows are very similar regardless of what mode you run it in and the outputs are also going to be very similar regardless of your mode. There’s differences and we’re not going to go into all that today, but as far as the final reports go, you should see fairly similar information regardless of how you run the software.
The second set of publications that I wanted to draw your attention to is all about sequences of concern. So this is a set of publications that was authored by Dr. Gene Godbold and our biocuration team, and it’s going to describe how we are labeling and categorizing sequences of concern, why we’re doing that, and how that might be relevant to both work in understanding microbial pathogenesis as well as policy considerations. So please check that out if you’re interested in sequences of concern and some of the fundamental concepts behind what we’re doing in the SeqScreen software.
And then jumping right in to SeqScreen installation. This is going to be a very quick overview of how to install SeqScreen and I am also assuming that you have Linux servers or Linux resources and familiarity with the command line with these instructions. If you do not have those things, it’s okay. You don’t need to install and run the software to participate in this course. We are going to provide you with the outputs, but I’d like you to pay attention for this part anyway, just so you can learn more about what’s involved in installing software and databases and running it and how we generated the outputs that you’ll be looking at in the exercise. So first things first, please look at the documentation for the software that is located in the link on this first bullet point. And within your Linux server or environment, you will need quite a bit of RAM and that’s because SeqScreen fast mode runs Diamond, which is a bioinformatics software package that requires quite a bit of RAM. We recommend at least 256 gigs for fast mode, but of course that can change depending on how long your file sizes are that you’re trying to process. If you’re working with really large genomes, you may want even more than that. Today we’re going to be running pretty small data sets, so not that much RAM will be needed, but in general, if you’re using SeqScreen for other applications, that’s a consideration. As far as disk space goes, we recommend that you have about 234 gigs of disk space. At least that’s what’s needed for all the database and software dependencies that come with SeqScreen once they’re all uncompressed and set up. That doesn’t count how much space you might need for your output files. Again, for this course, our output files are going to be pretty small as well as our input files, but if you’re using SeqScreen in your work for other things, you should consider how much space you’d like those output files to take up. There are a lot of intermediate outputs that SeqScreen will generate when you run it, and there is an optional command that you can run during execution to not save those intermediate files if that’s important to you and disk space is a consideration. Other people find those intermediate files to be really valuable for troubleshooting and looking for additional detail about field alignments and other analysis that’s been done on the dataset. So, up to you there, but I do recommend installing SeqScreen through Kondo or Mamba if you are running at an institution that will allow you to use Kondo or Mamba. That makes it very easy because when you install SeqScreen, you also automatically install all of its dependencies, and it’s not a complicated process to get it up and running. If you’re not able to use Kondo, maybe you don’t have regular internet access at your location, I would recommend that you check out the BioContainers at Koi.io. I’ve included a couple example commands here for how you can download and run SeqScreen as a singularity image or run it through Docker. The good folks at Koi.io and BioContainers have made this very easy so that any time there’s a Kondo package submitted, they basically bring over a version of that here so that we can download it in other ways too. So, thank you to them.
And the final thing that I’ll say here is that if you are downloading the SeqScreen database, which you will need to do if you’re going to run SeqScreen, please make sure that you check the MD5 sum of the compressed database file before you continue. This is a big source of error for people in that this is a huge database file, and if there’s any kind of an interruption in your internet connectivity during the download and things become corrupted, the software’s not going to work. You’re going to get your own messages. It’s going to be angry. So please check that MD5 sum before you continue or before you uncompress the database. But as long as it matches and everything is as expected, then go ahead and uncompress it and keep all the subdirectories exactly as they are after you’ve downloaded the database files. Don’t rename things, don’t move things. The software is going to have some expectations for where different files are located and what they’re named within the database directory. So don’t mess around with that. That’ll cause more errors. Other than that, it’s pretty straightforward to install and run everything. So hopefully you won’t have too much trouble getting it up and going. And thank you for giving me another try. We’re always interested in feedback from users of our software. You can of course email us, reach out to us in whatever way you typically contact us, or preferably post issues on SeqScreen since it’s a public repository. We welcome you to post issues on the GitLab and those can be about bugs that you’ve found or about feature requests that you might have if there’s something that you’d like the software to do that it doesn’t currently do.
So, within this exercise today, we are going to focus on TSV outputs. There are a couple different types of outputs that SeqScreen will give you in the report generation folder and the output directories. And I like the TSV files because they contain more information so they have more columns of output than the HTML outputs. However, the HTML has some positives. One is that it’s easier to look at as a human. If you’re not a computer, you will like the HTML output visually better than you like looking at this TSV file on the left. But an easy thing that you can do in TSV is you can copy and paste it into Excel or look at it in another way that’s easier for you. The HTML also has other visualization options like this alignment that’s shown here that’s shown here on the screen so you can see how your query sequence in amino acid space aligns with the reference. And sometimes that is nice to look at visually just to get a sense of how similar your query is to the subject that it’s being aligned to. And that alignment visualization is not part of the TSB output. So extra things in the HTML but by and large I prefer the TSV files.
Okay, I mentioned that the documentation for SeqScreen is available at the Wiki address below, but just to emphasize that the GitLab repo where the SeqScreen code is saved is connected to the Wiki. So if you go to the GitLab, there is a Wiki associated with this repository and that’s where you’ll find the SeqScreen documentation. This is an actively maintained piece of open source software, which means that people are often updating it, fixing things, adding new features. So it’s an actively moving target that I hope it meets the software, but we do try to keep documentation current for everyone. If you have a question or you notice something about the code that isn’t clear from the existing documentation, feel free to reach out to us, post an issue and we will get that resolved for you.
Okay, so today we are going to be looking at a fasta file of single gene sequences. And that is located in the GitHub repo for this course, so you should see a folder called single gene sequences. If you go in and look at that, there’s a fasta file and that’s going to be the input for SeqScreen. So if you’re running SeqScreen, please go ahead and use that input file to run it. And the command that I use to run SeqScreen that you’ll want to replicate is located in single gene sequences and then single gene seeks pre-computed results folder. It’s that SeqScreen_command.txt. So if you copy and paste that, assuming you have internet access, I think part of that command did use an online flag, so an optional online flag. SeqScreen will natively assume that you do not have internet access, but if you do, there is an optional flag that you can run. So you may want to edit that, what’s in the command, depending on the infrastructure where you’re running SeqScreen.
Beyond that, I have included the results that I got from running these sequences in this pre-computed results sub-directory that includes the report_generation.zip folder. Essentially, that is the sub-directory within your SeqScreen output folder that contains all the final reports. So if you unzip that, you’re going to see all the native things that would come from SeqScreen running the specific file with that particular command. We are going to be running in something called sensitive mode. That means that we’re using BLAST instead of DIAMOND. And I’ve also run with BLAST N, which means we’re looking at the nucleotide level as well as BLAST X, which is looking at more coding sequences. But in any case, the output files are all here. I have copied the SeqScreen report, pathgo.tsv file into an Excel file and saved that on GitHub just for ease of interpretation. I highlighted that here because that’s in our next video. We are going to be looking at that particular file. So if you want to jump ahead, make sure you have that downloaded and opened up because we’re going to look at that next. But that is where that Excel file was derived from, was this pathgo.tsv output. And then I also included single_gene_sequences_key, that’s not something that comes from SeqScreen. It is just something that I included if you wanted more information about where these fasta sequences were derived from. All of the sequences originally came from public sources. Nothing is proprietary or has any kind of non-open source nature to it. Everything is public and we’re going to talk more about those results next. So see you soon.
Alignment to Biothreats – SeqScreen “SeqMapper Workflow” Results
Read video transcript
Hi everyone, welcome to the second portion of the first exercise. And we left off in the prior presentation by discussing where the FASTA file is located as well as the pre-computed results on our GitHub repository for this training course. So what we’re looking at now is that GitHub repository and the single gene sequences directory. You see here’s our FASTA file. If you’re not familiar with FASTA files, this is what they look like. There’s a header row where we have the name of the sequence and the sequence itself. And it’s just repeated as many times as there’s a pre-sequence in the file. It’s very simple and this is the input file type that SeqScreen requires. So if you have a different type of file, you will need to convert it to FASTA before running it through SeqScreen. Okay, and then in the pre-computed results sub-directory, there’s a number of different things including the SeqScreen command that you can run if you are going to execute SeqScreen yourself. So you can just copy and paste this with the exception of this particular flag. If you don’t have internet access, go ahead and delete that. That won’t impact your results for this particular portion of the course, so no worries there. You can use all of the other portions of the command when you execute SeqScreen with this input file. If you are not going to do that, these are the output files in the report generation directory that you can download and unzip. Take a look at everything that’s in it. If you’re interested in what would have happened if you had run the software. And this is more information about the sequences that were used for testing. But this is the file that I really want you to download and be ready to look at with me. So it’s called final_tsv_report_single_gene_seqs.xlsx, and it’s a spreadsheet. Please go ahead and download that if you haven’t already to download something you just can click on it and then download. Or you can clone the whole repository, whatever is easiest for you go ahead and just grab that. Get it onto your local system and we will take a look at it in just a moment.
Before we do, I’d like to orient you to where we’re at in the SeqScreen workflow process. So this is the SeqScreen GitLab. And you’ll find this workflow figure both on the readme and in the wiki documentation. It gives you an overview of within Nextflow what the different workflows are that SeqScreen is executing to analyze your query data. The first workflow is an initialization step. And essentially all that’s doing is just making sure that your FASTA file is in a valid format. So is it formatted in such a way that there is a header row as appropriate with the sequence underneath each header? And there’s no duplication in the naming of the headers? That kind of thing. It’s checking to make sure that everything looks good so that it can be successfully processed in the downstream steps. What we’re going to really focus on in this exercise is the SeqMapper workflow, which is kicked off in sensitive mode directly after the initialization step.
If you’re wondering what the different colors are in this figure, those correspond to different modes. So sensitive mode is green, fast mode is yellow, ONT mode is purple, and if blue means it corresponds to all the different modes. So they all have those pieces. You’ll notice a lot of green boxes in this workflow figure, and that’s because not only does sensitive mode execute BLAST to give you more sensitive results, but it also has all of these optional pieces and extra modules. So if you’re going to give you more information about your query, all of that takes a little bit more time, a little bit more computational resources. But it can be worth it if you want to get as much information as possible from a query sequence. So when I have a small number of sequences that I want to analyze in as detailed a way as possible, I will always use sensitive mode. If I have a large number of sequences that I’m just trying to get through, I’m going to use fast mode, and if I have sequences that are long with multiple open reading frames within each query, I’m going to use LT mode. So for this example, we have single gene sequences. We don’t have a lot of them. I’ve chosen sensitive mode, and we’re going to focus in on this part right here of sensitive mode in the SeqMapper workflow, where we take the query sequence and we align it in both nucleotide space and amino acid space. So both tie two aligns in nucleotide space and wrap search to aligns amino acids to this custom database of something we call B-SET hits or B-SET organisms. Essentially it means any biological select agents or toxins that were listed in the U.S. lists or the Australian lists, as well as commerce control lists. Basically any kind of sequence that’s publicly available that has been noted by a regulatory authority as being something that we should be paying attention to when we do this type of screening. So taxonomic lists, gene lists that are noted. We put them in this custom database and do some alignments. This is a sensitive alignment. It is not a specific alignment. So that means we’re really just looking for anything that has similarity to a biological select agent or toxin. We are not saying that just because it does hit to this database that it is something that should be regulated or we should be concerned about because there are genes like housekeeping genes and other types of genes that are shared by organisms that are closely related to those biological select agents. So it doesn’t mean that we need to be concerned about it just because we get a hit here, but it is something worth noting. So we’re going to take a look at that next.
Let’s go ahead and open up that output file that I encouraged you to download. And then we will briefly give you an orientation to the different columns here. We have copied and pasted the output from a TSV report in SeqScreen into this spreadsheet and that TSV report ended in pathgo.tsvp. And it is exactly the same as the normal SeqScreen final report. It just has this extra column appended to the end of it called PathGO. And I like this because PathGO is an abbreviation for pathogenesis gene ontology. It is an ontology that was developed by Johns Hopkins University Applied Physics Laboratory under the IARPA FunGCAT program and Gene Godbold and others from our biocuration team collaborated with the PathGO developers to help review and refine this ontology. Essentially within the SeqScreen framework what PathGO means is if there is a PathGO annotation that appears with one of your query sequences, that means that there has been a manual annotation of that sequence for pathogenesis terms. So terms that describe why this sequence contributes to pathogenicity at the functional level. A human has gone through and reviewed that and recorded a PubMed ID that shares the experimental evidence for that particular function for that sequence. And there are not a lot of PathGO annotations just because of the amount of expert curation needed to assign these terms. Most sequences will not have a PathGO annotation and in that case you’re just going to see this little dash there. If it has been reviewed by an expert but determined not to have any PathGO term assignments, you’ll see a zero. So that is also worth noting. When we look at these two zeros, I would expect these to not be genes involved in pathogenesis and sure enough this is a gene that’s involved in the structural basis of the vaccine virus. So it’s not directly contributing to anything regarding pathogenesis. It’s just how the framework of that virus is created. And then this has to do with tRNA. So basic processes or you might call this a housekeeping gene or something that’s basically not involved in pathogenesis. So that makes sense and this is something that I like to look at. So that’s why I copied and pasted this TSV report into this spreadsheet for us to look at together. Now let’s go back and look at a couple other fields.
The first is query sequence. So if you remember back to when we were looking at the FASTA file, that header information is now listed here. So these are our query sequences. I’ve named these in an intuitive way for the purposes of this exercise so you can see where the sequence came from. These are all publicly available sequences. Some of them have gone through and been codon optimized for different organisms just to show how the results may change or not change based on codon optimization. And if that is true, if it has been codon optimized, you’ll see that here it’s been optimized for gene expression and equal IK12. Here the same gene has been optimized for expression in matrices and so forth. So there is the query sequence name. Multitask ID competence is another thing that I think would be good to pay attention to. And the reason for that is that when we look at the BSAT_hit column, which is the output of SeqMapper — so SeqMapper is here — remember that we’re paying attention to whether or not something hits the custom database based on mapping and if they type our amino acid space. And it’s a binary yes/no classification. So yes, it did match to something. No, it did not match to something. Here we see a lot of yeses because we’re looking at the Bacillus_anthracis genes and those are genes that are those are organisms at the taxonomic level that are on the biological select agent or toxic lists. So we see yeses. What’s interesting is that there are genes shared among the bacillus genus and likewise, you know, other threat organisms have genes that are shared under with non-threat organisms that are closely related to them. And we see examples of this here. So I have highlighted in red a couple of things just to illustrate this. These don’t natively show up in red when you get your SeqScreen outputs or you copy/paste things into Excel. I did this. These are genes not from bacillus anthracis, but we see NCBI taxids here that correspond to a bacillus anthracis strain. This is bacillus anthracis strain H9401. That’s what this NCBI taxid refers to. And then we see a colon and 1.0 meaning that this is a perfect match to this particular organism. It has high confidence that it belongs to this organism. But when we look at all the other information in this field, we see that it has high confidence in other things too, right? It’s equally shared across a number of different taxids. So that tells me that this is a conserved gene, likely among the bacillus genus. Maybe further, I have gone through and looked at all these different taxids, but I can tell you that a lot of these are bacillus. So this is something that is shared between the bacillus anthracis, which is on the bsat list, and these other bacillus organisms. And so when we look at the SeqMapper outputs, we see yes, but that doesn’t mean that this is a threat. That just means that this sequence has similarity to something that is being regulated at the taxonomic level. So that’s a very important point. Hopefully that was clear. And everyone understands that we’re not working from the bsat_hit column to determine whether or not something should be concerning. We’re just paying attention to that as a piece of information that we may want to consider when we’re starting to synthesize our results and decide on how to interpret the final decision and results here.
Okay, so we see a lot of yeses. I’m looking for some noes, so we can just talk about those too. All right, so at the very bottom, we have some noes. These were not matching to anything in the biological select that agent or toxin list. And when we look at these, what you might notice that these all have in common that these are eukaryotic organisms. So these are snakes, spiders, scorpions, and they produce toxins that obviously have some pathogenic effects because they’re toxins. However, eukaryotic toxins in these cases are not on any of those regulatory lists. So they may show up at the functional level as having some toxic capabilities, but they are not going to show up in our bsat_hit column here. And that’s really the distinction that I wanted to show you. Coming back to just the global idea here, SeqScreen is trying to give you additional information about sequences of concern. But while it’s doing that and sharing all this functional information with you, it’s also providing you with taxid matches, with bsat_hit potential, and lots of other things. So those are things to keep in mind as we move forward and you begin to interpret outputs. So that’s it. That’s the basic premise of this first exercise is just to become familiar with the SeqMapper outputs and how those show up in the final report and I hope that was helpful. And we will move on to the next module. Thank you so much.
What are “Bad” Sequences?
Read video transcript
So I’m going to talk about what a bad sequence is. And the work that we’ve done on FunGCAT informs this, basically this entire presentation. So one of the inclusions was that encoded biological threats are primarily from primarily, primarily, and also just in terms of sheer numbers, primarily from microbes that cause infectious diseases and secondarily from metazoan, that’s animal mostly, and plant venom components and toxins. So the pretty pictures on the screen differentiate some kinds of toxins. So ciguatoxin, you see down there, it’s like the figure, the complicated chemical figure, you know, that’s not an encoded toxin. You know, the organism, I think it’s adenophilagealate, needs to build that thing from, you know, from different precursors. On the other hand, dendrotoxin, which is from the handsome green mamba above the ciguatoxin figure, that is an encoded sequence and, you know, dendrotoxin is a really nasty, nasty toxin. The middle picture shows a castor plant, including the seed pots and of course, rice and comes from castor plants and those pretty seed pods. And if you mesh them up and everything, you can make rice and then that is a sequence while the trachotoxin and tetrototoxin, those are encoded sequences.
Next slide, please. Okay. So bad bugs, essentially, are parasites and of course you have to differentiate what are they, you know, the parasites infecting what. We’re mainly concerned about it as Matt referred to this, we’re mainly concerned about those that affect the stuff that humans are interested in. So us and our livestock and our crops are the main things. So there are different kinds of parasites you can have. The ones that fall into the microbial category are viruses, bacteria, a lot of fungi, I don’t think that one in particular, and protozoal parasites. There’s also worms like multicellular parasites, but I’m going to talk a little bit about worms. Actually, I’m not going to talk much about worms at all except maybe on the next slide. So, and these are things that live in or on larger hosts and cause disease, which is, you know, loss of homeostasis. So how many bad bugs are there? So there’s one trillion microbial species on earth. That’s microbes that doesn’t include the worms. Of them, 1300, roughly, this is my own accounting from literature. There’s 1300 viruses, bacteria, fungi, protozoan worms that can infect immune normal, that’s a term of art, immune normal humans. 300 of those are worms. So a thousand of them are microbes. An additional 700 bacteria and fungi can cause disease and immune compromised to it’s about 4% of the population. Immune compromised includes folks that have some barrier dysfunction or have HIV or are taking immunosuppressive drugs for transplants and things like that. Or there’s other diseases that can lead to immune compromise. Now, if you’re of that tiny, tiny, tiny percentage of people that has severe combined immunodeficiency, which is essentially no adaptive immune system at all, and because of the defects, the genetic defects, hardly any innate immune system, then there’s probably billions of microbes that can cause disease in that sort of human. So the susceptibility of the host in terms of how good its immunity is does determine what kind of, how bad the bug is or what things can harm us.
Next slide, please. Okay, so how do hosts resist parasites. Obviously, I’ve given something a little bit. So we’ve got barriers. We’ve got physical barriers that protect us, the skin, epithelial, from things that are going to try to get in. There’s lots of ways that microbes have to subvert those things and to get around them. We have chemical barriers, stomach acid mucus, antimicrobial peptides that are secretive by our immune cells. And then there’s biological barriers, in this case, like the microbiome and the skin that we’ve got in other places. Okay, so that’s one component. But the biggest component, well, a really important component is the immune system. There’s two kind of flavors in the immune system. There’s the innate immune system, which is founded in literally all organisms on Earth, except for viruses. And it’s constant, it’s invariant. And then the vertebrates, top vertebrates have an adaptive immune system, which, you know, has one of the, what it’s housing is an arm, for long bones, the arm, the leg. And that develops, it develops over time and it’s specific to the malady that’s afflicting us at the time. Okay, so, and the figure shows basically a bunch of new detectors where they are on a cell. It’s almost nearly every cell in your body has these things, some of them are specific for bacteria, bacterial proteins, other fungal. And they target the innate immune system recognizes things that are common to all microbial life and also some worms.
All right, so to be a parasite requires a susceptible host, but what makes a host susceptible? Yeah, okay, and that sequence is concerned with immunity, about 6% of the human genome and are under the greatest selective pressure of all the genes that we have, which means that they’re constantly changing why because this is a constant threat. Next slide, please. Yeah, so the outcome of the encounter between the host and the parasite is chiefly governed by the immune defense of the host, all those molecular components and physical components that comprise the immune system. And the virulence factors the parasite, you know, what would happen if we lack an immune system that’s kind of answered in the skin example I gave above. And our hypothesis is that sequences that subvert innate immunity and essentially make a force the host to be susceptible to the parasite. Why is this important? Because we have to look at the function of what the sequence is doing. Next slide, please.
Yeah, so I’m going to talk about immune suppressing sequences or immune subverting sequences. There’s a lot of sequences in microbes, thousands of sequences in microbes that suppress host cell immune signaling. So that if they can suppress how, if they can suppress the instigation of the immune response saying, ‘hey, there’s something here in me, maybe somebody should look at it, you know, send it back to send it back and let the immune cells come in.’ If they can suppress that, then they can, they’ve gone a long way toward causing infection. Another thing to do is to do a subvert antigen presentation. So once some of the bad bugs have been chewed up, the pieces of them can be displayed on the outside of the cell. Like I’m doing my hands here, you know, they’ve got peptides in here and then the T cells and other things can come and look at this and say, ‘oh, yeah, that’s a parasite. We’re going to come and kill you now and stop infection.’ And then there’s other immune effectors that can be encountered, which I’m going to talk about in the next slide. So this is some example, and I’m not going to, I know a lot of your bioinformatics people, this is going to make you crazy. But this shows some of the effectors that our immune system has and it shows some things that this streptococci, a particular kind, a genus of bacteria, can do to avoid them. So you can make these neutrophil extra cellular traps in the left-hand corner degradation of nets. And in response to being trapped in these nets of DNA, is the streptococcal secretes these proteins, specific proteins to degrade that DNA and chew it up. And so the streptococci are liberated from their prism. I think I mentioned before that there’s antimicrobial peptides that our immune cells can put out and other cells, not just immune cells. And there are ways that this streptococci says, with speed B and sick, that will neutralize those cut them in half or do other things to those antimicrobial peptides so they can’t reach bacterial surface. So there’s ways that the bacteria have to frustrate [unknown], which is the way that the bacteria and fungi can be recognized and killed. The streptococci can destroy antibodies in different ways. They can also degrade chemokines, which are a way that the immune system sickles. And they can target directly the neutrophils and other phagocytes that are trying to eat them up with nasty things like SLA and streptolycin O and streptolycin S.
Okay, next slide. Okay, so I’m going to talk a little bit just how, when we look for these things, when we’ve done the database, how do we find these things? So these are, we look them up in the literature, you know, do search strains and stuff, and these are some questions we ask. Does it act on the parasite of the host? Is it extracellular or intracellular? How does it get to where it acts? Is it secreted? Is there some adherence and invasion? What is the targeted molecule that it’s going after? And which host activity or pathway does it affect? And what’s the general effect on the host? Is it messing up an organ? Is it destroying a tissue? Things like that. Next slide. Okay, so what are the functions? So adherence, if it doesn’t adhere, nothing’s going to happen if something doesn’t adhere to the host. And then some proteins also enable invasion, basically being pulled into the cell that it’s trying to attack. Next slide, please. I think that’s dissemination. Okay, no, no. So we’re going to jump to damage though. So the damage is serious, you know, if the sequencing question is causing damage, there’s different kinds of damage if you specify cytotoxicity, it interferes with the cell’s metabolism or kills it, it can disrupt the cell membrane, it can permeabilize tissue, it can destroy the extracellular matrix of the organism on which the cells are sitting, it can release the cell from the substrate, from the ECM, it just like pops it off and then it floats away. It can disable the organ and it can induce inflammation. And that example is a T cell receptor which basically turns on all your, it makes a cytokine storm and that’s bad for you.
All right, next slide, please. This is our little ontology, the little homebrew ontology for how we label these. So damaging SOCs are cytotoxic, they degrade tissues, they disable organs or they induce inflammation. They can adhere, they can disseminate, they can enable host cell invasion or they can movement host cells. So these are all ones that operate directly and direct acting sequences target, hit and adhere and do something directly to a host molecule. There’s some that operate indirectly also and immune subverting SOCs are also important, direct acting, sequences of concern. They do a variety of things, it’s the Streptococcal example I hope showed. Next slide, please. So there’s 32 machine learning predicted functions of sequence of concern in SeqScreen. They’re going to come out as an output. So this is Krista’s output show, those different ones in that file she showed and these are on GitLab onTodd’s lab site. Yeah, and next slide, please. Yeah, so it appears as an output, if there’s nothing there, if there’s no FunSOC for that, or if that sequence doesn’t have a function of a sequence of concern for that entry, like disable organs, then it shows up as zero, but if it’s got an entry like cytotoxicity for the fibrobacterium, mongum, hemolysis, then it shows up as a one because it’s cytolitic. Next slide. We’ve also got pathogenesis gene ontology terms. They were nice enough to let me contribute some sequences and some terms. It can be found at that website down there and the slide is in the PDFs. Next slide, I think, give some examples or maybe not. So here’s some examples in the last slide over there. Anyway, but the PathGo terms give greater granularity of molecular function. So PathGO 278 is stimulating small GTPases in another organism. 255 there is mediate squine and nucleotide exchange for a small GTPase in another organism and the last one is mitralate transcription in other organisms. So I’m going to give you a different level. In addition to the effect that the ones I’ve shown give you some of what’s the protein being targeted in the host, which I think is useful. Okay, next slide. Okay, we’re done.
SeqScreen Software Overview
Read video transcript
Yeah, we can go ahead and get started here. I’m just going to give a brief overview. I’m not sure I’ll need the whole 20 minutes, but feel free to ask questions as I go along, or if you’re holding them to the end I’m happy to answer them at the end. So I’m going to briefly describe SeqScreen software, which I have to say before I get started here, obviously, it’s a joint collaborate effort with Signature Science, of course, Krista and Gene and many others. And also make sure to shout out and Advait who’s the primary developer of the software and Bryce who was with my group previously. So a huge team effort over many years and excited to kind of share some high level details with you all. And I also want to say this is really meant to be community-facing software. And I think there’s a lot of space in this area and I’ve been excited to partner with Kevin from ACLID and see all the great things he’s doing. So please take today’s conversation more as a general overview and framework that I think is a great place for community feedback and input and collaboration, rather than me saying that this is the one way and only way to do this. I do not feel that way. I don’t think any of us feel that way. It’s really about, I think this is a community effort. I think we need all hands on deck here. And it’s exciting to see innovative methods like the ones that Kevin’s developing at ACLID come out as well. So I think it’s all complementary, collaborative, and exciting team work where we go in terms of biosecurity.
This is already been mentioned a few times, I won’t belabor this point, but this software was born out of the FunGCAT program. An IARPA program focused on functional and computational systems of threats. So I won’t revisit old history, just wanted to shout out to IARPA for funding this work. And at this high level to give you some overview of what this is capable of. We’ve been running through some test cases today. Obviously there’s the wonderful resources that Rice provides, they provide a place to run some of these analyses.
But there’s many different environments where you may want to run the software tool for some larger datasets. You may want a better equipped server or in some cases you might want to be running things on a laptop. So this is where this slide is now talking about where if we’re running in our fast and sensitive mode, say you want to run it on many, many sequences, or say you want to run it even on maybe a set of metagenomic reads, which can be a very large dataset, you’re obviously going to want a server, a well equipped server, and maybe a computational cluster. And then we have this new mode, there’s a preprint now available that describes SeqScreen ONT mode, SeqScreen Nano. Again, collaboration with Signature Science of course, and then Ed Byton was the first author on this. This is where we took some of the ideas and optimized them and tailored them to the Oxford’s Nanopore Sequencing platform, and it just can run on a laptop. So again, there’s different modes you can run the software in. It’s meant to be configurable. There’s obviously different ways you may want to run this, but for today’s purposes, obviously we’re tuning it so that you can run it on the nodes that are provided to all of you. But needless to say, the use cases are beyond those that you’re running to today.
And then the only other comment I’ll say is, obviously we depend on a database. The database actually, Gene set me up well, he described kind of how that was created. And that’s really a key part of all this. I’ll talk a little bit more about this in additional slides here. One really kind of key innovation of the software with all of these new and exciting machine learning techniques coming out now, specific to protein folding, structure, function prediction. What we really spent lots of time on was a human-in-the-loop informed machine learning process where we had experts, bio curators, and expertise that Gene provides, that the team provided to set the foundation for the machine learning. And that was important. And I can get into some of the technical details, which I just don’t have time for today, but I’m happy to follow up on that. But really, when it comes down to limited set of training data, it can be noisy labeled. You really do need to help see the process with accurate labels at the beginning and accurate training data set that you can iterate on and improve. And that’s really what this database is. And of course we have that SeqScreen software in the middle, which we’ll talk about, which is open source machine learning-based functional annotation software, and then we have S2FAST which is the government purpose threat detection software, which Signature Science maintains and we can go back to Krista if she wants to share any additional details, but that’s kind of where all this stuff is. If you are wondering and you’ve heard some of these terms thrown around, that’s kind of a high level overview.
So yeah, the paper was first published last year in Genome Biology. So we have additional details there. As you can see by the number of co-authors, this is a very large collaborative effort. And unique in the sense that I would say every author, irrespective of the order on this paper, made significant contributions to this software. So I just want to highlight that. I’m thankful for all of the team’s contributions for all the different aspects because this is a very highly multi-disciplinary project, carried out over four or five years. And without any one of these individuals on the paper, this method would not exist. And then what’s been highlighted here is that we’re doing, it’s available on GitLab. And so it’s under my GitLab site where I have a lot of open source software and then one of them is SeqScreen.
And then just high level. So I just got a question in the chat. Krista, feel free to, I’m not reading those, but anyone wants to read those off to me, feel free. This is again, pretty informal discussion. So, SeqScreen is easiest to install via Conda, or Mamba, but if you’re not familiar with that, we also provide Docker/Singularity containers, especially if you’re in an air-gapped system. But we feel the install process, thanks to feedback through government partners while we’re performing on the contract, really helped us streamline the installation process. But if you do run into any difficulty when installing it, or interested in installing in your home environment, just reach out to any one of us. And we’re happy that you can post an issue on GitLab site that I shared or feel free to reach out over email. We’re happy to help work through any issues you might have. And importantly, there’s a check install option that allows that kind of make sure that the install was correct. It’s not just a MD5 sum that says, did all the files copy over correctly, it actually looks and says, is everything where it should be, etc. And so you do that after you do your install and that checks out you’re ready to go.
And then I’d also say that we do have a reasonably rich set of documentation on a wiki page. I say “reasonably” because obviously there’s always so many ways you run the software but I do feel that we have a lot of attention to detail into the documentation. But obviously if you find anything that isn’t clear either to today’s workshop or when you’re running the software, if you decide to run it, feel free to reach out again and say, please provide more detail, or this is unclear, etc. And always keep in mind you have the paper as well, which can serve as a complement to the information on the wiki page.
And then just if you have a question, you know, “which mode should I use?” This is a common question if you just back up high level and say, just generally speaking in bioinformatics, you’re going to be facing this question. So, forgetting about SeqScreen for the moment, you’re often faced with questions. Okay. Do we need a fast answer? Do I need a sensitive one? Or then there is the last one Do I need to tailor my, my tool to a specific type of data, right? That’s what the third one is. So the fast answer. So certainly there are tools that can get fast and sensitive answers, but usually it’s a trade off. Usually with speed you’re sacrificing sensitivity. For every approach that that exists nowadays for really any application, if you are going faster, usually you’re trading off sensitivity, but not always. Sometimes there are really clever algorithms that allow you to get a lot faster and maintain the exact same sensitivity. But typically speaking, you need to think about when you’re running through your analyses, how much time would you like to invest in the computation? And do you need to ensure that you get all the potential answers, if there are any in terms of the hits, or are you okay with maybe missing some of those with a greatly increased run-time? That’s what the fast mode is. The default mode, we set it as default because it’s a good choice for large datasets. It uses Diamond. Diamond’s a wonderful tool, if you’re not aware of it, you should check it out. We cite it obviously and it helps us accelerate our BLASTX searches. It’s a wonderful tool. And then fast mode is important. It only expects one protein coding region per read. So if you have very, very long reads and you try to run fast mode, you’re not going to get a good result, or you’ll just get one coding region annotated per read, which you probably don’t want. So that’s why the ONT mode is created.
So if that’s clear, I’ll move on to sensitive mode. Sensitive mode is really where you think, okay, well, I need to – SeqScreen’s unique – so it will kind of carefully characterize and annotate read by read. So you can imagine if your data set involves tens of thousands, hundreds of thousands, tens of millions, etc. to a point, this will take a very long time. This is not how standard you can kind of think about this. And most bioinformatics, you would not do a read by read analysis. But for the design of SeqScreen, this is kind of a key aspect of how it operates is, we really want to look at each read individually and characterize them separately, not combine them beforehand, either by assembly or binning or whatever. We really want to look at each read independently. We have some reasons for doing that. But mostly it’s for sensitivity and also thinking about use cases, which are most common, which are, you know, DNA screening where you have to look at every single read. Or in the case of, let’s say, a metagenomic run, you don’t want to get any assumptions about the importance of something in the sample based on the abundance. There could be one read or two reads or three reads in your sample that you really need to be able to characterize and you can’t skip over them. So that’s kind of a high level, non-technical reason for that, but I’m happy to have a sidebar conversation with anyone about that in more detail.
And then for ONT mode, this is what you’re running today. If your memory is limited, it might spend a lot of time optimizing this such that it can run in memory constrained environments so they can run on laptops or work stations that have 32 gigs of RAM, around that. And it allows for analysis of sequences that have more than one protein coding region again, two, three, and it will look at all. So if that’s clear, I’ll pause here if there’s any questions, but you know, this is kind of the key part here, particularly if anyone is interested in running these afterwards. If you understand these three different modes, you’re in good shape. But I’ll pause here if there are any questions.
Okay, I’ll keep going. I see some chat, but I’ll keep on the presentation.
Beth: The chat is, I’ve been posting links to the pubs and the GitLab and such. Yeah, so those aren’t questions coming in.
Yep, thank you, Beth. Yeah, thank you for clarifying. I just don’t have it open because I because I spend too much time on chat. I’d probably get distracted, I have an interactive conversation with everyone. And then on to Nextflow, for those of you who are not familiar with Nextflow, it’s a workflow management system, basically bioinformatics. There’s two, probably three things that people do. They use SnakeMake, they use Nextflow, or they just use their own Python script to do that. We like Nextflow, SnakeMake is great as well. So it’s really you pick one or the other.
Here’s the preprint I mentioned previously, recently posted. This is really about the ONT mode that I just described, and really to enable in-field characterization. So something we want to kind of highlight here, just more than anything, it’s not meant to be kind of anything other than if you want more details on how this works, it’s where you go, get that, that’s all I’m really highlighting that here for.
Okay, so then parameters, kind of the bane of all bioinformatics users and bioinformaticians. You download a tool, you have your data set, you get ready to run it. The paper says it’s easy to run, you get good results, and “help” when you see all of these darn parameters, and you have no idea how to set them, what are reasonable thresholds, how much they influence your results, do your stuff here. So I’m not gonna fully solve that problem by any stretch of the imagination. I’m just here to say we’ve thought really hard about the problem for specific use cases to set reasonable defaults, and I highly encourage any users that want to deploy this to look carefully at each of these parameter settings and look at the defaults and see if they make sense to you. We’ve certainly validated them, and we thought a lot about studying the defaults, but that doesn’t mean they’re appropriate in all settings. One, for example, thread, the fault is one, but obviously if you’re in a multi-threaded environment, you know, multiple cores, or you want to break that up, right, for example. And if you’re more of a kind of a power user here, you want to get into changing thresholds, you may want to change the E-value threshold, so you may have a bit score E-value threshold, you want to change, so etc. etc. I’m not going to read you all of these, but we’ve included a lot of parameters. We have documentation of all of these parameters, but if and when you run this and you find that your questions are not answered or you’re confused, this is a good point either to ping us over email or post an issue on the GitLab page.
Okay, I’ve already hinted to this, so I don’t really need to spend a ton of time here, but I’ll just kind of go carefully through this just very briefly, just to highlight again and thank Gene for all the great work he’s done on this, not only Gene, but many, many, many scientists and folks at Signature Science and the team writ large. And yeah, so we just kind of did this iterative machine learning process. The machine learning code is part of the gene repository, and so if anyone has any … this is just, I guess, a moment to briefly pause and kind of say, obviously, part of the inspiration for this model, this process actually, is thinking about this more from the community framework and how we would, I think, inspired by previous efforts that really engaged different communities to collectively look at problems like these and discuss and kind of do crowd-based annotation. In this case, it’s expert-based, of course, but you can see a system like this also being adapted to more of a crowd-based where we have many experts at many different places that would like to expand out and think about different types of problem like this and this framework that we built, I think really speaks to that. And again, just as a shout out to previous efforts like this that have worked in a collaborative community-minded framework which is what SeqScreen is.
In terms of machine learning, I hate to always just throw up plots and hand wave. I’m gonna do that here. The high-level message is more of this specific problem at the time when we are implementing a solution given the limited amount of training even after that iterative process, we had to deploy an ensemble machine learning approach. And what is that? It means that no single machine learning approach that we tried really necessarily always would perform well based on characteristics of the data. So based on how you would train your data, you could get different results based off of each, either FunSoC and the amount of training data that we had for each one of those. So we, Advait is the one who innovated this but really had to come up with an ensemble strategy using different, both neural nets and support vector etc. classifiers and look at pros and cons of each of these approaches, carefully looking at the data for a very long time and then deciding that we would combine these and then execute a majority voting scheme and that resulted in the best performance that we saw. And so the best performance is what we see here — a jumble of lines. So let me just summarize it for you very quickly. You don’t have to stare at all the lines, but essentially what you can look at is essentially the red line is the one — not essentially — it is. The red line is the one that is the ensemble call. The circles represent precision and the X’s will have the length of the X which will be recall and then anything that’s not red is gonna be a classifier on its own. But what you see typically, not always, but typically you’ll see the red line is always above all the other colors. That means the ensemble is working. The only thing you wanna note is that there is variable performance across each of these FunSoCs that Gene had described. So something you might wanna note, why are they here? So if you end up using the software, you happen to really care about a certain one of these FunSoCs here that are listed, you might want to note the performance in that category. For example, while bacterial counter signaling, you see there without the ensemble approach we would be doing really poor. There are just some categories such as induced inflammation that just globally don’t do as well as some of the other categories. And the summary for all of that, if I were to summarize in one word, why is there variable performance, this all comes back to the amount of training data and the quality of training data we have in each of the FunSoCs. And that’s probably the quickest and easiest way to describe that.
So I’m hopeful with these new approaches and machine-learning approaches that are coming out, that there’s ways of augmenting training data and improving this. But overall, the performance feels quite good. But obviously there’s always room for improvement. And another comment I wanna make is, again, not pitching this to be the end-all-be-all for everything, you see, antibiotic resistance FunSoC while we seek a performance in our hands, there are many tools that exist that do a great job with antibiotic resistance prediction. And you should, you can also run those, obviously instead of relying on this information, this is more of the view that, there’s a comprehensive view through SeqScreen, but it doesn’t mean that for any specific ones, just the antibiotic resistance prediction, there are like the other tools or better tools that you can use on your data.
And I just wanna stop here by saying, you know, I wanna thank my group, obviously Signature Science, I wanna thank, and I think, you know, but specifically the group, Advait and you, Gene, for sure, the huge, huge efforts made not only for software, but the workshop today. So obviously I’m running around half the time not really knowing where I’m going. So thanks to them for all this work that’s getting done and really seeing the vision that we had and the team executed and implemented. But obviously all this stuff is still work in progress in the sense that we can always improve it. So all of your feedback is definitely welcome. If you end up using and enjoying SeqScreen, I think the best feedback we can get is your just honest feedback about anything you like or dislike about the software so that we can keep it in mind for the future. And with that, I think that’s all I have here. I’ll just hand it back over to you and pause for questions.
Known Test Sequences
Read video transcript
Hi everyone, welcome back. Now that you’ve had a chance to listen to Gene tell you about sequences of concern and Todd tell you more about SeqScreen and details about the different workflows and how it operates, I wanted to revisit this example SeqScreen output that we have looked at previously, to tell you more about the different columns of output now that you have more context for what is going on and what is happening here. So to refresh your memory, this is an output from looking at a FASTA file of single gene sequences. So each row is a different query sequence. And previously we talked about this bsat_hit column, so that was a binary yes or no according to whether or not the query sequence had hit to one of the sequences in our custom database that included all the publicly available sequence data we compiled according to taxonomic lists from the US or from Australia or the US commerce control list that describes organisms that should be regulated. So we talked about how the bsat_hit column is sensitive but not specific and we looked at the multi taxids confidence field to understand how sometimes there are ties where a gene sequence can be equally likely to have been derived from a select agent or one of its closely related microbes that’s not pathogenic. And in that case we’re not worried about that particular sequence because it’s not involved specifically in pathogenesis as it relates to that select agent.
So we talked through that and just moving forward on that thread I think I’ll mention that this taxid field is another thing related to the multi taxid confidence in that some of our end users previously wanted to see a single taxid. So they didn’t want this long list they just wanted one number that they could go from that was the most likely organism that that sequence came from. The most likely taxon of origin. What SeqScreen will do in this case is it’s going to report the highest probability taxid in this field. If there is a tie it will just take the first taxid on the list. So in this case we see 1428 is 1.0 that’s the highest probability that there could be that something came from that particular taxon. But oh no there’s other 1.0s right, there are a lot of 1.0s Since we only put one taxid in this field SeqScreen just randomly takes the first one on the list and puts it there. So that’s something to be aware of when you’re interpreting these outputs that if you only look at this taxid field you will be missing out on any ties that might have happened in the probability and the confidence here in column D. So, do be aware of that when you interpret things.
GO stands for gene ontology so this is going to give you the gene ontology numbers which correspond to specific labels. Those are not only for pathogenicity these can be for anything. SeqScreen will by default report molecular function or biological process GO terms. There is an optional flag that I ran in the example command for this exercise called include CC that includes the cellular component GO terms. So in that case cellular component GO terms will be reported here as well. But these have to do with all kinds of different biological processes, molecular functions, cellular components so where in the cell these are going to take place. All kinds of things show up in GO terms right so this is a very broad collection of function and we’ve actually used this in other studies to look at things beyond pathogenicity just to collect functional labels for sequences and do other types of analysis. So that’s an interesting angle here but not something specific to biosecurity.
Okay this is a field that you can ignore it’s always going to say one. Don’t worry about that that is nothing and then here we have all the different FunSoCs. So for the purposes of this illustration I added a filter to my spreadsheet and I recognize that that does cut off these labels for the FunSoCs which could be frustrating but fear not because when you open your spreadsheet you will be able to see these labels. And if you want more information about what the FunSoCs are I would definitely encourage you to look at the software documentation. We have a wiki page dedicated to functions of sequences of concern. There’s this handy table here that will tell you the FunSoC definition and more information about it. So please check that out if you want to know more information about the specific FunSoCs listed in the screen. And please also refer to the publications that Gene Godbold authored about sequences of concern and categorizing them where he goes into more detail about all these concepts.
Related to that there’s also the pathGO terms which tend to be more detailed and numerous than the FunSoC terms but the big differentiator between pathGO terms here in this column and FunSoCs here is that FunSoCs in SeqScreen were all derived from machine learning. So these are all assigned by machine learning whereas these pathGO terms are assigned by subject matter experts. So humans and machines. We included both because humans and machines are both important for these kinds of applications. Humans can only annotate so many things. It’s a time-consuming process. It takes some level of expertise. You have to read through the literature and find experimental evidence to make the different assignments and that’s after the ontology has been developed. The ontology itself takes some time to refine and review and create. So there’s a lot that goes into the manual annotation and assignment. But once you have enough of those manual assignments you can use machine learning to scale them and iterate and go through and improve it. And that’s essentially the model that we used here.
We did manual assignments of FunSoCs in our original SeqScreen publication. Those were then scaled with machine learning to assign the labels to many other things. So when we look here at these FunSoC labels there are zeros and ones. So binary yes/no, dash means that it was not determined. So the machine learning algorithm did not assign anything to it zero or one. So if we look at ones here and in SeqScreen you won’t see them in color again. I just added some color to my spreadsheet for the purposes of this course. Here’s our ones and we see that disabled organ which is the most severe FunSoC. And here’s a little expert tip on SeqScreen output. The FunSoCs are roughly listed in descending order according to how severe they are. Or the negative effect they might have in their pathogenicity potential. So disable organ is listed first because if something can shut down an organ that’s very bad. And so what we see here for disabled organ are a lot of the eukaryotic toxins. They can be very damaging and the machine learning is flagging those as highly concerning. If we go to see if there are any pathGO terms associated with those we also see some different pathGO labels. And they’re not going to be completely equivalent to the FunSoC labels for one because they’re just called different things in the pathGO ontology and the FunSoC ontology. But mainly because the humans are going through a very accurately finding publications and references and making assignments here. Whereas machine learning is more broadly just looking for evidence of how to assign these different terms.
Okay so that’s disabled organ. We see the eukaryotic toxins having that. We see that the bsat_hit column is “no” because these are not on any regulatory lists. But they have functional evidence of being damaging. And then we see some pathGO terms here that talk about how those things may be damaging as well. And again this dash here in the pathGO field does not mean that it’s not a pathogenic sequence. It just means that the manual curation team did not curate that particular protein. That’s all that means. So dash means we didn’t look at it. It’s also worth noting that the Uniprot evalues here are very low. These are strong evalues. These look good. And when we look at the different Uniprot gene names these all look very reasonable based on what we know these query sequences were. So this is a solid result here for these particular eukaryotic toxins. This is pretty expected. Right.
And then if we take a high level view just seeing what else is here in FunSoC land. You can see my green ones indicate that we have a smattering of different FunSoCs assigned to these different genes. And if we look at the overlap between bsat_hit and FunSoCs we see things like here’s a dangerous gene from the Bacillus anthracis the lethal factor gene. And it’s got some FunSoC assignments. It’s labeled as lethal factor. I mean the name just sounds bad. Right. And then it also has pathGO terms that describe how we know it’s functionally damaging to the host. So this is clearly something of concern. It’s overlapping with the bsat_hit and functional labels that are concerning. So that’s something that I would definitely pay attention to. And there are a number of other things in here. If you look at what I collected for this list I collected a lot of things that overlap with the bsat list. But a number of the genes aren’t necessarily damaging. And the reason I did that is I wanted everyone to see how you know often there’s no FunSoCs. Right. This is common because most genes are not pathogenic. If you were to just take a random sampling of genes out there and run them through SeqScreen you would see a lot of zeros. Right. It’s not common to see pathogenic qualities in a sequence. And in this case we’re looking at things from Clostridium botulinum. So there obviously there’s a big toxin that can be made there. But there’s also just a lot of normal everyday housekeeping genes that aren’t concerning. So that’s something to kind of look through as you’re learning how to interpret everything.
It’s also interesting to see how some of the different codon optimization processes for different organisms. So basically adapting the codons to the optimized express in different organisms didn’t necessarily change results. Right. These all look pretty similar even though we’ve optimized them for different organisms which I think is a good thing. And the strength of SeqScreen reporting things consistently. Here we see different parts of the gene. So I basically broke this up into smaller pieces. There is an output here that tells you the size of the query. And so this is a fairly small query size. But we’re still getting pretty good results with small sequence sizes. I would say that SeqScreen performs pretty well with sequences down to 50 base pairs. Lower than that it can still return results. It’s just the smaller you get since we’re looking at things, and we’re talking about coding sequences, you start to get into very short amino acid segments that can be pretty ambiguous. So 50 base pairs or above SeqScreen should give you reasonable results. But feel free to look at as low as you want. Obviously the longer sequences the stronger the hits are going to be.
I think that’s the majority of what I had to say other than the machine learning process in SeqScreen is not run in real time. It’s run ahead of time and then we pre-populate our database with all of those results, so that every time you run a query sequence through SeqScreen you should get the same results. You shouldn’t see different results each time. And the machine learning is not perfect. No doubt you will find instances where you think you disagree with the machine learning results for one reason or another. And that’s fair. We definitely don’t claim that the machine learning is a perfect system. But it is the best system that we could develop given the limitations with scaling manual curations. And we do hope that as we improve labeling and ontologies and we get more and more curations in databases that are high quality, our machine learning performance is going to be improved. So we have really high hopes for that in the future. And we’re pretty happy with how this is performing now, but we definitely recognize it’s not perfect. If I had a choice between going with machine learning results or these manual pathGO results. I would definitely take a close look at the pathGO results because I know that those were assigned and reviewed by experts in the field. And then one other final thing I’ll mention as we wrap up this is that there is an HTML report that I haven’t showed you yet. So this is what the HTML report looks like. You do need to load that file before you view it. So when you open up the output file, you will not see anything. You will see a shell of an HTML framework. So you need to choose the file and then choose the particular data file that you want to load. Alright. We will pause there and come back to this in just a moment. Thank you.
Introduction to Case Studies
Read video transcript
Hi everyone, it is time for the case study portion of this course. So for the case studies within this repository, you’re going to find different folders for each case study. And within that, you will find a slide that describes the case study scenario. And these are scenarios that we put together that we think are gonna be relevant and interesting for the bioengineering community. Within each scenario is a question that you’re gonna try to answer. So there’s an input FASTA file provided with each case study. And the goal is for you to run that FASTA file through a SeqScreen and answer that question that’s provided in each case study with the SeqScreen output files. If you’re not able to run the SeqScreen output files yourself, that’s okay. We have provided those output files for you. So please go ahead and look at those and take the interpretation steps that you’ve learned with the new test sequences to try to understand what SeqScreen is telling you and answer the questions as best you can. And finally, I’ve also included videos with each of the case studies that describe how I went through and answered these questions. So if it’s helpful for you to see my thought process in answering the different scenarios and figuring out what’s going on, please feel free to watch those. I would encourage you to try it on your own first because I think it’ll be more impactful for you to try to answer these questions independently before you just see how I did it. However, I know it’s also helpful to learn from each other. So that’s why those videos are there. Feel free to watch all of them. No problem.
And also, I would love to hear back from you if you have other ideas for case studies. So if there’s anything here that you were hoping to see that you didn’t see or you have ideas or especially if you have test data sets that you could share publicly with this course in the future, we would love to hear about that and hear more information about those test data sets that you could share. Please feel free to email me with that information or you can post an issue to the GitHub and share more detail about any case study ideas that you have. Beyond that, I think the case studies in some cases are going to go into more detail than we’ve gone into before in the known test sequence exercise. But I think that’s good because you’re going to learn more about the different types of outlets that SeqScreen has, the intermediate files and other types of approaches that you can take to analyze the output. And it’ll be more interesting learning about all the different SeqScreen output files this way, rather than me just telling you about a bunch of intermediate outputs and what they are and that can be pretty dry and boring. So hopefully this will be more interesting to learn more about the software and what it can tell you in the context of answering the question. That’s the goal anyway. So I hope you enjoy it and thank you again for your participation.
Case Study 1
Read video transcript
Hi everyone, welcome to case study number one. I hope you had a chance to read through this slide and think about the scenario as well as find the example test data set that was provided for case study one, which in this case just consisted of a single sequence. And further, I hope that you were able to run that sequence through SeqScreen yourself or download the pre-computed output to interpret. And as you were looking at the outputs, I hope that you had a chance to think through what was going on on your own, and now we are going to go through and hear what I thought about the outputs and my thought process as I was trying to figure out what was going on with this sequence. I also wanted to mention that this is a real sequence that was provided to me from a bio-engineering and biotechnology company. It was something that one of their customers was ordering, using, and they wanted to know what our software could do to analyze the biosecurity concerns that might be present within this particular sequence. It was a challenging sequence, so other bioinformatics software packages were struggling with it, and it was an interesting use case that I thought would be worth mentioning. I did keep the company’s name out of this case study, and I also was pretty vague and made up the scenario and experiments around why this sequence existed, but that was to protect the anonymity of the company. They did give me permission to post the stuff they needed in the course. So thank you to them for submitting this and sharing it, and let’s jump in.
So I am going to open up the spreadsheet where we have pasted the outputs from the SeqScreen final report with the pathgo.tsv extension. So this is similar to what we looked at in the known sequence exercise, but now this is the output from case study one. And I have a number of different tabs in the spreadsheet. If you downloaded it and look at it on your own. Some of the tabs correspond to the different nodes in SeqScreen that I run, but then I ran in this case others correspond to intermediate files that came with the different modes that I thought might be helpful to look at. And we’ll talk through all that as we go. I just want to give you a overview of what’s in the spreadsheet, and that they’re all in multiple tabs for a reason. I want to keep things separate and look at things one-by-one because that is really how the output files are generated. They’re all separate output files.
Okay, so in fast mode, when we look at these outputs, we see that there was only one assignment made by Diamond, and it is a particular protein from yeast. And that protein had a pretty good e-value. It did not have any pathGO terms assigned, and there were no FunSoCs. So no obvious pathogenic functions here, although if you remember from my previous discussion about pathGO, there is a dash there that means that it wasn’t annotated by our biocuration team. And I can tell you that the reason that this one wasn’t annotated is because they really focus their annotations on proteins that represent sequences of concern. And this is not a sequence of concern. And that’s not to say that everything they have not annotated is not a concern and everything they have annotated encompasses all sequences of concern. But in this case, that’s why they didn’t annotate it. And so nothing concerning here in terms of the protein itself. The only thing that jumped out at me was this, that under the multi taxid confidence field, we see that the value for the confidence is 0.578. And that is relatively low on a scale of 0 to 1. That’s about right in the middle, which is not a very high confidence. So that’s something that I found strange and worth looking into further.
So the next thing that I did is that I looked at the HTML report output for Fast mode. And that is what this is that I’m showing on the screen. If you look at the HTML report and you click on the query sequence, you should see the alignment visualization for this particular sequence against the best reference that it found. And if you are familiar with the alignments, you’ll notice immediately that there’s a very core alignment in the middle of the sequence. So that’s this area right here. If you’re not familiar with alignments, a good alignment will show that the same amino acid abbreviation is present between the query and the subject. And that is denoted by the same amino acid letter showing up in the middle here. So this is all a perfect alignment. And this is a perfect alignment at the beginning of the end of the sequence. But the middle one’s not. We see one shared T here and a few pluses, which means that the amino acids shared some commonalities but they were not equivalent. So not every alignment in the middle of this sequence. Something else was going on here in the middle that is not yeast.
So to further analyze it, I ran sensitive mode. And before I move on to sensitive mode results, I should mention that I also included the fast mode Diamond outputs here, so these are intermediate files and you can find in your SeqScreen output folder within the taxonomic identification sub directory. And then within that sub directory, you go to the BLASTX sub directory and that’s where you’ll find these outputs. This is the B tab output, B T A B. And it essentially has all the different fields of output that you would expect for a BLASTX kind of result, which is what SeqScreen reports for Diamond in this case. And I added the headers to this particular file for ease of interpretation during the course. And hopefully that helps everything make a little bit more sense. The .578 number that we were talking about previously that I said looked strange for the confidence of the multi taxid, so the confidence for this being yeast was really (???). That is reflected here.
Other things that you might pay attention to include the e-value, the bit score, these are all significance indicators. So that’s our Diamond output. Fast mode also runs centrifuge, but in this case, centrifuge did not come to any good conclusions about the sequence. It was unclassified. And I think it’s worth noting that the centrifuge database that we use to seek screen is only a microbial database. So if any of the sequences come from eukaryotes or other non-microbial organisms, that will not appear in the centrifuge database and it will not be able to be detected with centrifuge in the way that we’re using it here. Our Diamond and BLAST databases that are used with SeqScreen are the big protein database and the MT database. Those are all moving beyond just microbes. We have all kinds of things there. So they should be able to be more broadly detecting different toxins whereas centrifuge will just be focused on microbes.
Okay, moving to the sensitive results. I jumped right into the detail here. So we’re looking at BLASTX outputs for sensitive mode. These are very similar to the Diamond outputs, but one of the first things you might notice here when comparing BLASTX to Diamond is that Diamond only has one row whereas BLASTX has a lot of rows. That’s because BLASTX is, as the name would suggest, more sensitive, in sensitive mode and it’s going to return more results. Diamond runs very rapidly and efficiently. BLASTX is going to be more detailed in the number of outputs that it can provide in the way that we run it in SeqScreen. And we see that reflected here. So one thing that jumped out to me as I looked at this is that the first row in the BLASTX output is identical to the first row in Diamond, which is the same protein that’s being returned on top. The next protein looks not like a terrible hit either. I mean the percentage of identical matches is fairly high. The expected value or evalue looks pretty comparable between these two. The bit scores look pretty comparable. So as far as strength of the hit, these first two rows look like they’re about the same strength. You know, some differences, but both would be worth paying attention to when we’re trying to figure out what’s going on here. As I glance through this and when I look at the actual name of what these proteins and information about the sequences might be, you see that the first one is the yeast protein that we’ve talked about before. The one below it is actually a snake toxin. So this here is a protein associated with a snake toxin. So that’s interesting from a biosecurity standpoint. It’s worth mentioning that the list of regulatory organisms and toxins does not include eukeryotoxins like snake toxins. But from a biosecurity standpoint, it’s something worth paying attention to. If you weren’t expecting this customer to be working with something like a snake toxin, that’s going to raise a red flag, I think in my mind, just asking questions and figuring out what’s going on here, and making sure that it’s safe and appropriate for use in the case of the company that provided the sequence as they’re analyzing this from one of their customers.
So there it is. Another thing that can be analyzed here just to clarify where these particular matches appear in the alignment is the start of alignment in query and the end of alignment in query field. So for the top row, which is the yeast, we see that it starts at position three and ends at position 137. The reason for that, which by the way is the full length of the sequence, or less, the reason for that is that when we think about alignments, it’s returning the beginning of the alignment and the end and then it’s not necessarily worrying about the middle part, not being aligned within the metric that we’re describing here where we’re talking about start and end. Where did it start then, for yeast, it started here, and it ended here. However, for the snake toxin protein, it starts at position 48 and ends at position 116. So that’s actually closer to the middle of the sequence. And so there you have it. We have yeast flanking a snake toxin protein.
I went ahead and analyzed it with the other modes in SeqScreen just for fun to what would happen. I also included the BLASTN results here from sensitive mode, which are primarily yeast, but you can see that the yeast it’s showing up at the beginning and at the end of the sequence here in the BLASTN results. For ONT mode, it was interesting to me that in ONT mode, it really keyed in on this middle snake toxin sequence. And I think the reason for that is that ONT mode is made to look at long sequences and try to infer the open reading frames within them. And since this middle sequence was its own open reading frame of a particular protein, it focused on that and it’s reporting here in the ONT mode’s extreme results. And we can look at how the pathogenicity labels look for this particular snake toxin. We see that disable organ is the FunSoC that’s assigned. So that’s the most severe, most damaging FunSoC possible, within SeqScreen. And then we also see a pathGO term here at the end and that suppresses coagulation in another organism. So the FunSoCs were assigned by machine learning, so the machine learning algorithm assigned disable organ or as a subject matter expert assigned the pathGO term suppressed coagulation in another organism. And that subject matter expert by a curator also assigned a pubMED ID, which is a reference to a scientific publication where they provide functional evidence for why that particular pathogenic function was assigned to that protein.
And we jump to that particular publication. We see it is. And so if you are using the SeqScreen and you see a pathGO term like this and you want to learn more about it, you can go to this publication and learn more about the experimental and functional evidence or the pathogenic activity of that protein. So here’s this. That’s helpful, I think. And then the final tab on the spreadsheet is giving you more detail about the element of the Diamond output which are just, yeah, that correspond to the snake toxin protein that was identified in the middle of the sequence.
Hopefully that was helpful. I think that overall, SeqScreen provided the information that was needed to figure out this. This is what we refer to in the bioinformatics community as a chimeric sequence, so it’s a sequence with more than one organism present within it. So there’s more than one protein from different organisms in the sequence. And that is challenging because you’re not getting, in this case, a really long sequence that’s contiguous from one organism. You’re getting small pieces that are mixed up and that often confuses our algorithms and our reporting. So it is a tricky case, but I think SeqScreen did a pretty good job in explaining what was going on. It did take some looking, we had to look through the results and think about it. But I think overall I was pretty happy with this performance in this case. So I hope that was helpful and you enjoyed Case Study 1. Thank you so much for your time. Bye.
Case Study 2
Read video transcript
Hi everyone, welcome to case study number two. It’s similar to case study one, I hope you had a chance to download the input fasta file for this case study and run it through SeqScreen on your own. If you weren’t able to run SeqScreen, I hope that you were able to download the pre-computed outputs and practice interpreting those results on your own. I selected a publicly available cloning vector sequence for this particular case study because I believe it’s representative of a common challenge that I’ve seen with biosecurity screening of viral sequences, particularly viral vector sequences if they’ve been derived from a pathogenic virus. So we’re going to talk about that more.
But before we do, I wanted to point out that within this example, I did provide you with the name of the cloning vector sequence and the GenBank ID that you could use to look it up and learn more information about it before you analyze the results and that is completely fine. As a bioengineer, you are going to be using many different cloning vectors in your work and I think in just about every case, you will have lots of information about that cloning vector and I encourage you to learn about that and consider it and look at the results of that cloning vector through your biosecurity pipeline so you understand what the expected results are going to be as you begin to modify sequences and use it, you can then interpret results within that context. So if we go to this cloning vector, GenBank ID here, we see that this is a cloning vector sequence from EV71 and it’s meant to, so here’s EV71, polyprotein 1 and then a particular peptide within polyprotein 1 of the EV71 virus and its purpose was to be a candidate dual vaccine for both EV71 and CVA16 and if you don’t know what all those letters and numbers mean, that’s okay. Basically these are two viruses that are involved in hand, foot, and mouth disease in children so they cause this disease in children and the purpose of this cloning vector is to try to help develop a vaccine against this particular disease. And it’s very common in vaccine development to only use part of a pathogenic viral sequence in the development of that vaccine. The reason for that is that we want your immune system to be able to recognize that particular virus by looking at part of it but not infect individuals with the entire pathogenic virus within the vaccine so we’re looking for something that will be effective in generating immunity without being pathogenic and making someone sick. And that’s exactly what this is. So this cloning vector itself is not pathogenic, it is safe, it’s only using part of the sequence of these viruses, specifically the EV71 virus. And to give you some additional context for that this is a publication that had a nice figure so I grabbed that and just to give you some of them kudos and references here’s the publication that we’re looking at and we are looking at polyprotein 1 of the EV71 virus specifically a region of peptides in this area of polyprotein 1 so there you have it. This is a portion of a pathogenic virus but the big question is what does this look like in a SeqScreen output file so what could we expect to see when we’re looking at a viral polyprotein through the SeqScreen pipeline.
So glad you asked. Let’s take a look. So this is my output file that you can download on our GitHub repo within the case study 2 directory and once again I have a number of different tabs in the spreadsheet. Those tabs are corresponding to different SeqScreen output formats as well as different intermediate files to give you more information about the alignments that were done to generate these outputs. In fast mode we see there is a single hit to a protein that’s called genome polyprotein but interestingly the organism is not a virus and that is because unlike in case study 1 where centrifuge did not return a result or the input sequence here centrifuge did return a result and that particular 1405 NCBI taxid corresponds to the bacillus species but when we look at the confidence for it it’s a very low confidence so I think this is likely due to differences in the database composition between centrifuge and Diamond but centrifuge did not find a very good match to this sequence however since it reported something that is what automatically got reported and populated in the fast mode results. We can see in the Diamond multi taxid field that there are a number of other taxids reported by Diamond and when we look at those in more detail we see that some of these are the Diamond results and there are a number of top hits that have similar bit scores and e-values and those all are genome polyproteins from Enterovirus A71 so this definitely looks like Enterovirus A71 from these confidence values and these results compared to the centrifuge results. You may be asking well why did SeqScreen report the centrifuge result then if that was lower confidence and Diamond was higher confidence and the reason can be found in the SeqScreen documentation. So if you remember within the three different modes of SeqScreen we have lots of different tools running. Within sensitive mode you can optionally run BLASTN in addition to BLASTX, within fast mode we run both Diamond and centrifuge and then within ONT mode we do just run Diamond but we have a couple different Diamond databases one is a higher quality database with higher quality functional annotations that’re similar to what we use in fast mode and sensitive mode but then we also have a lower quality database if there’s no hits to that first one in ONT mode we’ll go to the other one to increase sensitivity of results.
All that to say there’s a lot going on in SeqScreen lots of different tools are being run and the developers needed to come up with a way to decide which tool to go with in reporting the final results if they both gave you different results which one to select and when we go to the taxonomic identification workflow wiki page here in the SeqScreen documentation we can get more information about all things taxonomic identification workflow related. There’s a handy table here that’ll tell you what databases are being used with the different tools. There’s also some asterisks here that give more detail about how the databases were constructed, where you can find them all kinds of details here. What I want to draw your attention to though is at the very end of the documentation there’s some discussion about the final taxonomic logic and for sensitive mode when both BLASTN and BLASTX return taxids BLASTN is given preference so BLASTN results will be listed first before BLASTX in the taxonomic field and then in fast mode centrifuge is given preference so the only time that Diamond results are listed in the single taxid field or the multi taxid confidence field is when centrifuge does not return a result and in case study one we saw centrifuge not return a result so we were looking at the Diamond multi taxid confidences in case study two centrifuge did return a result and even though it was low confidence, it was something so it was output in that field. But this goes back to just knowing how to interpret the outputs I would definitely recommend that you look at the confidence values for the taxid assignments and if centrifuge happens to be low then take a look at the Diamond ones and see what those look like that’s what I just did in this example.
Alright one other point is that I don’t know that one tool always gives you the correct answer. Perhaps that’s a dirty secret in bioinformatics that different tools will perform well in different scenarios and sometimes centrifuge gives you a great result and sometimes Diamond gives you a better result and we needed to pick a decision path here to consistently have our software give answers in a uniform way but we included all these different results in here so that you would have more information as a user so that if one tool did not give you a very competent result you could always go back and look at the other one. So that’s the high level rationale here and hopefully that helps clarify things.
Okay so before we move on from these fast mode results I did want to go through and just talk about the functional annotations that were assigned to the UniprotID. So even though we just focused a lot on taxonomic classification and discussing why this bacillus species is showing up here the actual UniprotID is from the virus EV71 and it’s a genome polyprotein which means that there are many smaller functional units within this polyprotein that are actually used for different functional purposes for the viral processes and that is a point of confusion and challenge within databases right now as we try to do functional annotations and assignments and look at biosecurity through a functional lens because if the best match for a protein for your particular sequence is a viral polyprotein that means it’s going to have a really long list of functions. But not all of those functions actually correspond to your small sequence because now we’re looking at the entire polyprotein and listing all of those possible functions rather than those that are just specific to your short sequence and that has to do with the way that these proteins are organized within the database. I think this is definitely an area where more research and focus could be done in breaking up these polyproteins into smaller functional units and just organizing our databases and reporting a little bit differently in the future but for right now this is a challenge and it’s something to be aware of so that if you get a result like this where you’re getting a genome polyprotein and it’s got all kinds of FunSoCs assigned to it and that raises alarm bells in your mind that this could be a pathogenic protein, just keep in mind that this is a polyprotein and not all of these functions necessarily correspond to the short sequence that you were analyzing. So the functions that were getting returned here are corresponding to the whole viral polyprotein.
Hopefully that made sense we’re going to see a similar story with the other modes so here we have sensitive mode. So some differences that are worth mentioning here so for the organism or the taxid and the confidence that’s listed accordingly we see really high confidence here from the BLASTN results and it’s assigned to the exact cloning vector that we expected it to be so BLASTN is doing exactly what it should here for BLASTN against the NT database and it’s finding a perfect match to the cloning vector which is exactly what our input sequence was. So good job BLASTN you did it. You got the right answer and we see that with the confidence value that this is a very confident answer and there’s only one taxid listed so it was the best match from BLASTN. It definitely looks like it could be concerning I mean in addition to all the FunSoCs we also see that it could be a BSAT so lots of concern going on here, however I’ll just reiterate again that even though it might have some similarity to something on the BSAT list and it has a lot of FunSoCs assigned to it, this is not of high concern from a biosecurity standpoint because we are looking at all the labels right now from a genome polyprotein and this UniProtID is once again a polyprotein from the ED71 virus. It’s a different UniProtID than the one we saw previously but if we look at the list that sensitive mode is returning this is the one we looked at previously this is the one that’s on top for BLASTX and sensitive mode and when you look at the evalues and the bit scores they’re all very similar. So similar story slightly different UniProtID in BLASTX compared to Diamonds but same take-home message here.
And then with ONT mode once again we see the top hit being a genome polyprotein to enterovirus. A71 and the UniProtID is similar to what we were just discussing. Something I haven’t mentioned is that pathGO annotations will rarely appear for viral polyproteins because our viral curation team does not bother to annotate polyproteins because of all the challenges I just mentioned. It would be much more meaningful to analyze smaller functional units and report those from a biocuration standpoint than from the entire viral polyprotein so they don’t even bother with these you’ll rarely get a pathGO list of terms for a polyprotein. You will however get that for FunSoCs because the machine learning algorithm will assign those and these aren’t wrong for the polyprotein but when you interpret the results you just have to remember that what is true for the polyprotein is not necessarily true for the smaller subset of the sequence that you were analyzing. Other things that we see in ONT mode, relatively high taxonomic confidence here so we’re looking at the Diamond result for the ONT mode and that gives us the enterovirus A71 so whereas in fast mode we were looking at the centrifuge result now in ONT mode we’re looking at the Diamond result that gives you the correct organism.
And then we see a bunch of other stuff here in the ONT results and that corresponds to other open reading frames that it’s finding within this cloning vector and it’s trying to decide what other genes are present within the cloning vector sequence that was provided. It thinks it has found some different antimicrobial genes, antibiotic resistance genes, other genes just involved in normal processes of cellular metabolism, so nothing particularly concerning here but there are things that ONT mode is picking up on that the other modes didn’t as far as other possible open reading frames present within the sequence. And then these are the Diamond results for ONT mode which look similar to what we’ve talked about before where the top hit is this one.
Okay so that was my example for a viral polyprotein result. Again this is something that commonly happens in viruses and that is an artifact of the way that we have organized our databases around viral polyproteins and when we’re looking for functional labels we have to keep in mind if our functional label is coming from the polyprotein it’s not specific likely to the small subset of the sequence that we were analyzing, it’s looking at the entire universe of possible functions for that virus. Okay, great thank you so much and I will see you at the next case study. Bye.
Future Directions
Read video transcript
So the challenges first. So we hit a lot of these points, but I just wanted to describe some of the, either we saw a lot of values or a lot of reasons why we should be doing biosecurity screens or trying to bring a high level down to some of the points I wanted to make here. But there are some ongoing challenges and a lot of them we can solve by working together, but these are also areas that we could be seeking potential funding and support where there may be some projects where we can work together to try to improve and advance SeqScreen and other computational programs that are carrying out functional prediction as well as the taxonomical prediction.
It was great that Kevin described some other programs as well. And so these are some of the, I just put out like some of the challenges and sort of thinking about so there’s context impacts and we saw that in some of the examples and actually Kevin mentioned one too where there’s a sequence that is in a pathogen it is known to contribute to virulence but if you take that sequence into a benign E. coli or streptomyces, well, the entire virulent pathway is not intact and so it’s not harmful. But the only way that we can determine that is by doing some research. And so there’s a, actually, Gene mentioned that there is with the FunSoCs, there is some information about context – about combinations that need to be present. And so that’s getting there. The more that we can do that, you know, at the genome level, we’ll be able to determine whether a sequence in that context is going to have a concern.
And similar like the pools of amino acids too, oh sorry, pools of oligonucleotides, you know, look at interactions and collectively what bio might be there. And then there’s environmental impact and application, take that engineered organism and put it into a different environment like the GI tract. The gut will be very different than the environment that there could be ecological disruption. So that’s another area that, you know, we can’t see now in looking at individual sequence, but it has to be considered in the manual evaluation of a hit. And how could we make that more computational.
Test sets are always needed in more complex tests to train the algorithms and for, and I know there’s some efforts now to put together some standardized sets but that’s always an area that we can use more, more help. And then the chimeric sequence and handling that and that there’s some work you’re doing with with SeqScreen but just in general and that’s representative of industry applications. There will be these chimeric sequences and that’s engineered and repair templates for example for, for CRISPR applications where is it insert link by benign sequence that was that case study one. And then annotation crowdsourcing Todd mentioned you know that for FunSoCs, hey the more information that we’re getting that empirical results, research, we can continue to improve pathGO variety we can just keep working together to prove that scale up continues more challenges and then evolving business models so what what are the different applications and what are the input sequences and who in different models will be doing the bio –– or different businesses who does the biosecurity screening and that’s why it’s great that there’s such diverse group of people here that have different jobs. But we all need to know what is important that we know what biosecurity is what is involved and what are some ways we can we can carry out the sequence screening and then that hit on these topics and all of you have and I think that has summed up very well how you know working together. And then the awareness too that this course is providing us with why we need bio security.
The regulations— that was so helpful to hear from that — or the guidance, and how taking input from the community and revising. And it is guidance so we can as a community work together and have some resources now we can improve on the resources to to adhere to the to the recommendations and the guidance and involve everyone because now the guidance is including the users not just the providers of the sequence. And then you know work together resource building sharing case studies and where we can share case studies it saves because manual — and Kevin mentioned this too — the manual evaluation of these hits is very time consuming and if we could share what we’re learning, you know common false positives and keep building up that allow list that’s helpful and thanks to the effortsmany many friends when we work together to solve problems. And moving forward the ongoing training this course and you know, bring back to the education workforce development at BioMADE and Krista’s leading up this project, it’s in her portfolio. This is a great resource and continues to bring some of these into class and that is really wonderful — Sheila asked about hey these slides can these slides be available for my biotech class and that’s fantastic that’s what these resources, we can use that, you know, bring these into the classroom and keep biosecurity moving forward. Anyone have any other ideas?
Course Conclusion
Read video transcript
Hi everyone, it’s Krista again and congratulations! You have made it to the end of the course. I am so proud of you for making it this far and getting through all of that material. I know those were a lot of lectures to listen to and a lot of different exercises and case studies and outputs to work through, but you did it. You got to the end and I am so happy for you.
As we complete this course, I just wanted to review a few things and give you a couple updates. One is reviewing the objectives that we had for this course. Throughout all of this, I hope that you’ve gone through and completed the course in its entirety, including all the different portions and you’ve tried interpreting sample software outputs as well as maybe even trying to run the SeqScreen software on your own. I hope you filled out that pre-course survey and shortly you’re going to go fill out that post-course survey. Through all of that, you have become familiar with important biosecurity concepts and regulations. You’ve also gained experience by performing sequence screening and/or interpreting results on your own. Another important note just to reiterate is that SeqScreen software and all of the views and opinions about it presented here in this course are solely those of the authors. They do not represent any opinions from the US government or any government agency and certainly are not endorsed by any government agency. This is just from us to you, but thank you so much for joining us on this course.
Here’s the link to that post-course survey. We really appreciate you filling it out, so thank you so much in advance for doing that. And as another thing that you could do optionally, if you’d like to get more credit for the work that you have done here, please feel free to add this certification to your LinkedIn profile. I’m going to show you a few slides here just to demonstrate how I went ahead and did that. It is slightly more of a manual process, but it doesn’t take that much effort. So hopefully it’s easy for you to do. What you’re going to need to do is log into LinkedIn, then go to your profile page, and under the Licenses and Certification section, you’re going to click on that plus button. Once you do that, you’ll have some fields that you need to fill out. You’ll enter the name of the course, which is Biosecurity Sequence Screening Training Course for Bioengineers. Signature Science LLC will be the issuing organization, so you should be able to type that out and then it’ll find Signature Science and enter that there. Next, you’ll include the issue date. So for me, when I entered this on my LinkedIn profile, my issue date was when we originally delivered this course live to BioMADE Members, which was in March 2023. For you, your issue date is going to be whenever you completed this course. So because of the online nature of this course and that it’s self-directed, people can complete this course at any time. So not everyone’s going to be the same. You go ahead and use whatever issue date is most reflective of when you finished this course. From there, you can also include the URL for the BioMADE Biosecurity Training Course if you’d like to. That’s an optional part. That’s not necessary. I would definitely encourage you though, beyond just entering the certification information on LinkedIn, is while you’re logged in to your profile, please go to the skills section and add biosecurity as a skill. And that will be a wonderful thing to do because then others can endorse you for that skill, like me. And if you’d like me to personally give you an endorsement for biosecurity on your LinkedIn profile, I would be happy to do that. Please send me a note by any means necessary. You can contact me. The easiest thing to do might be just to send me an email and let me know that you have completed this online course and you would love for me to give you an endorsement on LinkedIn and I’d be happy to do so. So send me a note and, hopefully, we will connect in the future. And again, thank you so much for participating in this course. We really appreciate your time. and we hope that you found it valuable. Have a great day.