Biotechnology Searching Best Practices
|If you found this page through a web search, we invite you to visit our Main Page to see what Intellogist is all about. This article exists for the community to share common wisdom and valuable insights about the prior art searching process. Registered users can add, edit, or delete material on this page. Users should keep in mind that the information on this page is the result of community collaboration and, as such, is vetted by the community at large, not individual experts or fact-checkers. All information contributed to this page is public information - do not post confidential information. For more information about creating and editing Best Practices articles, please see our Help pages.|
The field of biotechnology involves natural and synthetic biological molecules, cells, tissues, organs and even whole organisms. Searching for biotechnology information typically requires a considerable amount of flexibility from the searcher, because subjects in biotechnology can be broad, necessitating an understanding of multidisciplinary fields. A searcher in this area may need knowledge of biology, biochemistry, chemistry, chemical engineering, pharmacology, or even mechanical engineering in order to perform a thorough search on a given invention. In addition, searchers are often required to use specialized databases and tools to perform nucleic acid and polypeptide sequence searching.
For many biotechnology searches, one who may be considered to “have ordinary skill in the art” is a Ph.D. in any of those areas listed above. For other biotechnology searches some graduate work or a bachelor’s degree may suffice.
Obstacles Facing the Searcher
The main obstacle that an at-large searcher in biotechnology will be faced with is the need to quickly become familiar with areas of related technologies in which a searcher has no or little experience. For example, a molecular biologist might have to perform a search on well-drilling fluids, or a microbiology expert might have to become familiar with a drug and the disease(s) for which it is used.
In addition, searchers must be able to quickly decipher complex disclosures which may contain jargon that is very field-specific. For example, a searcher with more of a chemistry background might have to become familiar with terms such as FISH (fluorescent in situ hybridization), FRET (fluorescence resonance energy transfer), FRAP (fluorescence recovery after photobleaching) or RTPCR (reverse transcription polymerase chain reaction) for a molecular biology related search.
Disclosures in this field can include many technical details which may not be relevant from a search strategy perspective. This can include a very detailed description of a molecular path when the invention is a nucleotide or protein sequence. Clarification of points of novelty and other important concepts are often necessary between the searcher and search requester.
Identifying the proper classifications for concepts involved in biotechnology inventions is another challenge. There are occasions where the subject is not correctly categorized, or where the subject matter can be found in many different classifications. In the following example, each phrase would represent an area of subject matter that would fall out in a new classification area:
- a transgenic animal,
- with an exogenous nucleotide sequence,
- producing a protein,
- and an antibody for that protein produced by an organism.
- producing a protein,
- with an exogenous nucleotide sequence,
Searching Patent Documents
One important facet of biotechnology searching involves text-based queries of information databases. For this reason, a customizable interface that easily highlights the keywords in context is valuable for efficiently reviewing large numbers of documents. For searches which require image searching, such as for medical devices and the like, USPTO EAST is superior, but requires proximity to the USPTO. For nucleic acid and protein sequence searching, GenomeQuest, STN sequence databases, and NCBI-BLAST can be used to identify patent sequences.
Biotechnology related prior art tools containing patent data include, but are not limited to:
NCBI stands for the National Center for Biotechnology Information. They have a pat (“patent”) file searchable using BLAST (Basic Local Alignment Search Tool) algorithms for polypeptide and nucleic acid sequence similarity searching. NCBI does not provide any details about the pat database other than that they add sequences that the United States Patent Office (USPTO) sends them, but they make no warranties about completeness. It includes sequences from US, WO, EP, and some JP and KR patent documents. It is mirrored on various web resources. Other non-patent NCBI sequence databases are also searchable (GenBank - a large database and data repository of nucleic acid and protein sequences at the National Library of Medicine, RefSeq - a curated database of Genbank's genomes, mRNAs and proteins, Swiss-Prot - a non-redundant, annotated and cross referenced protein sequence database (a subdivision is TrEMBL), PDB - a database and format of files which describe the 3D structure of a protein or nucleic acid, as determined by X-ray crystallography and nuclear magnetic resonance). Sequence searching on NCBI is free, and can be used as a preliminary search but cannot be considered complete for identifying novelty. NCBI also provides many other searching tools including PubMed for finding peer-reviewed journal articles and PubChem for identifying compounds. NCBI’s counterpart in Europe is called EBI.
GenomeQuest is a pay-per-use or subscription proprietary peptide and nucleic acid sequence searching. GenomeQuest maintains its own patent sequence database GQPAT which is very complete. Patent sequence data typically enter the database between 7 to 15 days after publication. Many other non-patent datafiles are also simultaneously searchable, such as GenBank, RefSeq, Swiss-Prot, and other NCBI and EBI files. Sequences are searchable using four options. GenePAST identity search is the usually the preferred method for intellectual property searching. BLAST similarity search is also available, as are 2 other algorithms, fragment search and motif search. GenomeQuest provides a DrugBank Pro tool to search annotation information on all drugs, targets and their metabolyzing enzymes. This information is available 6 months ahead of the University of Alberta’s public DrugBank website. It maybe linked to GenomeQuest sequence searching.
DGENE (GENESEQ) is database of nucleic and peptide sequences which have been extracted from basic patent documents of the 47 issuing authorities covered by the Derwent World Patents Index (file WPINDEX/WPIDS/WPIX). The file covers patent literature from 1981 to present and is updated every two weeks. GENESEQ is used routinely by examiners at the USPTO and other major patent offices around the world.
USGENE covers all peptide and nucleic acid sequences from the published applications and issued patents of the United States Patent and Trademark Office (USPTO). The USGENE database includes extensive bibliographic and text search options, including abstracts and patent claims. The file covers data from 1981 to present and is updated weekly, and new sequences are available within 3 days of publication. As of January 28th, 2011, USGENE is known to provide a more complete collection of U.S. patent sequence documents than GQ-PAT.
PCTGEN covers nucleotide and amino acid sequence information submitted electronically to the World Intellectual Property Organisation (WIPO) by patent applicants. The file contains nucleotide and peptide sequences and bibliographic data. Sequence and patent application information is compiled in this file as given by the patent applicant. The file comprises data from August 2001 to present and is updated weekly. New sequences are available within 24 hours of publication.
Searching Non-Patent Literature
For non-patent literature searching, PubMed, NCBI’s free public MEDLINE search system, is the place to start for identifying peer-reviewed journal abstracts. Titles freely available in full-text are searchable with PubMed Central. More detailed non-patent searching may require subscription and/or pay-per-use services.
Other sources for non-patent literature are numerous and varied in their content. A small list of some related non-patent literature tools biotechnology searchers may wish to consult include:
For a more complete listing of biotechnology resources, see the Resource Finder.
Sequence Search Tools
Genetic sequence searching is one area where the use of commercially produced and curated databases is essential. There are various obstacles to producing electronically searchable sequence databases, not the least of which is ensuring that the data is appropriately digitized. Even as late as 2008, at least one major patent data collection is not fully produced electronically, and must be digitized through an Optical Character Recognition (OCR) scanning process: the World Intellectual Property Organisation (WIPO) publishes some Patent Cooperation Treaty (PCT) applications in print form only in its gazette. Additionally, genetic sequences in PCT applications may be published apart from the patent text as a separate document or addendum to the patent “pamphlet,” requiring database producers to carefully ensure that they are properly obtaining and scanning these separate portions.
The timeliness of the sequence data is also a concern, especially because heavily curated databases are likely to be slower to upload new sequence data.
Public databases such as Genbank/NCBI also have a role to play in a thorough sequence search because they have the potential to be more timely and more economical. However, these databases are not secure, which presents a difficulty for highly confidential investigations.
- NCBI-BLAST (nr [non-redundant] database)
- STN (DGENE, USGENE, PCTGEN, REGISTRY)
- European Bioinformatics Institute (EBI)
- Nucleic Acids: EMBL_All, GenseqNA
- Proteins: UNIPROT, GenSeqAA, EPOP, USPOP, JPOP
- Some “exotic” databases: HGVBase, PDB, etc.
- GenomeQuest’s KERR
- GenomeQuest’s BLW (fragment search)
- GenomeQuest’s translated search algorithms (tRNA to prot and vice versa)
Specific Search Strategies
For a search in this subject area that does not require sequence searching, the following strategies are applicable:
- Identify synonyms for compounds, devices, and genetic elements
- Identify relevant US Patent Classification subclasses (class 424 – pharmaceuticals, classes 514, 520 series and 532-570 series – chemical compounds, classes 600 series and 128 for medical devices, etc.) This can be done by doing very specific keyword searches to identify relevant subclasses or examining class/subclass of patents provided.
- Exhaustive subclass searching, noting synonyms of keywords previously used.
- Classification searching (including US Patent Classification, IPC, ECLA, DEKLA, and Japanese F-Index and F-Terms) with keyword limiters and global text search.
For sequence searching:
- Find the sequence for the gene or protein in GenBank (unless provided).
- Review your sequence query*.
- Determine search limits (percent identity threshold, polymer length range for answers)
- Upload your sequence into GenomeQuest, STN, NCBI-BLAST or other chosen search engine.
- Run your queries for the sequence in the appropriate databases.
- Note: When first performing sequence searching, it is recommended to review your queries with another person before the searcher runs the query to double check the queries are correct and the databases chosen are appropriate. The services for these databases are usually very expensive and it is useful to have a second pair of eyes checking the queries before they are run to avoid mistakes.
Key Classification Areas
Identify classes and subclasses by developing a detailed search string designed to find a small group of closely related patents. A phone conversation with a United States Patent and Trademark Office (USPTO) Examiner is also beneficial after identifying some initial US classes and subclasses.
The following US and International Patent Classification (IPC) classes are common to biotechnology:
- 047 Plant husbandry
- 119 Animal husbandry
- 127 Sugar, starch, and carbohydrates
- 131 Tobacco
- 424 Drug, bio-affecting and body treating compositions
- 426 Food or edible material: processes, compositions, and products
- 435 Chemistry: molecular biology and microbiology
- 436 Chemistry: analytical and immunological testing
- 504 Plant protecting and regulating compositions
- 514 Drug, bio-affecting and body treating compositions
- 800 Multicellular living organisms and unmodified parts thereof and related processes
- 930 Peptide or protein sequence
- 977 Nanotechnology
- Section A – Human Necessities
- Section C – Chemistry, Metallurgy
Refer to Other Articles
It is recommended that one read the corresponding best practices articles on Chemistry and Pharmacology, Chemical Engineering, and Medical Devices. Due to some overlap in the fields, the best practices and sources disclosed in those articles may also be applicable to Biotechnology searches.