|This Glossary entry exists for the community to share information related to common terms used in prior art searching. Registered users can add, edit, or delete material on this page. Users should keep in mind that the information on this page is the result of community collaboration and, as such, is vetted by the community at large, not individual experts or fact-checkers. All information contributed to this page is public information - do not post confidential information. For more information about creating and editing Glossary articles, please see our Help pages. If you found this page through a web search, we invite you to visit our Main Page to see what Intellogist is all about.|
Controlled vocabulary is metadata associated with a method of identifying, standardizing, and codifying language used to tag individual records. All databases or collections that are indexed by controlled vocabulary come with a corresponding thesaurus, or list of accepted and agreed upon terms that can be used as tags. Controlled vocabularies are used in intellectual property databases as an enhancement implemented by a team of human indexers employed by the database producer (as opposed to the authors of the individual publications). Examples of prominent databases that use controlled vocabularies are Compendex, Inspec, and NTIS.
The need for controlled vocabulary arises for a number of reasons. First, English and most other languages have a number of words that mean different things in different contexts. Words such as “socket” may have more than one meaning- “socket” could represent an electrical outlet, a hollow part of a wrench or other mechanical device, or a single connection between two network applications in computing. Words with the same spelling but different meanings are called homographs, and they can make professional intellectual property searching difficult by eliciting false hits. Using controlled vocabulary, different entries in a thesaurus can be defined for each potential use of the word “socket.” The entries for the above example might be: “electrical socket”, “socket (structure),” and “network socket.” By using a thesaurus of controlled vocabulary, the indexing staff will consistently use the same standardized term to express a certain meaning, and searchers will know exactly which keywords to use to find the subject matter they are looking for.
Second, many words in English and most other languages have synonyms, or distinctly spelled and pronounced words that mean the same or similar things. For instance, a flange at the base of a metal tube may be called a flange, a rib, a rim, or a lip, among other things. Controlled vocabulary lets an indexing board decide on a standardized term for that collection of structural elements, and use that controlled term to tag all of the records that have that feature- regardless of what the original text calls it. By searching with controlled terms, records with inconsistent or vague language can be found more effectively, because they are grouped together under a standardized term that is made clear to both the indexers and the users.
The benefits of controlled vocabulary are 1) an increased clarification and specification of the language and terms being searched, and 2) a standardization of the terms being searched. The clarification and specification of technical language through indexing records with controlled terms can remove the confusion that can arise when words have more than one meaning. This in turn can lessen the burden on the searcher, who may create shorter, more targeted search strings and use fewer keywords than are necessary in a full text search. Standardization through consolidating many different synonyms into one controlled term leads to broader collections of similar subject matter, increasing the number of useful hits returned by the search.
However, controlled vocabulary has some inherent problems. First, since the entire system of controlled vocabulary is run by humans, there can be some human error. This can occur when deciding on terms to put in the thesaurus (the terms are too narrow or too broad, leaving gaps) or deciding with which terms to tag a record (an indexer overlooks an important part of the original document, or just chooses the wrong term). The indexers in charge of maintaining and applying the controlled vocabulary cannot be relied upon to be perfect in these aspects and this can lead to either missed hits or false hits.
Second, there is a timeliness problem inherent in this method. Controlled vocabulary needs to be agreed upon and codified into a thesaurus prior to use by the indexers. This leads to a lag between when new technology arrives, and the time that new terms of art are coined and accepted as industry standards. In the interim, indexers cannot accurately describe the contents of a record that do not have corresponding thesaurus entries. This leads to a gap in records that a searcher later may not be able to find, if the database is not retroactively indexed to reflect new thesaurus entries. To compensate for this problem, some sources that use controlled vocabulary (such as Compendex and Inspec) have an additional field for “uncontrolled vocabulary.” In the uncontrolled vocabulary field, indexers are allowed to insert their own terms, which may include terms that do not occur in the controlled vocabulary thesaurus due to gaps or timeliness problems. In addition to containing new or uncodified terms, the uncontrolled index often contains overflow from other indexing fields, and serves as a catchall. Searching in this field as well as the controlled field can sometimes allow users to pick up additional relevant records.
Third, controlled vocabulary is not standardized across different search systems. Individual thesauri must be consulted by searchers to understand the agreed-upon language of that search database. This leads to relearning the technical language used for controlled vocabulary for each different search database, which can take up valuable time.
Controlled vocabulary is a great additional searching tool if the searcher knows its strengths and weaknesses. It can help users eliminate false hits because its terms are more specifically applied than general keywords, and/or attract more useful hits by retrieving multiple synonyms with one standardized controlled term. However, it can also cause a careless searcher to miss useful hits if they are not aware that controlled vocabulary can be incomplete or incorrectly applied.
Where controlled indexing is available, specific search strategy that utilizes it should be adopted by the user in accordance with its inherent strengths and weaknesses. The feature’s strengths are apparent: controlled vocabulary allows searchers to find additional hits that would be missed by keyword only searching or classification searching. For example, keyword only searches may forget to include a specific synonym, and not return any hits that use only that synonym. Classification searching would not allow any record that has been misclassified to be found in a search. Controlled vocabulary searching helps fill in those gaps where useful hits may fall.
On the other hand, exclusively searching by controlled vocabulary could expose the weaknesses inherent in controlled vocabulary systems, discussed above. Relying on human indexers as intermediaries can introduce additional human error between the author of the record and the searcher. To address this weakness, it is highly recommended that searchers use controlled vocabulary searching as a supplement to established search methods, rather than a total replacement.
An additional search strategy related to controlled vocabulary is getting to know the thesaurus associated with the database one is using. Thesauri sometimes have additional information that can help users understand the history of how a term has been applied and when it was introduced. so that they know what dates would not be included in hit results, and can compensate with the use of additional controlled terms. Inspec’s thesaurus, for example, includes information about broader, narrower, prior, and top terms, as well as the year in which the term was introduced.