Report:VantagePoint/Pre-Analysis Tasks/Data Cleaning/Automated Data Cleaning
|Report||Patent Coverage Map||Ratings||Comments|
|This report was created by the Intellogist Team and is available for viewing only. If you'd like to share your knowledge on Intellogist, please visit the Best Practices, Glossary, or Community Reports pages. If you are a registered user and would like to be notified of any substantial changes to this report, you may place a "watch" on the Revisions page, which is the last page listed on the table of contents. To learn more about using the Intellogist "watchlist," see the Watchlist Help page.|
Automated Data Cleaning
Typically data cleaning is a way to remove duplicative or unnecessary items from a list, and/or group equivalent records together. Examples include removing multiple instances of the same document (e.g. if the same document was included in two different data sets), or combining multiple names for the same company into one common company name.
The individual fields can be automatically cleaned by selecting Fields --> List Cleanup. The List Cleanup window allows users to select the field to be cleaned (only one field can be cleaned at a time) and the options on how to clean the field, including selecting a Fuzzy Matching file. The List Cleanup window is shown below.
A Fuzzy Matching file is a set of guidelines for how the system will sort the data. A good example would be cleaning patent assignee names. Often there will be many different entries for the same company, such as General Motors, Gen Motors Corp, or GM Corp. The matching file helps the system determine that all three entries, while different, all represent the same company, and thus they should be grouped together. The same principal applies for all fields that need to be cleaned, although slightly different fuzzy logic is used for each data field. The file that best matches the type of data to be cleaned should be selected (for example, “Organization Names.fuz” would be an appropriate choice when cleaning assignees, and “Person Names.fuz” could be selected to clean inventor names).
Once the cleanup has been processed by VantagePoint, a list confirmation is displayed. In the figure below, the display shows the details behind how the system cleaned the list for assignee names. Variations of a company name are grouped together and one common name is assigned for those records. Users can manipulate this list and make changes. Finally, as a result of the list cleanup, the matches can be saved as a thesaurus that can be used later.
The cleaned list becomes a new field within VantagePoint. A sample cleaned list is shown below. This example shows assignees listed in order of occurrence.
The system also has two quick functions that work very similarly, Combine Duplicate Records and Remove Duplicate Records. When using either function, the user selects the fields they wish to combine or remove (the interface is the same for both functions), along with the desired matching type (either exact or fuzzy matching may be used). For those fields where fuzzy matching is desired, the user selects the Fuzzy Matching file as described above. A new dataset is created when either of these functions is performed.
Fuzzy Matching Editor
The Fuzzy Matching Editor is accessed through Tools --> Fuzzy Editor. This tool allows users to create and edit their own Fuzzy Matching files. As an example, the figure below shows the settings applied to the “Organization Names” fuzzy matching file in VantagePoint. Notice that the system uses various algorithms to clean the data, such as stemming, but also allows for outside information sources, such as a thesaurus, to be incorporated during processing. The ability to load an outside thesaurus allows analysts to use customized information that is not totally reliant on the computer algorithm. For instance, a list of common spellings of assignee names could be included, thus ensuring that commonly misspelled variants are correctly grouped together.
Furthermore, a weighted matching function can be applied, where the user assigns a weight to each word in a phrase (for example, a weight of 100 percent means the word is very important within the phrase, while a weight of 20 percent means the word is only vaguely important) before calculating the percent matching between separate records. For example, in the phrase "solar energy reactor", the weighted matching could assign the first word "solar" to receive 100 percent, the second word "energy" 20 percent, and the last word "reactor" 20 percent. The percentages do not need to add up to 100 percent, as each percentage stands alone. Thus, when the Fuzzy Matching file is applied, records containing the term "solar" would be a closer overall percentage match than records containing only the terms "energy" or "reactor".
The List Cleanup feature is strong enough for new users to get started with VantagePoint right away, however its true value will appear in time as the user becomes acquainted with the system. Over time users will be able to make improvements to the Fuzzy Matching files, including adding more relevant Thesaurus entries, "ignore" lists and so forth, to customize the feature. Essentially, each time a list clean occurs, some data can be saved that will improve the results should a similar data set be analyzed in the future.