Quick and easy search for information in scientific books
Imagine you have a container packed with numerous digitized documents. Seeking very specific information within these data is akin to the proverbial search for a needle in a haystack. A department of the Swiss pharmaceutical company Novartis encountered this very problem, thus identifying an urgent requirement for a sophisticated, fast and reliable search function.
Searching for specific information within documents in a browser sounds easy at first. By the time our experts got involved, Novartis already had a comprehensive body of digitized documents and a Solr index (provided by third parties).
Using a newly developed web UI, users can enter one or more search terms in an integrated form. The matching PDF files are displayed and can be opened. Due to copyright concerns, Novartis wanted to disable features such as printing or saving documents. The use of the PDF viewer integrated in conventional browsers was therefore not a viable option. Instead, the open source library PDF.js has been integrated and adapted.
To display a file correctly, the PDF viewer on the client side had to wait for the entire PDF to be downloaded to the target computer. The result was poor performance. To solve this problem the project team decided to use the page splitting method in which a multi-page PDF is split into individual pages.
The version of the PDF viewer customized for Novartis also comprises a zoom function and page navigation for the complete PDF document. The keyword search within a page, including highlighting of the search results, is carried out via the browser’s integrated search engine.
One of the next steps is to define how new documents are uploaded and indexed so that they can also be searched