Case Study

Overcoming Language Barriers in Legal Publishing

Based in Puerto Rico, the client is a leading case law content publisher, offering legal citations such as opinions and rulings set forth in the courts. The client's current archive of case law repository required the inclusion and repurposing of all the special characters or the standard ASCII character sets along with all Spanish special characters (specifically tildes, accents, and the like) into its repository of Puerto Rican content.

A customized service was needed that addressed all the client's content-enrichment requirements along with high quality and a turnaround time of 120 days.

  • Enriching and repurposing 74,000 case law citations comprising 1.27 billion characters within a four-month deadline
  • Linking the repository of 1 million XML with content in the form of scattered hard bound books, PDFs, RTFs and scanned images
  • Inconsistency in linking up many input files with the XML files
  • Requirement to send content repository of over 2,500 legal documents periodically in batches
  • File transfer protocol (FTP)
  • Scanning and performing OCR of hard bound books before embarking on the content-enrichment process
  • Locating the content source, finding the exact match in the XML content, and then substituting the accented word with the entity value as required to represent the accent
  • Adhering to 99.95% accuracy
Solution and Approach

Using its experience in digitization and legal content, Lumina Datamatics drew up a customized approach for this project:

  • Extracting and validating data using a proprietary toolset to ensure reduction in manual proofreading to achieve required accuracy
  • Developing tools and programmatically managing different source input files
  • Deploying a trained team of XML programmers to manage the conversion
  • Necessitating manual proofreading to enhance the case law contents
  • Employing a huge workforce to effectuate the process

Lumina Datamatics implemented the following approach:

  • A combination of partially automated and largely automated solutions was used to enhance the contents of the input source files, which were in RTF/Word and hardcopy formats
  • Scanned hardcopy books were converted into electronic format through hard copy conversion system (HCCS)
  • The input files sent by the client in hardcopy and electronic format were compared with XML documents and special characters were added to the XML documents
  • All standard ASCII characters were replaced with special characters in the XML contents per the original hardcopy source
  • The existing Spanish text was fixed by adding the appropriate special characters
  • The final dispatch files were placed in a predetermined FTP location


  • The tools helped in successfully running the input files on the source content and highlighting the accented character in different texts
  • They notified the sequence of words in the XML files that had no special characters
  • The tools also replaced the string in the XML files and at the same time assigned a hexacode value (entity value) for the accented characters
  • Effective use of technology and industry knowledge saved the client time and effort in the enrichment of this project


Lumina Datamatics delivered high-quality content-enhancement services to the legal publisher, leading to more big-ticket projects.