Complex Desktop
Publishing: Hieroglyphic PDF
File Recreation for
Translation

#DIGITAL TRANSFORMATION DESKTOP PUBLISHING REPARATION OPTICAL CHARACTER RECOGNITION

This case study is about collaboration between a world-leading language service provider and Palex Group on the task of making “live” a highly difficult PDF file on a tight schedule.

Background

Client urgently needs 14 scanned pages of 50-year-old medical manuscript in Japanese and English to be recreated in Word format for further translation. Such complicated request might seem flatly impossible for any other team but not for Palex.

Source text in PDF

When does one need to perform this kind of job?

There is a non-editable source file (image, video, PDF etc.) which has the content that you need to copy, edit or translate freely. Simple saving the file to an editable copy is not an option here. It might be converted to editable format but in a very odd way, so you still have to work hard to make the text and format similar to the original file.

Source text in PDF

What is OCR and preparation for translation?

To make an editable file look just as the source file you will always need to perform 2 steps.

1. Optical Character Recognition or OCR

OCR software is used to render live text and content to editable format. This step comes with many inaccuracies related to content and format.
That is why human input is still required.

2. Desktop Publishing Preparation or DTPP

During this step DTPP professional performs some actions to prepare the editable file for further use or translation. For example, they make sure that all the required text is visible, necessary formatting is applied consistently, no extra line breaks left etc.

Can I do OCR and DTPP by myself?

Sure, you can. Just do not forget to take into account some factors: locale and fonts of the language installed on your PC, appropriate OCR software, bunch of time for formatting the file and Quality Assurance of the content and format.

While working on easily formattable European languages, Palex DTP engineers can speed up to 20-25 pages per hour and 1 page per minute for QA.

Difficult Asian and Right-to-Left languages with sophisticated layout take about 7-9 pages per hour and 1 page per more than 2 minutes for QA. Production can take even more time depending on the file.

If you do not have much time for this job, contact an expert to
estimate the turnaround time and budget within 1 business day. Сontact an expert

Can I skip the OCR and DTPP steps for translation?

Yes, the solution here could be to make an “at sight” translation in the editable file. It would work well in case you are good to go with target plain text without formatting and do not care about translation quality. It would hardly work with tables, formulas, flowcharts and other elements.

This solution also precludes the use of CAT tools and comes with a high risk of omissions, additions, inconsistent terminology translation and other translation issues that could be easily checkedand corrected using QA tool.

In case you use the target text as the source file for more target languages, you have to multiply the risks and issues by the number of languages.

We recommend you to contact an expert who will estimate all the risks
and provide you with the best solution under your budget. Сontact an expert

Let us get back to our 14-page
medical manuscript and the challenges we enjoyed.

Japanese language facts

~2,800

common characters

50,000+

total characters set

Japanese font should support

Kanji characters
Kana (Hiragana + Katakana) scripts

as Japanese uses all three

Source text in PDF

Facts about case study file

✅ Text in the file could be highlighted and copied.
✅ All the fonts are available.

❌ Automated OCR by regular software fails: hieroglyphs are corrupted.

The problem is Japanese OCR still far from ideal. The methods used for the Latin alphabet do not perform well with Japanese. The reasons are the complexity and number of Japanese characters.

Source text in PDF

The challenge

Having analyzed the file, Palex DTP team
qualified it as “highly complicated”:

Medical content should be treated carefully to avoid any issue

Columns and Asian language require additional time for OCR and formatting the file

Automated OCR by means of regular software fails which means manual work on the hieroglyphs

Lack of Japanese DTP resources available under the requested budget

High quality risk

Medical content

Columns and mix of languages

Asian language

Two days' turnaround time

Limited budget

Failed automated OCR of hieroglyphs

Non-native DTP team

The solution

Pre-production THE TEAM It is the key factor that affects time and quality of this project.

Native Japanese
DTP&QA team

All the steps are
outsourced

Lower language
quality risks

Risk of TAT failure, formatting issues due to untested quality, unprofitable result, limited timeframes to find the resources

Palex non-native
DTP&QA team

All the steps are
in-house

Lower formatting and
budget risks

High language quality
risks and TAT risks due
to heavy manual work

💡 HYBRID SOLUTION

Palex DTP&QA team +
native Japanese QA

In-house OCR, outsourced QA
by native Japanese linguist +
in-house final DTP&QA

Lower language quality risks
Lower formatting and budget risks

Risk of TAT failure due to manual
work but it is not very high

Project Manager always
analyzes the risks, pros and
cons of every workflow. It is too
risky to involve both teams and
here came the 3rd variant.

Production

1 Step

Automated and Manual OCR

Palex DTP engineer
6+ hours of work

Automated OCR and manual recognition. Difficult hieroglyphs and formatting ignored.

2 Step

Native QA

Native Japanese linguist
2+ hours of work
700+ corrections on 14 pages

Adding missed hieroglyphs. Checking other recognized content.

3 Step

Automated and Manual OCR

Palex DTP engineer & QA specialist
6+ hours of work
100+ comments

Implementation of the linguist corrections. Quality check and formatting.

The results

2 days turnaround time

13 hours of DTP works

2 hours of native checks

5 hours of non-native checks

100% fit in budget

Client benefits

01 Excellent quality source file for translation recreated from
uneditable PDF to editable Word under cost effective solution

03 Strong expert reputation and loyalty of
the customer (with the help of Palex)

02 Complex Asian OCR is added into client services list

04 Old manuscripts are revitalized
for effective worldwide use

In general, the work on the file took 1+ hour per page. It is hardly the fastest result
but we are still proud of it. We learned our lessons. The native Asian DTP team was successfully tested
and we are ready for the new challenges!

Expert says

There are some languages, i.e. hieroglyphic and right-to-left, that require more time for Desktop Publishing Preparation and Desktop Publishing steps. When the task is to process these languages as source or target, you need to thoroughly review the content and layout to estimate risks, turnaround time and budget.

Palex team of Project Managers, Desktop Publishing Engineers and Quality Control Specialists has deep understanding of the most complicated localization engineering tasks, huge experience and some magic hints.

We take care of the most difficult files that are rejected by other teams and that would seem impossible to process within a short time frame.

Denis Sergeev

Localization Engineering Team Leader

Why PALEX?

Localization Engineering Services We know everything about recent Desktop Publishing and Multimedia trends and technologies.

We support our clients on their way to conquer the world. We help to localize all types of materials starting from simple Microsoft Office files to non-editable PDF-files and complex e-Learning courses with video, interactive tasks, subtitles, voice-over and questionnaires. Subject Expertise Palex has expertise and dedicated Multilanguage team for about 50 markets and ready to support you with international expansion.

Reliable Partner

Crystal-clear reputation

Experienced player on the market (since 2002)

Deep understanding of localization solutions

Localization Engineering and QA departments

$3M-liability insurance

Talented team committed to support client’s mission

ISO 17100 and ISO 9001 certified

Complex Desktop Publishing: Hieroglyphic PDF File Recreation for Translation

Background

When does one need to perform this kind of job?

What is OCR and preparation for translation?

Can I do OCR and DTPP by myself?

Can I skip the OCR and DTPP steps for translation?

Japanese language facts

~2,800

50,000+

Japanese font should support

Facts about case study file

The challenge

The solution

Native Japanese DTP&QA team

Palex non-native DTP&QA team

Palex DTP&QA team + native Japanese QA

Automated and Manual OCR

Native QA

Automated and Manual OCR

Post-production

The results

Client benefits

Expert says

Denis Sergeev

Why PALEX?

Reliable Partner

Menu

Let's get your translation started!

Tell us something more about your project.

Time to upload your content

Provide links to the content that needs to be translated.

Fill in your credentials

Complex Desktop
Publishing: Hieroglyphic PDF
File Recreation for
Translation

Native Japanese
DTP&QA team

Palex non-native
DTP&QA team

Palex DTP&QA team +
native Japanese QA