The main point of the module is to be able to convert text in images, regular document content in pdf, doc, excel, Power point into text and save it in a certain textarea field. If it is an image, save it in title and alt of image field, if it is a document file, save it in description. Besides, you can map a textarea field to store. It helps a lot with searching with views.
You can set a text field and hide it in the form. When you upload file it will fill convert text into form field mapping.
Key features and benefits of this module include:
- Text extraction from common image formats like JPG, PNG, TIFF as well as PDF documents. Get content in office file doc, docx, xls, xlsx, ppt, pptx, pdf... The extracted text can be stored and manipulated within Drupal.
- Integration with Views for searching and filtering content based on text extracted from images. No need for external OCR services.
- Update title alt existing files in bulk with view and VBO.
- Support for multiple languages using available OCR engines like Tesseract.
- A robust set of APIs and hooks to leverage OCR capabilities throughout the site.
How to Get Started
Install Tesseract on your environnement
Example on ubuntu:
sudo apt-get install tesseract-ocr
The main purpose of this module is that you can read the content of the image/document and assign it to the title, alt, description of the image, or a specified text field.
If you use file field, the module also extends to read the contents of the input files image, pdf, offfice (doc, docx, xls, xlsx,...)
How to work
- Setup permission Tesseract to work with php.
- Install module "OCR Image" with composer it will install
- Add field image / file
- with the image field
- Turn on Enable Alt field and Enable Title field.
- form display manage select OCR image
- with the file field
- Turn on Enable Description field.
- Form display manage select OCR / parser file
- In widget setting select your language, limit text (set 0 for full text)
Use with services
This module has a service that can be used by your own module
For example to parser text in document pdf, doc, excel, powerpoint,...
$document_parser_service = \Drupal::service('ocr_image.DocParser');
$line_text_array = $doc_parser_service->getText($file_path, $language = 'eng', $limit = 0);
For example ocr image:
$file_path = 'https://example.com/photo.jpg';
$ocr_image_service = \Drupal::service('ocr_image.OcrImage');
$image_text_array = $ocr_image_service->getText($file_path, $language = 'eng', $limit = 500);
This will return an array with the following keys: full_text (everything, as it appears on the image), title (only the first line), alt (everything but the first line) and array (1 line of text per value).
Update all existing images
This requires using View Bulk Operations.
Optionally add a text field to the entity that your image field belongs to.
Go to the "Manage Form Display" tab for the entity with the image field.
Change the widget to OCR Image. Configure the widget as desired.
Create a view that lists the entities with the image field. Add a the Bulk Operations field.
Save the view.
Now use it to select all your entities and choose the "Update empty image text (Image OCR)"
Do you like this module? Show your appreciation by buying me ☕.
Project information
- Project categories: Media
- Ecosystem: Bootstrap 5 admin
26 sites report using this module
- Created by lazzyvn on , updated
Stable releases for this project are covered by the security advisory policy.
Look for the shield icon below.
