Objective : The objective of this documentation is to provide a step-by-step guide to extract data from a PDF file and save it into a Google Sheet using Google Apps Script. The process involves creating an API that converts a PDF file into a Google Doc file and then extracts the data using OCR (Optical Character Recognition).
Step 1: Create an API that Converts a PDF File into a Google Doc and Extracts Data Using OCR
The following Google Apps Script code snippet performs the conversion and extraction :
function readPDFUsingOCR(pdfFileId) {
var file = DriveApp.getFileById(pdfFileId);
// Set the OCR options
var resource = {
title: file.getName().replace('.pdf', ''),
mimeType: MimeType.GOOGLE_DOCS
};
// Use the Drive API to perform OCR on the file
var ocrOptions = {
ocr: true,
ocrLanguage: 'en'
};
// Convert the file
var convertedFile = Drive.Files.copy(resource, pdfFileId, ocrOptions);
// Retrieve the text from the Google Doc
var docId = convertedFile.id;
var doc = DocumentApp.openById(docId);
var text = doc.getBody().getText();
var fileUrl = 'https://docs.google.com/document/d/' + docId;
// Extract data points from the text
var data = extractDataPoints(text);
DriveApp.getFileById(docId).setTrashed(true);
return {
data : {
"Invoice Date" : data.invoiceDate,
"Invoice Number" : data.invoiceNumber,
"Description" : data.description,
"Google Docs URL": fileUrl
}
}
}
function extractDataPoints(text) {
var data = {};
// Regular expressions to match the desired fields
var invoiceDateMatch = text.match(/Invoice Date\s+(\d+\s\w+\s\d+)/);
var invoiceNumberMatch = text.match(/Invoice Number\s+([A-Z0-9-]+)/);
var descriptionBlockMatch = text.match(/IDR\s+([\s\S]*?)\s+Subtotal/);
data.invoiceDate = invoiceDateMatch ? invoiceDateMatch[1] : '';
data.invoiceNumber = invoiceNumberMatch ? invoiceNumberMatch[1] : '';
data.description = descriptionBlockMatch ? descriptionBlockMatch[1].trim() : '';
return data;
}
function doGet(e) {
var pdfFileId = e.parameter.pdfFileId;
var result = readPDFUsingOCR(pdfFileId);
Logger.log(result); // Log the JSON response for debugging
return ContentService
.createTextOutput(JSON.stringify(result))
.setMimeType(ContentService.MimeType.JSON);
}
Step 2: Deploy the App Script Code as a Web App and use the Web App URL as a GET Request Endpoint
Step 3. Create a new workflow in Sheetgo - and select the option to import data from external services by sending request. Look for the option highlighted in the image below
**
Step 4. You’ll find a form that is required to be filled to call the API**
To call the API and pass the file ID as a parameter, use the following URL format:
https://script.google.com/macros/s/WEB_APP_URL/dev?pdfFileId=YOUR_PDF_FILE_ID
Replace WEB_APP_URL
with the URL you received during the deployment and YOUR_PDF_FILE_ID with the ID of the PDF file you want to process.
Generate Your Apps Script Web App Access Token, you can use the following code snippet :
Add the fetchData()
function to your script. Check the execution log to find the generated access token.
function fetchData() {
console.log(ScriptApp.getOAuthToken())
var url = "https://script.google.com/a/macros/sheetgo.com/s/WEB_APP_URL/dev";
var options = {
"method": "get",
"muteHttpExceptions": true,
"headers": {
"Authorization": "Bearer " + ScriptApp.getOAuthToken()
}
};
var response = UrlFetchApp.fetch(url, options);
console.log(response)
var responseCode = response.getResponseCode();
console.log(responseCode)
console.log(JSON.parse(response.getContentText()))
}
In the “Headers (JSON)” field, enter the following:
{
"Authorization": "Bearer YOUR_ACCESS_TOKEN"
}
Replace YOUR_ACCESS_TOKEN
with the token you generated.
Step 5. Destination Sheet : Create a new Google Sheet or choose an existing one where the extracted data will be stored.
So, to conclude, this documentation provides a detailed guide to creating an API for extracting data from a PDF and writing that data to a sheet. The key steps involve creating the API, deploying it as a web app, and setting up a connection to process the data and write it to the sheet.