Long time ago i had the idea to get rid of all the documents stored in my office. Apparently here in Germany it is not possible fo all documents, which is good on the one hand but bad on another... (Privacy vs. Environmental)

Also that you have an idea how much stuff is piling up, you can see the folders full of stuff, which i may need some time. This is why I keep them organized but this need a lot of space.

To be honest it's quite seldom that I've really looked something up but still it's the case sometime.

So. I want something where I'm able to have the documents in a digital way but not on some platform but my NAS (Network attached storage). The originals should then be stored in chronological order in one big pile (in case someone need the original).

There are professional solutions for those use cases called DMS (Document management System). But these are either expensive (for private use) or the stuff is hosted somewhere.

I'll show you how to set up your system on a really simple way.

Supplies

one old scanner

one rasperry pi

some display (preferred with touch)

Step 1: The System Design

The hole system will consist of a scanner to digitalize the documents and the client software to deal with the OCR

and the other things to enable a search over all documents.

- use a raspberryPi with a scanner to get a digital twin of the documents

- stich several pages together

- run an OCR over it

- Store the PDFs on my NAS

- Store the originals in a document Box (to throw them away in several years)

Step 2: Software

The software is really simple and only hacked down.

It mounts the the NAS

os.system('sudo mount -t cifs //where/ever /home/pi/DRIVE/share')

scan one page:

subprocess.check_output(['scanadf','--device-name' , self.deviceName ,'--output-file',outputFile,'--resolution',resolution,'--mode',mode,'-e','1'])

and run the OCR:

p = subprocess.Popen(['ocrmypdf', '-l', 'deu', '--output-type','pdfa', '--image-dpi', '300', outputFile, outputFile_ocr])

The code is basically only the GUI the tools "scanadf" and "ocrmypdf" are the workhorses for the whole tool. The HW setup is even more simple... plug in the scanner and a touch display to the raspberry pi. Due to the lack input devices I was not able to take proper screenshots.

scanAndOCR.py
Download

Step 3: Searching for Files

The information from the OCR are stored in the PDF in a additional layer. This way it is possible to search also in the PDFs itself. Now I can search for a invoice of one of my tools easily.

The originals are now simply stored in a BOX