Regarding my enrollment in the SAPO Summerbits program, I created a media filter for DSpace called OCR4DSpace.
With this media filter you are able to submit document images (scanned documents) and be able to search for their contents without having to fill them manually at submission time. The image contents are read by the OCR engines you have on your system.
The media filter is really simple to configure and use and hopefully will make some people’s life easier!
The web page is currently only in Portuguese but will be translated to English soon but rest your soul, the README is included in English.
To checkout the code, all you need is to run:
svn co svn://svn.softwarelivre.sapo.pt/ocrd/trunk/OCR4DSpace
Read the README file carefully and enjoy the automation that Optical Character Recognition can do for you!
Portugal needs more events like these, I hope next year more companies will join SAPO and Associação Ensino Livre and bring up the second edition of Summerbits with more projects!
So, today I continued my work in my SAPO Summerbits project. It consists in developing an OCR plugin for DSpace.
It’s been a while since I touched JAVA, the last time I did something with it was for BluePad which has a much simpler code-base.
Anyway, I don’t think I am a little rust on JAVA, it’s just that if you spend a while doing code in Python, there is some things you just take for granted. I mean, taking a look back at the JAVA API docs (good docs, BTW) just make me think: “Why the hell so much stuff for such a simple task (say, reading from a text file)!?”
But anyway, it is GOOD to switch programming languages once in while so you don’t stick too much to some stuff. One of the things I like doing most is *learning*, so I am looking forward to write more code in C once I get time, maybe learn C++ as well.
I was selected to be part of the first SAPO Summerbits, an initiative inspired on Google Summer of Code but only to people studying in Portugal.
I was really willing to be part of Google Summer of Code this year but didn’t apply to it since I was supposed to be working in Spain. So, when I left Spain I was of course disappointed by not being then part of Google SoC.
This said, I am really happy to participate in Summerbits and be part of this great initiative by SAPO.
This first edition of Summerbits is composed by only 10 projects and two of them from my University, mine and Paulo‘s one.
My work will be on top of DSpace developing a system to perform OCR over a scanned document so it retrieves the printed words from it and sets them as its document’s tags. This will hopefully automate and spare a lot of work for some people.
For the ones who don’t know:
SAPO is the Portuguese major ADSL provider and is considered by some to be the Portuguese Yahoo (the company, not the adjective 🙂 ) with a search/media/shopping/information/etc homepage.