This past weekend I participated in my first (and hopefully NOT last) hackathon. The event took place and was sponsored by the American Museum of Natural History and consisted of about 150 hackers from across NY and beyond.
The hackers assembled for the first time on Friday November 20th where they were introduced to a wide array of dinosaur related technical challenges that the museum and its researchers are hoping to solve. Challenges ranged from creating educational experiences/games, transcribing various datasets and developing methods of collecting and presenting existing/new data.
I hoped on a team consisting of two employees of NYPL who were also interested in working on a solution of crowdsourcing the transcription of handwritten digital assets. NYPL had previously created a framework called ScribeAPI which they are using to do a similar project relating to real estate records for NYC (http://emigrantcity.nypl.org/#/). We met back up Saturday at 3pm and began hacking away on a 24 coding binge. Once the museum closed to the public we were allowed to fan out and work in the dinosaur wing. Participants were allowed to sleep over and among the exhibits. I chose to curl up underneath the megalodn jaws in the marine exhibit.
The ScribeAPI framework gave us a solid platform to start with and we were able to quickly scaffold a working prototype that addressed two datasets the museum was looking to transcribe, one pertaining to a card catalog / fossil inventory of about 400,000 cards and another dealing with shipping records from the Frick Fossil Collection about 200,000 cards.
Using our system users can choose to participate in one of three tasks: Identifying regions on the card that contain specific data points (i.e. Description areas, Catalog Number Areas etc.), transcribing data from regions that have been marked or verifying transcriptions that have conflicting data. Each card gets transcribed by multiple people and when a configurable consensus (75%) agrees on what is listed the record is marked according. Cards can also get put into a contentious state if a consensus isn’t met within a certain threshold (10 transcriptions).
The project exposed me to Ruby on Rails, ReactJS and MongoDB and provided a fun excuse to really dive in headfirst and explore/tinker with the languages.
Our team wrote additional tools to pre-process the cards and to also potentially expose regions out to a OCR component (not developed) so that an automated process might provide a good initial pass, though w
e feel that for most of the data OCR just won’t cut it due to the contextual nature, inconsistent formatting and handwritten nature of the dataset. I wrote the framework of a leaderboard that would publish the main contributors to the project and might help “gamify” the experience to help drive further use / competitiveness.
In the end teams presented their tools and I was genuinely impressed with what was developed in a 24 hour period. The hackathon represented a total of about 3600 man-hours and while time will tell how much of our projects will be adopted I think its safe to say from atleast a R&D prospective the hackathon was a great success.
While the project only consists of sample data you can check it out at: http://crowdsaurus.herokuapp.com and http://frickfossils.herokuapp.com