Ciência Aberta, Questões Abertas/Oficinas/Content Mining

Description

At The Content Mine (contentmine.org) we are creating open source software that enables the crawling and scraping of academic journals, and the extraction of key facts. The software is relatively easy to use and we are running a series of workshops to show people the basic principles of text and data mining, how to use our software and customise it to be relevant to specific journals or disciplines of interest.

While it is aimed at people interested in science and uses scholarly articles, you don’t have to be a practising academic. We prefer to have a mix of people, with a mix of skills, including technological , discipline specific knowledge, and an interest in information extraction.

The workshop will be in English but we can ensure any materials are translated into Portuguese.

Activities to be undertaken

Understand concepts of content mining, including how to identify terms you may wish to extract
understand relevant legal issues.
Setting up Content Mine software on a virtual machine (no experience with VMs required).
Implementing existing scrapers to capture information from articles.
How to build a scraper for the journal of your choice.
How to run AMI to extract key facts from articles, including data from images.
Understand how to use regular expressions to identify words and phrases in articles.

Duration:

4 hrs

Numbers:

15 (although this number can be increased if Peter can have support either from other Content Mine team members, or someone else who has worked through parts of the workshop before)

Equipment

Needed from room:

A projector and white board/suitable surface, and means to attach PMRs lap top to projector.
Sufficient power points to allow attendees to plug in lap tops.
If no white board, then a flip chart would be useful

Wherever possible, attendees should bring their own laptop and power supply to carry out activities. If someone doesn’t have own laptop, they may be at a disadvantage, but there will be some hand based exercises, and all exercises can be carried out in pairs/groups.

We shall provide collaborative online tools such as Etherpads and Google Spreadsheets

Lead

Dr. Peter Murray-Rust

Reader in Molecular Informatics, University of Cambridge

Shuttleworth Fellow - http://contentmine.org/

contact@contentmine.org