New year, new work and new posts about Bioinformatics, NGS sequencing e Machine learning!

Friday, March 21, 2014

Hi all,

It has been a while since my last post at the blog. No, I didn't abandon the blog! It has happened many events at my life since november that I decided to pause a bit my posts marathon and started to organize my work life! The best news this year is that I am now facing new challenges on machine learning, data mining, big data and now on: bioinformatics!! That's right.  I am now CTO of Genomika Diagnósticos, a brazilian genetics laboratory at Recife, Pernambuco.  The laboratory combines the state-of-art genetic testing with comprehensive interpretation of test results by specialists, geneticists to provide clinically relevant molecular tests for a variety of genetic disorders and risk factors.

My work there now is work with NGS (next-gen sequencing) tools to support the exome and genome sequencing to analyse genes and exons in panels to detect any significant genetic variations, which are candidates to cause the patient's phenotype. There are a lot of work to do, so in the next weeks I will post some tutorials about bioinformatics, machine learning, parallelism and big data applied on genoma sequencing.

This field is a novel study field and there are many applications related to disease detection, prevention, and treatment. Could you imagine that sequencing DNA would cost more than $10,000 dollars in 2001 and it has been decreasing exponentially the cost of the procedure.

My next posts will talk about how DNA sequencing works and how machine learning and data ming can be applied in this exciting and promising field!


Marcel Caraciolo

Review new book: Practical Data Analysis

Tuesday, November 26, 2013

Hi all,

I was invited by the Packtpub team to review the book Practical Data Analysis by Hector Cuesta.  I started reading it for the last two weeks and I really enjoyed the topics approach covered by the book.

The book goes through the data science hot topics by presenting several practical examples of data exploration, analysis and even some machine learning techniques. Python as the main platform for the sample codes was a perfect choice, at my opinion. Since Python has becoming more popular at scientific community. 

The book brings several examples in  several science study fields such as stock prices, sentiment analysis on social networks,  biology modelling scenarios, social graphs,  MapReduce, text classification,  data visualisation, etc.  Many novel libraries and tools are presented including Numpy, Scipy, PIL, Python, Pandas, NLTK, IPython, Wakari ( I really liked a dedicated chapter for this excellent tool for scientific python environment on-line),  etc.  It also covers NoSQL databases such as MongoDB and visualisation libraries like D3.js.

I believe the biggest value proposition of this book is that it brings together in one book several tools and how they can be applied on data science. Many tools mentioned lacks further examples or documentation, which this book can assist any data scientist on this task.

However, the reader must not expect learn machine learning and data science in this book.  The theory is  scarce, and I believe it was not the main goal of the author for this book. For anyone looking to learn data science, this is not the right book. It is focused on who desires an extra resource for sample codes and inspirations, it will be a great pick!

The source code is available on Github, but you can explore them better explained inside the book, including illustrations. To sum up, I have to congratulate Hector for his effort writing this book. For the Scientific community, including the Python group, they will really enjoy this book! I really missed more material about scientific stack software installations, since for the beginners it can be really painful.  But in overall, it was well written focused on practical problems! A guide for any scientists.

For me the best chapters were the Chapter 6: Simulation of Stock Prices, the visualisation using D3.js was great, and the last chapter, 14, about On-line Data Analysis with IPython and Wakari. It's the first time I see Wakari covered in a book! Everyone who works with scientific Python today must give a chance some day to experiment this on-line tool! It's awesome!

Congratulations to PacktPub and Hector for the book!


Marcel Caraciolo

Non-Personalized Recommender systems with Pandas and Python

Tuesday, October 22, 2013

Hi all,

At the last PythonBrasil I gave a tutorial about Python and Data Analysis focused on recommender systems, the main topic I've been studying for the last years. There is a popular python package among the statisticians and data scientists called Pandas. I watched several talks and keynotes about it, but I didn't have a try on it. The tutorial gave me this chance and after the tutorial me and the audience fell quite excited about the potential and power that this library gives.

This post starts a series of articles that I will write about recommender systems and even the introduction for the new-old refreshed library that I am working on:  Crab,  a python library for building recommender systems. :)

This post starts with the first topic about the theme: Non-personalized Recommender Systems and giving several examples with the python package Pandas.  In future I will also post an alternative version of this post but referencing Crab, about how it works with him.

But first let's introduce what Pandas is.

Introduction to Pandas

Pandas is a data analysis library for Python that is great for data preparation, joining and ultimately generating well-formed, tabular data that's easy to use in a variety of visualization tools or (as we will see here) machine learning applications. For further introduction about pandas, check this website or this notebook.

Non-personalized Recommenders

Non-personalized recommenders can  recommend items to consumers based on what other consumers have said about the items on average. That is, the recommendations are independent of the customer,  so each customer gets the same recommendation.  For example, if you go to as an anonymous user it shows items that are currently viewed by other members.

Generally the recommendations come in two flavours: predictions or recommendations. In case of predictions are simple statements that are formed in form of scores, stars or counts.  On the other hand, recommendations are generally simple a list of items shown without any number associated with it.

Let's going by an example:

Simple Prediction using Average

The score in the scale of 1 to 5 to the book Programming Collective Intelligence was 4.5 stars out of 5.
This is an example of a simple prediction. It displays a simple average of other customer reviews about the book.
The math behind it is quite simple:

Score = ( 65 * 5 + 18 * 4 + 7 * 3 +  4 * 2 +  2 * 1)
Score =  428/ 96
Score = 4.45 ˜= 4.5 out of 5 stars

In the same page it also displays the information about the other books which the customers bought after buying Programming Collective Intelligence. A list of recommended books presented to anyone who visits the product's page. It is an example of recommendation.

But how Amazon came up with those recommendations ? There are several techniques that could be applied to provide those recommendations. One would be the association rules mining, a data mining technique to generate a set of  rules and combinatios of items that were bought together. Or it could be a simple average measure based on the proportion of who bought x and y by who bought x. Let's explain using some maths:

Let X be the number of customers who purchased the book Programming Collective Intelligence. Let Y be the other books they purchased. You need to compute the ration given below for each book and sort them by descending order.  Finally, pick up the top K books and show them as related. :D

Score(X, Y) =  Total Customers who purchased X and Y / Total Customers who purchased X

Using this simple score function for all the books you wil achieve:

Python for Data Analysis                                                 100%

Startup Playbook                                                              100%

MongoDB Definitive Guid                                                0 %

Machine Learning for Hackers                                          0%

As we imagined the book  Python for Data Analysis makes perfect sense. But why did the book  Startup Playbook came to the top when it has been purchased by customers who have not purchased Programming Collective Intelligence.  This a famous trick in e-commerce applications called banana trap.   Let's explain: In a grocery store most of customers will buy bananas. If someones buys a razor and a banana then you cannot tell that the purchase of a razor influenced the purchase of banana.  Hence we need to adjust the math to handle this case as well. Modfying the version:

Score(X, Y) =  (Total Customers who purchased X and Y / Total Customers who purchased X) / 
         (Total Customers who did not purchase X but got Y / Total Customers who did not purchase X)

Substituting the number we get:

Python for Data Analysis =   ( 2 / 2 ) /  ( 1 / 3) =  1 / 1/3  =  3 

Startup Playbook   =   ( 2 / 2)  /  ( 3 /  3)  =  1 

The denominator acts as a normalizer and you can see that Python for Data Analysis clearly stands out.  Interesting, doesn't ? 

The next article I will work more with non-personalized recommenders, presenting some ranking algorithms that I developed for for ranking  professors. :)

Examples with real dataset (let's play with CourseTalk dataset)

To present non-personalized recommenders let's play with some data. I decided to crawl the data from the popular ranking site for MOOC's  Course Talk.  It is an aggregator of several MOOC's where people can rate the courses and write reviews.  The dataset is a mirror from the date  10/11/2013 and it is only used here for study purposes.

Let's use Pandas to read all the data and start showing what we can do with Python and present a list of top courses ranked by some non-personalized metrics :)

Update: For better analysis I hosted all the code provided at the IPython Notebook at the following link by using nbviewer.

All the dataset and source code will be provided at crab's github, the idea is to work on those notebooks to provide a future book about recommender systems :)

I hope you enjoyed this article,  and stay tunned for the next one about another type of non-personalized recommenders:  Ranking algorithms for vote up/vote down systems!

Special thanks for the tutorial of Diego Manillof :)


Marcel Caraciolo

Ruby in the world of Recommendations (also machine learning, statistics and visualizations)

Tuesday, September 17, 2013

Hello everyone!

I am back with lots of news and articles! I've been quite busy but I returned. In this post I'd like to share my last presentation that I gave at Frevo On'Rails Pernambuco Ruby Meeting at Recife, PE. My  Ruby developer colleagues around Recife invited me to give a lecture there.  I was quite excited about the invitation and instanlty I accepted.

I decided to research more about scientific computing with Ruby and recommender libraries written focused on Ruby either.  Unfortunately the escossistem for scientific environment in Ruby is still at very beginning.  I found several libraries but most of them were abandoned by their developers or commiters.  But I didn't give up and decided to refine my searches. I found a espectular and promising work on scientific computing with ruby called SciRuby. The goal is to por several matrix and vector representations with Ruby using C and C++ at backend. It remembered me a lot the beginnings of the library Numpy and Scipy :)

About the recommenders, I didn't find any deep work as Mahout, but I found a library called Recommendable that uses memory-based collaborative filtering.  I really liked the design of the library and the workaround of the developer on making the linear algebra operations with Redis instead of Ruby :D

All those considerations and more insights I put on my slides, feel free to share :)

I hope you enjoyed, and even I love Python, I really like programming another languages :)



Slides and Video from my talk about Big Data with Python in Portuguese

Thursday, July 4, 2013

Hi all,

It has been a while since my last technical posts. But it's for a great cause! I am writing the current book about recommender systems and it is taking some dedicated time to work on it! But great posts are coming to the blog.

I'd like to publish an on-line talk that I gave in June' 18 at #MutiraoPython, a great initiative that I started with my startup PyCursos, an learning school for teaching Python and applications on-line. It's like a Coursera MOOC's for Programming in portuguese!  This talk was part of a series of keynotes that are happening every week on-line for free using the Hangout on Air!

I gave a lecture for about two hours about Big Data with Python and presentend some tools used for data analysis. I know that I missed to explore more Pandas, Ipython, Scikit-learn. However, I decided to explore the Hadoop Architecture and MapReduce Paradigm and some code examples with Python.

It's in portuguese! But all the content is available for free! I hope you enjoy!

Video and code