Introduction to large scale data analytics

PyConES 2014, Zaragoza, Spain

Sun 09 November 2014

Abstract

NumPy, Pandas and Matplotlib, among other libraries, have revolutionized data processing, munging and visualization in the Python ecosystem. But, what happens when our dataset is too big to fit in memory? We could use a database like Postgres or MongoDB, store it in disk like PyTables or BColz, or use distributed systems like Hadoop or Spark. Each of those options presents its own pros and cons. Learning each of those tools is time consuming, and that time could be better employed by getting insights rather that worrying about how to spell the computation that we need. Blaze provides a common interface to a variety of backends and useful tools for data processing, transformation and conversion. When dealing with large datasets, we also need libraries that provide tools that can actually help us display them. Bokeh is a Python visualization library that targets the browser and includes interesting features like abstract rendering or server-side down-sampling. This talk will introduce large scale data analytics and visualization and focus on how Blaze and Bokeh can help us. For more information, visit: http://bokeh.pydata.org/, http://blaze.pydata.org/,

Slides

Pictures