Friday, June 27, 2014

Becoming a Data Analyst/Data Scientist?

I'm looking into the possibility of changing my career focus slightly and the idea of becoming a Data Analyst is somewhat intriguing. Here's what I've found that might help to get started down this path:

Required Skills

  1. Microsoft Excel
  2. Programming Skills
    Tools to manipulate data (programming ability, sql, statistics tools, etc)
    - Python, Ruby, or another similar programming language
    - R, STATA, SAS, or some other statistical programming language for analyzing data
    - SQL or similar querying/manipulation language, understanding of fairly complex joins, nested queries, etc
    - Hive, Hadoop, etc. are really useful, albeit not essential for getting hired

Resources
Here are some books I've found recommended in various locations. I can't speak to their quality/usefulness, but others have found them helpful. I've added them here for future reference and your convenience:


  • Data Mining Techniques by Michael Berry and Gordon Linoff
    It starts off with defining data mining in the current business context and then summarizes some of the best practices in data mining.
  • Data Mining Cookbook by Olivia Parr Rud
  • It lists out several best practices that any good analyst would swear by.
  • Competing on Analytics by Thomas Davenport
    This book does not deal with any statistical equations or complex algorithms. The book, instead, describes how some of the leading companies in the world are using analytics to out-smart their competition. 



  • Quick Reference

    Q: What is SAS? 
    A: SAS is an integrated system of software solutions that enables you to perform the following tasks: data entry, retrieval, and management. report writing and graphics design. statistical and mathematical analysis.

    Q: What is R?
    A: R is a free software programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. Polls and surveys of data miners are showing R's popularity has increased substantially in recent years.

    Q: What is SQL?
    A: SQL is used to communicate with a database. According to ANSI (American National Standards Institute), it is the standard language for relational database management systems.

    Q: What is Python?
    A: Python is a high-level general-purpose programming language. Find out more information about Python here.

    Q: What is Ruby?
    A: Ruby is a dynamic, reflective, object-oriented, general-purpose programming language. Find out more information about Ruby here.

    Q: What is Hadoop?
    A: Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware.

    Q: What is Hive?
    A: Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.

    Other, More In-depth Articles
    This article talks about the importance of the field and how there is a high demand. The pay and benefits are worth consideration.

    This article gives an in-depth curriculum on learning the steps to becoming a Data Analyst.

    Finally, this article gives lots of good details about the suggested path you take for a Data Analyst from someone working for a fairly large company.