September 25, 2017

How to convert milliseconds or seconds into date format in Presto?

Milliseconds:
DATE_FORMAT(FROM_UNIXTIME(column_name /1000),'%Y-%m-%d')
Seconds:
DATE_FORMAT(FROM_UNIXTIME(column_name),'%Y-%m-%d')

Please note that '/1000' should be added when it converts milliseconds to human-readable format. 
We have the column "purchased_date_epoch" stored as numeric format. Let's say we want to convert the "purchased_date_epoch" column value "1442287036" to human-readable format. 


SELECT purchased_date_epoch FROM table                              
return: 144287036 
SELECT DATE_FORMAT(FROM_UNIXTIME(purchased_date_epoch),'%Y-%m-%d %T)
return: 2015-09-15 03:17:16                                         
SELECT DATE_FORMAT(FROM_UNIXTIME(purchased_date_epoch),'%Y-%m-%d)   
return: 2015-09-15                                                  



September 24, 2017

How to perform two-sample one-tailed t-test in Python

In python, we can use ttest_ind to perform two-sample one-tailed test. Assuming that our hypothesis are:
Ho(Null Hypothesis): P1 >= P2
Ha(Alternative Hypothesis): P1< P2

In this case, we know that we have 1st normal distribution with mean equal to 3 and variance equal to 2 with 400 data points. The 2nd normal distribution has the mean equal to 6 but the same sigma and size as 1st normal distribution. 




How can we interpret the results?

According the Stat Trek, when the null hypothesis is: 6>=3, the t score should be equal to 21.2 with degree freedom equal to 798 and SE equal to 0.1414. Stat Trek Calculator gives use the p-value equal to 1.



You might notice that no matter whether or not we write ttest_ind(P1,P2) or ttest_ind(P2,P1) , the t-statistics changes but the p-value does not change. Why? By default, Python Scipy library does not give an option for us to perform one-tailed two sample test. The p-value is computed based on the assumption of two-tailed two sample test. 

Therefore, the correct way to perform our null hypothesis in Python should be as below.
P1 = np.random.normal(6,2,400)
P2 = np.random.normal(3,2,400)
stats.ttest_ind(P1, P2, axis=0, equal_var=True)
And you will the see the results as below
Ttest_indResult(statistic=21.374858126615408, pvalue=1.6807582123709593e-80)

The real p-value for our null Hypothesis: P1>=P2 is

real_t_score=Ttest_indResult.statistic
real_pvalue=1-Ttest_indResult.pvalue/2 =1-1.6807582123709593e-80=1-0.84e-80=0.9999

As the real p value is so close to 1, we cannot reject the null hypothesis that P1>=P2 (6>=3). 

October 19, 2016

Why do you need to take Tableau Certificate Desktop Exam?


You might already read my previous blog about How to prepare for the Tableau Certificate Exam. Before you invested time and money in the exam, you may have the moment when you are wondering if it's worthy to take the exam.


The fact is that there are not many certificates in the market for data science. Tableau Software becomes a very popular business intelligence tool in the past couple years despite their decreasing stock price. According to the Tableau Report Fiscal Year 2015, "
88% of Fortune 500 companies, such as Cisco, Wells Fargo and Capital One, use Tableau, which bodes well for our land and expand strategy".


October 5, 2016

How to Add Mixpanel or Google Analytics on a Shiny App Correctly?

Before you start reading this post, I recommend to read this blog on shiny.rstudio.com about how to add a Google Analytics to your shiny app. It's vey likely that you followed all the steps on the blog but it did not work out. 

3 steps are missing in the blog post on rstudio website. I will walk you step by step about adding Mixpanel on shiny correctly:


January 23, 2016

Data tells you the secrets of extramarital affairs

Recently I found this interesting dataset about extramarital affair of women in 1974. The dataset was built under a survey and now is available to download via Pandas package in Python.

Some interesting facts are discovered after pure data visualization. If you are interested in interactive dashboard, please click here