quarta-feira, 30 de maio de 2018

Looking for great visualizations with R? Look no further...

Ever needed that great graph for a R analysis and impress your ever complaing boss? Well, here it is: http://www.r-graph-gallery.com.

R-GRAPH-GALLERY gathers (almost) all possible graphs, displaying the possibilities in a organized way.

Here is a sample that I found interesting using co-relation:

library(ggplot2)
data=data.frame(cond = rep(c("condition_1", "condition_2"), each=10), my_x = 1:100 + rnorm(100,sd=9), my_y = 1:100 + rnorm(100,sd=16) )
ggplot(data, aes(x=my_x, y=my_y)) + geom_point(shape=1)
ggplot(data, aes(x=my_x, y=my_y)) +    geom_point(shape=1) +  geom_smooth(method=lm , color="red", se=FALSE)  # Add linear regression line
ggplot(data, aes(x=my_x, y=my_y)) +    geom_point(shape=1) +  geom_smooth(method=lm , color="red", se=TRUE)




terça-feira, 17 de abril de 2018

How to create and populate a Time Dimension?

A Time dimension is always a useful asset on a Data Warehouse. A simple and useful way to create and populate it on SQL SERVER, is to use the idea bellow:

truncate table DimContact.DimTime;

declare @dtstart datetime;
select @dtstart=max(DimDate) from DimContact.DimTime;

select @dtstart=coalesce(@dtstart,getdate()-50000);

while @dtstart<getdate()+5000 
begin

INSERT INTO DimContact.DimTime(DimDate
,DateYear
,DateQuarter
,DateMonth
,DateWeek
,DateDay
,DateHour
,DateWeekDay)
select 
@dtstart,
Year(@dtstart),
DATEPART(quarter, @dtstart),
Month(@dtstart),
DATEPART(Week,@dtstart),
Day(@dtstart),
DATEPART(HOUR,@dtstart),
DATEPART(WEEKDAY,@dtstart);

PRINT 'data '+CAST(@dtstart AS VARCHAR);

select @dtstart=DATEADD(HOUR,1,@dtstart);
end;

Can you share your idea for a Time dimension?

terça-feira, 10 de abril de 2018

How to deal with missing values in R

There are different ways to treat missing values in R. One of the most commom ways to do that, is to replace the NA values with the mean (or mode). You can do that in a vector (or a column/line from  a data frame) using the is.na function:

> testena <- c(2,3,8,NA,9)
> testena[is.na(testena)] <- mean(testena,na.rm=T)
> testena
[1] 2.0 3.0 8.0 5.5 9.0

On RStudio...



Check out my new book about R Language http://www.amazon.com/dp/B00SX6WA06

sexta-feira, 23 de fevereiro de 2018

TOP 10 must know Data Science Algorithms

There are tons of algorithms and techniques you should learn in Data Science, but these are essencial:

• Linear regression - https://lnkd.in/gZsqvir
• Logistic regression - https://lnkd.in/guQ9upd
• SVM - https://lnkd.in/g3NxzjC
• Random forest - https://lnkd.in/gMrNiRR
• Gradient boosting - https://lnkd.in/ggmktnB
• PCA - https://lnkd.in/gj_f4mp
• K-means clustering - https://lnkd.in/g9FZFfk
• Collaborative filtering - https://lnkd.in/gtE5HRB
• kNN - https://lnkd.in/gUvEqsR
• ARIMA - https://lnkd.in/grwBJNd