Within a typhoon, another TokyoR Meetup! … well not really it turned out to be a false alarm and the weather was a wonderful 30 degrees Celsius with 800% humidity as usual in Tokyo. My gripes with the weather aside this month’s meetup was held at Cresco, an IT management strategy company, in their headquarters in Shinagawa, Tokyo.

In line with my previous round up posts:

I will be going over around half of all the talks. Hopefully, my efforts will help spread the vast knowledge of Japanese R users to the wider R community. Throughout I will also post helpful blog posts and links from other sources if you are interested in learning more about the topic of a certain talk. You can follow Tokyo.R by searching for the #TokyoR hashtag on Twitter.

Anyways…

Let’s get started!

BeginneR Session

As with every TokyoR meetup, we began with a set of beginner user focused talks:

Main Talks

ill_identified: Econometrics vs. Machine Learning

@ill_identified gave a talk on a more general industry topic rather than about R specifically by talking about the differences between econometrics and machine learning/A.I. To start off he listed some good resources (in English) to read from the economics/econometrics side of the discussion:

From looking at the methods that both fields use like multiple linear regression, logistic regression (GLMs), Monte-Carlo Markov Chain (MCMC), non-parametric regression, etc. it seems as though they might be the same thing… but an important distinction can be made in that:

  • Economics/Econometrics == Causal Inference
  • Machine Learning == Prediction

Frank Harrell in Road Map for Choosing Between Statistical Modeling and Machine Learning as well as Miguel Hernan in Data science is science’s second chance to get causal inference right: A classification of data science tasks discuss this in length if you want to take a look.

@ill_identified then took us through a couple examples of causal inference from a variety of economics research focusing on the spread and popularity of Randomized Control Trials (RCTs) and Difference-In-Differences (DID) in the field.

There was a talk on DID at last year’s Japan.R Conference that you can find here!

From the side of machine learning, Athey (2018) talked about how up to the present, economists have been trying to fit their model to the entirety of the data available to them, leading to potential problems of over-fitting. To counteract this problem Athey notes how the field can learn from machine learning by using cross-validation techniques and penalized models.

For causal inference from the machine learning side of the discussion, @ill_identified talked about Judea Pearl’s answer of Quora to the question, “What are the differences between econometrics, statistics, and machine learning?”.

In the answer, Judea Pearl makes the distinction between standard ML and advanced ML, namely that the former (while specifically including deep learning and neural networks in this category) “fits a function to a stream of data and plays the same role as statistical analysis, taking us from samples to properties of distribution functions.” while the latter “goes beyond distributions onto the process that generates the data, and so, allows us to manage policy interventions and counter-factual reasoning”. He then points to two of his works, 7 Tools of Causal Inference with Reflections on Machine Learning and The Book of Why (written with Dana MacKenzie) for further reading. It is in the former work that Judea Pearl talks about the three levels of the causal hierarchy:

  • Level 1: Association
  • Level 2: Intervention
  • Level 3: Counterfactuals

There is a lot of debate centered on “Potential Outcomes” theory posited by Neyman & Rubin versus Pearl’s “Causal Graphs/SEM” approaches in the past while Andrew Gelman has also talked about the issue here (2009) and here (2011). Very recently, Guido Imbens submitted an article, Potential Outcome and Directed Acyclic Graph Approaches to Causality: Relevance for Empirical Practice in Economics that discusses this in length that is probably worth checking out as well!

From the Rubin side of the debate Guido Imbens, Susan Athey, and Viktor Chernozhukov stand out as the primary researchers.

Athey:

Chernozhukov:

In the final section @ill_identified went over a few new methods in A.I./ML that provided some evidence to show that A.I./ML does indeed have similarities with econometrics mainly through the application of the structural estimation approach to modeling. A good overview is Artificial Intelligence as Structural Estimation: Economic Interpretations of Deep Blue, Bonanza, and AlphaGo - Mitsuru Igami while you can read about the specific AIs discussed in more detail below.

kilometer00: R interface to Python

TokyoR organizer and frequent BeginneR session speaker @kilometer00 talked about using Python with R.

To familiarize the audience with Python he went over quite a number of slides showing the similarities and differences in syntax between the two languages.

Next, @kilometer00 talked about the {reticulate} package which allows you to call Python from R and can provide translation between R and Python objects (such as R and Pandas data frames or R matrices and NumPy arrays). Using {reticulate} he talked about the importance of having an isolated and independent environment, to keep Python in a “sandbox”-ed virtual environment for security and reproducibility. To do this @kilometer00 likes to use Pipenv.

Once you’re done with all the set-up, you can install {reticulate} from CRAN and attach your Python virtualenv with reticulate::use_python() and then you can finally start doing stuff! But be wary of type errors when you’re coding:

You can also use Python in a R Markdown document by setting the code chunk to run it. With a recent development in RMD you can now also share objects from different languages by putting a prefix in front of the object name!

Funnily enough you can also run R in Python in R:

Pythonception!

More resources on {reticulate}:

LTs

wkwk_soprano: Creating network graphs with R!

It’s been a while since @wkwk_soprano used R (5 years!) but he’s come back with aplomb by talking about network graphs at Tokyo.R! Network graphs are used in all sorts of fields of study including physics, chemistry, linguistics, and the social sciences. In industry you might see them as part of a recommendation graph between a customer and products on sale. Frustrated by the fact that he didn’t have a fun data set to use the {network} package on, @wkwk_soprano decided to create his own data set based on his favorite manga, One Piece!

By counting up the times a character appeared in one panel of the manga with another he slowly built up a co-occurrence matrix of all the characters from Volume 1 to Volume 23. It took him about one hour per volume to create this data set, now that’s dedication!

You can find the gum-gum fruits of his labor here.

After creating the data @wkwk_soprano wanted to do some analysis on it like graph embedding via DeepWalk or Large-scale Information Network Embedding (LINE). There’s actually a R package called Rline to implement this method but he found that it was difficult to install and it hadn’t been updated in a while so he went with the original C++ implementation from Jian Tang et al. The result was an output of the distributed representation of all the characters in the data.

Lastly, @wkwk_soprano wanted to find similarities between One Piece characters so he used cosine similarity using this code snippet which allows you to extract top ‘N’ similar items from network embedding matrices. Taking a look at some popular characters he was somewhat disappointed in the results as from his extensive knowledge of the story he knew some of the character similarities just weren’t right!

More resources on network analysis in R:

gepuro: Translating tidyverse.org into Japanese!

After being involved in the Japanese translation of Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists, @gepuro thought about trying his hand at translating tidyverse.org in Japanese!

In recent months there have been big changes in major tidyverse packages such as {tidyr} and {ggplot2} with accompanying articles to boot. These articles, especially the new pivot functions vignette, are the ones @gepuro and fellow TokyoR community members such as @Atsusy have started working on in the past few weeks. To do the translation there are three key steps:

  1. Create a Github account
  2. Log into GitLocalize
  3. Access the specific GitLocalize repo where your translation project is located

GitLocalize looks like this:

Once you’re done, you create a “Review Request” which is checked by the maintainer @gepuro for any errors. He’ll receive the “Review Request” as a Pull Request on the R Lang Document JA repo and if everything is OK it’ll be merged in!

There are other ways to contribute to the project as well such as:

  • Helping to make the text sound more naturally Japanese
  • Create the blogdown website of the Japanese translation
  • Create a vocab list of common R terminology in Japanese to use as a reference
  • and more!

I enjoyed this talk as it was similar to the talk by Riva Quiroga on translating the “R for Data Science” book and R data sets into Spanish that I heard at user!2019 a few weeks ago (the talk is covered in my blogpost here). If you’re good at English and Japanese you can join the #translation channel on the Tokyo.R slack!

In other news, there was an announcement that this year’s Japan.R Conference will be on December 7th!

Other Talks

Food, Drinks, and Conclusion

TokyoR happens almost monthly and it’s a great way to mingle with Japanese R users as it’s the largest regular meetup here in Japan. We’re finally taking a break next month so the next meetup will be on September 28 and it will be a special session in {Shiny}!

Talks in English are also welcome so if you’re ever in Tokyo come join us!