SQL and Relational Databases SQL Code

You are to use SQL commands to query the database. You should strive to do as many of the computations
in SQL within the database rather than bringing back large amounts of data to R and then performing
computations there. There will be occassions when it is significantly simpler, or only feasible, to do
some of the computations in R, after an initial SQL query. However, you should try hard to come up with
the most comprehensive SQL query before you fall back to doing computations in R.
R Code
After completing the questions using SQL commands, implement questions 9, 10, 11, 12 using only R
commands and the dbReadTable() function to retrieve an entire table. (If you run into problems with the
size of the data, you can work with a large subset of each table.) Show the R code and verify that you
get the same results as with the SQL commands.
The Data
We will explore the Internet Movie Database (IMDB) available on-line at imdb.com and off-line via the
Alternative Interfaces. We can only do exploration on this data as a) it is essentially an entire
population, b) there are not very many insightful/interesting questions as there are very few numeric
variables, but most are qualitative. (We would like additional data such as revenues, reviews, audience
ratings, etc.) However, we will use it to explore SQL.
The SQLite database is available here
There are other datasets that are also interesting to explore when learning SQL.
• Lahman’s Baseball Database is terrific (if you have any interest in baseball). (There is a
chapter on this in Data Science in R: A Case Studies Approach available electronically via the UC
library on campus & off campus.)
• StackOverflow questions and answers, users, etc. Data are available at
https://archive.org/details/stackexchange
The following image shows schema for the tables in the database

READ ALSO :   Academic help online

We used the imdb-to-sql to convert the raw text files to the database and this uses regular
expressions, which we learned in the last assignment. Very big thanks to Ameer Ayoub for his code to
create the database from the raw files.
Note
• When I use the term “actor”, that includes both male and female actors.
• Similarly, by “movie”, we do not include TV series.
Questions
1. How many actors are there in the database? How many movies?
2. What time period does the database cover?
3. What proportion of the actors are female? male?
4. What proportion of the entries in the movies table are actual movies and what proportion are
television series, etc.?
5. How many genres are there? What are their names/descriptions?
6. List the 10 most common genres of movies, showing the number of movies in each of these genres.
7. Find all movies with the keyword ‘space’. How many are there? What are the years these were
released? and who were the top 5 actors in each of these movies?
8. Has the number of movies in each genre changed over time? Plot the overall number of movies in
each year over time, and for each genre.
9. Who are the actors that have been in the most movies? List the top 20.
10. Who are the actors that have had the most number of movies with “top billing”, i.e., billed as
1, 2 or 3? For each actor, also show the years these movies spanned?
11. Who are the 10 actors that performed in the most movies within any given year? What are their
names, the year they starred in these movies and the names of the movies?
12. Who are the 10 actors that have the most aliases (i.e., see the aka_names table).
13. Networks: Pick a (lead) actor who has been in at least 20 movies. Find all of the other actors
that have appeared in a movie with that person. For each of these, find all the people they have
appeared in a move with it. Use this to create a network/graph of who has appeared with who. Use the
igraph or statnet packages to display this network.
If you want, you can do this with individual SQL commands and the process the results in R to generate
new SQL queries. In other words, don’t spend too much time trying to create clever SQL queries if there
is a more direct way to do this in R.
14. What are the 10 television series that have the most number of movie stars appearing in the
shows?

READ ALSO :   Research Essay