SQL and Relational Databases

SQL and Relational Databases
SQL Code
You are to use SQL commands to query the database. You should strive to do as many of the computations in SQL within the database rather than bringing back large amounts of data to R and then performing computations there. There will be occassions when it is significantly simpler, or only feasible, to do some of the computations in R, after an initial SQL query. However, you should try hard to come up with the most comprehensive SQL query before you fall back to doing computations in R.
R Code
After completing the questions using SQL commands, implement questions 9, 10, 11, 12 using only R commands and the dbReadTable() function to retrieve an entire table. (If you run into problems with the size of the data, you can work with a large subset of each table.) Show the R code and verify that you get the same results as with the SQL commands.
The Data
We will explore the Internet Movie Database (IMDB) available on-line at imdb.com and off-line via the Alternative Interfaces. We can only do exploration on this data as a) it is essentially an entire population, b) there are not very many insightful/interesting questions as there are very few numeric variables, but most are qualitative. (We would like additional data such as revenues, reviews, audience ratings, etc.) However, we will use it to explore SQL.
The SQLite database is available here
There are other datasets that are also interesting to explore when learning SQL.
• Lahman’s Baseball Database is terrific (if you have any interest in baseball). (There is a chapter on this in Data Science in R: A Case Studies Approach available electronically via the UC library on campus & off campus.)
• StackOverflow questions and answers, users, etc. Data are available at https://archive.org/details/stackexchange
The following image shows schema for the tables in the database

READ ALSO :   Scientific thinking for acquisision of reserch skill

We used the imdb-to-sql to convert the raw text files to the database and this uses regular expressions, which we learned in the last assignment. Very big thanks to Ameer Ayoub for his code to create the database from the raw files.
Note
• When I use the term “actor”, that includes both male and female actors.
• Similarly, by “movie”, we do not include TV series.
Questions
1. How many actors are there in the database? How many movies?
2. What time period does the database cover?
3. What proportion of the actors are female? male?
4. What proportion of the entries in the movies table are actual movies and what proportion are television series, etc.?
5. How many genres are there? What are their names/descriptions?
6. List the 10 most common genres of movies, showing the number of movies in each of these genres.
7. Find all movies with the keyword ‘space’. How many are there? What are the years these were released? and who were the top 5 actors in each of these movies?
8. Has the number of movies in each genre changed over time? Plot the overall number of movies in each year over time, and for each genre.
9. Who are the actors that have been in the most movies? List the top 20.
10. Who are the actors that have had the most number of movies with “top billing”, i.e., billed as 1, 2 or 3? For each actor, also show the years these movies spanned?
11. Who are the 10 actors that performed in the most movies within any given year? What are their names, the year they starred in these movies and the names of the movies?
12. Who are the 10 actors that have the most aliases (i.e., see the aka_names table).
13. Networks: Pick a (lead) actor who has been in at least 20 movies. Find all of the other actors that have appeared in a movie with that person. For each of these, find all the people they have appeared in a move with it. Use this to create a network/graph of who has appeared with who. Use the igraph or statnet packages to display this network.
If you want, you can do this with individual SQL commands and the process the results in R to generate new SQL queries. In other words, don’t spend too much time trying to create clever SQL queries if there is a more direct way to do this in R.
14. What are the 10 television series that have the most number of movie stars appearing in the shows?
Alternative Database
Since the gender is messed up in the imdb_data created via imdb-to-sql, I have created a different version of the database using the imdbpy scripts. This creates a much more complete representation of the IMDB data, but with different tables and columns, i.e., a different schema. The full database is now 8.5 gigabytes. I have dropped some of the tables that we don’t need for our exploration and this brings it to 7.5 gigabytes.
You can elect to use this new version of the data (and hence get better results). The databases are available as
• the full version
• with 8 tables dropped
The schema for the full version is displayed below

READ ALSO :   open currency swap lines

________________________________________