56:219:521 DATA VISUALIZATION (dat sci)
56:824:728 DATA VISUALIZATION (soc sci/pub pol)
56:834:653 DATA VISUALIZATION (soc sci/pub pol)

https://theaok.github.io/vis most current syllabus (class materials updated continuously)
rucvis@googlegroups.com listserv (everyone in class gets these emails, use often!) [if you didn't get welcome email, email me!: critical you're on the list!]


Spring 2025; Thu 6.00-8.50pm BSB-134

prerequisites

No prerequisites, but ability to learn programming is necessary. You need to be comfortable using a computer. Knowledge of Python and/or computer science/programming/scripting is helpful but not necessary. We will cover the basics. social science/humanities students: This class is mostly coding/programming/scripting. If you do not like programming, this class is not for you. But you may not yet know whether you like it and you may start liking it in this class: it often happened before! Warning for people new to coding: dont get behind!

course description

It is an interdisciplinary applied data science class focused on visualization, an integral part of data science. We will also cover online visualization from within Python (glue lang). Visualization is perhaps the most rewarding part of data science as it produces insight, "aha moments." It is also perhaps the only part of data science that involves art: designing graphics. Some data management will also be covered as necessary to process data for visualization. We will mostly use Pandas and Matplotlib (and others building on it).

Course is relevant for natural and social science, and quantitative/digital humanities.

learning objectives/outcomes

  • data visualization/story telling using graphics (most of the class)
  • about data (sources, best practices, tips and tricks): this class is all about data (you will use the data you chose that will serve you well beyond this class!!)
  • the basics of the computer programming (Python)
  • The key is the mastery of "data story-telling:" 1) What data are telling, 2) what I want to say, and 3) what audience needs to know

    required textbooks and materials

    No required textbooks. All required materials (code, readings) will be provided.

    recommended course materials

    galleries [Py]

    general
  • comprehensive https://www.python-graph-gallery.com/
  • matplotlib (if you want to really customize it, most powerful/versatile way to do graphs but sometimes complicated code)
  • https://matplotlib.org/stable/gallery/index.html
  • and see notebook sec 'basics / setup with matplotlib'
  • pandas (very easy syntax, we use pandas for data managment anyway; full documentation, rather dry and boring)
  • https://pandas.pydata.org/docs/user_guide/visualization.html
  • basic plot function http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html
  • can also try panda's rplot, especially trellis: http://pandas.pydata.org/pandas-docs/version/0.14.1/rplot.html and http://pandasplotting.blogspot.com
  • others
  • seaborn (easy, fast, pretty!) https://seaborn.pydata.org/examples/index.html
  • plotly (interactive: pan, hoover, pop up) https://plotly.com/python
  • galleries [concept/theory]

  • https://chart.guide
  • https://datavizproject.com
  • https://www.smashingmagazine.com/2023/01/guide-getting-data-visualization-right/ An useful intro read, i like the breakdown: Comparison, Composition, Distribution, and Relationship; also by lev of measurment. And see bunch of useful links at the bottom. Quite comprehensive/exhaustive, almost like python-graph-gallery.com, but in general don't really need all of that, say you get 90\% functionality with 10\% of charts. In general, there's much fanciness/novelty seekeing resulting in proliferation of vis, with little value added and increasing probability of getting overwhelmed.
  • more like a blog but lots of great stuff https://flowingdata.com
  • online books/tutorials (traditional, lengthy, overly detailed, but great if you want a textbook/full elaboration)

  • looks great https://realpython.com/tutorials/data-viz/
  • creator of Pandas, uptodate https://wesmckinney.com/book/, incl notebooks: https://github.com/wesm/pydata-book
  • maybe especially ch3 and ch4 https://jakevdp.github.io/PythonDataScienceHandbook/
  • see dat sci, gis, etc a-gallery-of-interesting-jupyter-notebooks
  • software

    [right before the break so can troubleshoot during the break]
    python
  • We will use Python 3x (>=3.10).
  • It is free for Linux, Chromebook, Mac, and Win; say can get anaconda or from python.org: https://www.anaconda.com/products/distribution or https://www.python.org/downloads and can run it in RStudio or Stata-like environment with http://spyder-ide.org
  • BUT no need to download or install any software: we will run Python online in webbrowser in the cloud, so called "Colab" (2 sections down). But first lets get GitHub running.
  • GitHub
    We will use GitHub to store the Python code in form of a notebook, we will edit the notebook in colab (next sec).
  • sign up or login at github.com
  • (depending on os, browser) on top left hit "New" or "Create Rpository" or top right under plus "+" select "New repository"
  • pick some repository name, say "vis" ; keep selected 'Public'; important!: under "Initialize this repository with" check "Add a README file"; and hit at the bottom "Create repository"
  • then hit "Settings" towards the middle-top right; on the left select "Collaborators" tab and hit "Add people" : "theaok", and hit "Add theaok to this repository"
    workflow: my comments, diffs, inline response [lets go over this next week again]
  • i will run it in my Colab, edit, and upload back
  • diff and response to my comments: actually cleaner and better in colab: File-Revision history; or clunky in GitHub: can click my commit message and see the so called diff--the difference between your version and my version: important! do make sure to fix it up for next ps, you may even have inline response to my comments in your next ps (especially if sth complex or if you disagree)
  • you can dont forget about a meaningful commit message--can keep on uploading newer versions as many times as you like
  • note: when you click the file, you can then click 'History' and see how the file evolved over time :)
  • a thought about file naming: ps1.ipynb, ps2.ipynb, etc, or sections in one file; or just one file and keep it updating throught with new stuff as we go!
  • colab
    You can just run Py notebook in Colab and save subsequent versions in Github that will keep track of changes [stick with this for the ps]
  • go to https://github.com/theaok/vis/blob/main/all.ipynb and hit 'open in colab' OR go to https://colab.research.google.com and on popup pick GitHub, search for:
    https://github.com/theaok/vis/blob/main/all.ipynb
    (it should find it and click "all.ipynb", and it should load it into colab, and follow instructions at the top of the file, ie save it in your GitHub etc)
  • and best class vis:
  • https://github.com/theaok/vis/blob/main/bestStudentVis.ipynb
  • https://github.com/ewattudo/vis1
  • data

    The class is a bit like an independent study: you will carry out some research (by doing visualizations). You need your own data for this class ASAP, the more data and the more complex, the better. Software will need to load the data straight up from online! Some data are easily downloadable from online eg https://gss.norc.org/get-the-data/stata, but many are not. Then you have to put data online yourself [just go over Git<25mb]: https://theaok.github.io/generic/howToPutDataOnline.html

    icpsr: biggest repository of survey data; check out also var search
    google is great for data search; and it has data search, too
    google cloud/big query has data ,too
    kdnuggets listing of sources, a lot!; kdnuggets is great in general for data science
    another kdnuggets listing; maybe actually better start here, easier to wrap your head around
    kaggle

    NOAA
    NASA

    datsets on GitHub
    pew

    advice/requirements and grading

  • 2 keys to success: start early AND ask often many questions; (and study groups: get couple people on zoom, screenshare notebooks, etc) This is a software class. It is different from typical soc sci classes! You will get stuck often and whenever stuck, email listserv, ask me, ask your classmates, as opposed to pulling your hair out! And stop by my office, too. Googling (and built-in Gemini) solves most problems but for many things its better to talk to me and your classmates; also more social/human, if you talk to computer all the time, its not healthy.
  • There are several problem sets (ps) due the following week or typically in 2 weeks after being posted (as indicated in ps). You will be asked to write some computer code that does something that we covered in the class to your data. You may work in groups (<=2), but say who you worked with, and the more people in the group, the better/longer the code must be.
  • Final project (ps5) is like final paper (doing some useful empirical quantitative research), except that I only grade code, in fact you can submit code only.
  • 100% (5ps x 20%) problem sets [just Py notebook], may cowrite code (upto 2 people) but then the project should be 2 times better than a single-authored one
  • bonus/extra upto 5% engagement, class participation eg answering/asking questions, helping others, listserv discussions
  • bonus/extra upto 5% civic engagement (see bottom of the syllabus)

  • calendar


    [*] = bonus (extra/not required)
    jan23 intro vidSp25 vidSp23
  • ps0.pdf
  • see some vids, can see screen with good resolution for coding steps:)
  • intro.pdf
  • https://github.com/theaok/vis/blob/main/all.ipynb
  • [*] if time: final_project.pdf: just skim through TOC
  • [*] Data revolution! economist data data everywhere

  • data management

    jan30 data management 1 vidSp25 vidSp23
  • ps1.pdf
  • data.pdf
  • continue with notebook
  • feb6 data management 2 vidSp25 vidSp23
  • revisit what we did so far, esp difficult topics like merge: run sec 'merge' again
  • make it interactive: q and a, work on ps1, wrap up dat man
  • do dive into vis: discuss what folks did so far in their ps in terms of vis, what worked and what didnt; get going with next weeks class, at least vis tables and mpl setup first main cell
  • feb13: no class; instead register and come to Bailey lecture: https://dppa.camden.rutgers.edu/bailey-memorial/

    VIS

    feb20 dive into vis: notebook vidSp23
  • ps2.pdf
  • go over ps1 comments from listserv; and diff: https://colab.research.google.com/github/soymlk94/datavis_sp24/blob/main/Copy_of_ps1.ipynb
  • a point about merging and in general data management/processing: we do not have time to be thorough, this is vis class and we have to move on! again, simplify, have fewer obs, subset, do only easier part [btw i teach dat man class]
  • notebook
  • feb27 vis in notebook vid vidSp23
  • go over magics and themes/styles and mpl setup again
  • flip the class work on ps2; present https://github.com/erikaguiracocha/Data-Visualization-2025/blob/main/PS2erikaguiracocha.ipynb
  • notebook
  • mar6 vis theory vid vidSp23
  • wrap up mpl: revisit key stuff, q and a (mostly done, then really focus on your projects/vis: presentations/discussions)
  • vis by others: examples
  • theory.pdf
  • pull up some of your vis / flip the class work on ps2
  • mar13 ps2 presentations !!zoom only!! https://rutgers.zoom.us/j/8892839953?pwd=dFhiTE1BZVlnMXdWSWN6d3N3MXI0QT09 vidSp23
  • ps3.pdf
  • present ps2 (just focus on key/best vis, typically 3-5 graphs) 10min sharp (i will cut you off) and 10min discussion
  • look into your github repo for my comments (and explore your own progress): can diff in github but clunky, better in colab! File-Revision history; lets do like 3 examples incl some of your repos
  • mar20 sp break no class
    mar27 ps3 presentations 10min (sharp!) + 10min discussion vid
  • ps4.pdf
  • ps BONUS! [upto 5pts extCre] present someones else vis, incl python code (not too difficult!) (email me your ipynb to get ok and schedule date; and then email ipynb to rucvis@googlegroups.com ahead of presentation): can do it next week or later; again email me first about it

  • your vis projects (and advanced vis)

    we do slow down and focus on your vis projects, flip the class, present work in progress, etc
    we will also cover few more advanced vis topics (bonus/not required): interactive vis, maps, etc
    apr3 advanced mpl and interactive/plotly/d3 vid vidSp23
  • clustering and advanced mpl
  • https://github.com/theaok/vis/blob/main/plotly.ipynb
  • if time: flip class and work on ps4; revisit theory
  • apr10 ps4 presentations vid vidSp23
  • ps5.pdf
  • time: 9, discussions: 9
  • apr17 maps vid
  • lets quicky go over ps5.pdf again
  • 10min Erika extra credit presentation
  • https://github.com/theaok/vis/blob/main/map.ipynb
  • apr24 ols vis and wrapup vid vidSp23 vidSp23
  • let me go over Erika and Sai ps5 draft, plus general comments for everyone: https://colab.research.google.com/github/erikaguiracocha/Capstone-Project/blob/main/Ps5erikaguiracocga.ipynb and https://github.com/SaiAnirudh659/Vis/blob/main/ps5/SaiAnirudh_PS5.ipynb
  • 10min Shirley extra credit presentation
  • finish last weeks class and revisit, q and a
  • ad http:theaok.github.io/swb
  • final_project.pdf: just skim through TOC
  • check out my working paper and vis notebook: this is important! vis in real world! (0) start with theory/lit/idea; (1) always necessary to manipulate the data for the right vis!; (2) takes a bunch of vis to find the right one; and while trying to find the best way to tell the story, let the data speak, dont force it!
  • also see https://link.springer.com/article/10.1007/s11482-019-09719-y create var that is ratio in 1st vis 2nd panel; different levels of measurement for robustness: country, region, state, county; cross-section and time series
  • theory.pdf quickly revisit secs
  • revisit the class material, q and a: wrap.pdf
  • if time: ols.ipynb
  • flip the class and work on ps5
  • may1 last class: ps5/final presentations !!zoom only!! https://rutgers.zoom.us/j/8892839953?pwd=dFhiTE1BZVlnMXdWSWN6d3N3MXI0QT09
  • time: 9, discussions: 9
  • see canvass for your predicted course grade so far
  • just to be safe, may delete the data you have posted online, you never know: someone may be picky about it

    rules

    do not share or link to class videos! These videocasts and podcasts are the exclusive copyrighted property of Rutgers University and the Professor teaching the course. Rutgers University and the Professor grant you a license only to replay them for your own personal use during the course. Sharing them with others (including other students), reproducing, distributing, or posting any part of them elsewhere -- including but not limited to any internet site -- will be treated as a copyright violation and an offense against the honesty provisions of the Code of Student Conduct. Furthermore, for Law Students, this will be reported by the Law School to the licensing authorities in any jurisdiction in which you may apply to the bar.

    attendance Attendance is recommended. Be advised that you are responsible for any material covered in the class, whether or not it was in the readings or lecture notes. You are also responsible for any announcements made in class. For most students, attendance is simply essential to learning the material. If you do need to miss a class, be sure to consult with a fellow student to learn what transpired.

    incompletes: Generally speaking, the material in this course is best learned as a single unit. I will grant incompletes only in cases where a substantial change in life circumstances occurs that is beyond the control of the student, and only with appropriate documentation.

    study groups. You are encouraged to form a regular study group. Many students over the years have found the study groups to be very helpful. Study groups are permitted and encouraged to work on the problem sets together. However, each individual student should write up his or her own answer to hand in, based on his or her own understanding of the material. Do not hand in a copy of another person’s problem set, even a member of your own group. Writing up your own answer helps you to internalize the group discussions and is a crucial step in the learning process.

    academic integrity. I am very serious about this. Make no mistake--I may appear accommodating and informal--but I am extremely strict about academic integrity. Violations of academic integrity include cheating on tests or handing in assignments that do not reflect your own work and/or the work of a study group in which you actively participated. Handing in your own work that was performed not for this class (e.g. other class, any other project) is cheating, too. I have a policy of zero tolerance for cheating. Violations will be referred to the appropriate university authorities. For more information see http://fas.camden.rutgers.edu/student-experience/academic-integrity-policy

    accommodating students with disabilities. Any student with a disability affecting performance in the class should contact the disability office ASAP: https://success.camden.rutgers.edu/success-services/disability-services/