fa23 cross listed with data processing with python--this year most of this class is in python (*not* stata): https://theaok.github.io/datManPy

56:824:718 data management
(56:834:651 special problems in pub pol and adm)

https://theaok.github.io/dm most current syllabus (class materials updated continuously)
labs:during office hours

Fall 2023 Tue 6.00-8.50pm BSB-134

prerequisites

You need to be comfortable using a computer. Knowledge of Python (or Stata, R, etc ) and data-management/computer science is helpful but not necessary. We will cover the basics.

course description

Most of this class is writing computers programs; if you do not like programming, this class is not for you... But you may not yet know whether you like it and you may start liking it in this class: it often happened before!

There are two major components to the class: (1) data management, (2) simple programming. We will use Python and Stata. I'd discourage R for data management (unless small simple datasets and you already use R).

The class teaches tools for data cleaning, organization, quality control, and automation. Topics include data types, text/math functions, labeling, recoding, documentation, merging, reshaping, and programming (macros, loops, and branching).

This class should be named "programming Stata for social science" or "intro to data science for social science using Stata"

Stata is an excellent software for data management. But sometimes you need to use a general purpose programming language for data management. Python is both powerful and easy to use. We will use it for files manipulation, text processing, and interacting with APIs and scraping websites. [Python is optional]

learning objectives/outcomes

The key is the mastery of "data story-telling:" 1) What data are telling, 2) what I want to say, and 3) what audience needs to know

You'll learn:
  • about data (sources, best practices, tips and tricks): this class is as much about stata as about data (you'll use the data you chose that will serve you well beyond this class!!)
  • the basics of the computer programming
  • the practice of data management (there will be some theory, too)
  • how to conduct reproducible research
  • how to automate research by programming
  • basics of Python for data management [optional]
  • GIT, a version control system
  • You'll demonstrate mastery of the material by writing code for a project/paper using learned techniques; you may cowrite code (upto 2 people) but then the project should be 2 times better than a single-authored paper

    required textbooks and materials

    No required textbooks. All required materials (code, readings) will be provided.

    recommended course materials

  • Most of the class is based on: Mitchell '' Data Management Using Stata: A Practical Handbook'' http://www.Stata.com/bookstore/dmus.html(nice)
  • A similar book, but with focus on organization is: Long '' The Workflow of Data Analysis Using Stata'' http://www.Stata.com/bookstore/wdaus.html(much boilerplate, outdated)
  • Programming, specifically, is covered in Chris Baum "An Introduction to Stata Programming" https://www.stata.com/bookstore/introduction-stata-programming/(detailed))
  • For beginners: '' A Gentle Introduction to Stata, 3rd Edition'' https://www.stata.com/bookstore/gentle-introduction-to-stata/
  • Also for beginners: "Statistics with STATA: Version 12" by L Hamilton https://www.stata.com/bookstore/statistics-with-stata/(good but overpriced)
  • But no need to buy books; many superb free online stuff:
    UCLA is the best website: https://stats.idre.ucla.edu/stata/
    UCLA for data management: https://stats.idre.ucla.edu/stata/seminars/stata-data-management/
  • love this guy https://www.princeton.edu/~otorres and many more links here: http://www.Stata.com/links/resources1.html and here

  • software

    stata
  • We will use Stata 16 or 17 (Intercooled/IC or higher: SE or MP).
  • Some lucky people can download it for free at https://software.rutgers.edu; but, probably have to be RU employee
  • can buy Stata/BE (Basic Edition) perpetual license for $225 https://www.stata.com/order/new/edu/profplus/student-pricing
  • not much difference between versions: Stata >=12 is fine; new version is out every 1-2 years
  • Or can just run it remotely: https://apps.camden.rutgers.edu/novnc/; note: just hit connect at top right, and type your netid (it wont show anything typed), hit enter, type password, and hit enter; may see howto at https://it.camden.rutgers.edu/help/remote-x/ esp how to resize geometry to fit your screen, important! eg for 1280×1024: netid:geom=1280x1024

    an alternative is bad windows via RU Camden Virtual Lab https://rcit.rutgers.edu/virtlab

    sometimes may need to run those within VPN vpn1.rutgers.edu, and sometimes may need to activate apps first (maybe even vpn too): on the left 'Service Activation' https://netid.rutgers.edu/index.htm
    git
    howto get started with git (very quick HOWTO on basic use of github.com):
    [right before the break so can troubleshoot during the break] (first make a quick ex1.do with say just 'sysuse auto, clear' and 2nd line 'd')
  • sign up or login at github.com (can also use bitbucket.org or something else)
  • may see on your right 'New repository' button: click it; or on the left go to 'repositories', click 'new'; then pick some name for your repository, keep selected 'public', important!: must check 'Initialize this repository with a README', and click button 'Create repository'
  • now can simply hit button 'Upload files' and choose your dofile, say ps0.do, important: add some meaningful commit message, say: 'first try on importing and exporting data' and hit 'Commit changes' button
  • then hit 'Settings' towards the top-right, and then on the left select 'Collaborators' tab and add me "theaok" and hit 'Add collaborator',thats it!
  • I will download it, edit, and upload back
  • you can click my commit message and see the so called diff--the difference between your version and my version
  • you can download this latest version first, edit it, and upload it back when done--dont forget about a meaningful commit message--can keep on uploading newer versions as many times as you like
  • note: when you click the file, you can then click 'History' and see how the file evolved over time :)
  • a thought about file naming: ps0.do, ps1.do, etc or just substantive name and keep it updating with new stuff as we go! say "incomeInequalityAcrossCounties.do"
  • below are general references on how to get started using it fully, probably the first two are most useful
  • http://www.sitepoint.com/git-for-beginners/
  • http://rogerdudler.github.io/git-guide/
  • http://stackoverflow.com/questions/315911/git-for-beginners-the-definitive-practical-guide
  • https://backlogtool.com/git-guide/en/intro/intro1_1.html
  • recommended software

    Python
  • there will be at least 2 classes in the second part of the semester about Python
  • Python is a general purpose programming language that can do much more than stata (statistical software)
  • Python is the most user friendly and easy to use general programming language
  • Stata 16 or 17 can embed Python
  • advice/requirements

  • 2 keys to success: start early AND ask often many questions This is a software class. It is different from all other classes! You will get stuck often and whenever stuck, email me, as opposed to pulling your hair out! And stop by my office, too.
  • There are 6 problem sets (ps) due the following week after being posted (unless indicated otherwise; some ps will be due in 2 weeks). You will be asked to write some computer code that does something that we covered in the class to your data. You may work in groups (<=2), but say who you worked with, and the more people in the group, the better/longer the code should be.
  • Final project is like final paper (doing some useful empirical quantitative research), except that I only grade code, in fact you can submit code only.
  • grading

  • problem sets 60% (6ps x 10%) [just computer code (dofile)]
  • final project [just another computer code dofile wrapping previous ps] 40%

  • calendar

    [*] = bonus (extra/not required)

    jan20 intro [old vid] vid pwd: d#&6^bw5
  • ps0.pdf
  • intro_to_course.pdf   intro.do
  • replication.pdf
  • final_project.pdf: just skim through TOC
  • [*] Data revolution! economist data data everywhere
  • [*] "The end of theory" http://www.wired.com/science/discoveries/magazine/16-07/pb_theory

  • data management

    jan27 data formats and conversion (quick stata lab at 5.30 and will stay 15 min after the class if needed)
    zoom vid pass: Yzcm*6pX We will talk about different basic data formats, conversion between them, and how they can be imported/exported to/from stata
  • ps1.pdf
  • dataFormats.pdf   dataFormats.do
  • https://stats.idre.ucla.edu/stata/seminars/notes/stata-class-notesentering-data/
  • present ps0
  • [*] flip the class: we will flip the last half an hour or so
  • [*] Mitchell ch2
  • feb3 [IN PERSON! NO MORE ZOOM!] data manipulation vid [old vid] [zoom vid] pass: rxi@s!x5
  • again, push stuff early to github and ask for comments!!!
  • start looking at github and ps1
  • ps2.pdf
  • manipulate.pdf   manipulate.do
  • https://stats.idre.ucla.edu/stata/seminars/notes/stata-class-notesmodifying-data
  • if time or at home: do Example 1 (p12) from https://www.stata.com/manuals14/dmerge.pdf; at home: read that whole file and make sure you run examples; try to come to labs where we can discuss and practice more!
  • [*] Mitchell ALL
  • feb10 combining data vid [old vid] zoom vid pass: 4D%^X8zj
    This class covers the key command for this class: merge.
  • ps3.pdf
  • (start with a look at git repos) mergeAppendReshape.pdf   mergeAppendReshape.do
  • merge conceptual setup: https://www.princeton.edu/~otorres/Merge101.pdf
  • merge practice: (make sure you run examples that start on p12!): https://www.stata.com/manuals14/dmerge.pdf
  • a quick overview of what we have done so far and doing today http://dss.princeton.edu/training/DataPrep101.pdf
  • reshape (and see "help reshape"!) https://stats.idre.ucla.edu/stata/modules/reshaping-data-wide-to-long/    https://stats.idre.ucla.edu/stata/modules/reshaping-data-long-to-wide/    
  • [*] Mitchell ALL
  • flip the class
  • feb17 continue with merge: digest, practice; and: organization, documentation [old vid] vid
    We will continue with last class. We covered key topic, merge, last week, so we make sure today that everything is crystal clear. Also, labeling data, variables, and values. Keeping your data organized.
  • if needed: revisit the code from last class
  • organize.pdf organize.do
  • present ps2 in the middle of the class
  • [*] https://stats.idre.ucla.edu/stata/modules/labeling-data/
  • [*] Scott Long "The Workflow of Data Analysis Using Stata" ALL
  • [*] Mitchell ch5
  • if time: wrapup/exercises/tutorials/flip as per next class
  • feb24 ps3 presentations and revisi/wrap up
    vid Revisit/wrap up what we did so far, esp import/export and manipulating data; next week new module: visualization/graphs.
  • do organize.pdf and organize.do from last week
  • present ps3 15min and 15min discussion
  • wrap up what we did so far, review, revisit, maybe do http://dss.princeton.edu/training/DataPrep101.pdf
  • flip the class: flip the last 45min or so; maybe: exercises from last few classes; and tutorials from links from last classes

  • mar3 visualization 1 and exporting results vid [old vid] zoom vid pass: yJyr3+k2
    NOTE: 2 classes on this important topic of visualization. Graphics is critical in understanding data, and understanding data is critical in data management.

    Results from stata are data too, and need to manage them too!
  • ps4.pdf
  • c7_graphics.pdf   c7_graphics.do
  • mar10 Visualization 2 vid [old zoom vid] pass: K#BaXN7%
  • (traditional tables: table_jargo.do) exp.do and see at home http://dss.princeton.edu/training/Outreg2.pdf AND great resource per regressions: just follow these examples in your dofile! https://stats.idre.ucla.edu/stata/webbooks/reg/
  • theory.pdf try to do as much as we can: possibly just get to key slides based on scott long in first part as for theory for soc sci
  • 15min zach presentation zachPresentation.do and zachPresentation.pptx
  • flip the class: we spend most of class on your graphs, so please have some and be ready to present: the more you have, they more help you will get!

  • mar17 spring break

    programming

    mar24 programming elements: macros, loops vid [old vid] zoom vid pass: 2T^1pk@B
    Introduction to elements of programming in stata: macros and loops. macros are building blocks of stata programs. loops are very useful for automating repetitive tasks in stata.
  • ps5.pdf
  • macrosLoops.pdf   macrosLoops.do
  • ps4 10 min (plus 10min discussion) presentations: focus on graphs
  • a very basic introduction to basic programming http://www.ssc.wisc.edu/sscc/pubs/stata_prog1.htm
  • present ps4/graphs as time allows
  • [*] https://stats.idre.ucla.edu/stata/faq/how-can-i-reshape-doubly-or-triply-wide-data-to-long
  • [*] an introduction to programming https://stats.idre.ucla.edu/stata/seminars/stata-programming/
  • [*] foreach examples https://stats.idre.ucla.edu/stata/modules/working-across-variables-using-foreach
  • [*] more examples http://fmwww.bc.edu/ec-p/wp612.pdf
  • [*] Mitchell ch9
  • [*] Baum first few chapters
  • mar31 advanced macros and loops; [*] replication/practice using my dofiles vid [old vid] zoom vid pass: $H3=Bc@4
  • anyone would like to present some loops, macros you've made so far?
  • advMacLoo.do
  • merging and data management project: replicateMiComp.zip
  • cars and happiness paper: replicateLsCar.zip
  • [*] work and happiness paper: REPLICATION.tar.bz2
  • [*] examples of replication materials http://myweb.uiowa.edu/fboehmke/methods.html
  • [*] Baum first few chapters
  • flip the class: work on ps5
  • apr7 text as data and quick start with python vid [old vid] [old zoom vid] pass: !=#7*BtD
  • ps6.pdf
  • note: do brief presentations of ps5
  • stata_text.pdf   stata_text.do
  • we are finishing stata, anything to revisit, eg loops, any questions? btw did we go too fast or too slow?
  • a quick dive into python (if time)
  • basPy.pdf
  • colab (probably thru bas des sta)

  • python

    apr14 python for social science data management vid [old vid]
  • general point for ps6 and final project: the code you run must make substantive sense, too! dont just run stuff for the sake of it...we code to accomplish something
  • start with py from last class in colab: 'basic descriptive statistics' and continue throughout apiPy.pdf api: get data from internet
  • colab: pulling data from wb, fred
  • theory.pdf [very important for ps6 and final project!!]
  • if time: discuss final project for this class; qick look and skim through https://theaok.github.io/dm/final.pdf
  • apr21 graphs/maps vid [old vid]
  • [sp22 added new sec PANDAS: start with that]
  • colab (matplotlib, gis)
  • >>>note: the following will be updated>>>

    apr28 last class presentations and wrap up [old vid]
  • ps6 15min (sharp; ill cut you off) student presentations, focus on bottomline/results (eg des sta, substatntive findings in your soc sci research)!, which we will discuss and brainstorm, just code, no need for ppt
  • student presentations: what data; why? what is special about those data?, any limitations?; show nice chunk of code youre proud of; show some interestig des stats or graphs or maps or network analysis etc; also ask us questions!
  • have a look at canvass at your predicted course grade so far; remember final project is 40perc of the grade
  • if you get >=9.5 on ps6 AND your total is >=95% on canvass, youre done, can just submit ps6 as final project and its an A
  • check out my paper on happiness and pop growth across us counties: pdf and colab
  • wrap.pdf
    final project due on may5 at 6pm
  • final.pdf
  • final_project.pdf
  • just to be safe, delete the data you have posted online, you never know: someone may be picky about it

    rules

    do not share or link to class videos! These videocasts and podcasts are the exclusive copyrighted property of Rutgers University and the Professor teaching the course. Rutgers University and the Professor grant you a license only to replay them for your own personal use during the course. Sharing them with others (including other students), reproducing, distributing, or posting any part of them elsewhere -- including but not limited to any internet site -- will be treated as a copyright violation and an offense against the honesty provisions of the Code of Student Conduct. Furthermore, for Law Students, this will be reported by the Law School to the licensing authorities in any jurisdiction in which you may apply to the bar. attendance Attendance is recommended. Be advised that you are responsible for any material covered in the class, whether or not it was in the readings or lecture notes. You are also responsible for any announcements made in class. For most students, attendance is simply essential to learning the material. If you do need to miss a class, be sure to consult with a fellow student to learn what transpired.

    incompletes: Generally speaking, the material in this course is best learned as a single unit. I will grant incompletes only in cases where a substantial change in life circumstances occurs that is beyond the control of the student, and only with appropriate documentation.

    study groups. You are encouraged to form a regular study group. Many students over the years have found the study groups to be very helpful. Study groups are permitted and encouraged to work on the problem sets together. However, each individual student should write up his or her own answer to hand in, based on his or her own understanding of the material. Do not hand in a copy of another person’s problem set, even a member of your own group. Writing up your own answer helps you to internalize the group discussions and is a crucial step in the learning process.

    Academic Integrity. I am very serious about this. Make no mistake--I may appear accommodating and informal--but I am extremely strict about academic integrity. Violations of academic integrity include cheating on tests or handing in assignments that do not reflect your own work and/or the work of a study group in which you actively participated. Handing in your own work that was performed not for this class (e.g. other class, any other project) is cheating, too. I have a policy of zero tolerance for cheating. Violations will be referred to the appropriate university authorities.

    For more information see http://fas.camden.rutgers.edu/student-experience/academic-integrity-policy

    Accommodating Students with Disabilities. Any student with a disability affecting performance in the class should contact the disability office ASAP: http://learn.camden.rutgers.edu/disability/disabilities.html